Many times we are presented with data, in newspapers, magazines, the Internet, or meetings, but these data are rarely presented in its entirety. After all, in many cases, there are thousands of observations of each variable. It is therefore more common to present summarized data in the form of tables or charts that show the number (or frequency) of observations that fall into a certain range (or bin). In the last chapter, we used this idea to create a graphical depiction of the data in the form of a histogram. But what if you are starting from the summarized data and what to know something about the original data itself?
For example, what if you wish to compute the mean of the data? This is the most frequently used measure of central tendency and is often used a model of the data. The way in which we compute this measure of central tendency is based on having all of the individual data points in the set of data. In a summarized table of data, though, we do not have the actual values to add up. One thing is certain; we cannot simply average the frequency counts, as this does nothing to account for the actual values of the data and the frequency counts are not (usually) even in the same units as the data itself. For example, in looking at the table below, we see data on salary distribution at a company. If we average the frequency counts (labeled ”Number of Employees”) we get 11.8, which means that if the distribution were uniform, there would be 11.8 employees in each salary range. But this number has units of number of people. The average salary must have units of dollars. Somehow, we must estimate the mean based on both the salary ranges and the number of observations in that range.
Salary Range | Number of Employees |
$200,000 - $250,000 | 1 |
$150,000 - $199,999 | 2 |
$100,000 - $149,999 | 5 |
$50,000 - $99,999 | 13 |
$0 - $49,999 | 38 |
Unfortunately, as we’ll discover, once you have only the summarized data, there is no way to get the actual mean of the original data. At best, you are estimating the mean, and your estimate has a great deal of possible error, depending on the size (width) of each bin into which the data has been summarized. These same ideas hold true for estimating the standard deviation of the data, especially since we must first estimate the mean in order to compute the deviations of each observation (or, in this case, each group of observations) from the mean.
And while it is true that in many cases we have the actual data and can compute the true mean of the data, this is often not true. Have you every filled in a customer satisfaction survey? Such surveys often collect demographic data, such as the age of the person filling in the form, but rarely do they ask you to write in your age. It is more common to check off a box marking a range where your age fits (for example, 31-40 years old). In situations like this, the data starts as a summarized frequency table; the company collecting the data never has the actual ages of each survey participant. So they must resort to estimating the mean if they need it for other calculations.