Thinking inside the box

4.2 Thinking inside the box

Very often, we find that the measures of central tendency - mean, median and mode - are not enough to describe the data we are exploring. These numbers give us some idea of what a typical data point looks like, but they cannot answer questions like:

How much of the data is less than the average? How much is more than the average?
What is the largest value in the data? What is the smallest value?
Where is ”most” of the data? Is it close to the average?
Which measure of central tendency best describes this data?

To answer these questions, we will need to have more tools available. This means that we need more information. If you think about it, we start with a collection of data. This might include thousands of observations of each variable. No human mind can process that much data in order to draw conclusions to make decisions. Therefore, we tried the easiest thing possible: reduce all the data down to a single statistic that represents the central tendency of the data. Now we can see some of the limitations of this approach. Any time we reduce thousands of pieces of data to a single number we have lost information about the data. Consider the following statement:

The mean number of children in a U.S. family is 2.2.

Certainly, this does not mean that every family in the U.S. is made up of 2.2 children. In fact, even to claim that the typical family has 2.2 children is a little strange since the number of children in a family is a discrete numerical quantity. Based on this statement only, which of the following statements most closely seems to describe family structures in the U.S.?

Most families in the U.S. have two children. A few families have zero, or one child. A few more families have more than two children.
There are more families with two or fewer children than there are families with more than two children.
The number of families with two or fewer children is the same as the number of families with three or more children.

In fact, without more information, only the third statement can be ruled out. This one is based on the definition and computations used to compute the mean. (See if you can figure out why the third statement is definitely false.) We cannot decide which of the two remaining statements is more accurate without additional information. One common set of statistics used to get more information about a set of data are called quartiles. The idea behind quartiles is to take the data, put it in order from smallest to largest, and then break it into four quarters, each with the same number of data points in it. We then keep track of the data points at the places where the data is broken up, and we call these statistics quartiles. This gives us some idea of how the data is distributed. Graphically, we can represent the quartiles and other information about the spread of the data in a boxplot, which is a type of graph that contains about seven pieces of information to describe the data.

  4.2.1 Definitions and Formulas
  4.2.2 Worked Examples
  4.2.3 Exploration 4B: Relationships Among Data, Statistics, and Boxplots

[next] [prev] [prev-tail] [front] [up]