4.2.2 Worked Examples


Example 4.4. Making a boxplot when there are an odd number of data points
Consider the list of test scores below:

55, 60, 67, 70, 78, 81, 84, 88, 90, 95, 99

We already determined that the mean of this data is 78.82 and the median is 81. We now divide the list into four equal parts to determine the quartiles. Start by dividing the data into two equal parts, as with finding the median. Then divide each of these into two equal parts. For this data, each quartile should include three data points, since there are 11 total. Notice that the middle data point, the median, is in both the upper half and the lower half of the data when we divide it up.














Lowest 50%
Median
Upper 50%













55 60
67
70 78
81
84 88
90
95 99













Lowest 25%
Lowest 25%
Lowest 25%
Lowest 25%













We now have almost everything that we need to make the boxplot. We just need to check whether there are any outliers in this data. An outlier is more than 1.5 IQR from Q1 or Q3. The interquartile range (IQR) for this data is IQR = Q3 - Q1 = 89 - 68.5 = 20.5. Thus, outliers must be more than 1.5*20.5 = 30.75 from the quartiles. Outliers on the low end would be less than (Q1 - 30.75) = (68.5 - 30.75) = 37.75. Outliers on the high end would be greater than (Q3 + 30.75) = (89 + 30.75) = 119.75. Since there are no data points outside this range, there are no outliers in this data.

To make the boxplot, we simply draw an axis scaled from 55 to 99 (for ease of reading, let’s go from 50 to 100 in steps of 5). We then draw the box part of the graph, extending from Q1 to Q3. We put a vertical line in the box at the median. We add a star or a diamond for the mean, and then we extend the ”whiskers” of the box from the edges out to the minimum and maximum, since there are no outliers. The final result is shown in figure 4.2


PIC


Figure 4.2: Boxplot of test scores



Example 4.5. Reading an Interpreting a Boxplot
Consider the boxplot shown in figure 4.3. It represents the distribution of test scores in a class of 120 students. What can we learn about the class performance from this graph?


PIC


Figure 4.3: Boxplot of test scores for 120 students. What does it say about class performance?


To analyze the graph, we will consider a series of questions:

  1. What is the minimum test score? What is the maximum? What is the range? From the data, the lowest score is a 50 (which is an outlier) and the highest score is a 95. The range is 95 - 50 = 45.
  2. What are the quartiles of the data? The first quartile is 75. The median (second quartile) is about 82 and the third quartile is about 89. The IQR is 89 - 75 = 14. This number is about one-third of the range, indicating a fairly tight spread of data (lots of similar test scores).
  3. How many students scored between Q1 and Q3? We know that 50% of the data is always between Q1 and Q3. This means that 50% of the observations (in this case student test scores) fall between 75 and 89. Since 50% of 120 (the total number of observations) is 0.50*120 = 60, we know that 60 students scored between 75 and 89 on the test. However, this is a little misleading. It is possible that there are multiple students with the same tests score. If these duplications happen to be at the quartiles, then a few more students would be in the 75 to 89 range.
  4. Assuming that a score of 90 is sufficient to earn an ”A”, how many students got an ”A” on the test? This is harder to answer. Notice that the third quartile is 89. This means that 25% of the class (0.25*120 = 30 students) got an 89 or higher. So we only really know that at most 30 students earned an ”A”. It is possible that most of the scores in the third quartile are right at 89 and only a few of them are between 89 and 95, which would lead to a smaller number of A’s on the test.
  5. Did most students do well or poorly on this test? We see that the median is slightly higher than the mean. This shows that more than 50% of the class earned a score above the mean. We do not know exactly what percentage scored above the mean, only that it is between 50% and 75% (since the mean is between Q1 and the median). This indicates that the data is negatively skewed, so that more of it is piled up above the mean than below it. Overall, then, it seems that the class did a little better than average on this test.
  6. What other questions could we ask about the data?


Example 4.6. Side-by-Side Boxplots
Consider the sales data given above in example 3. (Data file C04 Tots.xls [.rda].) Let’s use boxplots to compare the two sales regions and select the region that has the better performance. If you enter the data above into your software and create side-by-side boxplots you should get a graph simlar to the one in figure 4.4.


PIC


Figure 4.4: Side-by-side boxplots of sales from two regions of the Toys-For-Tots company.


As you can see, the boxplot shows that the sales in the North East region are spread over a much greater range of sales figures than the sales in the North Central region. In addition, the highest performing store in the NC region is an outlier and is not at all representative of the region’s performance. However, the lowest 25% of the stores in the NE region are performing worse than all of the stores in the NC region (the minimum for the NC region is about equal to the first quartile for the NE region). By the same token, the upper 25% of stores in the NE region seem to be doing better than all the stores in the NC region (the third quartile of NE region sales is about equal to the maximum for the NC region, if we ignore the outlier). The middle of each region seems to be about the same, with the medians of the two regions almost equal. The mean sales of the NE region (indicated by the small dot) are higher than the mean sales in the MC region, but not by a very significant amount.

Given just these graphs, it might be difficult to determine which region is performing better overall. In general though, it seems that the NE region has more stores performing well than the NC region. Also, the highest performing store is in the NE region. Overall, it looks like the NE region has better sales, but we must remember that the NE region has fewer stores, so each quartile refers to fewer data points. The real question is what is causing the NE region to do better. Is it better management? Less overhead? Wealthier clients? Better marketing? Better service? Some other factor?


Example 4.7. Skewness of data
Generic University offers three sections of its math course for business majors. At the end of the semester, all sections take the same final exam. The boxplots in figure 4.5 show the results of the final separately for each section. The graphs are oriented vertically, rather than horizontally, just for variety. As you can see from the graphs, the minimum scores are the same in each section, as are the maximum scores and the means. Which section did best on the final exam?


PIC


Figure 4.5: Side-by-side boxplots of final exam scores from three sections of a course.


In order to decide which section did best on the final exam, we need to picture how the data itself looks, based on the boxplots. For example, the test scores in section 1 seem to be unevenly spread throughout the range of the data (low score: 40, high score: 99). We can tell this because the median of the data is very close to the upper end of the spread. Half of the students in section 1 scored above 83 on the exam. Even so, the overall mean of this section’s test scores was only 70 because the lower 50% of the class has scores from 83 on down to 40. This unevenness is referred to as skewness. When the mean is smaller than the median, we say the data is negatively skewed because the quantity (Mean - Median) would be less than zero. This is in stark contrast with section 3, where half of the students’ scores are bunched together at the low end of the spread, from 40 to 60, and the top half of the class has scores ranging from 60 up to 99. In this case, the mean is larger than the median, so the data is positively skewed. What about section 2? The data for this section doesn’t seem to be skewed at all; the mean and median are identical. This tells us that half of the students in section 2 scored above 70 and half of the students in section 2 scored below 70. Given all of this, it seems reasonable to conclude that section 1 had the best showing on the exam; more than half of the students in section 1 had exam scores above the median and mean scores of both sections 2 and 3.