3.2 Categorical Data and Means

The ideas in the previous section - modeling a set of data by using the mean and treating the standard deviation as the ”goodness of fit” for the model - will not work if our data is categorical, because we cannot add, subtract, multiply or divide with categorical data like ”Male” and ”Female”. Yet, we often have both categorical and numerical data and wish to know a variety of things about the underlying situation. Consider shipping records for a company that contain data on each shipment: the total weight, during what shift (morning, afternoon, night) it was loaded, truck size (van, semi, other) and so forth. We might want to profile the data to determine what a typical day-shift load looks like in order to help us plan employee schedules. How might we go about this?

One way would involve going through all the data by hand, adding up the information about just the day shift-related shipments, and then creating a report. However, typical data sets can number in the thousands of observations per variable and such a hand-compiled approach is not practicable. Another way might be to use the software to sort the data by shift and look at it that way. In a package like Excel, we could use the ”autofilter” feature to display only the daytime shift information. But, if we also want to know what a typical afternoon and night shift load are like, we would have to repeat all this again for each shift. An additional drawback is that if the data change (a new month is added, for example) we have to start over.

Fortunately, most modern statistical and data analysis software is designed to easily allow users to cross-section their data in a variety of ways. In Excel, the main tool for this is called a Pivot Table; in R, the same tool is accomplished with the reshapeGUI. These can be used to easily produce a report like the following one which has taken a spreadsheet of raw data and created a summary table for how the count of customers varies over two variables (in this case, the day of the week and the location in the restaurant).







Average of Count
Location





Day Cranny Hole Nook Grand Total





D 26.796 19.115 28.622 24.844





F 19.734 8.770 22.135 16.879





H 16.809 8.592 22.322 15.908





M 16.763 12.056 18.849 15.889





S 19.845 15.697 22.289 19.277





T 16.974 12.260 18.414 15.883





W 13.306 8.743 21.852 14.634





Grand Total 18.604 12.176 22.069 17.616






Table 3.1: Pivot table showing average seating in the areas of Over Easy, by day of week.

This table took about 30 seconds to produce from a set of data with almost 200 observations of each variable. Even for larger sets of data, that’s about all it takes. And look at what we can easily learn from this table. One thing that really stands out is that more people sit in the Nook area. This could be for a variety of reasons. For example, this area might have the best window view or be closest to the entrance (or farthest from the entrance). On the other hand, the Hole area doesn’t ever seem to have many people in it. It might be too dark or it could be located right next to the bathrooms.

At the same time, we have to ask if these results are significant. Maybe this data is skewed somehow and more typical data from a longer time period (several months) wouldn’t display such a large difference between the Hole and the Nook areas? To test this, one normally uses statistical tests called chi-square tests or z-tests. Our approach in this book will be to look for overall patterns rather than test for statistical significance. In this case, there is another important consideration before we read too much into the pivot table: Are all three areas of equal size? If the Hole area only has seating for 20 people, then it is often close to full. At the same time, if the Nook area has seating for 50 people, it is actually not being used much at all. This is one reason why it is often vital to introduce a computed variable into the data. In this case, we should probably run statistics not on the actual counts of people in each area. Instead, we could compute the percent capacity of each area (the number of people divided by maximum capacity of the area) and then run our statistics.

  3.2.1 Definitions and Formulas
  3.2.2 Worked Examples
  3.2.3 Exploration 3B: Gender Discrimination Analysis with Pivot Tables