Categorical Data and Means

The ideas in the previous section - modeling a set of data by using the mean and treating the standard deviation as the ”goodness of fit” for the model - will not work if our data is categorical, because we cannot add, subtract, multiply or divide with categorical data like ”Male” and ”Female”. Yet, we often have both categorical and numerical data and wish to know a variety of things about the underlying situation. Consider shipping records for a company that contain data on each shipment: the total weight, during what shift (morning, afternoon, night) it was loaded, truck size (van, semi, other) and so forth. We might want to profile the data to determine what a typical day-shift load looks like in order to help us plan employee schedules. How might we go about this?

One way would involve going through all the data by hand, adding up the information about just the day shift-related shipments, and then creating a report. However, typical data sets can number in the thousands of observations per variable and such a hand-compiled approach is not practicable. Another way might be to use the software to sort the data by shift and look at it that way. In a package like Excel, we could use the ”autofilter” feature to display only the daytime shift information. But, if we also want to know what a typical afternoon and night shift load are like, we would have to repeat all this again for each shift. An additional drawback is that if the data change (a new month is added, for example) we have to start over.

Fortunately, most modern statistical and data analysis software is designed to easily allow users to cross-section their data in a variety of ways. In Excel, the main tool for this is called a Pivot Table; in R, the same tool is accomplished with the reshapeGUI. These can be used to easily produce a report like the following one which has taken a spreadsheet of raw data and created a summary table for how the count of customers varies over two variables (in this case, the day of the week and the location in the restaurant).


Average of Count	Location

Day	Cranny	Hole	Nook	Grand Total

D	26.796	19.115	28.622	24.844

F	19.734	8.770	22.135	16.879

H	16.809	8.592	22.322	15.908

M	16.763	12.056	18.849	15.889

S	19.845	15.697	22.289	19.277

T	16.974	12.260	18.414	15.883

W	13.306	8.743	21.852	14.634

Grand Total	18.604	12.176	22.069	17.616

Table 3.1: Pivot table showing average seating in the areas of Over Easy, by day of week.

This table took about 30 seconds to produce from a set of data with almost 200 observations of each variable. Even for larger sets of data, that’s about all it takes. And look at what we can easily learn from this table. One thing that really stands out is that more people sit in the Nook area. This could be for a variety of reasons. For example, this area might have the best window view or be closest to the entrance (or farthest from the entrance). On the other hand, the Hole area doesn’t ever seem to have many people in it. It might be too dark or it could be located right next to the bathrooms.

At the same time, we have to ask if these results are significant. Maybe this data is skewed somehow and more typical data from a longer time period (several months) wouldn’t display such a large difference between the Hole and the Nook areas? To test this, one normally uses statistical tests called chi-square tests or z-tests. Our approach in this book will be to look for overall patterns rather than test for statistical significance. In this case, there is another important consideration before we read too much into the pivot table: Are all three areas of equal size? If the Hole area only has seating for 20 people, then it is often close to full. At the same time, if the Nook area has seating for 50 people, it is actually not being used much at all. This is one reason why it is often vital to introduce a computed variable into the data. In this case, we should probably run statistics not on the actual counts of people in each area. Instead, we could compute the percent capacity of each area (the number of people divided by maximum capacity of the area) and then run our statistics.

3.2 Categorical Data and Means