3.1 The Mean As A Model

Consider what we have so far: a lot of information in the form of spreadsheets filled with data that we arranged into variables and observations. But what do we do with all this? Unless you’re really special, you probably can’t learn a lot from looking at a list of one thousand numbers. You probably know even less from looking at a thousand observations for each of four different variables. Sets of data in business and science are usually larger than this, so we need to think of a more efficient analysis tool. The tool we will use is to build a model of the data. A model is a number or formula that represents a set of data - it is not the data itself, but is meant to capture certain important features of the data that would otherwise not be recognizable in a long list of numbers.

Using models help us to understand or simplify a situation. They can also help us make predictions about future events. For example, weather models help us analyze current weather and predict potential future weather patterns. Architecutural models help us visualize the design of a building before we commit it to bricks-and-mortar. In this section we will deal with what is possibly the simplest and most widely used model, called the mean of a set of data. Other commonly used models are given by graphs and equations, which we will develop in future chapters, eventually having models that include all sorts of features, like categorical variables.

Rather than look at the entire set of data, we want to look at the data one variable at a time in order to find out what that one variable tells us about the situation about which we collected data. To make things even easier, we want to reduce the data down to one number that represents the typical data point for that variable. In general, a number used to represent an entire variable is called a statistic. If that statistic is meant to represent the typical data point, we call it an average.

Let’s look at an example. Shown below are the fat and protein counts for 10 of the most popular sandwiches sold at Beef n’ Buns.




Item TotalFat Protein



Super Burger 39 29



Super Burger w/ cheese 47 34



Double Super Burger 57 48



Double Super Burger w/ Cheese 65 53



Hamburger 14 18



Cheeseburger 18 20



Double Hamburger 26 31



Double Cheeseburger 34 35



Double Cheeseburger w/ Bacon 37 38



Veggie Burger 10 14



We can reduce all this data down to the following simplistic model, telling us that the ”typical” sandwich has 34.7 grams of fat and 32 grams of protein.




Statistic Total Fat Protein



Mean (g) 34.7 32.0



The question we should ask ourselves is how well does the mean represent a given set of data. Looking at the data above, we see that although the typical sandwich has 34.7 grams of fat, there are some that have much higher values than that and some that have much less.

The first step in getting an overall measure for how the data values differ from the mean is to develop a standardized ruler to measure how close the observations are to one another. For example, in a crowd of people, your arm-length is a good measuring stick for ”closeness”: If someone is less than one arm-length away from you, you would consider them ”close”. However, this distance is not appropriate when driving down the freeway. A more appropriate measuring stick for this situation would be the length of a car. The Federal Aviation Administration has yet another definition of close: aircraft are not allowed within 1000 feet of each other without declaring a ”near miss.”

These situations all describe ways of measuring ”closeness” that refer to real physical distances. Seldom, however, do managers deal with these kinds of distances. More commonly, they collect data measured in dollars or years. Can we find a way to measure distance that will make sense for almost any situation that managers encounter?

As you’ve probably guessed, we can. To do so, however, we need to decide where to start measuring from. Most of the time we start measuring at zero, but this may not help very much when looking at sales figures in millions of dollars, especially if none of the figures is near zero. Rather than pick a single fixed place from which to always measure zero, it makes more sense to use a measure of central tendency, namely the mean for the variable. .

Once we have selected the mean as the reference point we can then look at the deviation of each observation from the mean: Is each observation above the mean or below the mean? By how much? Thus, we will always be measuring the spread of our data from a central reference point that pertains to that particular set of data.

The measuring tool that we will use to measure the spread of our data is called the standard deviation. This number is different for each set of data, but it is calculated through the same formula each time.




TotalFat (g) Protein (g)



Mean 34.700 32.000



Standard deviation 18.209 12.561



Looking again at the Beef n’ Buns data, we see that while the typical sandwich has 34.7 grams of fat, the majority of sandwiches actually range in fat grams from 26.5 (subtract 18.2 from 34.7) to 52.9 (add 18.2 to 34.7) grams.

You have probably encountered the standard deviation before. If you did, you may have thought that the formula was a little complicated and hard to understand. We are going to take a close look at the formula for standard deviation, because if you understand this formula you will understand a lot about statistics. Although the formula looks difficult, you will quickly learn that every piece of the formula makes sense and has a reason for being there. It wasn’t developed by some genius who made the formula up from thin air. The formula was developed as the simplest possible way to find an appropriate measuring stick for any set of data. In fact, the formula for standard deviation is essentially the best way to measure the average deviation of the data from the mean.

  3.1.1 Definitions and Formulas
  3.1.2 Worked Examples
  3.1.3 Exploration 3A: Wait Times at Beef n’ Buns