For most statistical packages, an explanatory variable is the name of a column of data. This name usually sits at the head of its data column in the spreadsheet and appears, as we have seen, in the regression equation. A statistical package carries out regression analysis by regarding all entries in a column under a variable name as numerical data. The data listed under a categorical variable, however, may be in the form of words or letters so that the mathematical operations necessary to perform linear regression would not make any sense.
What we need is a way to convert the categories of a categorical variable into numbers. But we must do it in such a way that it makes sense and that everyone can agree on the definitions. Otherwise, the mathematics will not make sense. The key to converting categorical data into numerical data is this: Categorical data falls into two or more categories but no observation is ever in more than one category at a time. In other words, if a variable called ”Style of House” has the categories ”colonial”, ”ranch”, ”split-level”, ”cape cod” or ”other”, then any given house (a single observation of ”style of house”) can only be one of these types.
What we cannot do to convert the categories into numbers is to simply number each category. Numerical data is, by its very nature, ordered data. It has a natural structure. In mathematics, 3 is bigger than 2, and 2 is bigger than 1. So how, ahead of time, can we know which category is ”bigger” than another? How do we know which category should be numbered 1, which should be 2, etc.? Since we cannot determine this ahead of time, we must find another approach to converting the categorical data into numerical data. The problem with this approach is that we tried to do it with a single variable that has different numerical values.
In order for statistical packages to be able to create regression models, the various values in each category may have to be translated into separate, individual ”dummy” variables, such as StyleColonial, StyleRanch, StyleSplitLevel, etc. These dummy variables can take only the values 1 or 0. For a given observation, one of the dummy variables will be equal to 1, the dummy variable named for the category that the observation fits into. The other dummy variables associated with this categorical variable will be 0, because the observation does not fall into those categories. Essentially, statistical packages, such as StatPro, handle categorical data as switches: either a category variable applies or it does not; it is ”on” (equal to 1) or it is ”off” (equal to 0).
We can then use these dummy variables (and not the original categorical variable) to build a regression equation. Each of these dummy variables will have its own coefficient. This allows us to create complex models using all sorts of data. After all, you expect categorical data to be important in most models. If you were trying to predict the cost of shipping a package, for example, the weight and its destination might be important, but so would the delicacy of the package. ”Fragile” packages would cost more to ship than ”durable” packages. The only way to include this characteristic in the model is through dummy variables.