9.2.2 Worked Examples


Example 9.4. Converting two-valued categorical data to dummy variables
A categorical variable must have at least two categories. Suppose a categorical variable has exactly two values. These values are used to indicate whether the category applies to a particular individual or does not. A good example of this is ”Gender”. It has two values: male and female. Furthermore, since no one can be both male and female, each person is coded as either male or female (M or F, 0 or 1, etc). This means that we can create two dummy variables, one for GenderMale and one for GenderFemale. Each observation will have one of these two dummy variables equal to 1 and the other 0, since no observation can fall into multiple categories at the same time; a person falls into one or the other, but not both. So we can go down the list of data and enter 1 and 0 where we need to in order to create our dummy variables.


Example 9.5. Converting multi-values categorical data to dummy variables
What about categorical variables with more than two categories? A good example of this is an employee’s education, which is coded with several category values (0,2,4,6,8) indicating the level of post-secondary education the employee has had, where 0 indicates no postsecondary education, 2 indicates an associate’s degree, 4 indicates a bachelor’s degree, 6 indicates a master’s degree and 8 indicates a doctorate. Each employee is classified according to the Education categorical variable and is assigned to one and only one of the five possible educational levels. In the end, you would wind up with the following data:

Original data

Dummy variables
Categorical variable: Education

Five dummy variables (Ed#)
Has five categories: 0, 2, 4, 6, 8



Employee has Education


No postsecondary 0


Associate’s degree 2


Bachelor’s degree 4


Master’s degree 6


Ph.D. 8








Ed0Ed2Ed4Ed6Ed8






No postsecondary 1 0 0 0 0






Associate’s degree 0 1 0 0 0






Bachelor’s degree 0 0 1 0 0






Master’s degree 0 0 0 1 0






Ph.D. 0 0 0 0 1







Example 9.6. Regression equations with dummy variables
Suppose we have a database of employee information are interested in whether ”gender” has an effect on an employee’s salary. Such questions are common in gender discrimination lawsuits. (We are not saying that employers purposely compute salaries differently for male and female employees. We are merely saying that after everything is accounted for, it is possible that gender is underlying some of the salary differences in employees.) In our hypothetical data, we have three variables: gender, age, and annual salary. A sample of this data is shown below. Gender is a categorical variable with two values: ”M” for male and ”F” for female. Age is simply the age of the employee. We are using this as a stand-in (or surrogate) variable to include the effects of experience, education, and other time-related factors on salary. Annual salary is coded in actual dollars. We want to build a regression model to predict annual salary.

Gender Age Annual Salary




Employee 1 M 55 57457
Employee 2 F 43 36345
Employee 3 F 25 23564
Employee 4 M 49 38745
Employee 5 F 52 41464
.
.. .
.. .
.. .
..

First we create dummy variables, ”GenderM” and ”GenderF”. Employee 1 is male, so this observation will have GenderM = 1 and GenderF = 0. Employee 2 will have GenderM = 0 and GenderF = 1, since employee 2 is female. The data now contains four variables: Gender, Age, Annual Salary, GenderM, and GenderF. To build the regression model, we select the explanatory variables that are appropriate. However, we cannot use both dummy variables. Let’s use GenderF in the equation. After all, if GenderF = 0, then we know the employee is male, so we don’t need the other dummy variable. The regression output looks exactly like multiple regression output and can read in exactly the same way. We find the full regression model to be

Annual Salary = 4667 - 2345*GenderF + 845*Age

When GenderF has value 0 (male employee), the salary is

Annual Salary = 4667 - 2345*(0) + 845*Age = 4667 + 845*Age

When GenderF has value 1, (female employee), the salary is

Annual Salary = 4667 - 2345*1 + 845*Age = 2322 + 845*Age

We can now see that the single regression equation with dummy variables is actually two separate equations, one for each gender:

For a female employee: Annual Salary = 2322 + 845*Age
For a male employee: Annual Salary = 4667 + 845*Age

What do these equations mean? When we control for age, that is, when the ages of the employees are the same, the model predicts that a female employee will earn $2345 per year less than a man. Notice that the slopes of the two equations - the rate at which salary increases based on age, is the same for both male and female employees. What is different is the starting salary, represented in these equations by the y-intercepts.