Chapter 9
Multiple and Categorical Regression1

In this chapter, we will explore relationships that are more realistic: one variable will be dependent on several variables. This is the most common scenario in analyzing data. Consider the salary of an employee at a company. Most likely, that salary is based on a combination of factors: educational background, prior experience in a related job, job level in the company, and number of years with the company, just to name a few. Trying to separate any one of these variables out to explain salary will result in a large amount of variation in the model. This is because there are probably several employees with the same educational background (like a Bachelor’s degree) but different experience. They will make different salaries. If you try to predict salary based only on education, the model will have a great deal of error caused by this spread in the data. Essentially, the problem is caused by trying to account for too much variation in salary with too few variables. In this chapter, we will use multiple linear regression to model relationships in which a single response quantity is dependent on several explanatory variables at one time. Multiple regression works pretty much like simple linear regression, but has more information (more slopes to deal with) and another measure of validity, called the adjusted R2.

The second part of this chapter will take us back to looking at categorical data. Up till now, we’ve created models using only numerical variables. Many of the data sets that we are interested in, however, include categorical data. In the past, to analyze such data, we have been forced to ”unstack” the data and make several graphs. One can certainly continue in this fashion, but if there are several different categorical variables of interest, the process would be time-consuming. As it happens, there is an agreed-upon method for converting categorical data into numerical data by introducing dummy variables. You will learn how to create dummy variables and how to build and interpret regression models built from them. By the end of the chapter, you will have a powerful collection of tools for modeling data. You will be able to represent relationships with several variables, using numerical, categorical, or a combination of variable types.

As a result of this chapter, students will learn

As a result of this chapter, students will be able to

How to read the measures of validity for multiple regression output

What the coefficients in a multiple regression output mean

How graphs can help interpret the validity of a multiple regression model

How multiple regression can handle more complex problems than simple regression

What dummy variables are

What a reference category is

How many equations are really hidden inside a single model with dummy variables

Set up a multiple linear regression model

Write down the regression equations for a multiple regression model

Analyze the accuracy of a multiple regression model

Make predictions, using , from a model

Determine appropriate variables to use, based on the adjusted R2 value

Create dummy variables for a set of data u(if needed, to construct models with categorical factors)

Construct a model using dummy variables

Identify the reference category in a model

Interpret a model with dummy variables, including all the ”hidden equations”

 9.1 Modeling with Proportional Reasoning in Many Dimensions
  9.1.1 Definitions and Formulas
  9.1.2 Worked Examples
  9.1.3 Exploration 9A: Production Line Data
 9.2 Modeling with Qualitative Variables
  9.2.1 Definitions and Formulas
  9.2.2 Worked Examples
  9.2.3 Exploration 9B: Maintenance Cost for Trucks
 9.3 Homework
  Mechanics and Techniques Problems
  Application and Reasoning Problems
 9.4 Memo Problem: Gender Discrimination