In this chapter, we will explore relationships that are more realistic: one variable will be dependent on several variables. This is the most common scenario in analyzing data. Consider the salary of an employee at a company. Most likely, that salary is based on a combination of factors: educational background, prior experience in a related job, job level in the company, and number of years with the company, just to name a few. Trying to separate any one of these variables out to explain salary will result in a large amount of variation in the model. This is because there are probably several employees with the same educational background (like a Bachelor’s degree) but different experience. They will make different salaries. If you try to predict salary based only on education, the model will have a great deal of error caused by this spread in the data. Essentially, the problem is caused by trying to account for too much variation in salary with too few variables. In this chapter, we will use multiple linear regression to model relationships in which a single response quantity is dependent on several explanatory variables at one time. Multiple regression works pretty much like simple linear regression, but has more information (more slopes to deal with) and another measure of validity, called the adjusted R2.
The second part of this chapter will take us back to looking at categorical data. Up till now, we’ve created models using only numerical variables. Many of the data sets that we are interested in, however, include categorical data. In the past, to analyze such data, we have been forced to ”unstack” the data and make several graphs. One can certainly continue in this fashion, but if there are several different categorical variables of interest, the process would be time-consuming. As it happens, there is an agreed-upon method for converting categorical data into numerical data by introducing dummy variables. You will learn how to create dummy variables and how to build and interpret regression models built from them. By the end of the chapter, you will have a powerful collection of tools for modeling data. You will be able to represent relationships with several variables, using numerical, categorical, or a combination of variable types.
As a result of this chapter, students will learn | As a result of this chapter, students will be able to |
|
|