7.1 Picturing and Quantifying the Relationship Between Two Variables

In many of the previous examples in this book you have probably been tempted to go too far in your conclusions. For example, if you were to look at information about employees at a company and you learned that the salaries were negatively skewed and that the ages of your employees were also negatively skewed, you might be tempted to claim that one variable (for instance, age) influences the other variable (in this case, salary).

However, it would be dishonest to make such a claim with the tools we have discussed so far. In fact, the relationship between the two variables could be exactly the opposite of what you claim: it could be that the low salaries are all earned by employees who are older and that younger employees are making more money. It is even possible that the two variables are unrelated entirely. All of our tools up to now have been tools to analyze data one variable at a time. In order to speculate about relationships between two or more variables, we need new tools that include two variables at a time. A graphical tool for this analysis is the scatterplot. This is a two-dimensional graph made up of points where each point represents a pair of observations, one for each of the two variables you are comparing. In this way, you can quickly spot connections between variables. Such connections are called correlations and can also be computed numerically with a fairly simple formula based on z-scores.

Consider the employee salary example above. One could speculate that the points representing the salary and age of each employee would show that older employees tend to have higher salaries (after all, they have been working longer, have more experience and have had more opportunities for promotion). If the graph shows this, then there might be a connection between the two variables.

We want to emphasize this as strongly as possible. Simply because the correlation between two variables is high does not mean that one variable is causing the changes in the other. Consider the following situation: You are interested in the performance of your stock brokers at a large investment firm. If you looked at the amount of money each broker earned for the firm and compared this to the number of cups of coffee that broker drinks each day at work, what would it mean if there were a strong positive correlation? Would that mean that drinking more coffee makes you a better broker? Clearly, this is absurd. What it does mean is that brokers who make more money for the firm also tend to drink more coffee. That’s all it means. Why might this be so? There are many reasons. It could simply be that the amount of coffee consumed is a surrogate for the number of hours the broker works. More hours worked might lead to more money for the broker. But more hours worked will probably involve drinking more coffee.

For the remainder of this book, we will be dealing with how to represent relationships among variables. Our goal is to develop these relationships into mathematical equations called functions that we can use in our decision-making.

  7.1.1 Definitions and Formulas
  7.1.2 Worked Examples
  7.1.3 Exploration 7A: Predicting the Price of a Home