-
Predicted values (fitted values)
- These are the predictions of the y-data from using the
model equation and the values of the explanatory variables. They are denoted by the
symbol ŷi.
-
Observed values
- These are the actual y-values from the data. They are denoted by the
symbol yi.
-
Residuals
- This is the part that is left over after you use the explanatory variables to predict
the y-variable. Each observation has a residual that is not explained by the model
equation. Residuals are denoted by ei and are computed by
Since these are computed from the y values, it should be clear that the residuals have the
same units as the y, or response, variable.
-
Total Variation (Total Sum of Squares, SST)
- The total variation in a variable is the sum
of the squares of the deviations from the mean. Thus, the total variation in y
is
-
Unexplained variation (Sum of Squares of Residuals, SSR)
- The variation in y that is
unexplained is the sum of the squares of the residuals:
-
Explained variation (Sum of Squares Explained, SSE)
- The total variation in y is
composed of two parts: the part that can be explained by the model, and the part that
cannot be explained by the model. The amount of variation that is explained
is
-
Regression Identity
- One will note that the Total Variation is equal to the sum of the
Unexplained Variation and the Explained Variation.
-
Coefficient of Determination (R2)
- This is a measure of the ”goodness of fit” for a regression
equation. It is also referred to as R-squared (R2) and for simple regression models it is the
square of the correlation between the x- and y-variables. R2 is really the percentage of the
total variation in the y-variable that is explained by the x-variable. You can compute R2
yourself with the formula
R2 is always a number between 0 and 1. The closer to 1 the number is, the more confident
you can be that the data really does follow a linear pattern. For data that falls
exactly on a straight line, the residuals are all zero, so you are left with R2 =
1.
-
Degrees of Freedom for a linear model
- The degrees of freedom for any calculation are the
number of data points left over after you account for the fact that you are estimating certain
quantities of the population based on the sample data. You start with one degree of
freedom for each observation. Then you loose one for each population parameter
you estimate. Thus, in the sample standard deviation, one degree of freedom is
lost for estimating the mean. This leaves you with n - 1. For a linear model, we
estimate the slope and y-intercept, so we loose two degrees of freedom, leaving
n - 2.
-
Standard Error of Estimate (Se)
- This is a measure of the accuracy of the model for making
predictions. Essentially, it is the standard deviation of the residuals, except that there are
two population parameters estimated in the model (the slope and y-intercept of the
regression equation), so the number of degrees of freedom is n - 2, rather than the normal
n - 1 for standard deviation.
The standard error of estimate can be interpreted as a standard deviation. This means that
roughly 68% of the predictions will fall within one Se of the actual data, 95% within two, and
99.7% within three. And since the standard error is basically the standard deviation of the
residuals, it has the same units as the residuals, whcih are the same as the units of the
response variable, y.
-
Fitted values vs. Actual values
- This is one of the most useful of the diagnostic graphs that
most statistical packages produce when you perform regression. This graph plots the points
(yi,ŷi) . If the model is perfect (R2 = 1) then you will have y
1 = ŷ1, y2 = ŷ2, and so on, so
that the graph will be a set of points on a perfectly straight line with a slope of 1
and a y-intercept of 0. The further the points on the fitted vs. actual graph are
from a slope of 1, the worse the model is and the lower the value of R2 for the
model.
-
Residuals vs. Fitted values
- This graph is also useful in determining the quality of the
model. It is a scatterplot of the points (ŷi,ei) = (ŷi,ŷi - yi) and shows the errors
(the residuals) in the model graphed against the predicted values. For a good
model, this graph should show a random scattering of points that is normally
distributed around zero. If you draw horizontal lines indicating one standard error
from zero, two standard errors from zero and so forth, you should be able to get
roughly 68% of the points in the first group, 95% in the first two groups, and so
forth.