8.2.1 Definitions and Formulas

Predicted values (fitted values)
These are the predictions of the y-data from using the model equation and the values of the explanatory variables. They are denoted by the symbol ŷi.
Observed values
These are the actual y-values from the data. They are denoted by the symbol yi.
Residuals
This is the part that is left over after you use the explanatory variables to predict the y-variable. Each observation has a residual that is not explained by the model equation. Residuals are denoted by ei and are computed by
e = y  - ˆy
 i    i   i

Since these are computed from the y values, it should be clear that the residuals have the same units as the y, or response, variable.

Total Variation (Total Sum of Squares, SST)
The total variation in a variable is the sum of the squares of the deviations from the mean. Thus, the total variation in y is
       ∑         2
SST =     (yi - y¯)

Unexplained variation (Sum of Squares of Residuals, SSR)
The variation in y that is unexplained is the sum of the squares of the residuals:
        ∑
SSR  =     (yi - ˆyi)2

Explained variation (Sum of Squares Explained, SSE)
The total variation in y is composed of two parts: the part that can be explained by the model, and the part that cannot be explained by the model. The amount of variation that is explained is
                                                  ∑             ∑
SSE  = Total Variation - Unexplained  Variation =     (yi - ¯y)2 -   (yi - ˆyi)2

Regression Identity
One will note that the Total Variation is equal to the sum of the Unexplained Variation and the Explained Variation.
SST  = SSR  + SSE

Coefficient of Determination (R2)
This is a measure of the ”goodness of fit” for a regression equation. It is also referred to as R-squared (R2) and for simple regression models it is the square of the correlation between the x- and y-variables. R2 is really the percentage of the total variation in the y-variable that is explained by the x-variable. You can compute R2 yourself with the formula

         Total Variation - Sum of Squares of Residuals
R2   =   ----------------------------------------------
         ∑            ∑ Total Variation
  2      --(yi --¯y-)2----(yi---ˆyi)2-
R    =          ∑ (y - y¯)2
                    i
     =   SST---SSR--=  SSE--
            SST        SST

R2 is always a number between 0 and 1. The closer to 1 the number is, the more confident you can be that the data really does follow a linear pattern. For data that falls exactly on a straight line, the residuals are all zero, so you are left with R2 = 1.

Degrees of Freedom for a linear model
The degrees of freedom for any calculation are the number of data points left over after you account for the fact that you are estimating certain quantities of the population based on the sample data. You start with one degree of freedom for each observation. Then you loose one for each population parameter you estimate. Thus, in the sample standard deviation, one degree of freedom is lost for estimating the mean. This leaves you with n - 1. For a linear model, we estimate the slope and y-intercept, so we loose two degrees of freedom, leaving n - 2.
Standard Error of Estimate (Se)
This is a measure of the accuracy of the model for making predictions. Essentially, it is the standard deviation of the residuals, except that there are two population parameters estimated in the model (the slope and y-intercept of the regression equation), so the number of degrees of freedom is n - 2, rather than the normal n - 1 for standard deviation.
     ∘ ------   ┌│ ------------   ∘ ------
        ∑ e2i    │∘ ∑  (yi - ˆyi)2     SSR
Se =   ------ =   ------------=    ------
       n -  2        n - 2         n - 2

The standard error of estimate can be interpreted as a standard deviation. This means that roughly 68% of the predictions will fall within one Se of the actual data, 95% within two, and 99.7% within three. And since the standard error is basically the standard deviation of the residuals, it has the same units as the residuals, whcih are the same as the units of the response variable, y.

Fitted values vs. Actual values
This is one of the most useful of the diagnostic graphs that most statistical packages produce when you perform regression. This graph plots the points (yi,ŷi) . If the model is perfect (R2 = 1) then you will have y 1 = ŷ1, y2 = ŷ2, and so on, so that the graph will be a set of points on a perfectly straight line with a slope of 1 and a y-intercept of 0. The further the points on the fitted vs. actual graph are from a slope of 1, the worse the model is and the lower the value of R2 for the model.
Residuals vs. Fitted values
This graph is also useful in determining the quality of the model. It is a scatterplot of the points (ŷi,ei) = (ŷi,ŷi - yi) and shows the errors (the residuals) in the model graphed against the predicted values. For a good model, this graph should show a random scattering of points that is normally distributed around zero. If you draw horizontal lines indicating one standard error from zero, two standard errors from zero and so forth, you should be able to get roughly 68% of the points in the first group, 95% in the first two groups, and so forth.


PIC


Figure 8.2: Sample initial data from which a regression line can be computed.



PIC


Figure 8.3: The various quantities involved in regression that are discussed above.