5/31/12

(AD.B.3) Basic regression: predictions, prediction-errors, and p-values

Objective: To begin using regressions for prediction and understanding of how one variable affects another.

Back to Softball  

In a previous article we learned why an average hitting distance is a good predictor of people's hitting ability. The average is good because it minimizes the sum-of-squared-prediction errors (SSE), which was how we defined accuracy. Then we learned that the SSE can be made even smaller by calculating one average for males and one average for females, and using those gender-specific averages to predict hitting distances.

Figure 1—Recalling the Softball Data

There is another piece of information in our softball data (shown above) which we haven't used: experience playing on a ball team. It would seem advantageous to use information on both gender and experience to predict hitting distances, and we will do just that, using the equation below.

[Equation 1] predicted distance = a0 + a1(male) + a2(experience)

As always, we want predictions which minimize SSE, and that is exactly what regression does. Using Excel, we can estimate a regression where the coefficients a0, a1, and a3 are chosen by Excel to minimize squared-prediction errors. If you are unsure how to estimate this regression in Excel, consult the video demonstration below.

Video 1—Regression Demonstration

As Video 1 demonstrates, the regression estimate is shown in Equation 2, below.

[Equation 2] predicted distance = 75.31 + 67.71(male) + 3.45(experience)

This equation can predict the hitting distance for either gender and for any years of ball experience. A female with no experience is expected to hit around 75.31 feet, which happens to equal the intercept because both male and experience equal zero, leaving nothing but the intercept in the equation. Compare this to a male with no experience, who hits about 143.02 feet. Notice the increase in distance when moving from a female to male (both with no experience) increases distance by exactly 67.71 (the coefficient for male). This means that any male is expected to hit 67.71 feet further than any female, if they both have the same ball experience. Finally, a male with five years of experience hits 160.27 feet.

Table 1—Regression Predictions

Person male experience predicted distance
1 0 0 75.31 + 67.71(0) + 3.45(0) = 75.31
1 1 0 75.31 + 67.71(1) + 3.45(0) = 143.02
1 1 5 75.31 + 67.71(1) + 3.45(5) = 160.27

From this regression we can determine the impact of each explanatory variable (male and experience) on hitting distance. This is a two-step process: the first step will be mysterious for the present but the second step will be obvious. Consider first the impact of gender: how does hitting distance change when the hitter changes from a female to a male, holding all other variables constant? The phrase "holding all other variables constant" can be succinctly stated in latin as ceteris paribus, and I will use this phrase frequently. This means we compare a male relative to a female when they both have the same hitting distance.

  1. First, we ask whether the variable is statistically significant. A regression will never give you a coefficient of exactly 0, but often the real effect of the variable might be zero, and we need a rule to determine this is the case. While it will be elaborated upon later, for now, whenever the p-value for a coefficient is less than 0.05 we say the coefficient is not really zero, and should be taken seriously. As shown in the regression output below, this is the case for both male and experience, so we say they are statistically significant, and take whatever their coefficients are seriously.
  2. Given an explanatory variable is statistically significant, its coefficient signifies the change in the dependent variable should the explanatory variable increase by one unit. The coefficient for male is 67.71, meaning distance increases by 67.71 feet when male increases from 0 to 1. On average, males hit 67.71 feet further than females. The coefficient of 3.45 for experience shows that for every additional year of experience one's hitting distance will rise about 3.45 feet. Consequently, when experience rises from 0 to 5 in Table 1, distance rises by (3.45)(5) = 17.25.

Figure 2—Regression Output
from regression predicted distance = a0 + a1(male) + a2(experience)

Graphing Regression Predictions