6/14/12

(AD.B.7) Basic regression: holding variables constant

Objective: To understand how regression "controls" for certain variables and what happens when important variables are missing in a regression.

Revisiting the gender gap  

Several times these articles has noted that females tend to hit the ball shorter distances and earn less money than males. Lest this be misconstrued as sexism, the female reader should be pleased to know that when single males and females with equal years of experience and background are compared, females earn more money. Note how important it was to compare both genders on a level playing-field. To compare a female who took ten years off from work to raise a family with a male of the same age who continued working would be unfair. While the female was caring for the children the male was acquiring technical knowledge, experience, and contacts. The female's job was probably more important, but the male's job paid better.

Suppose we took these data on salaries earned by agricultural graduates from Kansas State University and estimated the following regression.

[Equation 1] predicted salary =  a0  +  a1(female

The individuals in these data either have an agricultural economics, agronomy, animal science, or baking science degree, and no individual has a graduate degree. The coefficient a1 tells us the amount by which the average female salary is higher or lower than males, but since there are no other explanatory variables, this is an unfair comparison, as it does not account for the individuals' work experience. After estimating this regression I find the following.

[Equation 1] predicted salary =  52,216  -  13,543(female)
[p-values]                                    (0.00)         (0.00)                            

The low p-value on female signifies that males and females do make different average salaries, and the coefficient $13,543 says that females make almost $14,000 less.  To what extent does the a1 coefficient reflect gender differences and to what extent does it reflect differences in levels of experience? What we want is a regression that can compare the average salaries of males and females who have the same level of experience. To achieve this, we simply add experience as an explanatory variable to the regression.

[Equation 3] predicted salary =  36,841  -  10,103(female)  +  1,447(experience
[p-values]                                   (0.00)        (0.00)                     (0.03)

Now, females only make $10,000 less. The salary discount corresponding to females reduced by over $3,000, once we took experience into account. However, the fact that the female coefficient is still large (in absolute value) this still leaves much to learn. Why are females making $10,000? Ideally, we could acquire a host of other variables to include in the regression, and if we could account for every conceivable explanation for why females are valued less to employers, and could include those as explanatory variables, what would be left is a coefficient indicating the extent to which gender discrimination occurs. It is one thing to pay a woman less because she has less experience; quite another thing to do so simply because she is a woman.

Fortunately, our data do have some extra variable we may include. Perhaps females choose different majors than males, and in particular choose majors with lower salaries associated with them? To investigate this we estimate the following regression.

[Equation 4] predicted salary =  a0  +  a1(female)  +  a2(experience)  +  a3(agecon)  +  a4(agronomy)  +  a5(ansi

In Equation four the regression now includes dummy variables for those with agricultural economics degrees (agecon), agronomy degrees (agronomy), and animal science degrees (ansi). Some of the individuals have a degree in baking and milling science, and they are identified by the cases where agecon = agronomy = ansi = 0.  When an array of dummy variables encompass all the possibilities, one must be excluded from the regression because it is accounted for in the intercept a0. This is why we use either a female or male dummy variable, but never both.

In the estimate of Equation 4, shown below, the p-values for all the variables corresponding to type of degree are zero, suggesting they are not correlated with salary. If the choice of major has no impact on salary, one cannot say females are paid less because they choose less lucrative degrees. Indeed, the coefficient a1 barely changes from Equation 3 to Equation 4.

Figure 1—Regression Explaining Salaries
1

Disentangling two factors  

There are many cases where one uses a regression to predict a variable, but there are many different factors that either cause or are correlated with that variable: how can one disentangle the impact of two explanatory variables on a dependent variable? How does one determine the impact of a larger police force on the crime rate when at the same time the police force rises the prison population rises also, resulting in longer sentences for convicted felons? Regression is specifically designed to do just that. By including an array of explanatory variables in a regression, one can evaluate the impact of any one explanatory variable, holding the other variables constant.

This was demonstrated nicely in an interview on The Daily Show. Watch it, and pay attention to how often Dr. Levitt says something like, "holding these things constant," and how he says to use regression in his studies.

Video 1—Daily Show's Jon Stewart Interviews Economist Steven Levitt
(must use Internet Explorer)