Seeds: (AD.B.6) Basic regression: understanding prediction-errors

Objective: The interpretation of regression predictions and prediction-errors. Emphasis is placed on the difference between in-sample and out-of-sample prediction errors.

Amazing hits, both expected and unexpected

A regression for predicting how far people will hit a softball was estimated from these data, resulting in the regression below. One could go into the data and find people who hit surprisingly far and people who hit an embarrasingly short distance. I was the pitcher for most of these hits, and often I could tell whether someone was a good hitter before they swung the bat. Sometimes, though, people who looked like a slugger couldn't make it to the pither's mound, and some who I would never pick to be on my team hit it further than I could. There are times when you want to identify people who are expected to behave a certain way, but also there are times when you want to find the unexpected. Regression is ideal for both cases.

[Equation 1] predicted distance = 75.31 + 67.71(male) + 3.45(experience)

Predicting performance

This regression was estimated from over six hundred observations. If we use the regression to predict the performance of the people in the data it will be obvious that while a regression minimizes the sum-of-squared prediction-errors in these data, the predictions are almost never perfectly accurate. Follow how I used the regression to generate predictions, as it will demonstrate how you name a cell so that it can be used in a formula.

The first thing I do is create a sheet containing both the original softball data and the regression output, as shown below.

Figure 1—The Softball Data and Regression Output Together

Then, I want to create a column containing the regression's predicted hitting distance for each observation, and this is particularly easy if I give the cells containing the coefficients names. Figure 2 shows I how name the cell containing the intercept coefa0 by selecting the intercept, then selecting the box in the upper-left corner, and typing "coefa0". Similarly, I name the coefficient for male and experience coefa1 and coefa2, respectively.

Figure 2—Naming Cells in Excel

Next we create a new column for predictions where we enter the regression formula, using the cell names, predicting the hitting distance for each observation, as shown below. An adjacent column for prediction-errors (also called residuals) is also made, and remember the predicton error always equals the dependent variable's true value minus its prediction.

Figure 3—Entering Predictions

Not shown in Figure 3 are all the 616 observations, and let us first look at some of the great hits. The third longest hit (actual hit, not predicted) was 328 feet, whose hitter was a male with eighteen years of experience. This is not surprising, as this suggests someone who has played ball most of his life (remember these are college students). Moreover, the regression predicted this person would hit further than anyone else. That is, the males who around eighteen years of experience on average hit 205 feet. This is a conditional average, because it is conditional on a certain gender and amount of experience, and remember all regression predictions are basically conditional averages.

Although this hit of 328 feet is impressive it is also expected, given his gender and level of experience. Should we praise him for his hit, if it was expected of someone like him? We will return to that question shortly.

One person hit only eleven feet—it didn't even reach the picture. This was the eight shortest hit, and because it was from a female with no experience, this person was expected to reach one of the smallest distances. The hit of eight feet is almost laughable, but it was expected, so why should we laugh?

Special observations can be identified by a large or small prediction-error

Prediction-errors are not just indicators of regression performance. They also tell us something about the person. Suppose a unathletic female approaches the base, holding the bat clumsely and obvious frightened of being laughed at. She pushes her glasses closer to her eyes with a single finger, and looks at the pitcher as if she does not know how the ball will approach her. The pitch is a little off, and no one things she will be able to hit it even if she tries, but she does hit it, and slams it 350 down the field!

Prediction-errors can be used to identify such unusually talented hitters. We can find them by calculating the prediction-errors and looking for the largest value. This value is 200.87, meaning the person hit 200 feet further than the regression predicted. They are particularly spectacular because they defied all expectations. This person is special. Who is this person? A female with fifteen years of experience, and so while one would expect even a female with much experience to perform well—the regression predicted 127 feet—no one expected her to hit 328 feet.

Let me be more precise, because this is the key to understanding regression. Within the data, females with around fifteen years of experience tended to hit 127 feet, on average. Yet she hit 328 feet, making her above average for females who played the same amount of ball.

Just as someone was above-average there are below-average hitters. In particular, one male with ten years of experience stepped up to the plate and hit only three feet! The regression predicted 177 feet, because the average for males with similar experience was 177 feet, but he hit only three feet. This person may be particularly bad. Perhaps he had poor eyesight but forgot to wear his contacts? Or, it may have just been a particularly bad hit from someone who is nevertheless a good hitter. Even a pro may just glaze the ball, sending it straight up in the air and falling right in front of them. The point is that particular small prediction-errors designate hits that were not just poor hits, but poor hits relative to what was expected.

Seeking the extraordinary (and "holding variables constant")

Suppose you wanted to write a book about what separates successful people from their less successful counterparts. Not interested in the super-rich, but ordinary people who did fairly well for themselves. The idea is that it is difficult to interview the super-rich for clues how other people can be equally rich, as you would end up making unhelpful suggestions like "start your own social network with a million members". Instead, you want to discover the personalities and background of the top portion of the middle class. For example, perhaps the managers of a factory tend to be organized, or able to focus, or perhaps they have supportive spouses which enable them to work hard. These are useful hints.

How do you find such people? One obvious answer is to identify people who make large incomes, but this will tend to only include older people who have worked for decades in a field. While you wish to interview those people, you also wish to interview young people who, given their age, have done well. Or perhaps you want to identify the more successful females, but females do not make as much money because they do not have as much experience as males, because most took time off to raise families.

Ideally, you would identify people who, given their work experience and gender, perform above average. Regressions are ideal for such exercises. In these data are information about graduates from the agricultural economics department at Kansas State University who do not have graduate degrees. The data were collected in 1997 so they are not out-dated but not recent either. It contains their yearly salary, their years of work experience, and a dummy variable for the female gender. Salary is expected to rise with work experience and (unfortunately) fall for females—take heart females, because more recent data show that when you compare young single males and females with similar levles of experience, females make more! If we then estimate a regression predicting salary as a function of experience and gender we arrive at the following equation.

[Equation ?] predicted salary = 39,967.3398 + 1,205.1795(experience) - 10,000.7799(female)
[p-values] (0.00) (0.00) (0.03)

All variables are statistically significant and thus are truly correlated with salaries. Each year of work experience increases income (on average) by $1,205 per year and being a female is associated with a $10,000 reduction in salary.

It must be noted that the salary discount of $10,000 is not due to less work experience on the part of females, because experience is "accounted for" or "held constant" by the experience variable. That is, if both the female and male have no work experience—or five years, or twenty years, it doesn't matter so long as they have identical years of experience—females are still expected to make about ten thousand less.

Do not immediately assume that being female "caused" salaries to fall, though gender discrimination certainly exists in some areas—and for organizations wanting more diversity. This salary discount is probably "caused" by a host of variables. Perhaps an employer believes mothers with children have less time to devote to the job than husbands with children. I don't know, but what I do know is that when you start adding lots of other variables besides experience, thereby enabling you to hold many more things "constant", the salary discount disappears.

Regardless, the salary discount for females does exist in our regression, and we must accept it whether we like it or not.

Recall the purpose of this exercise: to find people who make large incomes, given their experience and gender. In essence, we are looking for people with large prediction-errors, where the actual salary is much larger than their predicted salary. Using this regression, I predicted the salary for each person in the data. That gives me the conditional average for that person (the average salary for people of that gender and years of experience). Then I calculated the prediction-error by subtracting the predicted salary from the actual salary, and searched for the largest prediction error. The figure below shows how I performed the work and the answer I found.

One male with nine years of experience made $150,000 when other males with the same experience only made around $50,814, and so the large prediction error of $99,186 is the largest in the sample. There is something special about this person that allows him to make 150% more than one would expect, if all one knew was his work experience.

The female with the largest prediction-error (who had 12 years of experience) made $62,500 when she was predicted to make $44,428. The prediction error of $18,072 is relatively low compared to the males' largest prediction error. It would seem there is something about the market for ag econ / ag business graduates that is designed to propel exceptional males into high positions but might hold females back. Or, it could be a sample size problem. With only 28 females but 269 males we are able to learn far more about males than females.

Figure 4—Studying Superlative Salaries (given gender and experience)

Out-of-sample predictions: how John Paulson made billions of dollars

The previous example used a set of data to estimate a regression and then used that regression to predict that same set of data from which it was derived. This is referred to as an in-sample prediction. Most of the time regressions are used to predict separate observations, called out-of-sample predictions. In these cases, both the predictions and the prediction-errors can be particularly useful.

Consider the case of how John Paulson made billions of dollars using a simple regression and basic economics. It is 2007 and investor John Paulson notices the increasing price of houses in the last five years and wonders if it is perhaps a price bubble—where people bid up the price not because they think it is more valuable but simply because they see prices rising and assume it will continue to rise. A good's price cannot exceed its true value for long, and when the market realizes the good is over-priced, the bubble "bursts" and the price plummets.

Paulson conducts a thought experiment, where he asks: suppose it is the year 2000, and we look at how house prices have trended in previous years and what that trend suggests the real value of houses is in 2007? Essentially, he used a regression like the one below, where time is a "time trend" variable which increases by one unit every time a month passes by.

[Equation ?] predicted house price = a₀ + a₁(time)

We will replicate John Paulson's thought experiment using data on house prices across the U.S. These data measure house prices as an index, where a 3% increase in the index denotes a 3% increase in house prices. Shown below, the first observation occurs in January, 1987 where the variable time is set equal to one. As each month goes by time increases by one unit. This is the case where the explanatory variable is constructed by the researcher, and constructed to specifically document the passing of time in a numerical fashion.

Figure 5—Index of U.S. House Prices (1987-2011)

Remember, first we want to document the trend in house prices until the year 2000, which means we want to estimate the regression only using data up to December, 1999 (when time equals 156). Doing so result in the following regression. The time-trend variable is statistically significant (p-value is less than 0.05) so house prices were trending upwards across time. Before the 2000's home prices seemed to increase moderately, but was certainly in no bubble. Paulson thus assumes Equation ? represents how the true value of homes should trend over time, and unless something changes to suddenly make homes more valuable, if the actual price index is high above what the regression predicts that prediction-error represents an over-valuation of homes.

[Equation ?] predicted house price = 71.24 + 0.1003(time)
[p-values] (0.00) (0.00)

If there is no house price bubble, one would expect the house price index between 2000 and 2007 to increase by only 0.1003 per month, or none at all. For example, the regression predicts that in January, 2007 house prices should be 71.24 + 0.1003(241) = 95.312. In reality the house price index was 222.61, so yes, if the real value of homes only increases about 0.1003 per month, house were horribly over-priced and would eventually come crashing down. The prediction-error of 222.61 - 95.213 = 127.397 is enormous, considering the largest prediction-error from the in-sample predictions is 12.98! Of course, if the real value of house prices suddenly rose to new heights there would be no bubble, but John Paulson didn't see any reason why houses would increase in value by that much (nor do I). He assumed the large out-of-sample prediction-error denoted a price-bubble.

Let's create a graph showing the predicted house prices and the actual house prices from 2000 to 2011. The blue line shows the actual price of houses across the U.S. The red line denotes the in-sample predictions, and because there was little variability in house prices before 2000 the prediction errors (difference between the red line and blue line) are relatively small. The red line has a slight incline, representing the incremental increase in house prices over time. When this trend is continued past the year 2000 (into years which were not used to estimate the regression) the prediction errors become large. Paulson saw the enormous gap between actual house prices and the out-of-sample predictions he concluded prices were much larger than was justified by real supply and demand conditions. Not only were prices too high, but they were bound to plummet, and whenever one knows how prices in the future will move there is money to be made.

Figure 6—In- and Out-of-Sample Predictions of House Prices (1987-2011)

Paulson did when any good investor would do: he speculated on house prices. But how? Obviously, if you believe house prices will rise the way to make money is to buy lots of houses, rent the houses out, and once house prices climb to a high level you turn around and sell those houses. Buy low, sell high. A phrase we all know. But how do you "buy low, sell high" when you believe house price will fall? You, "sell high, buy low," and while that may seem difficult it is not impossible.

People do this kind of thing for stocks all the time: it's called shorting a stock. A speculator named James finds someone named Andrew who owns stock in the corporation he wishes to short, and James pays Andrew money to "borrow" this stock from him for a period of time. During this time James has ownership of the stock, allowing him to sell it, and that's exactly what the speculator does. If the James guesses right, the stock price does indeed fall, and when it does James purchases the stock and returns it to Andrew. Because James sold the stock at a high price and repurchased it for a low price, he makes money off the difference. Some of these profits are used to pay Andrew for the right to borrow the stock, but the rest he keeps. So, whenever a corporation begins to experience financial difficulties, people begin shorting the stock, causing prices to fall to a level more consistent with its financial health.

Paulson basically wanted to short houses, but this is difficult to do, so he found another way. All Paulson needed to do was to find some financial transaction that makes money when house prices begin to fall. He knew that when house prices begin to fall it is often because people have trouble paying their mortgages and many houses are put on the market at the same time. So, if he could find something that pays money when people default, he could make money. As you can guess, he did just that.

As the house bubble continued to grow people started selling things called credit-default swaps, which were basically just insurance policies to protect lenders. Say someine named Jim sells these credit-default swaps, where he agrees to pay a large sum of money to a bank if a homeowner defaults on their mortgage. If the homeowner does not default then Jim pays nothing. Of course, Jim does not sell credit-default swaps for free. Jim charges a premium, and he receives the insurance premium regardless of whether homeowners default, but only pays an insurance indemnity if the homeowner does defaults.

If people began defaulting on their mortgages in large numbers the owners of credit-default swaps would pay out a lot of money. Moreover, the swaps were cheap to purchase because when a market is "inside the bubble" few think the price will ever fall. Paulson then began purchase lots of credit-default swaps, hoping the bubble would burst. It did, and when it did Paulson made his firm so much money that it was called one of the greatest trades ever made. Needless to say, Paulson would compensated handsomely.

     Mr. Paulson charged Mr. Pellegrini with figuring out whether homes were, in fact, overpriced. Late at night, in his cubicle, Mr. Pellegrini tracked home prices across the country since 1975. Interest rates seemed to have no bearing on real estate. Grasping for new ideas, Mr. Pellegrini added a "trend line" that clearly illustrated how much prices had surged lately. He then performed a "regression analysis" to smooth the ups and downs.
     The answer was in front of him: Housing prices had climbed a puny 1.4% annually between 1975 and 2000, after inflation. But they had soared over 7% in the following five years, until 2005. The upshot: U.S. home prices would have to drop by almost 40% to return to their historic trend line. Not only had prices climbed like never before, but Mr. Pellegrini's figures showed that each time housing had dropped in the past, it fell through the trend line, suggesting that an eventual drop likely would be brutal.
     ...
     In late 2007, Mr. Pellegrini took his wife on vacation in Anguilla. Stopping at an automated-teller machine in the hotel lobby to withdraw some cash, she checked the balance of their checking account.
     On the screen before her was a figure she had never seen before, at least not on an ATM. It's not clear how many others ever had, either: $45 million, newly deposited in their joint account. It was part of Mr. Pellegrini's bonus that year.
—Gregory Zuckerman. October 31, 2009. "Profiting From the Crash." The Wall Street Journal.

6/8/12

(AD.B.6) Basic regression: understanding prediction-errors