The average is a good prediction
Each year this class goes to a field to hit softballs. Someone pitches a slow pitch, and each person is required to hit three pitches as far as they can. Hitters are told to swing at all pitches within their reach, and not just wait for a good pitch. Afterwards we administer a survey to the hitters to collect an array of information that might explain why some people hit further than others. If one gender can hit further, on average, data on gender might help predict hitting distances. Likewise, information on their experience on a softball / baseball team might also explain why some people hit further than others.
Figure 1—Collecting Statistics On Softball Hits
This article begins a series of articles all devoted to regression. You can think of regression as a technique where you use the values of one or more variables (like gender and experience) to predict the values of another variable (like hitting distance).
The articles frequently employ the softball data because most students can easily envision how information on certain variables should influence hitting distance, making them ideal data for learning the basics of regression. Moreover, once you learn how to apply regression for predicting hitting distances, this knowledge can easily be transferred to other settings.The data on hitting distances can be downloaded here.
Before learning regression, we must understand the features of an average, for as you will see, a regression is just sophisticated average. When you open the data in Excel it will look like the following.
Figure 2—Glance at Softball Data
If I ask you to predict a student's hitting distance using these data, but I did not tell you anything about the hitter, how will you use the data to form a prediction? You will probably simply take the average of distance. That would be a wise choice. The average provides great predictions in that it minimizes the sum of all the prediction-errors (i.e., minimizes the SSE), where a prediction error is the actual value of the variable minus the prediction.
Even though an average provides good predictions, Figure 2 shows we have data on people's gender and experience playing on a ball team. It would seem obvious that we should use these data to provide different predictions for different people. Like, if the males in the data hit further than the females, we should predict a larger hitting distance when a male steps up to the plate after a female has hit. Likewise, someone with little experience on a ball team is likely to hit a shorter distance than someone of the same gender who has played lots of ball.
Conditional averages are betterIf you were truly interested in predicting hitting distances, you could imagine sorting the data by gender and experience, and calculating different averages for the following types of people:
- males with less than three years of experience,
- males with more than three years of experience,
- females with less than three years of experience,
- females with more than three years of experience.
These averages which are conditional upon a certain type of person are referred to as conditional averages. Your predictions using conditional averages will probably be more accurate than using one average for everyone. In fact, if accuracy is defined as a low SSE, it is mathematically impossible for conditional averages to not be more accurate.
Averages are good, but conditional averages are better. Imagine you have a huge amount of data, and there are dozens of different ways of grouping people. An example would be data on household income, including information on the occupation of the household adults—think of all the different occupations people have! It would be a lot of work to define 200 different groups from a dataset of 22,000 observations.
Fortunately, there is a much easier way of obtaining conditional averages. All we need to do is construct a formula where the values of certain variables are used to predict a variable of interest. The formula below uses the male dummy variable and the variable for years of experience to predict hitting distances. Of course, we need the values of a0, a1, and a2 (referred to as coefficients or parameters)before we can predict anything. That is the easy part. As we will see shortly, Excel calculates those coefficients for us, and it chooses the unique coefficient values that minimizes the SSE.
[Equation 6] predicted distance = a0 + a1(male) + a2(experience)
The act of specifying this type of equation and choosing coefficients to minimize the SSE is referred to as regression analysis. As we will see in subsequent articles, regression in Excel gives us the following equation for predicting hitting distances. Moreover, the equation tells us that males hit 67.71 feet further on average than females and each year of experience increases the average hitting distance by 3.45 feet. These "averages" here are conditional average, and at its core, regression is nothing more than a very fancy way of acquiring conditional averages.
[Equation 7] predicted distance = 75.31 + 67.71(male) + 3.45(experience)
Video 1—Short Video on Using Regression to Predict Softball Hits