Seeds: (AD.B.8) Basic regression: hypothesis testing and p-values

Objective: To use p-values for determining whether a variable is statistically significant in a regression.

Satan, and the 9/11 Terrorist Attacks

While the World Trade Center burned, the black smoke took on a number of different forms, all of them simple manifestations of randomness. Because the smoke took on random appearances, it looked random...except for one part of the smoke that looked like a face—Satan's face? Could it really be Satan? After all, if Satan were to become active in the world, 9/11 is something he might do.

Figure 1—Satan's Face During 9/11?

Most of us dismissed the possibility that it really was Satan . Yet, if a Medieval Englishman saw Figure 1 he would have little doubt that it really was Satan. What makes us so different from our ancestors? The main difference is that we believe in randomness. Most of the world may still be religious, but even the most devote Christian believes some things in the world are not directly caused by God, even if God lets it happen.

Sometimes, however, we make observations about the world but are unable to decide whether it is a meaningful pattern or the result of randomness. Enter the world of statistics, which provides easy tools for discerning when a picture is a face, or when it just happens to look like a face. This tool is the subject of the present article.

A quick word about miracles: how do we explain this face in the smoke, or things that seem a miracle? Easy. Miracles are things that have almost no probability of occurring at any given time. However, each person lives about 2,459,808,000 (about 2.5 billion) seconds. Even if a miracle has only a minute chance of occurring in any one second, it has a reasonable chance of occurring at some point in your lifetime. This means that we should expect to see things like Satan's head in smoke—or coincidences that seem like they couldn't be mere coincidences—not every time we see smoke, but at some point in our lives.

Talking to god: the case of ancient Jews

Talking

     Then the Lord hurled a violent wind on the sea, and such a violent storm arose on the sea that the ship threatened to break apart. The sailors were afraid, and each cried out to his god. They threw the ship's cargo into the sea to lighten the load. Meanwhile, Jonah had gone down to the lowest part of the vessel and had stretched out and fallen into a deep sleep.
     The captain approached him and said, "What are you doing sound asleep? Get up! Call to your god. Maybe this god will consider us, and we won't perish."
     "Come on!" the sailors said to each other. "Let's cast lots. Then we'll know who is to blame for the trouble we're in." So they cast lots, and the lot singled out Jonah. Then they said to him, "Tell us who is to blame for this trouble we're in. What is your business and where are you from? What is your country and what people are you from?"
     ...
     He [Jonah] answered them, "Pick me up and throw me into the sea so it may quiet down for you...".
     ...Then they picked up Johan and threw him into the sea, and the sea stopped raging...".
     Now the Lord had appointed a huge fish to swallow Jonah, and Jonah was in the fish three days and three nights.
—Holman Christian Standard Bible. Jonah.

Most readers have heard the story of Jonah and the whale, and for those who actually read the story the part about "casting lots" must have been confusing. Unfortunately for us, little historical evidence exists to tell us exactly how lot-casting was performed, but we do know it was basically like rolling a die. For example, if there were only six sailors on the ship, they could assign each man a number, roll a die, and whichever number resulted indicated who was to blame.

If this sounds odd you are understanding correctly, and the reason why the casting of lots seems to arbitrary is because—regardless of how religious you are—you view the world like a devote Epicurian. What I mean is that you do not assume every single thing that happens in life was determined by gods. There is some randomness in life, and the reason gambling is fun is because there is randomness. If we lose money in Vegas, it's not because god spites us, we believe.

Many people in ancient history did believe that the will of the gods is manifested in what happens in the physical world. For Jews, who incorporated lot-casted extensively into their culture (to see for yourself, acquire an e-book of the bible and search for the word "lots" or "cast lots"), God could be consulted directly by the casting of lots. In battle, if the Jews were conflicted about whether attacking or retreating, they may very well cast lots. Meaning, they flipped a coin: "heads", you attack; "tails", retreat.

If you were an ancient Jew, almost everything we do with statistics and regression is a heresy because it assumes there is randomness in the world independent of God's will.

Talking to god: Beware the Ides of March

Romans were always keen to observe the world around them, interpreting odd events like natural disasters as the manifestation of gods' emotions. They would especially observe the behavior of birds. The English word "auspicious" is derived from the Latin word "auspice", which is an interpretation of bird behavior as a sign of favor from the gods. The Roman army would take sacred chickens with them, and if they were unsure whether they should perform an action they would place crums of cake on the bottom of the chickens' cages, and the gods would answer "yes" or "no" depending on whether the chickens ate the cake. This is the same as the ancient Jew casting lots, or a modern American rolling the die.

In plays and movies about Julius Caesar, his murder is usually foretold by a soothsayer who says, "Beware the Ides of March." This is another way of saying, "Beware of March 15." Of course, Caesar is murdered on this day. The soothsayer would not receive direct prophecy from the gods, but would infer the gods' wishes through omens and auspices. Also included in this clip is a scene where the priests of Marcus Aurelius inspect an animal carcass in search of omens.

Video 1—Beware the Ides of March, Caesar!^??
(must use Internet Explorer)

The Romans' superstitious beliefs were exploited by the ruling classes. Great men would always invent stories of auspices that occurred on the day of their birth to signify their favor from the gods, or would invent an omen if they wanted to prevent others from doing something. This is not all that unusual. Rome's most formidable enemy, Hannibal of Carthage, acquired much of his power by convincing others that he was ordained by gods to emerge victorious in all battles.

Dio attributes Hannibal's ability to predict future events to the fact that he understood divination by the inspection of entrails. At those critical moments when confidence in their mission had begun to ebb away from his troops, Hannibal seems to have ensured that some evidence of divine favor was presented by which the stock of Carthaginian self-belief was replenished and the troops were reminded that they were literally following in the footsteps of Heracles [or, Hercules] and his army
The captain approached him and said, "What are you doing sound asleep? Get up! Call to your god. Maybe this god will consider us, and we won't perish."
—Miles, Richard. 2011. Carthage Must Be Destroyed. Chapter 10.

To my amazement, some cultures still portend the future by observing entrails. In Aka, India marriages are not recognized until a particular type of cattle called Mithan has been slaughtered and its liver "read". For example, a couple is predicted to experience an accident but to live a largely happy life if the liver contains a small spot in a certain area.

Figure 1—Fortelling the future by
observing a cow liver (Aka, India)^?

Talking to god: Ancient China

The ancient Chinese are most relevant to this article, because their method of talking to the gods is most similar to modern statistical techniques. The Shang dynasty in China existed around 1500 B.C., and in their pursuit of the gods used the shoulder blades of cattle or the shells of turtles as "oracle bones." Questions would be carved into the bones—questions asked of ancestors of the present king, who were the gods of interest. The bones would then be cracked such that the cracks either pointed up or down, constituting a yes or no answer.

But there is more. Questions were asked sometimes on a daily basis, and many of the questions had answers which could be later verified (e.g., is the harvest going to be good this year?), and the bones would be stored so that future generations could assess the accuracy of a certain king's ancestor. Yes, the people did verify to determine whether the oracle bones gave accurate answers—what audacity!

But there is more. Rather than asking a question once, they may ask the question 5, 10 times to determine whether the answer is clear or ambiguous. If roughly half the answers were yes and half were no, then the gods essentially did not answer. If there were eight yes answers and two no answers, then the gods' answers were an official yes. After all if 80% of the time an answer is yes, it is more likely that the gods' true answer is indeed yes.

But there is more. Today we know (or, I believe) the outcome of the oracle bones is random, and thus the answers are random. If the gods consistently gave accurate answers, the present king has a true and authentic heavenly mandate because his ancestors are powerful gods. If the gods seem to know nothing about the future, the present king is a farce, his ancestors are no gods, and the king must be replaced.

The laws of probability asserts that as time goes by, there will be some rulers whose ancestors happen to predict the future well. And consequently, the rule of a king depended on chance—the rolling of the die—as much as his quality as a ruler. Kings ruled based on randomness.^?

Type I Errors and p-values

Around the time of the Trojan Wars (1,500 BC) the Shang dynasty began on the remains on the Xia dynasty, and the Shang left an huge collection of used oracle bones and records of their success at predicting the future. Today, we would view these oracle bones as we would a coin toss, where they are just as likely to give you a wrong answer as a right answer, and the outcome depends on chance, not the deified ancestors of the Shang. As researchers inspect the historical evidence, they are likely to assume that the probability of an oracle bone predicting the future accurately (p) is 50%. This is their default, or null hypothesis. It would take considerable evidence for them to be convinced otherwise.

[Equation 1] Null hypothesis: p = 0.5

[Equation 2] Alternative hypothesis: p ≠ 0.5

Suppose, however, that studies of these bones suggest they are correct 90% of the time, and this percent comes from studying 176 oracle bone predictions. What would you then think of the null hypothesis? You would begin to believe the null is wrong. Perhaps the Shang ancestors are deities, or perhaps the oracle bones were a farce used to trick people into revering the Shang?

It could also be true that the null really is correct, but just by random chance, 176 coin flips landed on 'heads' 90% of the time. However, the probability of that happening is tiny, and so you reject the null and conclude p is greater than 50%, knowing there is almost no chance of you being wrong.

Consider another case where only 52% of the oracle bones yielded the correct prediction. Well, the probability of observing 52 accurate predictions out of 176 predictions is highly likely, it would seem. You do not reject the null then, because if you do, the probability of being wrong is high.

In regards to discerning between the null and alternative hypothesis test, statistics use the concept of Type I and Type II Errors.

[Equation 3] Null hypothesis: p = 0.5. A Type I Error occurs when you 'reject the null', when in fact the null is true.

[Equation 4] Alternative hypothesis: p ≠ 0.5. A Type II Error occurs when you 'fail to reject the null', when in fact the null is false.

Researchers don't like to settle on a conclusion which has a high likelihood of being wrong, and when we can measure that likelihood we incorporate the measurement into our decision-making strategy. The probability of a Type II Error cannot be measured, but that for a Type I Error can. As a result, we usually discern between null and alternative hypotheses by only 'rejecting the null hypothesis' if the probability of a Type I Error is small.

The probability of a Type I Error is called a p-value. We 'reject the null' when the p-value is less than 5%. Sometimes we use 10% and sometimes we use 1%. When the p-value is greater than 5% (or 10%) we 'fail to reject the null'.

Professor Pope's 100,000 coin tosses

A coin toss, we know, has a 50% chance of 'heads' and 50% chance 'tails', yet we also know that even with 255 coin flips the percent of 'heads' will not be exactly 50%. It could, I suppose, but it is highly unlikely. It might be 49.113532%, or 52.0000394%, or ... Even if we flipped a coin 100,000 times the percent resulting in 'heads' will not be 50%. We know because a man actually tried it in the 1930's.

Yes, a man by the name of Professor Pope Hill flipped a coin 100,000 times. It took him an entire year, but he did complete the exercise, finding that 49,855(49.855%) resulted in a 'heads' and 50,145 (50.145%) resulting in 'tails'.^?1,?2 Notice two things. First, we never expected the percents to be exactly 50% for 'heads' or 'tails'. Second, because the number of observations are very large, the percentages are very close to 50%.

Instead of just assuming the true probability of a 'tails' is 50%, let's empirically determine whether this is the case. 'Empirically determine' means we use observation, or data. In statistics we use a rule for determining whether we say the true probability of tails is or is not 50%. First, we establish the following two hypotheses, where p is the true probability we cannot observe directly, but we can make inferences about it using data.

[Equation 5] Null Hypothesis: true probability of tails (p) = 0.5

[Equation 6] Alternative Hypothesis: true probability of tails (p) ≠ 0.5

Then, we say we 'reject the null hypothesis' only if the probability of being wrong (the probability of a Type I Error, or the p-value) is 5% or less (or 10%, or 1%). In what follows, we won't estimate the p-value itself, but instead will infer whether it is less than 5% or greater than 5%.

When Professor Pope was finished flipping a coin 100,000 times, it resulted in a 'tails' 50.145% of the time. If this percentage is highly unlikely to result from a fair coin toss, we would doubt the fairness of the toss. If the percent is not an unlikely outcome of a fair coin toss, then we would say we have no reason to think it was unfair (which is like saying we 'fail to reject the null hypothesis). So, how likely or unlikely is a percent of 50.145% in a fair con

Suppose that the true probability of a 'tails' is 50%. What is the probability we would estimate a percent of 50.145% in a sample size of 100,000? Whatever that probability is, it equals the probability of comitting a Type I Error if we reject the null hypothesis.

For example, there was only a 1% chance Professor Pope could flip a 'tails' 50.145% of the time, if coin toss were fair. We might then say that the coin toss was not fair, and conclude there was something about his experiment that made a 'tails' a higher probability of being observed.

You might remember from your introductory statistics course that probabilities like these go by a standard normal distribution, when we transform them the right way. Using the information in Figure 2 below, we find the value of z in Professor Pope's case is z = (0.50145 - 0.5) / {0.5(1-0.5)/100,000}100,000}^0.5= 0.917. This is not greater than 2 nor is it less than -2. Thus, we know that Professor Pope's calculation is consistent with a fair coin toss. If we decided to reject the null hypothesis and say the coin toss wasn't fair, the probability of us being wrong is greater than 5%. The probability of a Type I Error is thus greater than 5%, and we are not willing to take that chance.

Suppose instead Pope found that 52% of the coin tosses results in 'tails'. The value of z is then: z = (0.52 - 0.5) / {0.5(1-0.5)/100,000}^0.5= 12.649. This is far, far greater than 2, which means there is less than a 5% change of observing 52% 'tails' if the true probability is 0.5. Consequently, we can 'reject the null hypothesis' and say the coin toss is not fair with only a small chance of being wrong—only a small chance of committing a Type I Error.

One more example. Suppose 49.6% of the tosses were 'tails'. The value of z is then: z = (0.496 - 0.5) / {0.5(1-0.5)/100,000}^0.5 = -2.53. This is less than -2, and so the probability of this result in a fair coin toss is less than 5%. Because the probability of a Type I Error is less than 5%, we go ahead and say the coin toss was not fair, knowing we have only a small chance of being wrong.

Keep in mind that this 'probability of a Type I Error' is called a p-value. This is important for the next section.

Figure 2—The Standard Normal Distribution

Can your telephone number predict your basketball shooting performance?

It's a ridiculous question, I know. To say the last digit of your telephone number can help predict the percent of basketball shots you make is absurd, but watch this. The data here contain information on 171 attempted shots by students in a previous class. Each student was asked to attempt five shots from the free-throw line, and we recorded the percent they made (the dependent variable). Then we asked them to attempt another five free-throw shots. After that, they attempted five shots from the 3-point line (further from the goal) and then another series of five 3-point shots. Data on each person's gender, years of experience playing on a basketball team, and the last digit of their phone number was collected. These data are collected in the equation below, where male is a dummy variable for male shooters, 3-point is a dummy variable for shots made from the 3-point line (as opposed to the free-throw line). Estimate this regression and you get the following result.

[Equation 7] percent shots made = a₀ + a₁(male) + a₂(3-point) + a₃(experience) + a₄(last-digit)

Figure 3—Basketball Regression Results

The coefficients tell us that males' shooting percentage is about nine percentage points higher than females, shooting percentage falls about 18 points when you move from the free-throw to the 3-point line, each year of experience increases shooting percentage by almost two points, and people whose last digit of their phone number shoot better...what?

Yes, the regression predicts that if a person's last digit of their phone number increases from five to six, their shooting percentage increases by 0.238 percentage point. Yet, we know that isn't true.

If you flip a coin 100 times, the percent 'heads' will not be exactly 50%. It will be close, but not 50%. Likewise, even if an explanatory variable has no impact on the dependent variable, the probability that the coefficient estimate will equal zero is almost zero. Even if the 'true' coefficient is zero, the estimate of it will not be. So, how do we tell when a variable impacts the dependent variable? First, we set up the two hypotheses for the effect of last-digit.

[Equation 7] Null Hypothesis: a₄ = 0

[Equation 8] Alternative Hypothesis: true probability of tails a₄ ≠ 0

We then reject the null and say phone numbers do affect shooting ability only when the probability we are wrong is small. What is that probability, exactly? It is the p-value given for last-digit in Figure 3. That p-value is 0.728, which means if we reject the hypothesis a₄ = 0 we have a 72.8% chance of being wrong. That is far too high a probability to accept, and so we 'fail to reject the null' and thus conclude a person's phone number has nothing to do with their shooting ability.

Notice the p-values on all the other coefficients are less than 0.05. This means we can reject the null hypothesis that a₁ = 0, or a₂ = 0, or a3 = 0, and the proability we are wrong is less than 5%. That is a probability we accept, and thus we conclude that a person's gender, the distance from the goal, and their experience playing basketball all impact their shooting ability.

In statistical terms, last-digit is statistically insignificant, while male, 3-point, and experience are statistically significant variable.

P-values and the global warming debate

If the pictures of those towering wildfires in Colorado haven't convinced you, or the size of your AC bill this summer, here are some hard numbers about climate change: June broke or tied 3,215 high-temperature records across the United States. That followed the warmest May on record for the Northern Hemisphere – the 327th consecutive month in which the temperature of the entire globe exceeded the 20th-century average, the odds of which occurring by simple chance were 3.7 x 10-99, a number considerably larger than the number of stars in the universe.
—McKibben, Bill. August 2, 2012. "Global Warming's Terrifying New Math." Rolling Stone magazine.

P-values and Mitt Romney's fake supporters

     Last week Zach Green of 140Elect, noticed some strange goings-on with Mitt Romney's Twitter account (@MittRomney). Romney's account, which had been averaging around 2,000 to 5,000 new followers a day, gained 141,000 followers in two days.
     This observation prompted speculation - from Green, Slate,The Huffington Post, CNN, and many others - that the Romney Campaign was buying robot followers, or perhaps (conspiratorially) someone else was buying them to make Romney look bad.
     "But actual analysis of these new followers has been limited to manual observation; many do, indeed, look fake. However, high-profile users can be targets for the algorithms that run bot accounts, and some amount of bogus followers is to be expected. We decided to dig into the data of these new followers to see if they differ statistically from the new followers of other accounts similar in size to Romney. We subjected Barack Obama's account, @BarackObama, to the same analysis.
     We developed a simple methodology for testing whether a set of followers is likely to be the product of natural user following behavior or bot networks. This test revealed a significant difference between the distribution of followers among the accounts in Mitt Romney's recent spike and that of similar users in our comparison. It strongly indicates that non-organic processes induced Romney's recent surge in followers. We did not find a similar pattern in Barack Obama's recent followers. The details of these findings are presented below.
     ...
     According to a random sample of 1000 followers from the candidates' accounts, 26.9% of Romney's 150,000 newest followers had fewer than 2 followers. For other accounts of similar size, only 9.6% of new followers had less than 2 followers themselves. The median number of followers for Romney's new followers was 5, whereas the median for the comparison group was 27. This represents a stark, and statistically significant difference. If you are a statistics nerd, like us, you might want to know that the p-value on this was 0.0000. For the rest of the world, this means that there is, essentially, a zero percent chance that the underlying characteristics of Romney's followers are actually the same as the comparison users.
—Furnas, Alexander and Devin Gaffney. July 31, 2012. "Statistical Probability That Mitt Romney's New Twitter Followers Are Just Normal Users: 0%." The Atlantic. www.theatlantic.com.

In case you missed the point...

Many times we estimate a regression and are not sure whether an explanatory variable really belongs in the regression. Or, our objective in a regression is to determine if a variable helps predict another variable. In both cases, we use the p-value to make this determination. Whenever the p-value is 0.05 or less, we say the explanatory variable of interest is indeed correlated with the dependent variable. The expanatory variable is statistically significant.

Notice, however, I did not say one variable causes the other to change. Correlation is not causation.

References

(1) Boese, Alex. June 9-10, 2012. "The Pleasures of Suffering for Science." The Wall Street Journal. C3.

(2) Boese, Alex. "Testing the Law of Probability." Weird Universe. Accessed June 10, 2012 at http://www.weirduniverse.net/blog/comments/testing_the_law_of_probability/.

(?) Rymer, Russ (author) and Lynn Johnson (photographer). July, 2012. "Vanishing Voices." National Geographic.

(?) Hammond, Kenneth J. 2004. Lecture 2: The First Dynasties. From Yao to Mao: 5,000 Years of Chinese History. The Teaching Company.

(?) Empire. 2002. Arenas Entertainment, et. al.

6/27/12

(AD.B.8) Basic regression: hypothesis testing and p-values