You might have heard that science has a replication crisis. Both experienced researchers and the lay press have commented on the fact that many research findings cannot be replicated by researchers redoing the same experiments. Because replication is a fundamental aspect of science research, as it allows us to essentially double-check the initial study, such a discovery suggests that many of the research findings we take for granted might not actually be true.
Most famously, this was shown in terms of power posing, where early research suggested that holding a specific pose derived improvements in mood and performance. This then led to a best-selling book and highly popular TED Talk. Only… it turned out that these findings could not be replicated, leading many to believe that they might not be true.P-hacking is believed to be one of the main drivers of the (alleged) replication crisis in science, says @craig100m. Click To Tweet
More recently, Brian Wansink, a leading food researcher from Cornell University, was found to have committed scientific misconduct, and was made to leave his post. Wansink’s papers, which have over 20,000 citations, have been largely discredited, with 13 being retracted from the journals they were published in. Wansink was accused of p-hacking, a practice that is largely believed to be one of the main drivers of the (alleged) replication crisis in science. So, just what is p-hacking? In this article, I aim to find out.
The Scientific Method
First, a reminder on how science should work, with specific mention of p-values. When I conduct an experiment, I should (in theory—but nobody really states this explicitly anymore), create two hypotheses: the null hypothesis and the alternative hypothesis. The idea with a study is to set out to falsify the null hypothesis; while we can’t “prove” that something has an effect, we can say that an effect is very likely if we are able to reject the null hypothesis. (The approach detailed here is an example of Null Hypothesis Significance Testing [NHST]. It is perhaps the most well-known method in scientific research, but it’s not the only one, so keep this in mind).
Let’s work through an example: I have 20 athletes, and I want to understand whether caffeine improves their 1 repetition maximum (1RM) bench press. What I plan to do is get them all to do a 1RM bench press test without caffeine, and get them all to do a 1RM bench press test with caffeine. If they lift more with caffeine than without, then I can state caffeine enhances 1RM bench press performance.
If I was a really good scientist, I’d randomize the order in which they did the test; some athletes would do the caffeine-free test first, followed by the caffeine test, and some the other way around. I would also want to blind the athletes as to whether or not they had consumed caffeine, as through the expectancy or placebo effects, knowing whether or not you consumed caffeine could affect your performance outside of any impact of caffeine. Having viewed the previous research on caffeine and performance, I think that caffeine likely would enhance bench press performance. So, in this case, my null hypothesis is “caffeine will not enhance performance” and my alternative hypothesis is “caffeine will enhance performance.”
After setting up my experiment and my hypotheses, in order to show an effect of caffeine, I want to try and reject the null hypothesis. To do this, I can use a variety of different statistical methods, but the most common and basic is the t-test. One of the outputs from these statistical tests is a p-value. We can use this p-value to guide us on whether or not we can safely reject the null hypothesis. What the p-value tells us—and this is commonly misunderstood—is the chance (or probability, hence “p”) of getting this result, and the null hypothesis being correct.The p-value tells us the probability of getting this result, and the null hypothesis being correct, says @craig100m. Click To Tweet
Let’s return to the caffeine example. I’ve done the 1RM bench press testing of my athletes under both conditions (caffeine and placebo). The average 1RM score when the athletes didn’t have caffeine was 120kg. The average when they did have caffeine was 130kg. We want to use our statistical tests to understand whether the differences in means is likely “real” (i.e., caffeine does enhance performance—the alternative hypothesis) or “false” (i.e., caffeine does not actually improve performance, and that the difference in means is likely due to chance, random variation, etc.—the null hypothesis).
Remember, the p-value tells us the chance of getting a result this extreme, and the null hypothesis being correct. If I had a p-value of 0.1, then there would be a 10% chance of a difference between the two trials being 10kg, and the null hypothesis (i.e., caffeine does not enhance performance) being true. Similarly, if the p-value was 0.01, then there is a 1% chance.
False Positives vs. False Negatives
Still with me? Now I need to introduce type I and type II errors. A type I error is where we reject the null hypothesis, but the null hypothesis is actually true. In the caffeine example, we would state that caffeine does have an effect, while in actual fact it does not. We can consider type I errors to be false positives.
A type II error is the opposite; here, we accept the null hypothesis, when the alternative hypothesis is actually true. In the caffeine example, this would be saying that caffeine has no effect on 1RM bench press strength, when in fact it actually does. We can consider type II errors to be false negatives.
The p-value essentially tells us our risk of committing a type I error. So, the big question is: What is the acceptable risk of committing a type I error? In this case, what should we set our p-value threshold as, before we can reject the null hypothesis and say that caffeine does have a performance-enhancing effect on 1RM bench press strength?
We could argue about this all day long, but the general consensus is that a p-value of 0.05 is the appropriate threshold. With a p-value of 0.05, there is a 5% chance of us getting the result we did, and the null hypothesis being true. So, if we reject the null hypothesis when p=0.05, what we’re effectively saying is that we have a 5% chance of stating there is an effect of caffeine when actually there isn’t (i.e., a false positive). Some researchers recommend using a much stricter threshold, such as 0.001 (while others believe p-values are a largely outdated method). There is a balancing act here: The stricter the threshold we choose to accept for a p-value—and therefore, the lower the chance of committing a type I error—the greater the chance of committing a type II error.
Ok, now we’re getting closer to the crux of the problem with p-hacking. If I set my p-value threshold as 0.05, then I’m accepting that there is a 5% chance of me claiming that caffeine has a performance-enhancing effect when it actually doesn’t. This means that, if I repeat this experiment 20 times, and each time get a p-value of <0.05, on one of those occasions (i.e., 5% of the trials) I will have gotten a false positive—I will be saying caffeine has an effect, when actually, it might not. This has implications for larger, more complex experiments, when we might have to run multiple statistical tests.
Returning to the caffeine example, let’s now introduce 10 genes that we might believe affect how much caffeine influences performance. For each gene, I want to know whether people with one version see a greater performance enhancement than people with the other version—so I need two statistical tests for each gene, and I have 10 genes, leading to 20 statistical tests. If I select my p-value as being 0.05 for each of these, then, by virtue of running many tests, I’m greatly increasing the chances of committing a type I error; the chance of a false positive is 1 in 20, and I’ve done 20 tests (this is an oversimplification, but it helps to demonstrate the points).
There are a number of ways researchers can correct for this, including the Bonferroni correction. Here, the accepted p-value is divided by the number of significance tests carried out: If I did 20, then my p-value threshold becomes 0.0025 (0.05 ¸20); if my p-value is above this, then I accept the null hypothesis as per usual. This is what I should do, if I’m being honest and scientifically robust.
What p-hacking entails is doing a number of statistical tests, seeing which are significant, and then selectively reporting the tests you did in your paper. So, in my caffeine and genotype example, I would have done 20 statistical tests, with a p-value of 0.05 as my threshold. Having done these tests, I found that subjects with a certain type of one gene, CYP1A2, found caffeine enhanced their performance to a greater extent than those with the other type of that gene.
The p-value for this statistical test is 0.03; below my threshold of 0.05. All the other tests I ran showed p-values of anywhere between 0.2 and 1.0, meaning that, for those genes, I cannot reject the null hypothesis, and so I have to state that those genes have no effect on the size of performance enhancement seen following intake of caffeine. Because null results aren’t as interesting as positive results, and because there is a bias of journals to only report interesting results, I decide to write my paper by just looking at CYP1A2 and caffeine. In my paper, I therefore “pretend” that I’ve only carried out one statistical test. I report the p-value as 0.03, below the threshold of significance (0.05), and thereby demonstrating that caffeine is more performance enhancing in some people than others.
Of course, what I should have done was correct my p-values for multiple hypothesis testing; I actually ran 20 tests—even if I didn’t publish these—meaning my p-value should have been 0.0025. In reality, this gene had no clear effect on the size of performance enhancement following caffeine consumption, but by selectively reporting what statistical tests I did, I can make it seem like it did. And this, in a nutshell, is what p-hacking is.If you perform multiple tests and selectively report just the significant ones, you are p-hacking, says @craig100m. Click To Tweet
There are other ways I can p-hack. I might carry out my analysis, find that I’m very close to a significant p-value (whether that’s 0.05 or something else), and then go back into the data and make changes so that I am engineering a “successful” p-value. For example, in my caffeine study, I might find that in the caffeine trial my subjects lifted more weight, but with a p-value of 0.08—close to my threshold of 0.05, but not quite there.
So, I go back and play around with the data: What happens when I remove males from the analysis? Or what if I remove those with more than four years of training history? Or perhaps subject 17 only had a 0.5% improvement while all others had a 6% improvement, leading me to believe that he/she didn’t really try, so I can remove them from the analysis. Often, there are both legitimate and innocent reasons for removing some subjects from data analysis, which is fine—provided it’s not being done to manufacture a significant result.
Hypothesis After Result is Known
P-hacking also has a close cousin: HARKing, where HARK stands for Hypothesis After Result is Known. Here, researchers generate a hypothesis after they have analyzed their data. Again, this is frowned upon—the purpose of a statistical test is often to test a hypothesis, which indicates that such a hypothesis has to exist prior to the test being used. Similar to p-hacking, HARKing increases the risk of a type I error, which is why replicating such research often proves impossible—hence the replication crisis.
The world of science is well aware of these issues, and the dangers of them potentially undermining public confidence. There are a number of practices being put in place by the various journals in an attempt to guard against both p-hacking and HARKing. These include open data sharing, where researchers upload their raw data as a supplement to their paper, for all to analyze. A second approach is the pre-registration of study designs; here, researchers state what they are going to do, what their hypothesis is, and how they’re going to analyze their data, before they actually do it—preventing both p-hacking and HARKing.
Another potential solution that has been proposed is to increase the threshold required for statistical significance (although not everyone agrees).Various journals are putting practices in place to guard against both p-hacking and HARKing, says @craig100m. Click To Tweet
Finally, we could drop p-values altogether. This is an approach that has gained increased popularity in sports science research in recent years, in part because p-values might not be all that useful to researchers in the field, with a focus on researchers reporting effect sizes with a probability of importance, as opposed to p-values.
Perhaps the most popular approach here is that of magnitude based inferences (MBI), developed by Will Hopkins and Alan Batterham. However, the use of MBIs has recently been heavily criticized by other statisticians, with at least one journal stating they won’t accept papers that utilize the method. Nevertheless, the approaches detailed here will hopefully help address the replication crisis, and increase public confidence in the scientific process. Given how important it is to society as a whole, this is hugely important.