Over the years, sports coaching and training has become much more scientific. Today, training decisions and process are expected to at least have some form of evidence base, and the expectation for coaches and support staff (and even athletes) to be scientifically literate has increased. The advent of the internet has also led to an increase in PubMed warriors (myself included), and there is an ever-increasing number of scientific papers out there that can be found and cited to support a person’s position.
Central to most scientific research is the idea of statistical significance. As part of my day job, I run education programs for fitness professionals from personal trainers to those in elite sports. When I’m presenting data, a common question will be, “Is it significant?” The unsaid assumption is that, if it is not statistically significant, the results are worthless.
Researchers determine statistical significance through the use of p-values, which you will have come across if you’ve read scientific papers. Typically, if the p-value is said to be less than or equal to 0.05 (represented as p≤0.05), then the effect is said to be statistically significant. If it’s greater than 0.05 (i.e., p>0.05), then it’s said not to be statistically significant.
The Problems With P-Values
There are a few problems with the p-value method, and these often lead to issues in understanding research that are common with sports science practitioners. This, in turn, can lead to the misinterpretation of research and, therefore, the misuse of research paper findings.
Let me illustrate this with an example. Let’s say I want to test the effects of caffeine on vertical jump height. What I would likely do is gather a group of athletes—let’s say 40—and get them to do two trials: one with caffeine, one without. I collect the data and then analyze it using a t-test, and I come up with the following conclusion:
Subjects jumped significantly higher (p<0.05) in the caffeine trial than the placebo trial.
From this, I can conclude that caffeine improves jump height. But what is the p-value really telling us? Have a think and come up with your own explanation of what the p-value actually is. Seriously, do it—I’m not going anywhere.
Thought about it? Great! I also asked this question on Facebook, and got answers from a number of coaches and support staff—people who I would consider scientifically literate. Far and away, the most common description of the p-value was that it tells us “whether the effect is down to chance or not.” So, in the caffeine example, a p-value of <0.05 tells us that there is a less than 5% chance of these results being down to chance. Why 5%? Well, it’s arbitrary, but it’s a nice round number that has caught on, so it gets used.
Here’s the problem: That explanation is not quite correct.
To explain why, I need to explain what really happens when most people do an experiment. The common method used in research is that of Null Hypothesis Significance Testing (NHST). This means that we have two hypotheses: the null hypothesis and the alternative hypothesis. So, in my caffeine example, my hypotheses are:
- Caffeine improves vertical jump height (alternative hypothesis).
- Caffeine does not improve vertical jump height (null hypothesis).
When I’m calculating the p-value, what I’m really doing is deciding whether or not I can reject the null hypothesis. Scientific research is set up to disprove the null hypothesis. If p≤0.05, I can reject it, if its >0.05, I can’t. Rejecting the null hypothesis means that there is a difference between the groups.
What the p-value actually tells us is the probability of getting a result as extreme as this, and the null hypothesis being correct. When p=0.05, this means there is a 5% chance of getting a result as extreme as this and the null hypothesis being correct. If p=0.01, there is a 1% chance of getting a result that extreme and the null hypothesis being correct. Essentially, we’re getting the probability of falsely rejecting the null hypothesis, which is a false positive, known as a Type-I error.
P-Values and the Size of the Effect
So far, perhaps we have been largely arguing about semantics. Let me introduce the next question for you. Returning to my caffeine trial, I add an additional group of athletes. The first group take 3mg/kg or placebo. The second group take 6mg/kg or placebo. Here is the main finding:
Subjects jumped significantly higher in both the 3mg/kg (p=0.04) and 6mg/kg (p=0.004) caffeine trials compared to placebo.
My question to you is this: Is 6mg/kg of caffeine more effective than 3mg/kg of caffeine at improving vertical jump height?
Again, when I asked this on Facebook, most people answered yes. The correct answer is: you can’t tell. This is the main issue with p-values and NHST; they don’t tell you the size (or, more correctly, the magnitude) of the effect. So, while we can be more confident about correctly rejecting the null hypothesis in the 6mg/kg trial, we can’t be more confident that the effect was greater.
The main issue with p-values and NHST is that they don’t tell you the magnitude of the effect. Share on XTo repeat, the p-value tells us nothing about the size of the effect. Something having greater significance does not necessarily have a greater effect. This is important when it comes to translating science into practice. Whilst 6mg/kg caffeine might significantly (p<0.05) improve vertical jump height, if the size of this effect is just 0.1cm, this might not have any real-world effect. For example, if you’re a high jumper, you can only move the bar higher in 1cm increments, so jumping 0.1cm higher has no real-world impact for you.
Similarly, I once read a research paper examining the use of a specific type of training on mood. Mood was determined by a questionnaire, with each person scoring themselves out of 10. The training significantly improved (p<0.05) mood, but the average improvement in the training group was 0.2. Given that the scale used was 1, 2, 3 … 10, an improvement of 0.2 means that you would need five subjects to get a real-world improvement of 1 (i.e., going from 1 to 2, or 9 to 10). So how effective is the training really?
The scientific community is starting to wake up to the issues with p-values and NHST, and I am certainly not the first person to notice it. The American Statistical Association released a statement on this last year. Sports scientist Martin Buchheit, from Paris Saint Germain Football Club, recently authored a great editorial on the subject. In recent years, the godfather of statistical analysis in sports science, Will Hopkins, has proposed the use of Magnitude Based Inferences (MBIs) to help practitioners understand the true size of the effect of an intervention, in order to determine whether it is useful or not. Journals are starting to slowly move away from just the reporting of p-values, requiring effect sizes to also be used.
All of this allows for the better use of science in sport. Right now, my concern is that athletes and coaches only look for statistical significance, and not real-world significance. An effect can be statistically significant due to a large sample size, but have no real-world effect. Conversely, an intervention can have no significant difference in terms of statistics (usually due to a small sample size), but have a large real-world effect. More pertinently, when comparing two different interventions, the difference in p-values between them doesn’t really tell us anything about the magnitude of these effects, which is more important.
Something that has a greater significance does not necessarily have a greater effect. Share on XFinally, to complicate things further, recent research has illustrated that there is a significant amount of inter-individual variation in response to an intervention, such that even if an intervention has no statistically significant effect for the average between groups, the effect can be huge for individuals within a group. As confusing as this might be, having a working knowledge of what a p-value is, and knowing the limitations of it, are crucial to successfully translate science into practice.
Recommended Reading:
The problem with p values: how significant are they, really?
Since you’re here…
…we have a small favor to ask. More people are reading SimpliFaster than ever, and each week we bring you compelling content from coaches, sport scientists, and physiotherapists who are devoted to building better athletes. Please take a moment to share the articles on social media, engage the authors with questions and comments below, and link to articles when appropriate if you have a blog or participate on forums of related topics. — SF
Is there not issues with MBIs as well?
https://fivethirtyeight.com/features/how-shoddy-statistics-found-a-home-in-sports-research/
“When p=0.05, this means there is a 5% chance of getting a result as extreme as this and the null hypothesis being correct. If p=0.01, there is a 1% chance of getting a result that extreme and the null hypothesis being correct.” No, it is the probability of observing a result as extreme (or more extreme) under the assumption that the null hypothesis is correct. “Essentially, we’re getting the probability of falsely rejecting the null hypothesis, which is a false positive, known as a Type-I error.” No, the p-value is not the probability of a Type I error. The p-value only says something about the data, namely how compatible they are with a given hypothesis. See Point 6 of the ASA statement you are indirectly citing (https://doi.org/10.1080/00031305.2016.1154108 is a better link).