Understanding P-Values: A Practical Guide for Coaches

Summary

While statistical significance is the basis of most scientific research, there are problems with the way that statistical significance is determined. Coach Craig Pickering takes a look at the use of p-values to make a case for real-world effect and the misinterpretation of research.

[mashshare]

P Values

Over the years, sports coaching and training has become much more scientific. Today, training decisions and process are expected to at least have some form of evidence base, and the expectation for coaches and support staff (and even athletes) to be scientifically literate has increased. The advent of the internet has also led to an increase in PubMed warriors (myself included), and there is an ever-increasing number of scientific papers out there that can be found and cited to support a person’s position.

Central to most scientific research is the idea of statistical significance. As part of my day job, I run education programs for fitness professionals from personal trainers to those in elite sports. When I’m presenting data, a common question will be, “Is it significant?” The unsaid assumption is that, if it is not statistically significant, the results are worthless.

Researchers determine statistical significance through the use of p-values, which you will have come across if you’ve read scientific papers. Typically, if the p-value is said to be less than or equal to 0.05 (represented as p≤0.05), then the effect is said to be statistically significant. If it’s greater than 0.05 (i.e., p>0.05), then it’s said not to be statistically significant.

The Problems With P-Values

There are a few problems with the p-value method, and these often lead to issues in understanding research that are common with sports science practitioners. This, in turn, can lead to the misinterpretation of research and, therefore, the misuse of research paper findings.

Let me illustrate this with an example. Let’s say I want to test the effects of caffeine on vertical jump height. What I would likely do is gather a group of athletes—let’s say 40—and get them to do two trials: one with caffeine, one without. I collect the data and then analyze it using a t-test, and I come up with the following conclusion:

Subjects jumped significantly higher (p<0.05) in the caffeine trial than the placebo trial.

From this, I can conclude that caffeine improves jump height. But what is the p-value really telling us? Have a think and come up with your own explanation of what the p-value actually is. Seriously, do it—I’m not going anywhere.

Thought about it? Great! I also asked this question on Facebook, and got answers from a number of coaches and support staff—people who I would consider scientifically literate. Far and away, the most common description of the p-value was that it tells us “whether the effect is down to chance or not.” So, in the caffeine example, a p-value of <0.05 tells us that there is a less than 5% chance of these results being down to chance. Why 5%? Well, it’s arbitrary, but it’s a nice round number that has caught on, so it gets used.

Here’s the problem: That explanation is not quite correct.

To explain why, I need to explain what really happens when most people do an experiment. The common method used in research is that of Null Hypothesis Significance Testing (NHST). This means that we have two hypotheses: the null hypothesis and the alternative hypothesis. So, in my caffeine example, my hypotheses are:

Caffeine improves vertical jump height (alternative hypothesis).
Caffeine does not improve vertical jump height (null hypothesis).

When I’m calculating the p-value, what I’m really doing is deciding whether or not I can reject the null hypothesis. Scientific research is set up to disprove the null hypothesis. If p≤0.05, I can reject it, if its >0.05, I can’t. Rejecting the null hypothesis means that there is a difference between the groups.

What the p-value actually tells us is the probability of getting a result as extreme as this, and the null hypothesis being correct. When p=0.05, this means there is a 5% chance of getting a result as extreme as this and the null hypothesis being correct. If p=0.01, there is a 1% chance of getting a result that extreme and the null hypothesis being correct. Essentially, we’re getting the probability of falsely rejecting the null hypothesis, which is a false positive, known as a Type-I error.

P-Values and the Size of the Effect

So far, perhaps we have been largely arguing about semantics. Let me introduce the next question for you. Returning to my caffeine trial, I add an additional group of athletes. The first group take 3mg/kg or placebo. The second group take 6mg/kg or placebo. Here is the main finding:

Subjects jumped significantly higher in both the 3mg/kg (p=0.04) and 6mg/kg (p=0.004) caffeine trials compared to placebo.

My question to you is this: Is 6mg/kg of caffeine more effective than 3mg/kg of caffeine at improving vertical jump height?

Again, when I asked this on Facebook, most people answered yes. The correct answer is: you can’t tell. This is the main issue with p-values and NHST; they don’t tell you the size (or, more correctly, the magnitude) of the effect. So, while we can be more confident about correctly rejecting the null hypothesis in the 6mg/kg trial, we can’t be more confident that the effect was greater.

The main issue with p-values and NHST is that they don’t tell you the magnitude of the effect. Share on X

To repeat, the p-value tells us nothing about the size of the effect. Something having greater significance does not necessarily have a greater effect. This is important when it comes to translating science into practice. Whilst 6mg/kg caffeine might significantly (p<0.05) improve vertical jump height, if the size of this effect is just 0.1cm, this might not have any real-world effect. For example, if you’re a high jumper, you can only move the bar higher in 1cm increments, so jumping 0.1cm higher has no real-world impact for you.

Similarly, I once read a research paper examining the use of a specific type of training on mood. Mood was determined by a questionnaire, with each person scoring themselves out of 10. The training significantly improved (p<0.05) mood, but the average improvement in the training group was 0.2. Given that the scale used was 1, 2, 3 … 10, an improvement of 0.2 means that you would need five subjects to get a real-world improvement of 1 (i.e., going from 1 to 2, or 9 to 10). So how effective is the training really?

The scientific community is starting to wake up to the issues with p-values and NHST, and I am certainly not the first person to notice it. The American Statistical Association released a statement on this last year. Sports scientist Martin Buchheit, from Paris Saint Germain Football Club, recently authored a great editorial on the subject. In recent years, the godfather of statistical analysis in sports science, Will Hopkins, has proposed the use of Magnitude Based Inferences (MBIs) to help practitioners understand the true size of the effect of an intervention, in order to determine whether it is useful or not. Journals are starting to slowly move away from just the reporting of p-values, requiring effect sizes to also be used.

All of this allows for the better use of science in sport. Right now, my concern is that athletes and coaches only look for statistical significance, and not real-world significance. An effect can be statistically significant due to a large sample size, but have no real-world effect. Conversely, an intervention can have no significant difference in terms of statistics (usually due to a small sample size), but have a large real-world effect. More pertinently, when comparing two different interventions, the difference in p-values between them doesn’t really tell us anything about the magnitude of these effects, which is more important.

Something that has a greater significance does not necessarily have a greater effect. Share on X

Finally, to complicate things further, recent research has illustrated that there is a significant amount of inter-individual variation in response to an intervention, such that even if an intervention has no statistically significant effect for the average between groups, the effect can be huge for individuals within a group. As confusing as this might be, having a working knowledge of what a p-value is, and knowing the limitations of it, are crucial to successfully translate science into practice.

Author

Craig Pickering

As a former professional athlete in both track (100m) and bobsled, Craig competed in five World Championships and two Olympic Games, and he’s one of only eight British athletes to be selected for both a Summer and Winter Olympics. Since retiring, Craig has been working as Head of Sports Science at DNAFit, along with a number of other consultancy roles, including sports coaching. He’s also currently studying for a professional doctorate in Elite Performance.
View all posts

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Table of Contents

Understanding P-Values: A Practical Guide for Coaches

The Problems With P-Values

There are a few problems with the p-value method, and these often lead to issues in understanding research that are common with sports science practitioners. This, in turn, can lead to the misinterpretation of research and, therefore, the misuse of research paper findings.

Let me illustrate this with an example. Let’s say I want to test the effects of caffeine on vertical jump height. What I would likely do is gather a group of athletes—let’s say 40—and get them to do two trials: one with caffeine, one without. I collect the data and then analyze it using a t-test, and I come up with the following conclusion:

Subjects jumped significantly higher (p<0.05) in the caffeine trial than the placebo trial.

From this, I can conclude that caffeine improves jump height. But what is the p-value really telling us? Have a think and come up with your own explanation of what the p-value actually is. Seriously, do it—I’m not going anywhere.

Thought about it? Great! I also asked this question on Facebook, and got answers from a number of coaches and support staff—people who I would consider scientifically literate. Far and away, the most common description of the p-value was that it tells us “whether the effect is down to chance or not.” So, in the caffeine example, a p-value of <0.05 tells us that there is a less than 5% chance of these results being down to chance. Why 5%? Well, it’s arbitrary, but it’s a nice round number that has caught on, so it gets used.

Here’s the problem: That explanation is not quite correct.

To explain why, I need to explain what really happens when most people do an experiment. The common method used in research is that of Null Hypothesis Significance Testing (NHST). This means that we have two hypotheses: the null hypothesis and the alternative hypothesis. So, in my caffeine example, my hypotheses are:

Caffeine improves vertical jump height (alternative hypothesis).
Caffeine does not improve vertical jump height (null hypothesis).

When I’m calculating the p-value, what I’m really doing is deciding whether or not I can reject the null hypothesis. Scientific research is set up to disprove the null hypothesis. If p≤0.05, I can reject it, if its >0.05, I can’t. Rejecting the null hypothesis means that there is a difference between the groups.

What the p-value actually tells us is the probability of getting a result as extreme as this, and the null hypothesis being correct. When p=0.05, this means there is a 5% chance of getting a result as extreme as this and the null hypothesis being correct. If p=0.01, there is a 1% chance of getting a result that extreme and the null hypothesis being correct. Essentially, we’re getting the probability of falsely rejecting the null hypothesis, which is a false positive, known as a Type-I error.

P-Values and the Size of the Effect

So far, perhaps we have been largely arguing about semantics. Let me introduce the next question for you. Returning to my caffeine trial, I add an additional group of athletes. The first group take 3mg/kg or placebo. The second group take 6mg/kg or placebo. Here is the main finding:

Subjects jumped significantly higher in both the 3mg/kg (p=0.04) and 6mg/kg (p=0.004) caffeine trials compared to placebo.

My question to you is this: Is 6mg/kg of caffeine more effective than 3mg/kg of caffeine at improving vertical jump height?

Again, when I asked this on Facebook, most people answered yes. The correct answer is: you can’t tell. This is the main issue with p-values and NHST; they don’t tell you the size (or, more correctly, the magnitude) of the effect. So, while we can be more confident about correctly rejecting the null hypothesis in the 6mg/kg trial, we can’t be more confident that the effect was greater.

The main issue with p-values and NHST is that they don’t tell you the magnitude of the effect. Share on X

To repeat, the p-value tells us nothing about the size of the effect. Something having greater significance does not necessarily have a greater effect. This is important when it comes to translating science into practice. Whilst 6mg/kg caffeine might significantly (p<0.05) improve vertical jump height, if the size of this effect is just 0.1cm, this might not have any real-world effect. For example, if you’re a high jumper, you can only move the bar higher in 1cm increments, so jumping 0.1cm higher has no real-world impact for you.

Similarly, I once read a research paper examining the use of a specific type of training on mood. Mood was determined by a questionnaire, with each person scoring themselves out of 10. The training significantly improved (p<0.05) mood, but the average improvement in the training group was 0.2. Given that the scale used was 1, 2, 3 … 10, an improvement of 0.2 means that you would need five subjects to get a real-world improvement of 1 (i.e., going from 1 to 2, or 9 to 10). So how effective is the training really?

The scientific community is starting to wake up to the issues with p-values and NHST, and I am certainly not the first person to notice it. The American Statistical Association released a statement on this last year. Sports scientist Martin Buchheit, from Paris Saint Germain Football Club, recently authored a great editorial on the subject. In recent years, the godfather of statistical analysis in sports science, Will Hopkins, has proposed the use of Magnitude Based Inferences (MBIs) to help practitioners understand the true size of the effect of an intervention, in order to determine whether it is useful or not. Journals are starting to slowly move away from just the reporting of p-values, requiring effect sizes to also be used.

All of this allows for the better use of science in sport. Right now, my concern is that athletes and coaches only look for statistical significance, and not real-world significance. An effect can be statistically significant due to a large sample size, but have no real-world effect. Conversely, an intervention can have no significant difference in terms of statistics (usually due to a small sample size), but have a large real-world effect. More pertinently, when comparing two different interventions, the difference in p-values between them doesn’t really tell us anything about the magnitude of these effects, which is more important.

Something that has a greater significance does not necessarily have a greater effect. Share on X

Finally, to complicate things further, recent research has illustrated that there is a significant amount of inter-individual variation in response to an intervention, such that even if an intervention has no statistically significant effect for the average between groups, the effect can be huge for individuals within a group. As confusing as this might be, having a working knowledge of what a p-value is, and knowing the limitations of it, are crucial to successfully translate science into practice.

Author

Craig Pickering

As a former professional athlete in both track (100m) and bobsled, Craig competed in five World Championships and two Olympic Games, and he’s one of only eight British athletes to be selected for both a Summer and Winter Olympics. Since retiring, Craig has been working as Head of Sports Science at DNAFit, along with a number of other consultancy roles, including sports coaching. He’s also currently studying for a professional doctorate in Elite Performance.
View all posts

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Contents

Visit our Store

Understanding P-Values: A Practical Guide for Coaches

Share this

Summary

The Problems With P-Values

P-Values and the Size of the Effect

Recommended Reading:

Author

Leave the first comment (Cancel Reply)

Understanding P-Values: A Practical Guide for Coaches

Share this

The Problems With P-Values

P-Values and the Size of the Effect

Recommended Reading:

Author

Leave the first comment (Cancel Reply)

Trending Resources

Enode Sensor vs. GymAware Powertool

Blog

Building a Better High Jump: A Review of Stride Patterns

Blog

How We Got Our First Sprint Relays to State in Program History

Blog

Science, Dogma, and Effective Practice in S&C

Blog

Contents

Visit our Store

Understanding P-Values: A Practical Guide for Coaches

Share this

Summary

The Problems With P-Values

P-Values and the Size of the Effect

Recommended Reading:

Author

Leave the first comment (Cancel Reply)

Understanding P-Values: A Practical Guide for Coaches

Share this

The Problems With P-Values

P-Values and the Size of the Effect

Recommended Reading:

Author

Leave the first comment (Cancel Reply)

Trending Resources

Contents

Browse By Topics