The commercialization of sport has led to an increased emphasis on getting an edge over the opponent in any (mostly legal) way possible. Historically, this was achieved through improved training techniques aimed at enhancing physical performance or reducing injury rates. Over the last few years, however, there has been a focus on how the backroom staff collects and utilizes data. This has naturally fed into an increased emphasis on how data is used to make decisions, and more and more sport scientists are tending to “borrow” from other disciplines, such as computer science and statistics, to help them make better use of this data.
As a result, we’ve seen a rise of the data scientist—or at least sports scientists that are comfortable in using data—within sport, with some prominent examples being Mladen Jovanovic, whose website I fully recommend, and Sam Robertson, a researcher from Victoria University who is embedded within the Western Bulldogs AFL team as Head of Research and Innovation. Additionally, a number of leading sports organizations, such as the New South Wales Institute of Sport and UK Sport, have recently advertised for data science positions.People involved in sport should have some idea of what data scientists add to athlete preparation. Click To Tweet
Consequently, it is probably a good idea for people involved in sport to at least have some idea of what these roles add to the athlete preparation sphere. In this article, I aim to explore machine learning and its close cousin, data mining, in order to shed some light on what information we can expect to gain from these practices that are emerging in sport.
What Is Machine Learning?
First, some definitions. Machine learning refers to the process by which a computer system utilizes data to train itself to make better decisions. So, if we input a set of data—such as that from a GPS system—along with injury data across a season, the software will try to create a model that allows it to predict which players got injured. We can then feed in additional information, such as the next season’s injury data, and the computer will again try to predict injuries—but this time, it will also look for corrections in the calculations it makes in order to enhance its predictions. What calculations were unnecessary, for example, or which data point was given too much weight previously? We can then add more data, such as player wellness scores, ratings of perceived exertion, etc., and the program will continue to make these calculations, refining its output.The goal of machine learning in sport is to be able to predict what will happen in the future. Click To Tweet
The aim of this is to be able to predict what will happen in the future: for example, which player from your youth team will become a world-class player? Which type of training is best for a given athlete? How likely is a given person to become injured, and how does this change with exposure to specific types of competition or training?
As such, the quality of the prediction is associated with the quality of data that is put into the machine. Garbage data in will lead to garbage data out. This is where the data mining aspect comes in: Data mining is the extraction of patterns (and therefore knowledge), from large amounts of data. It essentially represents the first aspect of efficient machine learning—which parts of data matter, and which can be discarded?
One of the advantages of the machine learning process within sport is that it allows us to better understand non-linear systems. Biological processes tend not to operate in a linear manner: This is important, because if we can only analyze using linear analysis—such as the “r” in standard correlation calculations—this can hamper our understanding of these processes. As a simple example, let’s take the recent work of Tim Gabbett and his development of the Acute:Chronic workload ratio. Based on the findings of a number of papers, we now understand that both too much and too little training are risk factors for injury.Applying the machine learning process in sport allows us to better understand non-linear systems. Click To Tweet
If we plot this on a graph, with training load on the x-axis and injury risk on the y-axis, it would not be a linear relationship, but rather a curvilinear relationship in the shape of a U. As such, standard statistical methods for understanding this relationship (i.e., a non-linear relationship) are insufficient, and we need to start to build slightly more complex models. Adding more and more data types—such as wellness, age, previous injury history, sleep duration, and other aspects associated with an increased injury risk—increases the complexity of the modeling required.
Another important aspect to consider is the difference between explaining what has happened and predicting what will happen in future. Explaining why an athlete has previously been injured allows us to identify some potential risk factors for this. Age, for example, has been found to be a risk factor for hamstring injury. As a result, we can state that age is associated with hamstring injury in athletes. But can we then use this information to predict future injury? To do this, we need what is termed a “holdout set,” meaning a set of data that has not been used in the previous statistical model to test the predictive power of that model in the future (the data used to create that model is termed the “training” set).
Obviously, in sport, it is far more important to predict what will happen in the future than explain what has happened in the past. A good example of this is a recent paper from the journal Medicine and Science in Sports and Exercise. Here, researchers collected data from a group of professional soccer players over the course of five seasons. They collected hamstring injury prevalence and severity, “exposure” time (such as time spent training and playing), anthropometric data, and information on a number of different genes. They then plugged this data into a statistical model, finding that the following were significantly associated with hamstring injury during that five-season period:
- Seven genetic variants
- Previous hamstring injury
- Age (with players over 24 more likely to become injured)
Furthermore, if the researchers selected two players at random, the probability that the player with the higher injury risk (as determined by the model) would be more likely to suffer an injury was around 75%…which is pretty solid. This represents the training data stage.
The next step was to use this model, and its related inputs, to “predict” future injury using holdout data. In this case, the researchers used data from the following season, in which 67 players suffered 31 hamstring injuries. Here, if the researchers selected two players at random, the probability that the player with the higher injury risk (as determined by the model) would be more likely to gain an injury was around 50%, which is essentially the same as flipping a coin—i.e., chance. So, while this model was useful in explaining previous hamstring injury, it did not predict future injury rates well at all.The strength of any predictive model is enhanced by its total number of data inputs. Click To Tweet
The reasons for this lack of predictive ability are likely varied. The first is that the strength of any predictive model is enhanced by its total number of data inputs. A model trained on 1,000 players will typically outperform a model trained on 100 players. This is obviously problematic in professional sport, because the average first-team size in most sports varies from 20-50 players, and most teams do not want to share their data.
In individual sports governed by a central federation, it might be easier to overcome the problem of sample size—although, by definition, the prevalence of elite athletes is always going to be very low. Furthermore, sporting injuries are notoriously multifactorial, as demonstrated in a seminal paper by Roald Bahr and Tron Krosshaug. As a result, any statistical model aimed at predicting injury risk would need to have a great number of data inputs that cover the various individual risk factors, while the model used to predict hamstring injury in the paper under discussion only used a limited number.
As a result, it’s clear that, for complex outcomes such as injury risk—which is highly multifactorial in nature—we need a large number and range of data inputs. However, for more “simple” outcomes (and by “simple” I mean affected by a small number of variables), less complex models may hold promise. An example of this is muscle fiber type, which is largely influenced by genetic factors.
Understanding an individual’s genotype may be useful when it comes to selecting various training modalities and variables; but, at present, there is a limited number of available options by which we can achieve this. We could take a muscle biopsy, which is highly invasive and somewhat damaging to the muscle, or we could use some sort of test, such as a vertical jump, to predict muscle fiber type. A recent paper explored the effectiveness of a model utilizing seven different genetic variants to predict muscle fiber type, finding that it was pretty accurate. As a result, for more simple outcomes, such as muscle fiber type, a less complex model can be useful, while complex outcomes often require a complex model.
From Data to Decision-Making
A further example of how we might be able to utilize machine learning as a way to support better decision-making was reported in a conference paper from late 2017. Here, researchers from Belgium utilized a machine learning tool to optimize training load based on the prediction of session rating of perceived exertion (sRPE). They collected data from 61 training sessions of elite Belgian soccer teams, where the players wore data collection sensors, allowing the researchers to gain insight into metrics such as speed, distance covered, and heart rate.
Additionally, after each training session, the players reported their sRPE for that session. Further inputs, such as environmental temperature, humidity, age, baseline fitness, muscle fiber type, and others were all added to the model. In total, the model performed well, providing coaches with the ability to predict sRPE before the session occurred, which has some obvious benefits: Individual training session load and intensity can be modified prior to the session occurring based on real-time data to ensure that the required outcomes are met.The use of data mining and machine learning in sport holds promise, and has wide implications. Click To Tweet
Similar results have also been recently reported when attempting to predict the risk of injury in a group of soccer players. Here, the authors utilized a variety of inputs based around individual player anthropology (e.g., height, weight, age), sporting factors (e.g., position), GPS metrics, and various other workload-related aspects, such as previous training load. Their model could detect around 80% of injuries, which is better than currently available estimation techniques.
Additionally, the model had very few false positives; this means that few players who were flagged as being high injury risk went on to not get injured. This is important, because incorrectly suggesting a player is at an increased risk of injury can lead to needlessly missed training sessions, and possibly even missed competitions. A machine learning approach utilizing artificial neural networks has also been shown to correctly predict around 70% of a player’s competitive level (i.e., Premier League vs. Championship) when data such as passing accuracy and shots were utilized. Early research has also been undertaken to explore the use of machine learning in the development of optimal training programs.
Clearly, the use of data mining and machine learning in sport holds promise. If we can predict what will happen in a given circumstance, then we can make interventions to guide us to the desired outcome. This is obviously going to be of great use when it comes to training program design and load management, hopefully improving athlete performance and reducing injury risk. The concept also has wider implications.
For example, these techniques could be utilized when developing tactical frameworks—within a team, which moves and passing networks lead to the greatest success? Data mining can also be used with competition data to better understand the underlying aspects that are most associated with success. For example, if teams that win perform a certain skill better than others, it allows for the use of targeted technical training to ensure that players can effectively execute those crucial skills.It’s important that coaches and data science specialists speak each other’s languages. Click To Tweet
This is undoubtedly an area that will grow in the future, as evidenced by the increasing number of data science roles in sport. As always, because sporting success often relies on an effective supporting team, the ability of each support team member to speak each member’s “language” is important. As such, it is potentially important for coaches to at least have a bit of working knowledge around data science, especially at the highest level. However, just as importantly, the data science specialists will have to speak the coach’s language. Given the promise this area holds, I look forward to watching it develop.
Practical prelude to machine learning by Kyle Peterson
Predictive modeling of football injuries by Stylianos Kampakis (PhD Thesis)