7  Thesis Discussion

7.1 Interpretation of the results

Sports Scientists are inundated by data through wearables, testing equipment, motion-capture, and with automated and semi-automated annotated match statistics. Consequently, there is a requirement for more sports science research to focus on increasing the statistical capabilities of Sports Scientists, to ensure that the quality of statistical analysis can match the increasing quality of data being collected. The aim of this thesis is to provide Sports Scientists with access to applications of statistical methods that will expand their statistical toolkit to accommodate data sets regularly seen in a sports science context. This thesis explored mixed models for imbalanced data sets with repeated observations where there was inherent variability between athletes, introduced the sports science community to Pareto frontiers as a method to identify extreme values when considering multiple variables of interest, and explored inferential statistics within a Bayesian framework as an alternative to the traditional frequentist framework, especially when working with small samples and small effect sizes. It would have been simpler to simulate data sets to illustrate the statistical concepts explored in this thesis; however, it was decided that there would be some loss in authenticity and translatability into the field. Consequently, a secondary aim of this thesis, was to produce research that furthers not only the statistical education of the sports science community, but also contributes to the body of research within each sports discipline explored within this thesis.

This thesis firstly highlighted the need for mixed models when examining the differing movement patterns within the various levels of women’s rugby league and describing the movement patterns and match statistics of the NRLW competition. Secondly, it highlighted the need for Pareto frontiers to visualise the trade-off relationship within batting and bowling in T20 cricket. Next it highlighted the need for Bayesian inference when examining the effect of β-alanine supplementation of 4-km TT performance. Finally, it highlighted the need for mixed models, Pareto frontiers, and Bayesian inference when understanding the dynamics of short- and long-duration specific running intensities.

One feature of this thesis was the transparency in both the code and data used within this thesis. There has been calls for further adoption of ‘open-science practices’ over the past few years such as Registered reports (Caldwell et al., 2020), sharing of data and code (Borg et al., 2020). Indeed, some journals are now encouraging the use of data repositories (e.g., FigShare, Open Science Framework) to facilitate these open-science practices; however, this is still the minority within the sports science discipline (Borg et al., 2020). By publishing this thesis as an eBook, it is intended that the practices shown within this thesis serve as an example just as much as the actual code helps other researchers in performing their own analyses.

Another area of notable interest for this thesis, was the intentional use of data from women’s sporting codes. While there is a growth in the professionalism of women’s sport, the body of research within each sport in the women’s game is still sparse. Consequently, this thesis presented an opportunity in which statistical concepts could be clearly articulated using quality data, while also addressing the inequality that is present in the frequency of publishing of research in women’s sports. For example, prior to the start of this PhD candidature, there were no studies in women’s rugby league examining the movement patterns at any level of competition. Therefore, programming and prescription for athletes was derived from men’s rugby league research with no objective justification for the translation of research. Since the commencement of candidature, the movement patterns of women’s international rugby league, movement patterns and match statistics of the Australian domestic NRLW competition, a comparison of the three levels of competition in Australia, and a study of the physiological characteristics of female rugby league players have all been published.

In Chapter 2 and Chapter 5, mixed models were able to provide insights into female rugby league that have previously been required to be assumed based on the research previously performed in male cohorts. By providing data relevant to the female cohorts, better decision making by both coaches, support staff, and the league administration can be achieved with the further knowledge elucidated by these studies. As the NRLW only contained four teams throughout the 2018-20 seasons and was played in 60-min matches, it was necessary to understand the movement patterns to guide the expansion of the competition. At the time of writing, the 2023 competition will feature 10 teams (National Rugby League, 2022) playing in 70-min matches (NRL.com, 2021), of which our research has provided a baseline understanding of the game to understand how the expansion has altered movement patterns of the athletes. The applications of these two studies can enable coaches and support staff to better program and prescribe training sessions suitable for these athletes, especially those recently recruited into the expansion teams. Furthermore, these baseline results can serve as a benchmark for lower divisions to provide information on the standards in physiology required to compete at the NRLW level.

Pareto Frontiers were also introduced to Sports Scientists in Chapter 3 of this thesis. While mixed models and Bayesian inference have been used, albeit somewhat uncommonly, in sports science; as far as we are aware, Pareto Frontiers have never been used within sports science applications. As sports are consistently focused on identifying athletes that could provide an added edge on their opposition, they require athletes that are substantially different from the ‘average’ athlete in a given sport. This concept has been shown with the introduction of statistics such as ‘wins above replacement’, ‘points above replacement’, ‘runs above replacement’, which describe the impact a player has relative to a ‘replacement’ player (i.e., if you substituted the player for the average player in the league). Consequently, sports science research should consider methodologies that do not revolve around only identifying the mean without extracting the player level effects to identify the extreme athletes.

This thesis endeavoured to encourage Sports Scientists in Chapter 4 to consider adopting inferential statistics in a Bayesian framework over the more commonly-used frequentist framework. There is a pitfall created by the arbitrary statistical significance threshold of \(\alpha\) = 0.05 typically seen in a frequentist framework. Due to the expected impact of a study, journals are more likely to publish results in which a significant difference is found, than a study in which no there were no significant findings. This generates a publication bias within sports science journals (Borg et al., 2023) in which the distribution of published results is skewed in favour towards those that present significant findings, leading to unintended consequences such as p-hacking. Therefore, sports science journals should promote the use of probabilistic statements generated from the posterior distribution of the variable of interest, resulting in less reason to reject papers solely based upon an arbitrary cut-off of \(\alpha\) = 0.05.

One curious observation seen in both Chapter 3 and Chapter 6, was that variables that were thought to be negatively correlated, were indeed positively correlated. While initially baffling, upon further investigation after identifying each athlete, it seems that a Simpson’s paradox was present in each case. It was known that there is a negative relationship between the initial sprint time and the resulting repeated-sprint ability decrement score (i.e., an athlete with a faster sprint speed experiences worse decrement) (Bishop et al., 2001). Similarly, there is also a negative relationship between the repeated-sprint ability decrement score and maximal oxygen uptake (VO2MAX) (i.e., an athlete with a higher VO2MAX has a less severe decrement in repeated-sprint ability) (Rampinini et al., 2009). Consequently, there should be a positive relationship between initial sprint time and VO2MAX (i.e., an athlete with a faster sprint speed displays a lower VO2MAX); curiously however, no significant relationship is evident (Rampinini et al., 2009). Similarly, when comparing sprint performance and the Yo-Yo intermittent fitness test, there were either no correlation (Castagna et al., 2009; Lockie et al., 2017) or even positive relationships (Hermassi et al., 2015; Ingebrigtsen et al., 2014) found. While this has baffled researchers in the past (Hermassi et al., 2015), this is possibly due to a Simpson’s paradox where athletes that are more highly-trained will be higher in both attributes than less highly-trained athletes. This is evident in one study (Ingebrigtsen et al., 2014), where athletes from a 3rd-division team recorded slower sprint times and decreased Yo-Yo intermittent fitness test results than a 1st-division team, yet the correlations calculated between the sprint times and the Yo-Yo intermittent fitness results did not account for the level of competition. To illustrate this point, some sample data was generated shown in Figure 7.1.

Show the code:
library(tidyverse)
library(patchwork)
set.seed(99.94)
a = c(runif(25, 60, 120), runif(25, 40, 100), runif(25, 20, 60)) ## Generate random number sequences in three groups
b = (1 / a) * c(rnorm(25, 150, 30), rnorm(25, 80, 30), rnorm(25, 10, 20)) ## Generate random correlation within the three groups
df <- data.frame(x = a,
                 y = b,
                 group = rep(c("A", "B", "C"), each = 25)) ## Form into a data frame

plot_1 <- ggplot(df, aes(x = x / 1.2, y = y * 40)) +
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method = "lm",
              se = F,
              linetype = "dashed",
              color = "black") +
  theme_minimal() +
  labs(x = "Variable 1",
       y = "Variable 2",
       color = "Quality") +
  theme(axis.text = element_blank()) ## Generate plot with no grouping

plot_2 <- ggplot(df, aes(x = x / 1.2, y = y * 40, color = group)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "lm",
              se = F,
              linetype = "dashed") +
  theme_minimal()+
  labs(x = "Variable 1",
       y = "Variable 2",
       color = "Quality")+
  theme(legend.position = "right",
        axis.text = element_blank(),
        axis.title.y = element_blank()) ## Generate plot with grouping

(plot_1 + plot_2) + plot_annotation(tag_levels = 'A') ## Combine plots

Figure 7.1: Simpson’s paradox illustrated using sample data where a positive relationship is evident when the whole cohort is combined (A) yet a negative relationship is evident when the cohort is split by level of competition (B).

As seen in panel A of Figure 7.1, when the regression line is fitted across the whole cohort, a positive correlation is present. Consequently, if these variables are positively correlated, a Pareto Frontier may not be as useful given that it is likely that the individual that is highest in one variable is also likely high in the other variable. However, researchers and practitioners should be aware of other confounding variables that could influence the analysis. For example, in panel B of Figure 7.1, if the quality of player (or level of competition) is incorporated into the regression model, then a clear negative relationship is identified between the two variables and the Pareto Frontier could provide useful applications across each level of competition (e.g., finding players with movement patterns more representing that of a higher-level of competition). Consequently, when a relationship is unexpected (e.g., positive relationship when expected to be negative), it is recommended that additional demographic variables should also be tested in the regression model to ensure there is no Simpson’s paradox and that the relationship is actually present.

7.2 Practical Applications

Even though mixed models are not new to sports science (Dalton-Barron et al., 2020), there is still valuable data being underutilised in some studies using alternative statistical methods (such as RM-ANOVA) that require complete data sets when dealing with longitudinal data. This requires the need to either discard data, summarise the data (i.e., invalidly take the mean for each participant without accounting for differing number of observations), or impute data which has its flaws too (Borg et al., 2021). By using a mixed model instead, Sports Scientists can retain the full data set and can provide flexibility for missing data due to injuries, squad selection, and access to athletes on a given day.

As Pareto frontiers have not been used within sports science before, we believe the use of Pareto frontiers is only in its infancy to identify athletes that possess the best compromise between attributes of interest. Consequently, the applications can be wide-ranging as it is not limited to one specific area of sports science. While this thesis predominantly framed Pareto frontiers in a talent identification context, the applications can be more widespread. As referred to in Chapter 3, there have been multiple instances in the last few years (Duthie et al., 2021; Morin et al., 2021; Rudsits et al., 2018) in which researchers have attempted to identify “maximal” efforts when multiple observations of an individual are recorded across varying conditions of the independent variable. To eliminate any sub-maximal efforts, a Pareto frontier can be established to only extract the values that exhibit the best compromise between the variables of interest.

Of particular note, Chapter 6 explored Pareto frontiers using mixed models in a Bayesian framework, solidifying these three pillars of the thesis. As outlined as a limitation in Chapter 3, when multiple observations are present, quantifying the uncertainty around each individual and where they are with respect to the Pareto frontier is required. The mixed model can account for the dependency between observations, while the Bayesian framework enables the probability of an individual being located on the Pareto frontier can be calculated.

7.3 Further Research

While mixed models have become more common within sports science over the past five years, it seems that the uses of the inferential statistics generated by the mixed model are still relatively limited. However, the estimates arising from these mixed models (e.g., marginal means and conditional means) can also be used for further applications. For example, in Chapter 6, the estimated values from the mixed model were simulated thousands of times to estimate the true populate Pareto frontier. By sampling from the mixed model, the random effects can be accounted for and can be used to simulate data sets from the model (Borg et al., 2020). This data set can then be used for further research (e.g., building Pareto frontiers); therefore, a mixed model does not have to be the end of the analysis; rather, can serve a specific purpose in an analysis pipeline (in this case, providing partial pooling of estimates for an imbalanced data set) which can facilitate further analyses later in the pipeline.

When identifying the Pareto frontiers in different data sets, it become apparent that the shape of the frontier could differ substantially. Take, for instance, the following two Pareto frontiers:

Show the code:
library(rPref)
df <- data.frame(
  a = c(1, 2, 3, 4, 6, 8, 10, 6, 2, 2, 4, 2.5, 2),
  b = c(10, 8, 6, 4, 3, 2, 1, 2, 6, 4, 2, 2, 2.5)
) ## Generate concave data set
df <- psel(df, high(a) * high(b), top_level = 99) ## Retrieve Pareto frontier for concave data set

plot_1 <- ggplot() +
  geom_point(data = df %>% filter(.level != 1), aes(x = a, y = b)) +
  geom_point(data = df %>% filter(.level == 1),
             aes(x = a, y = b),
             color = "red") +
  geom_line(data = df %>% filter(.level == 1),
            aes(x = a, y = b),
            color = "red") +
  theme_minimal() +
  labs(x = "Variable 1",
       y = "Variable 2") +
  theme(axis.text = element_blank()) ## Plot concave Pareto frontier

set.seed(99.94)
df2 <- data.frame(a = c(1, 3, 5, 6, 7, 8, 9, 10, runif(20, 0, 7)),
                  b = c(10, 9, 8, 7, 6, 5, 3, 1, runif(20, 0, 7))) ## Generate convex data set
df2 <- psel(df2, high(a) * high(b), top_level = 99) ## Retrieve Pareto frontier for convex data set

plot_2 <- ggplot() +
  geom_point(data = df2 %>% filter(.level != 1), aes(x = a, y = b)) +
  geom_point(data = df2 %>% filter(.level == 1),
             aes(x = a, y = b),
             color = "red") +
  geom_line(data = df2 %>% filter(.level == 1),
            aes(x = a, y = b),
            color = "red") +
  theme_minimal() +
  labs(x = "Variable 1",
       y = "Variable 2") +
  theme(axis.text = element_blank(),
        axis.title.y = element_blank()) ## Plot convex Pareto frontier

(plot_1 + plot_2) + plot_annotation(tag_levels = 'A')

Figure 7.2: The difference between a concave (A) and a convex (B) Pareto frontier.

The Pareto frontier is concave (as seen in panel A of Figure 7.2) if the X axis variable increases, the Y axis variable declines sharply. Similarly, as the Y axis variable increases, the X axis variable declines sharply. However, when the Pareto frontier is convex (as seen in panel B of Figure 7.2), as the X axis variable increases, there is only a gradual decline in the Y axis variable. Additionally, as the Y axis variable increases, there is only a gradual decline in the X axis variable.

While initially overlooked, the applications of the shape of the Frontier could be wide-reaching. If the scatter plot features players within a team and only a limited number of athletes can be selected from the cohort, if there are multiple athletes on the Pareto frontier it may be difficult to decide which athletes should be selected. Consequently, in this case, it could be argued that if the data reflected that in panel A of Figure 7.2, the athletes at the extremities of the Pareto frontier are of greater value than those in the middle of the frontier as any gain in one variable comes at a great cost to the opposing variable. Juxtaposing this, if the data reflected that shown in panel B of Figure 7.2, it could be argued that athletes in the middle of the Pareto frontier are of the highest value, as their gains in each variable have come at a relatively low expense to the opposing variable. Further research could investigate the relative locations within the Pareto frontier and their contributions towards the team construction. Additionally, Pareto frontiers are only one of many approaches to multiobjective optimisation and other methods should be explored when applying into a sports science context.

Even though Bayesian inference is not new to sports science (Mengersen et al., 2016; Santos-Fernandez et al., 2019), there is still limited use, as well as limited resources on using Bayesian inference in sports science contexts. In this thesis, the use of informative priors (Chapter 4) was explored and it was shown how informative priors can assist in decision making for small samples and small effect sizes. Further research can build on this call for informative priors by developing catalogs of standardised, normative reference data to use as the prior distribution for a given variable. These databases could minimise the perceived effect of ‘biasing the prior’, by ensuring that all studies are utilising the same prior when determining their conclusions.

One area that this thesis intentionally did not explore was the application of machine learning techniques within a sports science context. It is noted that machine learning techniques are changing the way data is collected, extracted, and processed (Richter et al., 2021). However, the level of expertise required to understand these techniques may be beyond the scope of a ‘typical’ sports scientist. Within organisations, it may be that hybrid roles start to emerge in which there is crossover between a Sports Scientist and a data scientist; however, I do not believe that every applied Sports Scientist needs to understand and be competent to run their own machine learning models. Consequently, Sports Scientists should endeavour to learn general principles surrounding machine learning models to engage with data scientists that are building these models as the frequency in which Sports Scientists will utilise software that harness machine learning will only grow into the future. Therefore, statistical methodology readily available to applied sport scientists was the main focus of this thesis.

7.4 Thesis Conclusion

This thesis encourages Sports Scientists to develop in their statistical literacy to ensure validity and robustness in their statistics being presented for decision-making. As repeated-measures, imbalanced data sets, multiple variables of interest, small samples, and small effect sizes are commonplace in sports science, this thesis was written to urge Sports Scientists to develop a deeper understanding and improved competency in statistical methods. Additionally, Study 5 serves as a ‘capstone’ study, illustrating how the different statistical concepts explored in this thesis (i.e., mixed models, Pareto frontiers, and Bayesian inference) can be used in tandem to produce novel research. The thesis provides all the data and R code required to run all the analyses within this thesis to ensure Sports Scientists can replicate the analyses with their own data, as well as to provide an example to the sports science community of open and transparent research practices. By compiling this thesis as an eBook, it is intended that Sports Scientists can utilise these studies as a resource when conducting their own research.