3 Pareto Frontiers for Multivariate Sports Performance

3.1 Abstract

Athletes often require a mix of physical, physiological, psychological, and skill-based attributes that can be conflicting when competing at the highest level within their sport. When considering multiple variables in tandem, Pareto frontiers is a technique that can identify the observations that possess an optimal balance of the desired attributes, especially when these attributes are negatively correlated. This study presents Pareto frontiers as a tool to identify athletes who possess an optimal ranking when considering multiple metrics simultaneously. This study explores the trade-off relationship between batting average and strike rate as well as bowling strike rate, economy, and average in Twenty 20 cricket. Eight hundred ninety-one matches of Twenty 20 cricket from the men’s (MBBL) and women’s (WBBL) Australian Big Bash Leagues were compiled to determine the best batting and bowling performances, both within a single innings and across each player’s Big Bash career. Pareto frontiers identified 12 and seven optimal batting innings performances in the MBBL and WBBL respectively, with nine and six optimal batting careers respectively. Pareto frontiers also identified three optimal bowling innings in both the MBBL and WBBL and five and six optimal bowling careers in MBBL and WBBL, respectively. Each frontier identified players that were not the highest ranked athlete in any metric when analyzed univariately. Pareto frontiers can be used when assessing talent across multiple metrics, especially when these metrics may be conflicting or uncorrelated. Using Pareto frontiers can identify athletes that may not have the highest ranking on a given metric but have an optimal balance across multiple metrics that are associated with success in a given sport.

The following chapter is a copy of the published manuscript:

Newans, T., Bellinger, P., & Minahan, C. The balancing act: Identifying multivariate sports performance using Pareto frontiers. Frontiers in Sports and Active Living. 4:918946.

As co-author of the paper “The balancing act: Identifying multivariate sports performance using Pareto frontiers”, I confirm that Timothy Newans has made the following contributions:

Study concept and design
Data collection
Data analysis and interpretation
Manuscript preparation

Name: Clare Minahan

Date: 29/03/2023

3.2 Introduction

Many sports require athletes to possess well-developed physiological or mechanical characteristics (e.g., endurance, maximal sprint speed) and/or skill-related attributes (i.e., shot speed and shot accuracy) that are often opposing and/or are not associated. These disparate attributes appear to be more obvious between athletes in team sports when compared to specific disciplines in individual sports that require highly specific attributes for success (e.g., track sprint cycling (Lievens et al., 2021) or marathon running (Jones et al., 2021)). Furthermore, it is not unusual for team sports to include both specialists, such as athletes who possess superior maximal sprint speed and athletes who possess superior endurance capacity, and individual athletes who have a wide range of physiological/mechanical and skill-related attributes. For instance, in stick or racquet sports such as golf and tennis, each athlete is required to balance the accuracy and speed of their strokes (Maquirriain et al., 2016; Wells et al., 2009); while in most team sports, an athlete should develop a balance of speed and endurance (Stølen et al., 2005). Moreover, in sports like cricket, an athlete is required to balance their batting/bowling average with their batting/bowling strike rate or economy (Barr & Kantor, 2004; Patel et al., 2017). In each of these examples, a case could be made for which is the preferred attribute; however, the preferred attribute may differ given the other athletes within a particular team or given a particular situation within a single match. A cohesive team may require a squad of players that differ in their balance of attributes and, therefore, it is apparent that performance analysis requires a multi-faceted approach when selecting prospective players.

While talent identification processes have been extensively reported (Dodd & Newans, 2018; Falk et al., 2004; Johnston et al., 2018; Pion et al., 2015; Pyne et al., 2005; Till et al., 2016), standards and benchmarks are typically reported univariately; that is, each attribute is assessed in isolation. For example, players could be standardized within each attribute (i.e., z-scores) to determine where the athlete sits with respect to the rest of the athletic population (Turner et al., 2019). However, this method has a flaw, in that some attributes are negatively correlated, as well as physiological/mechanical characteristics such as maximal sprint speed and endurance capacity (Sánchez-García et al., 2018). Therefore, if an athlete excels in one attribute, it is likely that this would come at the expense of a conflicting attribute. By assessing talent identification univariately, athletes are identified when they have specialist skills (e.g., strongest, fastest, fittest, leanest etc.) (Minahan et al., 2021). Although a given sport often requires a balance of these attributes, it is reasonable to suggest that talent identification processes should assess talent multivariately, that is, multiple variables in tandem, rather than univariately.

The process of optimizing the balance of multiple attributes is termed “multi-objective optimization”. This technique is becoming increasingly of interest with recent developments in machine learning algorithms; however, their origins are quite simple mathematically in that they aim to create the perfect balance of the attributes of interest. If a data point was defined as: \(\vec{x_1} \: \varepsilon \: X\), it is, therefore, better than another data point defined by: \(\vec{x_2} \: \varepsilon \: X\) if \(f_{i}(\vec{x_1}) \leq f_{i}(\vec{x_2})\) for all metrics i \(\varepsilon\) {1, 2, …, k} and \(f_{i}(\vec{x_1}) < f_{i}(\vec{x_2})\)for at least one metric j \(\varepsilon\) {1, 2, …, k}. Once these conditions have been met, the remaining points are deemed Pareto-optimal and form what is called the Pareto frontier. There are two key strengths to using Pareto frontiers. Firstly, when balancing multiple attributes, Pareto-optimal observations can be identified with just few lines of computer code. Secondly, when visualizing a limited number of attributes (i.e., three or less), the Pareto frontier is intuitive and can be clearly identified, assisting in the translation and interpretation of the results to coaches and other support staff. This has been used in other fields such as designing aircrafts with maximum aerodynamic efficiency, maximum range, and minimum weight (Mastroddi & Gemma, 2013). However, there is very limited use of Pareto frontiers within sport (Pérez-Toledano et al., 2019). By using Pareto frontiers, the optimal balance of all these attributes can be identified rather than guessing through siloed univariate analyses.

To illustrate the concept of Pareto frontiers, the present study used batting and bowling data in Twenty 20 (T20) cricket. Like other forms of cricket (i.e., Test and one-day matches), T20 cricket requires players to score as many runs as possible within the allotted 20 overs without being dismissed (i.e., being bowled, caught out, run out etc.). In addition, players of T20 cricket also need to score runs in as few deliveries as possible. Therefore, it is difficult to determine whether 80 runs “off” (i.e., from) 60 balls or 40 runs off 20 balls is of more benefit to a team as their differing risk profiles contribute differently to the formation of the team (Bukiet & Ovens, 2006). For example, early on in an innings the risk-return of attempting to hit six runs off a ball is significantly different than in the final over of an innings. Similarly, a bowler needs to balance taking wickets while also conceding as few runs as possible. For instance, when bowling four overs, it is again difficult to determine whether taking three wickets for 50 runs is of more worth than taking no wickets but only conceding eight runs as the three wickets may not have been worth conceding 50 runs. Therefore, when assessing the quality of players, it is necessary to utilize tools that can analyse these data sets without favoring one metric over another. The concept of Pareto frontiers is one such tool and, therefore, the present study aimed to introduce Pareto frontiers to the sports science community and illustrate how they can identify players with the optimal balance of attributes that can be obfuscated when performing univariate analysis across each metric.

3.3 Methods

The present study comprised all 489 matches of the first 11 editions of the Men’s Big Bash League (MBBL) and all 402 matches of the first seven editions of the Women’s Big Bash League (WBBL), Australia’s domestic T20 cricket competition. The MBBL data set contained 423 batters and 313 bowlers, while the WBBL data set contained 214 batters and 159 bowlers. All scorecards were freely available online. Collectively, there were 13,764 individual batting innings with observations ranging from 1 to 113 innings per batter, while there were 10,796 individual bowling innings with observations ranging from 1 to 106 innings per bowler.

Show the code:

library(tidyverse)
library(rPref)
library(patchwork)
library(scatterplot3d)
options(scipen = 999)
mbblbat <- read_csv('www/data/Study_2_pareto_mbblbat.csv') # Men's batting scorecards
mbblbowl <- read_csv('www/data/Study_2_pareto_mbblbowl.csv') # Men's bowling scorecards
wbblbat <- read_csv('www/data/Study_2_pareto_wbblbat.csv') # Women's batting scorecards
wbblbowl <- read_csv('www/data/Study_2_pareto_wbblbowl.csv') # Women's bowling scorecards

To summarize the data, two summary statistics were generated for batting and three summary statistics were generated for bowling. The summary statistics were as follows:

Batting Average: runs scored divided by frequency of dismissal
Batting Strike Rate: runs scored divided by balls faced multiplied by 100
Bowling Average: runs conceded divided by wickets taken
Bowling Strike Rate: balls bowled divided by wickets taken
Bowling Economy: runs conceded divided by overs (i.e., 6 balls) bowled

Show the code:

dismissals_men <- mbblbat %>%
  group_by(id) %>%
  filter(Dismissed == T) %>%
  summarise(Dismissals = n()) ## Calculate number of dismissals

notouts_men <- mbblbat %>%
  group_by(id) %>%
  filter(Dismissed == F) %>%
  summarise(NotOuts = n()) ## Calculate number of not outs

sumBat_men <- mbblbat %>%
  group_by(id, Batter, LastName) %>%
  summarise(
    TotalRuns = sum(R),
    TotalBalls = sum(B),
    Innings = n()
  ) %>%
  ungroup() %>%
  left_join(dismissals_men) %>%
  left_join(notouts_men) %>% 
  mutate(
    Dismissals = case_when(is.na(Dismissals) ~ as.integer(0),
                           T ~ Dismissals),
    NotOuts = case_when(is.na(NotOuts) ~ as.integer(0),
                           T ~ NotOuts),
    Average = TotalRuns / Dismissals,
    StrikeRate = TotalRuns / TotalBalls * 100) ## Calculate career batting average and strike rate

filtBat_men <- sumBat_men %>%
  filter(Innings >= 15) ## Filter only those with 15 or more batting innings

sumBowl_men <- mbblbowl %>%
  group_by(id, Bowling, LastName) %>%
  summarise(
    Innings = n(),
    Balls = sum(Balls),
    Wickets = sum(W),
    Runs = sum(R)
  ) %>%
  mutate(
    Average = Runs / Wickets,
    Economy = Runs / Balls * 6,
    StrikeRate = Balls / Wickets
  ) ## Calculate bowling average, economy, and strike rate

filtBowl_men <- sumBowl_men %>%
  filter(Balls >= 200) %>%
  ungroup() ## Filter only those with 200 or more bowling deliveries

dismissals_women <- wbblbat %>%
  group_by(id) %>%
  filter(Dismissed == T) %>%
  summarise(Dismissals = n()) ## Calculate number of dismissals

notouts_women <- wbblbat %>%
  group_by(id) %>%
  filter(Dismissed ==  F) %>%
  summarise(NotOuts = n()) ## Calculate number of not outs

sumBat_women <- wbblbat %>%
  group_by(id, Batter, LastName) %>%
  summarise(
    TotalRuns = sum(R),
    TotalBalls = sum(B),
    Innings = n()
  ) %>%
  ungroup() %>%
  left_join(dismissals_women) %>%
  left_join(notouts_women) %>% 
  mutate(
    Dismissals = case_when(is.na(Dismissals) ~ as.integer(0),
                           T ~ Dismissals),
    NotOuts = case_when(is.na(NotOuts) ~ as.integer(0),
                           T ~ NotOuts),
    Average = TotalRuns / Dismissals,
    StrikeRate = TotalRuns / TotalBalls * 100) ## Calculate career batting average and strike rate

filtBat_women <- sumBat_women %>%
  filter(Innings >= 15) ## Filter only those with 15 or more batting innings

sumBowl_women <- wbblbowl %>%
  group_by(id, Bowling, LastName) %>%
  summarise(
    Innings = n(),
    Balls = sum(Balls),
    Wickets = sum(W),
    Runs = sum(R)
  ) %>%
  mutate(
    Average = Runs / Wickets,
    Economy = Runs / Balls * 6,
    StrikeRate = Balls / Wickets
  ) ## Calculate bowling average, economy, and strike rate

filtBowl_women <- sumBowl_women %>%
  filter(Balls >= 200) %>%
  ungroup()  ## Filter only those with 200 or more bowling deliveries

To understand optimal batting performance, the number of runs scored as well as the rate at which these runs were scored are both of importance in Twenty20 cricket (Barr & Kantor, 2004). As it was expected that there would be a trade-off relationship between these variables, there would be multiple batters that possess an optimal balance of these attributes. Therefore, it was deemed appropriate that a Pareto frontier could be established to identify these batters. Similarly, to identify optimal bowling performance, the number of wickets, runs conceded, and rate of which the wickets and runs are recorded are all of importance (Patel et al., 2017). As the number of wickets taken can come at the expense of runs conceded, it was also expected that no bowler would be optimal in every attribute and therefore, a Pareto frontier would also be required to identify the bowlers that possess the optimal balance of these bowling attributes.

Consequently, four Pareto frontiers for each competition were established within the data set:

Pareto-optimal Batting Innings

This analysis outlined the highest runs scored within an innings at the highest strike rate.

Show the code:

BatInnPareto_men <- psel(mbblbat %>% filter(R > 0), high(R) * high(SR), top_level = 999) ## Identify Pareto frontier for men's BBL batting innings
BatInnPareto_women <- psel(wbblbat %>% filter(R > 0), high(R) * high(SR), top_level = 999) ## Identify Pareto frontier for women's BBL batting innings

Pareto-optimal Batting Career

This analysis outlined the highest batting average across a career at the highest strike rate. To provide a more accurate career report, batters required to have played a minimum of 15 innings which left 158 eligible male batters and 116 eligible female batters.

Show the code:

BatCarPareto_men <- psel(filtBat_men,high(Average) * high(StrikeRate), top_level = 999) %>% 
  filter(Average > 20 | StrikeRate > 100) ## Identify Pareto frontier for men's BBL batting career
BatCarPareto_women <- psel(filtBat_women,high(Average) * high(StrikeRate), top_level = 999) %>% 
  filter(Average > 20 | StrikeRate > 100) ## Identify Pareto frontier for women's BBL batting career

Pareto-optimal Bowling Innings

This analysis outlined the most wickets taken in an innings at the lowest economy.

Show the code:

BowlInnPareto_men <- psel(mbblbowl, high(W) * low(Econ), top_level = 999) ## Identify Pareto frontier for men's BBL bowling innings
BowlInnPareto_women <- psel(wbblbowl, high(W) * low(Econ), top_level = 999) ## Identify Pareto frontier for women's BBL bowling innings

Pareto-optimal Bowling Career

This analysis outlined the lowest bowling average across a career at the lowest economy and lowest strike rate. To provide a more accurate career report, bowlers required to have bowled a minimum of 200 balls which left 137 eligible male bowlers and 98 eligible female bowlers.

Show the code:

BowlCarPareto_men <- psel(filtBowl_men, low(Average) * low(StrikeRate) * low(Economy), top_level = 999) %>% 
  filter(Average < 50) ## Identify Pareto frontier for men's BBL bowling career
BowlCarPareto_women <- psel(filtBowl_women, low(Average) * low(StrikeRate) * low(Economy), top_level = 999)  %>% 
  filter(Average < 50) ## Identify Pareto frontier for women's BBL bowling career

All data was analyzed using R (v 4.1.0) statistical software (R Core Team, 2019). Firstly, the dplyr (Wickham et al., 2021) and tidyr (Wickham, 2021) packages were used for data manipulation to format the data into the correct structure to identify the Pareto frontiers. The rPref package (Roocks, 2016) was used to determine the Pareto frontiers using the psel function with the “top_level” argument set to 999 to ensure every athlete was assigned to a frontier (see Line 15 of the attached script for an example). Once the frontiers were established, the ggplot2 (Wickham, 2016) and scatterplot3D (Ligges & Mächler, 2003) packages were used to visualize the data and subsequent Pareto frontiers.

3.4 Results

Men’s Pareto-optimal batting

As seen by the red line in Figure 3.1, 11 Pareto-optimal innings were identified as Pareto-optimal innings. That is, no other batter has scored more runs at a faster strike rate than these 12 innings. These innings ranged from 6 runs off 1 ball (i.e., strike rate = 600) to 154 off 64 balls (i.e., strike rate = 240.63). Additionally, the solution of 6 runs off 1 ball has been attained six times. In Figure 3.2, nine Pareto-optimal batting careers were identified (in red) with Andre Russell achieving the highest strike rate (164.07) and Brad Hodge achieving the highest average (42.78), while Joe Clarke, Alex Hales, Glenn Maxwell, Chris Lynn, Ben McDermott, Kevin Pietersen and Mitchell Marsh were all deemed Pareto-optimal due to varying combinations of both metrics.

Men’s Pareto Batting Innings
Men’s Pareto Batting Career

Show the code:

ggplot(mapping = aes(x = R, y = SR)) +
    geom_point(data = BatInnPareto_men %>%  filter(R > 50 | SR > 100), alpha=0.05, size = 3) +
    geom_point(data = BatInnPareto_men %>% filter(.level == 1), alpha = 0.1, color = "red",size = 3)+
    geom_text(data = BatInnPareto_men %>% filter(.level == 1 & R != 6 & !(LastName %in% c("Coulter-Nile","Rashid","Cutting","Maxwell","McAndrew"))), aes(label = LastName),color = "Red", hjust = "left",nudge_x = 2, nudge_y = 10)+
    geom_text(data = BatInnPareto_men %>% filter(.level == 1 & LastName == "Rashid"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 15)+
    geom_text(data = BatInnPareto_men %>% filter(.level == 1 & LastName == "Coulter-Nile"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 22)+
    geom_text(data = BatInnPareto_men %>% filter(.level == 1 & LastName == "Cutting"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 22)+
    geom_text(data = BatInnPareto_men %>% filter(.level == 1 & LastName == "McAndrew"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 25)+
    geom_text(data = BatInnPareto_men %>% filter(.level == 1 & LastName == "Maxwell"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = -10,nudge_y = 15)+
    geom_line(data = BatInnPareto_men %>% filter(.level == 1), alpha = 0.5, colour = "red")+
    annotate(geom = "text",x = 15,y = 600, label = "6 players", color = "red")+
    scale_y_continuous(breaks = seq(100,600,100))+
    coord_cartesian(xlim = c(0,160))+
    theme_minimal() +
    labs(x = "Runs Scored in an Innings",
         y = "Innings Batting Strike Rate") +
    theme(axis.title = element_text(size = 12,face = "bold"),
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 12, color = "black"))

Figure 3.1: Men’s Pareto-optimal batting within an innings with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their runs scored was below 50 and their strike rate was below 100.

Show the code:

ggplot(BatCarPareto_men, aes(x = Average, y = StrikeRate)) +
    geom_point(alpha=0.3, size = 3) +
    geom_point(data = BatCarPareto_men %>% filter(.level == 1), alpha = 0.3, color = "red",size = 3)+
    geom_line(data = BatCarPareto_men %>% filter(.level == 1), color = "red")+
    geom_text(data = BatCarPareto_men %>% filter(.level == 1 & LastName == "Clarke"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.6,nudge_y = 1.5)+
    geom_text(data = BatCarPareto_men %>% filter(.level == 1 & LastName == "Hales"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.6,nudge_y = 1.7)+
    geom_text(data = BatCarPareto_men %>% filter(.level == 1 & LastName == "Marsh"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.6,nudge_y = 1)+
    geom_text(data = BatCarPareto_men %>% filter(.level == 1 & LastName == "McDermott"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.6,nudge_y = 1.8)+
    geom_text(data = BatCarPareto_men %>% filter(.level == 1 & LastName == "Pietersen"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.6,nudge_y = 0.9)+
    geom_text(data = BatCarPareto_men %>% filter(.level == 1 & !LastName %in% c("Clarke","Hales","Marsh","McDermott","Pietersen")), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.6)+
    theme_minimal()  +
    coord_cartesian(xlim = c(0,46))+
    labs(x = "Career Batting Average",
         y = "Career Batting Strike Rate") +
    theme(axis.title = element_text(size = 12,face = "bold"),
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 12, color = "black"))

Figure 3.2: Men’s Pareto-optimal batting across a career with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their average was below 20 and their strike rate was below 100.

Men’s Pareto-optimal bowling

Three Pareto-optimal bowling innings in Figure 3.3 were identified: 1/0 (i.e., 1 wicket for 0 runs conceded) by Jhye Richardson, 3/3 by Mitchell Johnson, and 6/7 achieved by Lasith Malinga. While there were three occurrences of 0/0, none of these are deemed Pareto-optimal as 1/0 by Richardson supersedes this combination. Five Pareto-optimal bowling careers were identified in Figure 3.4 as having the best balance of bowling average, strike rate, and economy, with Adil Rashid achieving the lowest average (14.13), Lasith Malinga achieving the lowest economy (5.40), Mitchell Starc achieving the lowest strike rate (11.25), while Rashid Khan and Hayden Kerr were deemed Pareto-optimal due to a combination of the three metrics.

Men’s Pareto Bowling Innings
Men’s Pareto Bowling Career

Show the code:

ggplot() +
  geom_jitter(data = BowlInnPareto_men %>% filter(.level != 1),aes(x = W, y = Econ), alpha = 0.1, size = 3, width = 0.1) +
  geom_point(data = BowlInnPareto_men %>% filter(.level == 1), aes(x = W, y = Econ), alpha = 0.3, shape = 21, size = 3, fill = "red", color = "red") +
  geom_text(data = BowlInnPareto_men %>% filter(.level == 1), aes(x = W, y = Econ, label = LastName), color = "Red", hjust = "left",nudge_x = -0.3,nudge_y = -1)+
  geom_line(data = BowlInnPareto_men %>% filter(.level == 1), aes(x = W, y = Econ), alpha = 0.5, colour = "red")+
  theme_minimal() +
  labs(x = "Wickets in an innings",
       y = "Innings Bowling Economy") +
    coord_cartesian(xlim = c(0,6.3))+
    theme(axis.title = element_text(size = 12,face = "bold"),
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 12, color = "black"))

Figure 3.3: Men’s Pareto-optimal bowling within an innings with the points on the Pareto frontier highlighted in red.

Show the code:

BowlCarPareto_men$color <- case_when(BowlCarPareto_men$.level == 1 ~ 2,
                                     BowlCarPareto_men$.level > 1 ~ 1)
BowlCarPareto_men$Label[BowlCarPareto_men$.level == 1] <- BowlCarPareto_men$LastName[BowlCarPareto_men$.level == 1]
MenBowlCarParetoPlot <-scatterplot3d(BowlCarPareto_men[c("Economy","Average","StrikeRate")], type = "h",pch = 16, color=BowlCarPareto_men$color,
                                  xlab="Career Bowling Economy",
                                  ylab="Career Bowling Strike Rate",
                                  zlab="Career Bowling Average")
zz.coords <- MenBowlCarParetoPlot$xyz.convert(BowlCarPareto_men$Economy, BowlCarPareto_men$Average, BowlCarPareto_men$StrikeRate) 
text(zz.coords$x, 
     zz.coords$y,             
     labels = BowlCarPareto_men$Label,               
     cex = .8, 
     pos = 2,
     col = "red")

Figure 3.4: Men’s Pareto-optimal bowling across a career with the points on the Pareto frontier highlighted in red.

Women’s Pareto-optimal batting

Seven Pareto-optimal innings were identified in Figure 3.5 with extremities ranging from 6 runs off 1 ball (i.e., strike rate = 600) to 114 off 52 balls (i.e., strike rate = 219.23). In Figure 3.6, six Pareto-optimal batting careers were also identified with Laura Kimmince achieving the highest strike rate (144.08), Ellyse Perry achieving the highest average (50.15), with other Pareto-optimal solutions due to varying combinations of both metrics.

Women’s Pareto Batting Innings
Women’s Pareto Batting Career

Show the code:

ggplot(BatInnPareto_women %>%  filter(R > 50 | SR > 100), aes(x = R, y = SR)) +
    geom_point(alpha=0.05, size = 3) +
    geom_point(data = BatInnPareto_women %>% filter(.level == 1), alpha = 0.1, color = "red",size = 3)+
    geom_text(data = BatInnPareto_women %>% filter(.level == 1 & !(LastName %in% c("Nitschke","Molineux","Devine","Harris"))), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2)+
    geom_text(data = BatInnPareto_women %>% filter(.level == 1 & LastName == "Nitschke"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 10)+
    geom_text(data = BatInnPareto_women %>% filter(.level == 1 & LastName == "Molineux"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 10)+
    geom_text(data = BatInnPareto_women %>% filter(.level == 1 & LastName == "Devine"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 10)+
    geom_text(data = BatInnPareto_women %>% filter(.level == 1 & LastName == "Harris"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 10)+
    geom_line(data = BatInnPareto_women %>% filter(.level == 1), alpha = 0.5, colour = "red")+
    scale_y_continuous(breaks = seq(100,600,100))+
    coord_cartesian(xlim = c(0,125))+
    theme_minimal()  +
    labs(x = "Runs Scored in an Innings",
         y = "Innings Batting Strike Rate") +
    theme(axis.title = element_text(size = 12,face = "bold"),
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 12, color = "black"))

Figure 3.5: Women’s Pareto-optimal batting within an innings with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their runs scored was below 50 and their strike rate was below 100.

Show the code:

ggplot(BatCarPareto_women, aes(x = Average, y = StrikeRate)) +
    geom_point(alpha=0.3, size = 3) +
    geom_point(data = BatCarPareto_women %>% filter(.level == 1), alpha = 0.3, color = "red",size = 3)+
    geom_line(data = BatCarPareto_women %>% filter(.level == 1), color = "red")+
    geom_text(data = BatCarPareto_women %>% filter(.level == 1 & !LastName %in% c("Kimmince","Lanning","Mooney")), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 0.6)+
    geom_text(data = BatCarPareto_women %>% filter(.level == 1 & LastName == "Kimmince"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 0.6,nudge_y = 1)+
    geom_text(data = BatCarPareto_women %>% filter(.level == 1 & LastName == "Lanning"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 0.6,nudge_y = 1.5)+
    geom_text(data = BatCarPareto_women %>% filter(.level == 1 & LastName == "Mooney"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 0.6,nudge_y = -0.5)+
    coord_cartesian(xlim = c(0,53))+
    theme_minimal()  +
    labs(x = "Career Batting Average",
         y = "Career Batting Strike Rate") +
    theme(axis.title = element_text(size = 12,face = "bold"),
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 12, color = "black"))

Figure 3.6: Women’s Pareto-optimal batting across a career with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their average was below 20 and their strike rate was below 100.

Women’s Pareto-optimal bowling

Three Pareto-optimal bowling innings in Figure 3.7 were identified: 2/0 by Samantha Bates, 4/2 by Jemma Barsby, and 5/8 achieved by Amanda-Jade Wellington. Six Pareto-optimal bowling careers were identified in Figure 3.8 as having the best balance of bowling average, strike rate, and economy, with Julie Hunter achieving both the lowest average (16.38) and lowest economy (5.16), Harmanpreet Kaur achieving the lowest strike rate (16.00), while Sarah Aley, Darcie Brown, Ruth Johnston, and Hannah Darlington are deemed Pareto-optimal due to a combination of the three metrics.

Women’s Pareto Bowling Innings
Women’s Pareto Bowling Career

Show the code:

ggplot(BowlInnPareto_women, aes(x = W, y = Econ)) +
    geom_jitter(data = BowlInnPareto_women %>% filter(.level != 1),aes(x = W, y = Econ), alpha=0.1, size = 3, width = 0.1) +
    geom_point(data = BowlInnPareto_women %>% filter(.level == 1), aes(x = W, y = Econ), alpha = 0.3, shape = 21, size = 3, fill = "red", color = "red") +
    geom_text(data = BowlInnPareto_women %>% filter(.level == 1), aes(x = W, y = Econ, label = LastName), color = "Red", hjust = "left",nudge_x = -0.2,nudge_y = -1)+
    geom_line(data = BowlInnPareto_women %>% filter(.level == 1), aes(x = W, y = Econ), alpha = 0.5, colour = "red")+
    theme_minimal() +
    coord_cartesian(xlim = c(0,5.5))+
    scale_x_continuous(breaks = c(0:5))+
    labs(x = "Wickets in an innings",
         y = "Innings Bowling Economy") +
    theme(axis.title = element_text(size = 12,face = "bold"),
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 12, color = "black"))

Figure 3.7: Women’s Pareto-optimal bowling within an innings with the points on the Pareto frontier highlighted in red.

Show the code:

BowlCarPareto_women$color <- case_when(BowlCarPareto_women$.level == 1 ~ 2,
                                       BowlCarPareto_women$.level > 1 ~ 1)
BowlCarPareto_women$Label[BowlCarPareto_women$.level == 1] <- BowlCarPareto_women$LastName[BowlCarPareto_women$.level == 1]
WomenBowlCarParetoPlot <-scatterplot3d(BowlCarPareto_women[c("Economy","Average","StrikeRate")], type = "h",pch = 16, color=BowlCarPareto_women$color,
                                     xlab="Career Bowling Economy",
                                     ylab="Career Bowling Strike Rate",
                                     zlab="Career Bowling Average")
zz.coords <- WomenBowlCarParetoPlot$xyz.convert(BowlCarPareto_women$Economy, BowlCarPareto_women$Average, BowlCarPareto_women$StrikeRate) 
text(zz.coords$x, 
     zz.coords$y,             
     labels = BowlCarPareto_women$Label,               
     cex = .8, 
     pos = 2,
     col = "red")

Figure 3.8: Women’s Pareto-optimal bowling across a career with the points on the Pareto frontier highlighted in red.

3.5 Discussion

The present study aimed to introduce Pareto frontiers to sports scientists and its application in identifying and visualizing the most extraordinary players when considering multiple variables. While it is intuitively recognized that multiple attributes are required for success in sport, by identifying the Pareto frontier between these attributes is a simple, yet effective method to identify all the players that possess an optimal balance of these attributes relative to other individuals in the cohort. By analyzing talent multivariately, rather than simply analyzing multiple variables univariately, players can be deemed optimal despite not being objectively highest in a single variable. The present study highlights that when there are conflicting attributes that are of equal interest, the attributes should be viewed in tandem using Pareto frontiers, or else there is a risk that the expectations of an individual to attain the highest level in both attributes univariately may be unfeasible. This was evident in all eight Pareto frontiers, as at least one athlete was identified in each example that was not identified as highest ranked athlete in any metric when analyzed univariately, and yet was deemed Pareto-optimal due to their balance in the metrics of interest. For instance, when observing the career batting average Pareto frontier in the WBBL Figure 3.6, there is an expansive continuum of batters that are all deemed Pareto-optimal as they each have slightly different average/strike rate profiles.

The main advantage of Pareto frontiers highlighted in the present study is identifying athletes who are optimal across multiple metrics even when they are not the highest ranked in any metric. This was most evident in the MBBL where, Chris Lynn when viewed univariately, has the 14th-highest career batting average (34.54), which is 8.24 runs per innings lower than the highest Figure 3.2. Similarly, he has the eighth-highest strike rate, striking at 148.84 which is 15.22 runs per 100 balls lower than the highest. However, when considering both metrics simultaneously and visualizing these metrics, it is clear that he is one of the best batters across the 11 seasons of the MBBL.

Pareto frontiers can also be used to provide benchmarks when conflicting attributes are both desirable and by analyzing these attributes in tandem, more realistic expectations can be set for each athlete depending on whether they sit in the cartesian plane. The resulting benchmarks can be more individualized than what can be expected when viewing metrics univariately. Consequently, by viewing these metrics multivariately, different levels of quality within differing squad roles can be better expressed. Pareto frontiers can be used not only for extreme values, but can also be used to visualize the second, third, and n-th frontiers to further interrogate where an individual lies within the cartesian space. For example, in Figure 3.5, it was evident that there is a substantial gap between the first and second WBBL career batting Pareto frontiers.

In this study we chose to observe batting and bowling as purely independent roles within cricket; however, there are also avenues for Pareto frontiers to be established for all-rounders within cricket (i.e., players that are picked for both their batting and bowling ability). However, it should be noted that if an all-rounder Pareto frontier were to be established with both batting average and strike rate as well as bowling average, economy, and strike rate, the resulting five-dimensional outputs, while valid and executable, become increasingly difficult to interpret and visualize. To do such an analysis, a factor-reduction technique such as principal components analysis should be considered and the Pareto frontier could be built from the extracted components (e.g., batting and bowling).

This technique could also be used to identify “maximal” efforts when multiple observations of an individual are recorded which is common in sports science practice. For example, Duthie et al. (2021) sought to define maneuverability by identifying the maximum tortuosity within each 0.5 m·s⁻¹ increment of speed and plotting the line of best fit through these points (Duthie et al., 2021). In this instance, a Pareto frontier could have been fitted to the data to eliminate the need for an arbitrary selection of 0.5 m·s⁻¹ speed bins. Similarly, Morin et al. (2021) sought to develop an in-situ acceleration-speed profile by identifying the maximum acceleration within each 0.2 m·s⁻¹ of speed and plotting the line of best fit through these points (Morin et al., 2021). By using a Pareto frontier, the need for an arbitrating 3 m·s⁻¹ threshold and 0.2 m·s⁻¹ speed bins could have been avoided. Finally, Rudsits et al. (2018) sought to identify the torque-cadence and power-cadence profile of cyclists by identifying maximal torque or power values in each 5 rpm bin (Rudsits et al., 2018). If a Pareto frontier were used, the model would not require any additional filtering to remove “non-maximal” efforts as the frontier will have already deemed those points non-optimal.

The present study also illustrated how Pareto frontiers can be used to visualize talent in more than 2 dimensions. For example, while Darcie Brown has the seventh-lowest bowling average, 11th-lowest economy, and the 18th-lowest strike rate (see Figure 3.8), she can be deemed a Pareto-optimal bowler as there are no other bowlers who supersede her across all three metrics. Similarly, in the MBBL (Figure 3.4), while Rashid Khan has the sixth-lowest average, sixth-lowest economy and the 18th-lowest strike rate, he can be deemed a Pareto-optimal bowler as there are no other bowlers who supersede him across all three metrics. While there will be some correlations between the three bowling metrics (i.e., average, economy, and strike rate) as the metrics are related (e.g., wickets taken is the denominator of average and numerator of strike rate), visualizing the third dimension is still necessary as the reader would still need to multiply the x and y values to understand where they would sit in the third dimension. This could be further expanded into higher dimensions; however, these dimensions become increasingly difficult to visualize.

It should also be considered that there is some level of uncertainty surrounding each observation in the career Pareto frontiers due to the differing number of observations. For example, Joe Clarke is deemed Pareto-optimal as he is currently striking at 153.82 at an average of 28.94 after 16 innings; however, it is right to assume that it is more uncertain that he lies on the frontier than Chris Lynn who has 100 observations. Therefore, future research could consider providing confidence or credible intervals around the probability that an individual lies on the Pareto frontier. Consequently, it is then feasible that a probability that an individual sits on the first, second, or third frontier could be calculated.

While the present study used Twenty20 cricket to illustrate the power and usefulness of Pareto frontiers, the concept can be widely applied within sports science data sets, especially when the variables of interest are uncorrelated or negatively correlated. Pareto frontiers can still be established between two positively correlated metrics; however, it is likely that there will be less “hidden” athletes on this frontier as naturally the athletes who are high in one metric will be high in the other metric. Future research should apply Pareto frontiers across different avenues within sports performance analysis, such as repeated-sprint ability and dynamic strength index which have multi-faceted determinants. In addition, there are many other possibilities within sports science whereby Pareto frontiers can reveal athletes who possess the optimal balance of the metrics of interest.

Barr, G., & Kantor, B. (2004). A criterion for comparing and selecting batsmen in limited overs cricket. Journal of the Operational Research Society, 55(12), 1266–1274. https://doi.org/10.1057/palgrave.jors.2601800

Bukiet, B., & Ovens, M. (2006). A mathematical modelling approach to one-day cricket batting orders. Journal of Sports Science & Medicine, 5(4), 495–502. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3861747/

Dodd, K., & Newans, T. (2018). Talent identification for soccer: Physiological aspects. Journal of Science and Medicine in Sport, 21(10), 1073–1078. https://doi.org/10.1016/j.jsams.2018.01.009

Duthie, G., Robertson, S., & Thornton, H. (2021). A GNSS-based method to define athlete manoeuvrability in field-based team sports. PLoS One, 16(11), e0260363. https://doi.org/10.1371/journal.pone.0260363

Falk, B., Lidor, R., Lander, Y., & Lang, B. (2004). Talent identification and early development of elite water-polo players: A 2-year follow-up study. Journal of Sports Sciences, 22(4), 347–355. https://doi.org/10.1080/02640410310001641566

Johnston, K., Wattie, N., Schorer, J., & Baker, J. (2018). Talent identification in sport: A systematic review. Sports Medicine, 48(1), 97–109. https://doi.org/10.1007/s40279-017-0803-2

Jones, A., Kirby, B., Clark, I., Rice, H., Fulkerson, E., Wylie, L., Wilkerson, D., Vanhatalo, A., & Wilkins, B. (2021). Physiological demands of running at 2-hour marathon race pace. Journal of Applied Physiology, 130(2), 369–379. https://doi.org/10.1152/japplphysiol.00647.2020

Lievens, E., Bellinger, P., Van Vossel, K., Vancompernolle, J., Bex, T., Minahan, C., & Derave, W. (2021). Muscle typology of world-class cyclists across various disciplines and events. Medicine and Science in Sports and Exercise, 53(4), 816–824. https://doi.org/10.1249/MSS.0000000000002518

Ligges, U., & Mächler, M. (2003). Scatterplot3d - an r package for visualizing multivariate data. Journal of Statistical Software, 8(11), 1–20. http://www.jstatsoft.org

Maquirriain, J., Baglione, R., & Cardey, M. (2016). Male professional tennis players maintain constant serve speed and accuracy over long matches on grass courts. European Journal of Sport Science, 16(7), 845–849. https://doi.org/10.1080/17461391.2016.1156163

Mastroddi, F., & Gemma, S. (2013). Analysis of Pareto frontiers for multidisciplinary design optimization of aircraft. Aerospace Science and Technology, 28(1), 40–55. https://doi.org/10.1016/j.ast.2012.10.003

Minahan, C., Newans, T., Quinn, K., Parsonage, J., Buxton, S., & Bellinger, P. (2021). Strong, fast, fit, lean, and safe: A positional comparison of physical and physiological qualities within the 2020 Australian Women’s Rugby League team. The Journal of Strength and Conditioning Research, 35(Suppl 2), S11–S19. https://doi.org/10.1519/JSC.0000000000004106

Morin, J.-B., Le Mat, Y., Osgnach, C., Barnabò, A., Pilati, A., Samozino, P., & Prampero, P. di. (2021). Individual acceleration-speed profile in-situ: A proof of concept in professional football players. Journal of Biomechanics, 123, 110524. https://doi.org/10.1016/j.jbiomech.2021.110524

Patel, A., Bracewell, P., Gazley, A., & Bracewell, B. (2017). Identifying fast bowlers likely to play test cricket based on age-group performances. International Journal of Sports Science & Coaching, 12(3), 328–338. https://doi.org/10.1177/1747954117710514

Pérez-Toledano, M. Á., Rodriguez, F., García-Rubio, J., & Ibañez, S. J. (2019). Players’ selection for basketball teams, through performance index rating, using multiobjective evolutionary algorithms. PLoS One, 14(9), e0221258. https://doi.org/10.1371/journal.pone.0221258

Pion, J., Lenoir, M., Vandorpe, B., & Segers, V. (2015). Talent in female gymnastics: A survival analysis based upon performance characteristics. International Journal of Sports Medicine, 94(11), 935–940. https://doi.org/10.1055/s-0035-1548887

Pyne, D., Gardner, A., Sheehan, K., & Hopkins, W. (2005). Fitness testing and career progression in AFL football. Journal of Science and Medicine in Sport, 8(3), 321–332. https://doi.org/10.1016/s1440-2440(05)80043-x

R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Roocks, P. (2016). Computing Pareto frontiers and database preferences with the rPref package. The R Journal, 8(2), 393–404. https://doi.org/10.32614/RJ-2016-054

Rudsits, B., Hopkins, W., Hautier, C., & Rouffet, D. (2018). Force-velocity test on a stationary cycle ergometer: Methodological recommendations. Journal of Applied Physiology, 124(4), 831–839. https://doi.org/10.1152/japplphysiol.00719.2017

Sánchez-García, M., Sánchez-Sánchez, J., Rodríguez-Fernández, A., Solano, D., & Castillo, D. (2018). Relationships between sprint ability and endurance capacity in soccer referees. Sports, 6(2), 28. https://doi.org/10.3390/sports6020028

Stølen, T., Chamari, K., Castagna, C., & Wisløff, U. (2005). Physiology of soccer: An update. Sports Medicine (Auckland, N.Z.), 35(6), 501–536. https://doi.org/10.2165/00007256-200535060-00004

Till, K., Cobley, S., Morley, D., O’hara, J., Chapman, C., & Cooke, C. (2016). The influence of age, playing position, anthropometry and fitness on career attainment outcomes in rugby league. Journal of Sports Sciences, 34(13), 1240–1245. https://doi.org/10.1080/02640414.2015.1105380

Turner, A., Jones, B., Stewart, P., Bishop, C., Parmar, N., Chavda, S., & Read, P. (2019). Total score of athleticism: Holistic athlete profiling to enhance decision-making. Strength & Conditioning Journal, 41(6), 91–101. https://doi.org/10.1519/SSC.0000000000000506

Wells, G., Elmi, M., & Thomas, S. (2009). Physiological correlates of golf performance. The Journal of Strength and Conditioning Research, 23(3), 741–750. https://doi.org/10.1519/JSC.0b013e3181a07970

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. https://ggplot2.tidyverse.org

Wickham, H. (2021). Tidyr: Tidy messy data. https://CRAN.R-project.org/package=tidyr

Wickham, H., François, R., Henry, L., & Müller, K. (2021). Dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr