Appendix A — Pareto Frontiers for Multivariate Cricket Performance

A.1 Abstract

In Twenty20 cricket, there is a trade-off relationship between batting average and strike rate as well as bowling strike rate, economy, and average. This study presents Pareto frontiers as a tool to identify athletes who possess an optimal ranking when considering multiple metrics simultaneously. 884 matches of Twenty20 cricket from the Indian Premier League were compiled to determine the best batting and bowling performances, both within a single innings and across each player’s career. Pareto frontiers identified nine optimal batting innings and six batting careers. Pareto frontiers also identified three optimal bowling and five optimal bowling careers. Each frontier identified players that were not the highest ranked athlete in any metric when analysed univariately. Pareto frontiers can be used when assessing talent across multiple metrics, especially when these metrics may be conflicting or uncorrelated. Pareto frontiers can identify athletes that may not have the highest ranking on a given metric but have an optimal balance across multiple metrics that are associated with success in a given sport.

The following chapter is a copy of the published manuscript:

Newans, T., Bellinger, P., & Minahan, C. Identifying multivariate cricket performance using Pareto frontiers. MathSport Conference 2022.

As co-author of the paper “Identifying multivariate cricket performance using Pareto frontiers”, I confirm that Timothy Newans has made the following contributions:

  • Study concept and design

  • Data collection

  • Data analysis and interpretation

  • Manuscript preparation

Name: Clare Minahan

Date: 29/03/2023

A.2 Introduction

The need to identify attributes to quantify optimal performance is evident for every sport (Johnston et al., 2018). With the exception of a few single-skill sports (Rienhoff et al., 2013), most athletes require a number of attributes to perform in their given sport. These attributes can encompass physical (Kelly & Williams, 2020), physiological (Dodd & Newans, 2018), mental (Morris, 2000), or skill-based characteristics (Davids et al., 2000), that all can contribute to the performance of a player. Attributes such as speed, endurance, agility, strength, power, and accuracy are common across multiple sports (Davids et al., 2000), and each attribute can have multiple variables seeking to quantify that attribute. As such, coaches and support staff are consistently looking for new variables that could be used to either quantify new attributes of interest or develop more variables to better quantify already-identified attributes with the hope that these new variables can identify previously-hidden talent or interrogate subtle differences between different athletes. However, with the increase in the number of attributes of interest, the likelihood that an athlete excels in every attribute decreases. Consequently, methods are required that can analyse multiple attributes simultaneously, rather than viewing each attribute in isolation.

While traditional research statistical techniques focus around identifying the mean and standard deviation of a population (Hopkins et al., 2009), sports typically are not interested in the mean during talent identification processes, rather, they are looking for outliers. That is, coaches and support staff are looking for athletes that sit the furthest away from the mean in the direction that success is defined. Therefore, when multiple attributes are of interest, selection of athletes is by choosing athletes that sit the further away from the mean within each attribute. While this process can work when variables are positively correlated, this process can miss talent when variables are negatively correlated. For instance, at the elite level, there is a negative correlation between maximal sprint speed and endurance capacity (Sánchez-García et al., 2018). However, running-based team sports require athletes possess both speed and endurance to play at the elite level and, therefore, players necessarily need to trade off between having optimal speed and optimal endurance. In its simplicity, if both speed and endurance were equally required for success, selecting the top-n sprinters and the top-n endurance runners may not be the optimal athletes for that sport.

Consequently, both attributes need to be viewed in tandem. The process of optimising the balance of multiple attributes is termed ‘multi-objective optimisation’. Mathematically, they aim to create the perfect balance of the attributes of interest. If a data point was defined as: \(\vec{x_1} \: \varepsilon \: X\), it is, therefore, better than another data point defined by: \(\vec{x_2} \: \varepsilon \: X\) if \(f_{i}(\vec{x_1}) \leq f_{i}(\vec{x_2})\) for all metrics i \(\varepsilon\) {1, 2, …, k} and \(f_{i}(\vec{x_1}) < f_{i}(\vec{x_2})\) for at least one metric j \(\varepsilon\) {1, 2, …, k}. Once these conditions have been met, the remaining points are deemed Pareto-optimal and form what is called the Pareto frontier.

In Twenty20 cricket, there are multiple facets within both batting and bowling that can define success. Unlike Test cricket and, to an extent, One-Day cricket where scoring as many runs as possible regardless of how many deliveries faced is of most importance, Twenty20 crickets requires batters to score faster (i.e., higher strike rate) and for bowlers to concede minimal runs which, in some cases, can come at the expense of preserving their wicket. Therefore, there is a trade-off relationship between batting average and strike rate as well as bowling economy, average, and strike rate within Twenty20 cricket. For example, early on in an innings the risk-return of attempting to hit six runs off a ball is significantly different than in the final over of an innings. Similarly, a bowler needs to balance taking wickets while also conceding as few runs as possible. For instance, when bowling four overs, it is again difficult to determine whether taking three wickets for 50 runs is of more worth than taking no wickets but only conceding eight runs as the three wickets may not have been worth conceding 50 runs. As both attributes within each domain are of interest, Pareto frontiers can be used to determine batters and bowlers that may not record the highest in either variable but display an optimal balance of the two attributes. Therefore, when assessing the quality of players, it is necessary to utilise tools that can analyse these data sets without favouring one metric over another. Therefore, the present study aimed to use Pareto frontiers to identify the best performing Twenty20 batters and bowlers.

A.3 Methods

The present study comprised all 884 matches of the first 14 editions of the men’s Indian Premier League (IPL), India’s domestic T20 cricket competition. The data set contained 566 batters and 467 bowlers. Collectively, there were 13,357 individual batting innings with observations ranging from 1-208 innings per batter, while there were 10,925 individual bowling innings with observations ranging from 1-180 innings per bowler.

Show the code:
library(tidyverse)
library(rPref)
library(patchwork)
library(scatterplot3d)
options(scipen = 999)
iplbat <- read_csv('www/data/Study_A_pareto_iplbat.csv') # Men's batting scorecards
iplbowl <- read_csv('www/data/Study_A_pareto_iplbowl.csv') # Men's bowling scorecards

To summarise the data, two summary statistics were generated for batting and three summary statistics were generated for bowling. The summary statistics were as follows:

  • Batting Average: runs scored divided by frequency of dismissal

  • Batting Strike Rate: runs scored divided by balls faced

  • Bowling Average: runs conceded divided by wickets taken

  • Bowling Strike Rate: balls bowled divided by wickets taken

  • Bowling Economy: runs conceded divided by overs (i.e., 6 balls) bowled

Show the code:
dismissals <- iplbat %>%
  group_by(id) %>%
  filter(Dismissed == T) %>%
  summarise(Dismissals = n()) ## Calculate number of dismissals

notouts <- iplbat %>%
  group_by(id) %>%
  filter(Dismissed == F) %>%
  summarise(NotOuts = n()) ## Calculate number of not outs

sumBat <- iplbat %>%
  group_by(id, Batter, LastName) %>%
  summarise(
    TotalRuns = sum(R),
    TotalBalls = sum(B),
    Innings = n()
  ) %>%
  ungroup() %>%
  left_join(dismissals) %>%
  left_join(notouts) %>% 
  mutate(
    Dismissals = case_when(is.na(Dismissals) ~ as.integer(0),
                           T ~ Dismissals),
    NotOuts = case_when(is.na(NotOuts) ~ as.integer(0),
                           T ~ NotOuts),
    Average = TotalRuns / Dismissals,
    StrikeRate = TotalRuns / TotalBalls * 100) ## Calculate career batting average and strike rate

filtBat <- sumBat %>%
  filter(Innings >= 20) ## Filter only those with 20 or more batting innings

sumBowl <- iplbowl %>%
  group_by(id, Bowler, LastName) %>%
  summarise(
    Innings = n(),
    Balls = sum(Balls),
    Wickets = sum(W),
    Runs = sum(R)
  ) %>%
  mutate(
    Average = Runs / Wickets,
    Economy = Runs / Balls * 6,
    StrikeRate = Balls / Wickets
  ) ## Calculate bowling average, economy, and strike rate

filtBowl <- sumBowl %>%
  filter(Innings > 20) %>%
  ungroup() ## Filter only those with more than 20 bowling innings

To understand both the batting and bowling attributes within cricket, four Pareto frontiers for were established within the data set:

  1. Pareto-optimal Batting Innings

    This analysis outlined the highest runs scored within an innings at the highest strike rate.

Show the code:
BatInnPareto <- psel(iplbat %>% filter(R > 0),high(R)*high(SR),top_level = 999)
  1. Pareto-optimal Batting Career

    This analysis outlined the highest batting average across a career at the highest strike rate. To provide a more accurate career report, batters required to have played a minimum of 20 innings which left 163 eligible batters.

Show the code:
BatCarPareto <- psel(filtBat,high(Average)*high(StrikeRate), top_level = 999) %>% 
      filter(Average > 20 | StrikeRate > 100)
  1. Pareto-optimal Bowling Innings

    This analysis outlined the most wickets taken in an innings at the lowest economy.

Show the code:
BowlInnPareto <- psel(iplbowl,high(W)*low(Econ),top_level = 999)
  1. Pareto-optimal Bowling Career

    This analysis outlined the lowest bowling average across a career at the lowest economy and lowest strike rate. To provide a more accurate career report, bowlers required to have bowled in more than 20 matches, which left 145 eligible bowlers.

Show the code:
BowlCarPareto <- psel(filtBowl,low(Average)*low(StrikeRate)*low(Economy),top_level = 999) %>%
      filter(Average < 50)

The rPref package (Roocks, 2016) was used in R v 4.1.0 (R Core Team, 2019) to determine the Pareto frontiers using the psel function with the ‘top_level’ argument set to 999 to ensure every athlete was assigned to a frontier.

A.4 Results

Pareto-optimal Batting Innings

Nine Pareto-optimal innings were identified with extremities ranging from 6 runs off 1 ball (i.e., strike rate = 600) to 175 off 66 balls (i.e., strike rate = 265.15). Additionally, the solution of 6 runs off 1 ball has been attained eight times. The IPL batting innings Pareto frontier is displayed in Figure A.1 and the batters are listed in Table A.1.

Show the code:
ggplot(BatInnPareto, aes(x = R, y = SR)) +
    geom_point(alpha=0.1, size = 3) +
    geom_text(data = BatInnPareto %>% filter(.level == 1 & R != 6 & !(LastName %in% c("de Villiers","Pollard","Russell","Pathan","Miller","Gayle"))), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2)+
    geom_text(data = BatInnPareto %>% filter(.level == 1 & LastName == "de Villiers"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 15)+
    geom_text(data = BatInnPareto %>% filter(.level == 1 & LastName == "Pollard"), aes(label = LastName), color = "darkgreen", hjust = "left",nudge_x = 2,nudge_y = 5)+
    geom_text(data = BatInnPareto %>% filter(.level == 1 & LastName == "Russell"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 5)+
    geom_text(data = BatInnPareto %>% filter(.level == 1 & LastName == "Pathan"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 15)+
    geom_text(data = BatInnPareto %>% filter(.level == 1 & LastName == "Miller"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = 2,nudge_y = 7)+
    geom_text(data = BatInnPareto %>% filter(.level == 1 & LastName == "Gayle"), aes(label = LastName), color = "Red", hjust = "left",nudge_x = -7,nudge_y = 12)+
    geom_line(data = BatInnPareto %>% filter(.level == 1), alpha = 0.5, colour = "darkgreen")+
    geom_line(data = BatInnPareto %>% filter(.level == 1), alpha = 0.5, colour = "red")+
    theme_minimal() +
    annotate(geom = "text",x = 18,y = 600, label = "8 players",color = "red")+
    scale_y_continuous(breaks = seq(100,600,100))+
    coord_cartesian(xlim = c(0,180))+
    theme_minimal() +
    labs(x = "Runs Scored in an Innings",
         y = "Innings Batting Strike Rate") +
    theme(axis.title = element_text(size = 16),
          panel.grid.minor.y = element_blank(),
          axis.text = element_text(size = 16, color = "black"))

Figure A.1: Pareto-optimal batting within an innings with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their runs scored was below 50 and their strike rate was below 100.

Show the code:
BatInnPareto %>% 
  filter(.level == 1) %>% 
  mutate(SR = round(SR,2)) %>% 
  select(Batter,R,B,SR,Season,Match = Season.Match.No) %>% 
  arrange(-R) 
Table A.1: List of all Pareto-optimal IPL batting within an innings.
Batter R (B) Strike Rate Match
Chris Gayle 175 (66) 265.15 IPL06 Match 31
David Miller 101 (38) 265.78 IPL06 Match 51
Yusuf Pathan 100 (37) 270.27 IPL03 Match 2
Suresh Raina 87 (25) 348.00 IPL07 Match 59
Andre Russell 48 (13) 369.23 IPL12 Match 17
AB de Villiers 41 (11) 372.72 IPL08 Match 16
Chris Morris 38 (9) 422.22 IPL10 Match 9
Krunal Pandya 20 (4) 500.00 IPL13 Match 17
Numerous 6 (1) 600.00 IPL04 Match 74 1st occurrence

Pareto-optimal Batting Career

Six Pareto-optimal batting careers innings were identified. Andre Russell recorded the highest career batting strike rate with 178.57 runs per 100 balls, while KL Rahul recorded the highest batting average with 47.43 runs per dismissal. The IPL batting career Pareto frontier is displayed in Figure A.2 and the batters are listed in Table A.2.

Show the code:
ggplot(BatCarPareto, aes(x = Average, y = StrikeRate)) +
    geom_point(data = BatCarPareto %>% filter(.level != 1), alpha=0.3, size = 3) +
    geom_point(data = BatCarPareto %>% filter(.level == 1), color = "red", size = 3)+
    geom_line(data = BatCarPareto %>% filter(.level == 1), color = "red")+
    geom_text(data = BatCarPareto %>% filter(.level == 1 & !(LastName %in% c("de Villiers", "Bairstow"))), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.5,nudge_y = 0)+
    geom_text(data = BatCarPareto %>% filter(.level == 1 & LastName == "Warner"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.5,nudge_y = 0)+
    geom_text(data = BatCarPareto %>% filter(.level == 1 & LastName == "Bairstow"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.5,nudge_y = 3)+
    geom_text(data = BatCarPareto %>% filter(.level == 1 & LastName == "de Villiers"), aes(label = LastName), color = "red", hjust = "left",nudge_x = 0.5,nudge_y = 1)+
    theme_minimal()  +
    scale_x_continuous(limits = c(2,49),breaks = seq(from = 0,to = 50,by = 10))+
    labs(x = "Career Batting Average",
         y = "Career Batting Strike Rate") +
    theme(axis.title = element_text(size = 16),
          legend.position = "none",
          axis.text = element_text(size = 16, color = "black"))

Figure A.2: Pareto-optimal batting across a career with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their average was below 20 and their strike rate was below 100.

Show the code:
BatCarPareto %>% 
  filter(.level == 1) %>% 
  select(Batter,Innings,Average,`Strike Rate` = StrikeRate) %>% 
  mutate(across(c(Average,`Strike Rate`),round,2)) %>% 
  arrange(-Average)
Table A.2: List of all Pareto-optimal IPL batting careers.
Batter Innings Average Strike Rate
KL Rahul 85 47.43 136.38
David Warner 150 41.60 139.97
Jonny Bairstow 28 41.52 142.19
Chris Gayle 141 39.72 148.96
AB de Villiers 170 39.71 151.69
Andre Russell 70 29.31 178.57

Pareto-optimal Bowling Innings

Three Pareto-optimal bowling innings were identified: 2/0 by Suresh Raina, 5/5 by Anil Kumble, and 6/12 achieved by Alzarri Joseph. The IPL bowling innings Pareto frontier is displayed in Figure A.3 and the bowlers are listed in Table A.3.

Show the code:
ggplot(BowlInnPareto, aes(x = W, y = Econ)) +
  geom_jitter(data = BowlInnPareto %>% filter(.level != 1),aes(x = W, y = Econ), alpha=0.1, size = 3, width = 0.1) +
  geom_point(data = BowlInnPareto %>% filter(.level == 1), aes(x = W, y = Econ), alpha = 0.3, shape = 21, size = 3, fill = "red", color = "red") +
  geom_text(data = BowlInnPareto %>% filter(.level == 1), aes(x = W, y = Econ, label = LastName), size = 5, color = "Red", hjust = "left",nudge_x = -0.3,nudge_y = -1)+
  geom_line(data = BowlInnPareto %>% filter(.level == 1), aes(x = W, y = Econ), alpha = 0.5,colour = "red")+
  theme_minimal() +
  labs(x = "Wickets in an innings",
       y = "Innings Bowling Economy") +
    coord_cartesian(xlim = c(0,6.3))+
  theme(axis.title = element_text(size = 16),
        legend.position = "none",
        panel.grid.minor.y = element_blank(),
        axis.text = element_text(size = 16, color = "black"))

Figure A.3: Pareto-optimal bowling within an innings with the points on the Pareto frontier highlighted in red.

Show the code:
BowlInnPareto %>% 
  filter(.level == 1) %>% 
  select(Bowler, Overs = O, Wickets = W, Runs = R, Season, Match = Season.Match.No) %>% 
  arrange(Wickets)
Table A.3: List of all Pareto-optimal IPL bowling innings.
Bowler Overs Wickets Runs Match
Suresh Raina 0.3 2 0 IPL04 Match 52
Anil Kumble 3.1 5 5 IPL02 Match 2
Alzarri Joseph 3.4 6 12 IPL12 Match 19

Pareto-optimal Bowling Career

Five Pareto-optimal bowling careers were identified, with Doug Bollinger achieving the lowest average, Rashid Khan achieving the lowest economy, while Kagiso Rabada recorded the lowest strike rate. The IPL bowling career Pareto frontier is displayed in Figure A.4 and the bowlers are listed in Table A.4.

Show the code:
BowlCarPareto$color <- case_when(BowlCarPareto$.level == 1 ~ 2,
                                     BowlCarPareto$.level > 1 ~ 1)
BowlCarPareto$Label[BowlCarPareto$.level == 1] <- BowlCarPareto$LastName[BowlCarPareto$.level == 1]
BowlCarParetoPlot <-scatterplot3d(BowlCarPareto[c("Economy","Average","StrikeRate")], type = "h",pch = 16, color=BowlCarPareto$color,
                                  xlab="Career Bowling Economy",
                                  ylab="Career Bowling Strike Rate",
                                  zlab="Career Bowling Average")
zz.coords <- BowlCarParetoPlot$xyz.convert(BowlCarPareto$Economy, BowlCarPareto$Average, BowlCarPareto$StrikeRate) 
text(zz.coords$x, 
     zz.coords$y,             
     labels = BowlCarPareto$Label,               
     cex = .8, 
     pos = 2,
     col = "red")  

Figure A.4: Pareto-optimal bowling across a career with the points on the Pareto frontier highlighted in red.

Show the code:
BowlCarPareto %>% 
  filter(.level == 1) %>% 
  select(Bowler, Innings, Average, Economy, `Strike Rate` = StrikeRate) %>%
  mutate(across(c(Average:`Strike Rate`),round,2)) %>% 
  arrange(Average)
Table A.4: List of all Pareto-optimal IPL bowling careers.
Bowler Innings Average Economy Strike Rate
Doug Bollinger 27 18.73 7.22 15.57
Kagiso Rabada 59 19.71 8.22 14.39
Lasith Malinga 122 19.79 7.14 16.63
Jofra Archer 35 21.33 7.13 17.93
Rashid Khan 86 21.46 6.40 20.12

A.5 Discussion

This study sought to use Pareto frontiers to visualise optimal Twenty20 cricket batting and bowling performances, both within an innings as well as across a career. By analysing performance multivariately, rather than simply analysing multiple variables univariately, players can be deemed optimal despite not being objectively highest in a single variable. When conflicting attributes are of equal interest, Pareto frontiers can view these variables in tandem as the expectations of an individual to attain the highest level in both attributes univariately may be unfeasible. All four Pareto frontiers contained at least one athlete that was not the highest ranked athlete in any metric when analysed univariately, and yet was deemed Pareto-optimal due to their balance in the metrics of interest.

The main advantage of Pareto frontiers highlighted in the present study is identifying athletes who are optimal across multiple metrics even when they are not the highest ranked in any metric. This was most evident where Chris Gayle, when viewed univariately, has the 9th-highest career batting average (39.72), which is 6.71 runs per innings lower than the highest (Figure 2). Similarly, he has the 14th-highest strike rate, striking at 148.96 which is 29.61 runs per 100 balls lower than the highest. However, when considering both metrics simultaneously and visualising these metrics, he is one of the best batsmen across the 14 seasons of the IPL.

The present study also illustrated how Pareto frontiers can be used to visualise talent in more than 2 dimensions. For example, while Jofra Archer has the sixth-lowest bowling average, 14th-lowest economy, and the 19th-lowest strike rate (see Figure 4), he can be deemed a Pareto-optimal bowler as there are no other bowlers who supersede him across all three metrics. While there will be some correlations between the three bowling metrics (i.e., average, economy, and strike rate) as the metrics are related (e.g., wickets taken is the denominator of average and numerator of strike rate), visualising the third dimension is still necessary as the reader would still need to multiply the x and y values to understand where they would sit in the third dimension.

In the present study we chose to observe batting and bowling as purely independent roles within cricket; however, there are also avenues for Pareto frontiers to be established for all-rounders within cricket (i.e., players that are picked for both their batting and bowling ability). However, it should be noted that if an all-rounder Pareto frontier were to be established with both batting average and strike rate as well as bowling average, economy, and strike rate, the resulting five-dimensional outputs, while valid and executable, become increasingly difficult to interpret and visualise. To do such an analysis, a factor-reduction technique such as principal components analysis should be considered and the Pareto frontier could be built from the extracted components (e.g., batting and bowling).

While the present study is designed to be an introduction for sports scientists to the concept of Pareto frontiers, it should also be considered that there is some level of uncertainty surrounding each observation in the career Pareto frontiers due to the differing number of observations. For example, Jonny Bairstow is deemed Pareto-optimal as he is currently striking at 142.19 at an average of 41.52 after 28 innings; however, it is right to assume that it is more uncertain that he lies on the frontier than AB de Villiers who has 170 observations. Therefore, future research could consider providing confidence or credible intervals around the probability that an individual lies on the Pareto frontier. Consequently, it is then feasible that a probability that an individual sits on the first, second, or third frontier could be calculated.

While the present study used Twenty20 cricket to illustrate the power and usefulness of Pareto frontiers, the concept can be widely applied within sports science data sets, especially when the variables of interest are uncorrelated or negatively correlated. Pareto frontiers can still be established between two positively correlated metrics; however, it is likely that there will be less ‘hidden’ athletes on this frontier as naturally the athletes who are high in one metric will be high in the other metric. Future research should apply Pareto frontiers across different avenues within sports performance analysis which have multi-faceted determinants as there are many other possibilities within sports whereby Pareto frontiers can reveal athletes who possess the optimal balance of the metrics of interest.

A.6 Conclusion

With the proliferation of various physiological, mechanical, and skill-related attributes associated with performance, Pareto frontiers should be used within sports science to visualise multiple performance metrics. By analysing opposing data in tandem, more feasible expectations and benchmarks can be established to reveal talent that may have been missed when analysing multiple metrics univariately.