Appendix A — Pareto Frontiers for Multivariate Cricket Performance
A.1 Abstract
In Twenty20 cricket, there is a trade-off relationship between batting average and strike rate as well as bowling strike rate, economy, and average. This study presents Pareto frontiers as a tool to identify athletes who possess an optimal ranking when considering multiple metrics simultaneously. 884 matches of Twenty20 cricket from the Indian Premier League were compiled to determine the best batting and bowling performances, both within a single innings and across each player’s career. Pareto frontiers identified nine optimal batting innings and six batting careers. Pareto frontiers also identified three optimal bowling and five optimal bowling careers. Each frontier identified players that were not the highest ranked athlete in any metric when analysed univariately. Pareto frontiers can be used when assessing talent across multiple metrics, especially when these metrics may be conflicting or uncorrelated. Pareto frontiers can identify athletes that may not have the highest ranking on a given metric but have an optimal balance across multiple metrics that are associated with success in a given sport.
The following chapter is a copy of the published manuscript:
Newans, T., Bellinger, P., & Minahan, C. Identifying multivariate cricket performance using Pareto frontiers. MathSport Conference 2022.
As co-author of the paper “Identifying multivariate cricket performance using Pareto frontiers”, I confirm that Timothy Newans has made the following contributions:
Study concept and design
Data collection
Data analysis and interpretation
Manuscript preparation
Name: Clare Minahan
Date: 29/03/2023
A.2 Introduction
The need to identify attributes to quantify optimal performance is evident for every sport (Johnston et al., 2018). With the exception of a few single-skill sports (Rienhoff et al., 2013), most athletes require a number of attributes to perform in their given sport. These attributes can encompass physical (Kelly & Williams, 2020), physiological (Dodd & Newans, 2018), mental (Morris, 2000), or skill-based characteristics (Davids et al., 2000), that all can contribute to the performance of a player. Attributes such as speed, endurance, agility, strength, power, and accuracy are common across multiple sports (Davids et al., 2000), and each attribute can have multiple variables seeking to quantify that attribute. As such, coaches and support staff are consistently looking for new variables that could be used to either quantify new attributes of interest or develop more variables to better quantify already-identified attributes with the hope that these new variables can identify previously-hidden talent or interrogate subtle differences between different athletes. However, with the increase in the number of attributes of interest, the likelihood that an athlete excels in every attribute decreases. Consequently, methods are required that can analyse multiple attributes simultaneously, rather than viewing each attribute in isolation.
While traditional research statistical techniques focus around identifying the mean and standard deviation of a population (Hopkins et al., 2009), sports typically are not interested in the mean during talent identification processes, rather, they are looking for outliers. That is, coaches and support staff are looking for athletes that sit the furthest away from the mean in the direction that success is defined. Therefore, when multiple attributes are of interest, selection of athletes is by choosing athletes that sit the further away from the mean within each attribute. While this process can work when variables are positively correlated, this process can miss talent when variables are negatively correlated. For instance, at the elite level, there is a negative correlation between maximal sprint speed and endurance capacity (Sánchez-García et al., 2018). However, running-based team sports require athletes possess both speed and endurance to play at the elite level and, therefore, players necessarily need to trade off between having optimal speed and optimal endurance. In its simplicity, if both speed and endurance were equally required for success, selecting the top-n sprinters and the top-n endurance runners may not be the optimal athletes for that sport.
Consequently, both attributes need to be viewed in tandem. The process of optimising the balance of multiple attributes is termed ‘multi-objective optimisation’. Mathematically, they aim to create the perfect balance of the attributes of interest. If a data point was defined as: \(\vec{x_1} \: \varepsilon \: X\), it is, therefore, better than another data point defined by: \(\vec{x_2} \: \varepsilon \: X\) if \(f_{i}(\vec{x_1}) \leq f_{i}(\vec{x_2})\) for all metrics i\(\varepsilon\) {1, 2, …, k} and \(f_{i}(\vec{x_1}) < f_{i}(\vec{x_2})\) for at least one metric j\(\varepsilon\) {1, 2, …, k}. Once these conditions have been met, the remaining points are deemed Pareto-optimal and form what is called the Pareto frontier.
In Twenty20 cricket, there are multiple facets within both batting and bowling that can define success. Unlike Test cricket and, to an extent, One-Day cricket where scoring as many runs as possible regardless of how many deliveries faced is of most importance, Twenty20 crickets requires batters to score faster (i.e., higher strike rate) and for bowlers to concede minimal runs which, in some cases, can come at the expense of preserving their wicket. Therefore, there is a trade-off relationship between batting average and strike rate as well as bowling economy, average, and strike rate within Twenty20 cricket. For example, early on in an innings the risk-return of attempting to hit six runs off a ball is significantly different than in the final over of an innings. Similarly, a bowler needs to balance taking wickets while also conceding as few runs as possible. For instance, when bowling four overs, it is again difficult to determine whether taking three wickets for 50 runs is of more worth than taking no wickets but only conceding eight runs as the three wickets may not have been worth conceding 50 runs. As both attributes within each domain are of interest, Pareto frontiers can be used to determine batters and bowlers that may not record the highest in either variable but display an optimal balance of the two attributes. Therefore, when assessing the quality of players, it is necessary to utilise tools that can analyse these data sets without favouring one metric over another. Therefore, the present study aimed to use Pareto frontiers to identify the best performing Twenty20 batters and bowlers.
A.3 Methods
The present study comprised all 884 matches of the first 14 editions of the men’s Indian Premier League (IPL), India’s domestic T20 cricket competition. The data set contained 566 batters and 467 bowlers. Collectively, there were 13,357 individual batting innings with observations ranging from 1-208 innings per batter, while there were 10,925 individual bowling innings with observations ranging from 1-180 innings per bowler.
To summarise the data, two summary statistics were generated for batting and three summary statistics were generated for bowling. The summary statistics were as follows:
Batting Average: runs scored divided by frequency of dismissal
Batting Strike Rate: runs scored divided by balls faced
Bowling Average: runs conceded divided by wickets taken
Bowling Strike Rate: balls bowled divided by wickets taken
This analysis outlined the highest batting average across a career at the highest strike rate. To provide a more accurate career report, batters required to have played a minimum of 20 innings which left 163 eligible batters.
This analysis outlined the lowest bowling average across a career at the lowest economy and lowest strike rate. To provide a more accurate career report, bowlers required to have bowled in more than 20 matches, which left 145 eligible bowlers.
The rPref package (Roocks, 2016) was used in R v 4.1.0 (R Core Team, 2019) to determine the Pareto frontiers using the psel function with the ‘top_level’ argument set to 999 to ensure every athlete was assigned to a frontier.
A.4 Results
Pareto-optimal Batting Innings
Nine Pareto-optimal innings were identified with extremities ranging from 6 runs off 1 ball (i.e., strike rate = 600) to 175 off 66 balls (i.e., strike rate = 265.15). Additionally, the solution of 6 runs off 1 ball has been attained eight times. The IPL batting innings Pareto frontier is displayed in Figure A.1 and the batters are listed in Table A.1.
Figure A.1: Pareto-optimal batting within an innings with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their runs scored was below 50 and their strike rate was below 100.
Table A.1: List of all Pareto-optimal IPL batting within an innings.
Batter
R (B)
Strike Rate
Match
Chris Gayle
175 (66)
265.15
IPL06 Match 31
David Miller
101 (38)
265.78
IPL06 Match 51
Yusuf Pathan
100 (37)
270.27
IPL03 Match 2
Suresh Raina
87 (25)
348.00
IPL07 Match 59
Andre Russell
48 (13)
369.23
IPL12 Match 17
AB de Villiers
41 (11)
372.72
IPL08 Match 16
Chris Morris
38 (9)
422.22
IPL10 Match 9
Krunal Pandya
20 (4)
500.00
IPL13 Match 17
Numerous
6 (1)
600.00
IPL04 Match 74 1st occurrence
Pareto-optimal Batting Career
Six Pareto-optimal batting careers innings were identified. Andre Russell recorded the highest career batting strike rate with 178.57 runs per 100 balls, while KL Rahul recorded the highest batting average with 47.43 runs per dismissal. The IPL batting career Pareto frontier is displayed in Figure A.2 and the batters are listed in Table A.2.
Figure A.2: Pareto-optimal batting across a career with the Pareto frontier highlighted in red. N.B. For illustrative purposes, points were filtered out if both their average was below 20 and their strike rate was below 100.
Table A.2: List of all Pareto-optimal IPL batting careers.
Batter
Innings
Average
Strike Rate
KL Rahul
85
47.43
136.38
David Warner
150
41.60
139.97
Jonny Bairstow
28
41.52
142.19
Chris Gayle
141
39.72
148.96
AB de Villiers
170
39.71
151.69
Andre Russell
70
29.31
178.57
Pareto-optimal Bowling Innings
Three Pareto-optimal bowling innings were identified: 2/0 by Suresh Raina, 5/5 by Anil Kumble, and 6/12 achieved by Alzarri Joseph. The IPL bowling innings Pareto frontier is displayed in Figure A.3 and the bowlers are listed in Table A.3.
Show the code:
ggplot(BowlInnPareto, aes(x = W, y = Econ)) +geom_jitter(data = BowlInnPareto %>%filter(.level !=1),aes(x = W, y = Econ), alpha=0.1, size =3, width =0.1) +geom_point(data = BowlInnPareto %>%filter(.level ==1), aes(x = W, y = Econ), alpha =0.3, shape =21, size =3, fill ="red", color ="red") +geom_text(data = BowlInnPareto %>%filter(.level ==1), aes(x = W, y = Econ, label = LastName), size =5, color ="Red", hjust ="left",nudge_x =-0.3,nudge_y =-1)+geom_line(data = BowlInnPareto %>%filter(.level ==1), aes(x = W, y = Econ), alpha =0.5,colour ="red")+theme_minimal() +labs(x ="Wickets in an innings",y ="Innings Bowling Economy") +coord_cartesian(xlim =c(0,6.3))+theme(axis.title =element_text(size =16),legend.position ="none",panel.grid.minor.y =element_blank(),axis.text =element_text(size =16, color ="black"))
Figure A.3: Pareto-optimal bowling within an innings with the points on the Pareto frontier highlighted in red.
Show the code:
BowlInnPareto %>%filter(.level ==1) %>%select(Bowler, Overs = O, Wickets = W, Runs = R, Season, Match = Season.Match.No) %>%arrange(Wickets)
Table A.3: List of all Pareto-optimal IPL bowling innings.
Bowler
Overs
Wickets
Runs
Match
Suresh Raina
0.3
2
0
IPL04 Match 52
Anil Kumble
3.1
5
5
IPL02 Match 2
Alzarri Joseph
3.4
6
12
IPL12 Match 19
Pareto-optimal Bowling Career
Five Pareto-optimal bowling careers were identified, with Doug Bollinger achieving the lowest average, Rashid Khan achieving the lowest economy, while Kagiso Rabada recorded the lowest strike rate. The IPL bowling career Pareto frontier is displayed in Figure A.4 and the bowlers are listed in Table A.4.
Table A.4: List of all Pareto-optimal IPL bowling careers.
Bowler
Innings
Average
Economy
Strike Rate
Doug Bollinger
27
18.73
7.22
15.57
Kagiso Rabada
59
19.71
8.22
14.39
Lasith Malinga
122
19.79
7.14
16.63
Jofra Archer
35
21.33
7.13
17.93
Rashid Khan
86
21.46
6.40
20.12
A.5 Discussion
This study sought to use Pareto frontiers to visualise optimal Twenty20 cricket batting and bowling performances, both within an innings as well as across a career. By analysing performance multivariately, rather than simply analysing multiple variables univariately, players can be deemed optimal despite not being objectively highest in a single variable. When conflicting attributes are of equal interest, Pareto frontiers can view these variables in tandem as the expectations of an individual to attain the highest level in both attributes univariately may be unfeasible. All four Pareto frontiers contained at least one athlete that was not the highest ranked athlete in any metric when analysed univariately, and yet was deemed Pareto-optimal due to their balance in the metrics of interest.
The main advantage of Pareto frontiers highlighted in the present study is identifying athletes who are optimal across multiple metrics even when they are not the highest ranked in any metric. This was most evident where Chris Gayle, when viewed univariately, has the 9th-highest career batting average (39.72), which is 6.71 runs per innings lower than the highest (Figure 2). Similarly, he has the 14th-highest strike rate, striking at 148.96 which is 29.61 runs per 100 balls lower than the highest. However, when considering both metrics simultaneously and visualising these metrics, he is one of the best batsmen across the 14 seasons of the IPL.
The present study also illustrated how Pareto frontiers can be used to visualise talent in more than 2 dimensions. For example, while Jofra Archer has the sixth-lowest bowling average, 14th-lowest economy, and the 19th-lowest strike rate (see Figure 4), he can be deemed a Pareto-optimal bowler as there are no other bowlers who supersede him across all three metrics. While there will be some correlations between the three bowling metrics (i.e., average, economy, and strike rate) as the metrics are related (e.g., wickets taken is the denominator of average and numerator of strike rate), visualising the third dimension is still necessary as the reader would still need to multiply the x and y values to understand where they would sit in the third dimension.
In the present study we chose to observe batting and bowling as purely independent roles within cricket; however, there are also avenues for Pareto frontiers to be established for all-rounders within cricket (i.e., players that are picked for both their batting and bowling ability). However, it should be noted that if an all-rounder Pareto frontier were to be established with both batting average and strike rate as well as bowling average, economy, and strike rate, the resulting five-dimensional outputs, while valid and executable, become increasingly difficult to interpret and visualise. To do such an analysis, a factor-reduction technique such as principal components analysis should be considered and the Pareto frontier could be built from the extracted components (e.g., batting and bowling).
While the present study is designed to be an introduction for sports scientists to the concept of Pareto frontiers, it should also be considered that there is some level of uncertainty surrounding each observation in the career Pareto frontiers due to the differing number of observations. For example, Jonny Bairstow is deemed Pareto-optimal as he is currently striking at 142.19 at an average of 41.52 after 28 innings; however, it is right to assume that it is more uncertain that he lies on the frontier than AB de Villiers who has 170 observations. Therefore, future research could consider providing confidence or credible intervals around the probability that an individual lies on the Pareto frontier. Consequently, it is then feasible that a probability that an individual sits on the first, second, or third frontier could be calculated.
While the present study used Twenty20 cricket to illustrate the power and usefulness of Pareto frontiers, the concept can be widely applied within sports science data sets, especially when the variables of interest are uncorrelated or negatively correlated. Pareto frontiers can still be established between two positively correlated metrics; however, it is likely that there will be less ‘hidden’ athletes on this frontier as naturally the athletes who are high in one metric will be high in the other metric. Future research should apply Pareto frontiers across different avenues within sports performance analysis which have multi-faceted determinants as there are many other possibilities within sports whereby Pareto frontiers can reveal athletes who possess the optimal balance of the metrics of interest.
A.6 Conclusion
With the proliferation of various physiological, mechanical, and skill-related attributes associated with performance, Pareto frontiers should be used within sports science to visualise multiple performance metrics. By analysing opposing data in tandem, more feasible expectations and benchmarks can be established to reveal talent that may have been missed when analysing multiple metrics univariately.
Davids, K., Lees, A., & Burwitz, L. (2000). Understanding and measuring coordination and control in kicking skills in soccer: Implications for talent identification and skill acquisition. Journal of Sports Sciences, 18(9), 703–714. https://doi.org/10.1080/02640410050120087
Dodd, K., & Newans, T. (2018). Talent identification for soccer: Physiological aspects. Journal of Science and Medicine in Sport, 21(10), 1073–1078. https://doi.org/10.1016/j.jsams.2018.01.009
Hopkins, W., Marshall, S., Batterham, A., & Hanin, J. (2009). Progressive statistics for studies in sports medicine and exercise science. Medicine & Science in Sports & Exercise, 41(1), 3–12. https://doi.org/10.1249/MSS.0b013e31818cb278
Johnston, K., Wattie, N., Schorer, J., & Baker, J. (2018). Talent identification in sport: A systematic review. Sports Medicine, 48(1), 97–109. https://doi.org/10.1007/s40279-017-0803-2
Kelly, A., & Williams, C. (2020). Physical characteristics and the talent identification and development processes in male youth soccer: A narrative review. Strength & Conditioning Journal, 42(6), 15–34. https://doi.org/10.1519/ssc.0000000000000576
Morris, T. (2000). Psychological characteristics and talent identification in soccer. Journal of Sports Sciences, 18(9), 715–726. https://doi.org/10.1080/02640410050120096
R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Rienhoff, R., Hopwood, M., Fischer, L., Strauss, B., Baker, J., & Schorer, J. (2013). Transfer of motor and perceptual skills from basketball to darts. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00593
Roocks, P. (2016). Computing Pareto frontiers and database preferences with the rPref package. The R Journal, 8(2), 393–404. https://doi.org/10.32614/RJ-2016-054
Sánchez-García, M., Sánchez-Sánchez, J., Rodríguez-Fernández, A., Solano, D., & Castillo, D. (2018). Relationships between sprint ability and endurance capacity in soccer referees. Sports, 6(2), 28. https://doi.org/10.3390/sports6020028