You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Using PITCHf/x to model the dependence of strikeout rate on the predictability of pitch sequences


We develop a model for pitch sequencing in baseball that is defined by pitch-to-pitch correlation in location, velocity, and movement. The correlations quantify the average similarity of consecutive pitches and provide a measure of the batter’s ability to predict the properties of the upcoming pitch. We examine the characteristics of the model for a set of major league pitchers using PITCHf/x data for nearly three million pitches thrown over seven major league seasons. After partitioning the data according to batter handedness, we show that a pitcher’s correlations for velocity and movement are persistent from year-to-year. We also show that pitch-to-pitch correlations are significant in a model for pitcher strikeout rate and that a higher correlation, other factors being equal, is predictive of fewer strikeouts. This finding is consistent with experiments showing that swing errors by experienced batters tend to increase as the differences between the properties of consecutive pitches increase. We provide examples that demonstrate the role of pitch-to-pitch correlation in the strikeout rate model.


The act of hitting a pitch in major league baseball places extraordinary demands on the batter’s visuomotor system. A typical fastball travels from the pitcher to the batter in about 400 milliseconds and, in order to initiate a swing on time, the batter must estimate the time and location for contact during the first 150–200 milliseconds of the pitch trajectory (Gray, 2002a). Even small errors in time or space will lead to failure and, as a result, major league batters frequently swing and miss. Pitchers endeavor to make the batter’s task even more challenging by throwing pitches with a wide variety of characteristics. A pitcher can change speeds by mixing fastballs and off-speed pitches. He can also change the batter’s eye level by mixing pitches that are high and low or disrupt the batter’s balance by locating pitches inside and outside. In addition to changing speed and location, pitchers can impart different spins on the ball which alters its trajectory. Given the difficulty of the hitting task, batters can benefit from being able to predict the characteristics of an upcoming pitch. Consequently, the pitcher’s team expends significant effort involving, for example, elaborate sign sequences and players covering their mouths when discussing strategy to keep the parameters of the next pitch a secret.

A batter’s success in predicting and reacting to the characteristics of a pitch depends on the distribution of pitches that might be thrown. Experiments with experienced batters on simulated pitches have shown, for example, that contact rates improve significantly when pitches are limited to two speeds rather than drawn from a wide range of speeds (Gray, 2002a). Another study has shown that major league strikeout rates increase as a pitcher’s number of distinct pitch types increases (Arthur, 2014a). An important question for the pitcher’s team is the optimal distribution of pitches that should be utilized. This distribution depends on a number of variables including the relative quality of the pitcher’s array of pitches, the batter’s strengths and weaknesses, the count, the score, the inning, the number of outs, the baserunners, and the identity of the following batters. Researchers (Gassko, 2010) (Tango et al., 2007) have proposed the use of game theory to derive an optimized distribution of pitches for a given situation.

In addition to striving for an optimized pitch distribution, a pitcher can also seek advantage by adjusting his sequence of pitches. Gray (2002a, 2002b) performed experiments with college batters to show that the average spatial and temporal error in a batter’s swing for a given pitch has a significant dependence on the speed of the preceding pitches. In particular, the errors are larger when there is a significant difference between the speed of the current pitch and the speed of the prior pitches. Several studies of major league matchups have also demonstrated that pitchers benefit from varying speed and movement from pitch-to-pitch. Bonney (2015) showed that pitchers achieve better results when they reduce velocity by at least five miles per hour after a first-pitch fastball. Glaser (2010) showed that, on average, following an off-speed pitch with the same off-speed pitch is not a good choice. Roegele (2014) showed that pitchers benefit when consecutive pitches have different movement after following the same tunnel (Long et al. 2017) during the first part of their path to the batter. These results are consistent with the strategy of using setup pitches (Greenhouse, 2010) which aim to enhance the probability of a pitcher’s success on a subsequent pitch.

More than half of the pitches thrown in major league baseball in 2014 were some variant of a fastball. The popularity of the fastball stems from its high velocity which limits a batter’s reaction time and from the ability of most major league pitchers to control the fastball with more accuracy than their offspeed pitches. Several studies (Arthur, 2014a) (Cameron, 2009) have found a positive correlation between a pitcher’s fastball velocity and his strikeout rate and we might reasonably expect that pitchers also benefit from fastball movement. Thus, the properties of a pitcher’s fastball will play an important role in determining his strikeout rate. In addition to the intrinsic characteristics of his fastball and other pitches, the previous discussion suggests that distribution and sequencing will also play a role in a pitcher’s success. Lichtman (2013) has shown, for example, that pitchers who throw a high fraction of fastballs suffer a larger decline in performance when they face batters for the second or third time in a game as compared to other pitchers.

In this paper we introduce a set of pitch-to-pitch correlation measures for a pitcher which quantify his tendency to throw consecutive pitches with similar properties and which quantify the degree to which the location, velocity, and movement of his next pitch can be predicted from the characteristics of the previous pitch. Since these measures are derived from estimates of continuous-valued variables, they avoid the loss of information that occurs during classification in methods that analyze pitch type sequences (Arthur, 2014b) (Weinstein, 2015). The utility of the correlation measures is investigated using PITCHf/x (Fast, 2010) parameter estimates for nearly three million pitches thrown by a set of pitchers over the years from 2008 to 2014. Since the handedness (left or right) of the batter and pitcher plays an important role in pitch selection, the pitch-to-pitch correlations are computed separately for each applicable platoon configuration (LHP vs LHB, LHP vs RHB, RHP vs LHB, RHP vs RHB) for each pitcher and year. We show that these pitcher descriptors are repeatable from year-to-year and that the measures derived from velocity and movement provide more year-to-year consistency than the measures derived from location. We also evaluate the use of the correlation measures as explanatory variables within a model for pitcher strikeout rate. The model reveals that, as expected, a pitcher’s strikeout rate increases as his fastball velocity and vertical movement increase. The model also shows that, other factors equal, a pitcher’s strikeout rate decreases as his fastball fraction and pitch-to-pitch correlation increase. We use the example of James Shields and Bartolo Colon to demonstrate the dependence of strikeout rate on these measures of predictability.

2PITCHf/x data

PITCHf/x is a system that uses two cameras to capture a set of images of pitches thrown in baseball games (Fast, 2010). The system was developed by Sportvision and was available in all thirty major league stadiums at the start of the 2008 season. The PITCHf/x images can be used to estimate the three-dimensional path of a pitch and to derive information about its speed and movement. Pitch information is publicly distributed in real-time by Major League Baseball Advanced Media (MLBAM) using the GameDay application.

Our analysis of PITCHf/x data considers several of the reported attributes for each pitch. The pair (px,pz) specifies the location of a pitch as it crosses home plate where px is the horizontal coordinate and pz is the vertical coordinate relative to an origin at the back vertex of home plate. The positive x-axis points to the right from the catcher’s perspective, the positive y-axis points toward second base, and the positive z-axis points up. The coordinates px and pz are typically reported in feet. The movement of a pitch (pfx_x,pfx_z) is defined as the difference between the pitch location (px,pz) and the theoretical location of a pitch thrown at the same speed that does not deviate from a straight path due to spin (Nathan, 2012). The movement parameters pfx_x and pfx_z are typically reported in inches. The start-speed is an estimate of pitch speed in three dimensions near the release point in miles per hour. Brooks Baseball ( improves the accuracy of the MLBAM reported values by making small adjustments to the calculations.

Different pitch types have different characteristics. For a right-handed major league pitcher, for example, a four-seam fastball typically has a start-speed above 90 miles per hour with a negative pfx_x and a positive pfx_z while a curveball typically has a start-speed below 80 miles per hour with a positive pfx_x and a negative pfx_z. For a left-handed pitcher, the sign of pfx_x will reverse for these pitch types. In addition to the measured parameters, MLBAM also assigns a label to each pitch such as FF for a four-seam fastball or CU for a curveball.

3Pitch-to-pitch correlation


We use the measurements described in section 2 to define a vector of descriptors for a pitcher that quantifies the relationship between consecutive pitches in location, movement, and velocity. For a given pitcher in a given season, say Clayton Kershaw in 2014, we consider separately the pitches thrown to left-handed and right-handed batters after intentional balls are removed. Let (xi,xi),i=1,2,,N represent all pairs of consecutive pitches that Kershaw threw to a left-handed batter within a single plate appearance in 2014 where xi is the px coordinate of the first pitch in the pair and xi is the px coordinate of the second pitch in the pair. We note that a pitch can appear as the second of a pair and then as the first of the next pair, so that a four-pitch plate appearance will have three pairs. N is the total number of these pairs. Kershaw’s px correlation coefficient rx for consecutive pitches against left-handed batters in 2014 is defined by


where x¯i and x¯i represent the means of the xi and xi values respectively over the N pairs. The correlation coefficient rx, therefore, provides a statistical measure of the relationship between consecutive px values over the N pairs of pitches.

We can also let (xi,xi) represent the pz coordinates for pairs of pitches to compute Kershaw’s pz correlation coefficient rz for consecutive pitches against left-handed batters in 2014 using (1). Similarly, we can use pfx_x, pfx_z, and start-speed to compute correlation coefficients for each of these variables which we denote respectively by rmx, rmz, and rs . Thus, for a given pitcher, season, and batter handedness such as Kershaw in 2014 against left-handed batters, we can compute a vector of five correlation coefficients (rx, rz, rmx, rmz, rs) which represents the pitcher’s degree of pitch-to-pitch consistency in location, movement, and velocity.

A correlation coefficient r has several important properties. The value of r is always between -1 and +1 with the sign of r being the same as the sign of the slope of the regression line for the set of N points (xi,xi). The absolute value |r| measures the strength of the linear relationship between xi and xi. If |r|=1, then the set of points (xi,xi) lie exactly on a line and the xi of the second pitch in each pair can be exactly predicted using the xi of the preceding pitch. As |r| decreases toward zero, the ability to predict xi from xi using a linear model decreases. More precisely, the square of the correlation coefficient r2 is the fraction of the variance in the second pitch xi that is accounted for by a linear model and the xi value for the first pitch. We might expect that a pitcher with smaller values of |r| for a given pitch attribute, everything else being equal, would be more effective due to the increased uncertainty that results from using the current pitch to predict the value of that attribute for the next pitch.


In section 4 we will examine the dependence of a pitcher’s strikeout rate on the correlation coefficients defined in section 3.1. As with the correlation coefficients, we compute each pitcher’s strikeout rate separately for each applicable platoon configuration for each year. Before the rate is computed, however, we remove all plate appearances that resulted in a bunt or an intentional walk and we also remove all plate appearances with a pitcher as a batter. The number of remaining plate appearances is referred to as adjusted plate appearances. A pitcher’s strikeout rate PK is then defined as the ratio of strikeouts to adjusted plate appearances. Using considerations (Healey, 2015) that were derived from reliability studies (Carleton, 2013), we consider a pitcher’s strikeout rate to be reliable for a season and platoon configuration if the pitcher had at least 150 adjusted plate appearances for that season and platoon configuration. For this study, we also removed all pitchers that were used strictly as relievers during a season as well as all pitchers who had at least twenty percent of their pitches classified as knuckleballs. Table 1 summarizes the total number of pitcher seasons that satisfy these criteria for each of the four platoon configurations over the years from 2008 to 2014 for which PITCHf/x data was widely available.

Table 1

Number of pitcher seasons for each platoon configuration, 2008 to 2014

Table 2

Pitch-to-pitch correlation statistics for RHP versus RHB, 2008–2014

Var.MeanStd. Dev.MinimumPitcher/YearMaximumPitcher/Year
rx0.0765890.051952–0.074098J. Westbrook/20130.265521I. Kennedy/2013
rz0.0607400.046061–0.077553C. Morton/20100.220789N. Tepesch/2014
rmx0.0826310.082353–0.148586D. Bush/20100.383200J. Marquis/2013
rmz0.0998130.089800–0.130765M. Pineda/20110.554724C.-M. Wang/2008
rs0.0720660.088115–0.164595M. Estrada/20110.515570B. Colon/2013
Table 3

Pitch-to-pitch correlation statistics for RHP versus LHB, 2008–2014

Var.MeanStd. Dev.MinimumPitcher/YearMaximumPitcher/Year
rx0.0883630.051648–0.081521B. Bannister/20090.262060L. Hernandez/2010
rz0.0578920.048850–0.098732B. Tomko/20080.236806D. Pauley/2010
rmx0.0917980.081257–0.121566H. Iwakuma/20140.428561J. Marquis/2013
rmz0.0887610.089713–0.127830J. Duchscherer/20080.456142M. Batista/2008
rs0.0606860.089174–0.152202N. Figueroa/20100.438098B. Colon/2014

The (rx, rz, rmx, rmz, rs) correlation coefficients were computed for all of the cases represented in Table 1 using the Brooks Baseball adjustments to the PITCHf/x measurements. Tables 2, 3, 4, and 5 present the mean, standard deviation, and minimum and maximum values for each coefficient over the pitcher seasons for each platoon configuration. The Tables also provide the pitcher and year for each minimum and maximum value. We see that the coefficients that are based on movement and speed (rmx, rmz, rs) have larger ranges and standard deviations across pitchers than the coefficients that are based on location (rx, rz).

Table 4

Pitch-to-pitch correlation statistics for LHP versus RHB, 2008–2014

Var.MeanStd. Dev.MinimumPitcher/YearMaximumPitcher/Year
rx0.0797370.050875–0.063483C. Friedrich/20120.217846C. Kershaw/2010
rz0.0534020.040659–0.089965R. Rowland-Smith/20090.170210W. Smith/2012
rmx0.1058600.075576–0.150204E. Bedard/20110.340786T. Glavine/2008
rmz0.0932710.077765–0.137182E. Bedard/20110.308426M. Hampton/2009
rs0.0511890.079572–0.215225J. Outman/20090.250635J. Garcia/2013
Table 5

Pitch-to-pitch correlation statistics for LHP versus LHB, 2008–2014

Var.MeanStd. Dev.MinimumPitcher/YearMaximumPitcher/Year
rx0.0585380.060958–0.079173B. Chen/20130.236886J. Garcia/2011
rz0.0587830.056556–0.078295J. Vargas/20130.235601B. Duensing/2011
rmx0.0761010.082671–0.135050C.J. Wilson/20100.334654S. Diamond/2013
rmz0.1106300.083982–0.055096J. Vargas/20140.365518J. Danks/2011
rs0.0669180.080587–0.111602C.C. Sabathia/20110.300316W. Miley/2014

3.3Year-to-year analysis

An important question is the degree to which the statistics defined in section 3.1 represent distinctive and repeatable characteristics of a pitcher. One way to answer this question is to compute year-to-year correlations which measure the consistency of a statistic for pitchers from year-to-year. Specifically, for each platoon configuration we identified the instances of pitchers who satisfied the criteria described in section 3.2 for consecutive seasons. These instances were used to form pairs of consecutive pitcher seasons for each platoon configuration where the second year of a pair was allowed to be the first year of another pair. The total number of pairs for each platoon configuration is given in Table 6. For each of the statistics defined in section 3.1 we computed the year-to-year correlation coefficient for each platoon configuration using these pairs. The results are shown in Table 7. If each pitcher has the same value of a statistic for every pair of consecutive years, then the correlation coefficient will be one. The observed value of the correlation coefficients is less than one due to variation in the statistic measurements that originate from the use of limited sample sizes and from actual changes in pitcher tendencies over time. We see that the year-to-year correlations for statistics derived from movement and speed measurements (rmx, rmz, rs) are larger than the year-to-year correlations for statistics derived from location measurements (rx, rz).

Table 6

Number of pairs of pitcher seasons for each platoon configuration, 2008 to 2014

Table 7

Year-to-year correlations for each variable


4Modeling strikeout rate


4.1.1Fastball velocity and movement

The large majority of major league pitchers throw a two-seam or four-seam fastball and these pitches are typically assigned one of the four labels FA (fastball), FF (four-seam fastball), FT (two-seam fastball), or SI (sinker) by MLBAM. We computed the average of the variables start-speed, pfx_x, and pfx_z for the pitches with each of these labels for each pitcher in our study for each applicable platoon configuration and year. The variable velo for a pitcher, year, and configuration refers to the largest average start-speed over the four labels. Similarly, the variable max_pfx_z refers to the largest average pfx_z over these labels. For right-handed pitchers, pfx_x is typically negative for these labels and we define min_pfx_x as the minimum value of the average pfx_x over the four labels. The variable max_pfx_x is defined in a similar way for left-handed pitchers for which pfx_x is typically positive for these pitches. The variables velo, max_pfx_z, min_pfx_x (for RHP), and max_pfx_x (for LHP) characterize the velocity and movement of a pitcher’s fastball.

4.1.2Pitch mix and sequencing

In addition to the intrinsic properties of individual pitches, a pitcher’s effectiveness also depends on pitch distribution and sequencing. A single parameter that provides a high-level description of pitch distribution is the fastball fraction f . For each pitcher,year, and platoon configuration we define f as the ratio of pitches with a label of FA, FF, FT, or SI to the total number of pitches. The variables rx, rz, rmx, rmz, rs provide a measure of pitch sequencing by capturing pitch-to-pitch correlation in location, movement, and velocity. As defined in section 3.1, these five variables are based on correlations that are computed using all pitches regardless of type.

4.2Model estimation

We use a separate linear regression model for each platoon configuration to approximate pitcher strikeout rate PK using the variables defined in section 4.1. The number of observations for each configuration is given by Table 1. We considered approximations of the form


for right-handed pitchers and


for left-handed pitchers where F1 and F2 are a linear combination of their component variables and first-order cross terms. Tables 811 present the models of this form for each platoon configuration that have the most significant variables where significance is defined by a p-value below 0.01. Each table includes the coefficient, standard error, t-statistic, and p-value for each significant variable along with the R2 for the fit. We observe that off-speed pitches typically have a larger impact on same-sided (RHP vs. RHB, LHP vs. LHB) matchups. Therefore, the limited treatment of off-speed pitches by the model may explain the larger R2 values for opposite-sided (RHP vs. LHB, LHP vs. RHB) matchups. Due to considerations presented in section 3.1, we also examined the use of variables defined by the absolute value of the pitch-to-pitch correlation coefficients. Using the original signed values of the coefficients, however, led to models with less error which suggests that a negative correlation benefits a pitcher’s strikeout rate more than a zero correlation.

Table 8

RHP versus RHB, 796 observations, R2 = 0.240

VariableCoefficientStd. Errort-Statisticp-value
Table 9

RHP versus LHB, 814 observations, R2 = 0.342

VariableCoefficientStd. Errort-Statisticp-value
Table 10

LHP versus RHB, 394 observations, R2 = 0.383

VariableCoefficientStd. Errort-Statisticp-value
Table 11

LHP versus LHB, 180 observations, R2 = 0.332

VariableCoefficientStd. Errort-Statisticp-value

Table 12 presents the mean and maximum absolute error |PK-PˆK| over the observations for each platoon configuration. We see that the average absolute error is a few percent for each configuration. The table also provides the pitcher/year associated with the maximum absolute error. For each maximum error case, the approximation PˆK underestimates PK and the large error is a result of the pitcher having a highly effective off-speed pitch which is not accounted for by the current model. In particular, Yu Darvish benefited from an exceptional slider, Felix Hernandez from an exceptional changeup, and Clayton Kershaw from an exceptional slider and curveball.

Table 12

Mean and maximum of absolute difference D=PK-PˆK

RHP vs RHB7960.0330490.176470Y. Darvish/2013
RHP vs LHB8140.0316040.110285F. Hernandez/2013
LHP vs RHB3940.0289720.113404C. Kershaw/2014
LHP vs LHB1800.0360790.155823C. Kershaw/2013

We see from Tables 811 that the variable velo is significant for each platoon configuration with a positive coefficient which indicates that a one mile per hour increase in velocity corresponds to an increase in PˆK of between 0.010 and 0.014 depending on the configuration. The fastball fraction f is also significant for each platoon configuration with a negative coefficient which indicates that a larger fraction of fastballs, everything else being equal, leads to a lower strikeout rate.

For the three configurations with the most observations (Tables 811) the fastball vertical movement variable max_pfx_z is significant along with one of the pitch-to-pitch correlation variables. Specifically, rmz is significant for the two configurations involving right-handed pitchers and rmx is significant for the LHP versus RHB configuration. As expected, increased vertical movement and lower pitch-to-pitch correlation are associated with higher strikeout rates. The horizontal movement variable min_pfx_x and the cross term max_pfx_z*min_pfx_x are also significant for the RHP versus RHB configuration.

Figures 1 through 4 illustrate the distribution of the explanatory variables and the estimated F1 and F2 surfaces for the RHP versus LHB platoon configuration considered in Table 9. Figure 1 plots the joint distribution of fastball velocity and vertical movement for the 814 observations for this configuration. The variables are nearly uncorrelated with a correlation coefficient of 0.05. Figure 2 plots F1(1,velo,max_pfx_z) and shows that strikeout rate increases as fastball velocity and vertical movement increase. Figure 2 plots the joint distribution of fastball fraction and pitch-to-pitch correlation in vertical movement. These variables are also nearly uncorrelated with a correlation coefficient of 0.09. Figure 3 plots F2 (f, rmz) and shows that strikeout rate decreases as fastball fraction and pitch-to-pitch correlation in vertical movement increase. The shape of the F1 and F2 surfaces will be similar for the other platoon configurations for which these variables are significant.


Joint distribution for maxpfxz and velo for RHP versus LHB.

Joint distribution for maxpfxz and velo for RHP versus LHB.

F1 surface for RHP versus LHB.

F1 surface for RHP versus LHB.

 Joint  distribution for  f and  rmz for  RHP  versus LHB.

 Joint  distribution for  f and  rmz for  RHP  versus LHB.

F2 surface for RHP versus LHB.

F2 surface for RHP versus LHB.

4.3Impact of pitch-to-pitch correlation

In this section, we examine the role of the pitch-to-pitch correlation variables on strikeout rate in more detail. Table 13 considers the three platoon configurations for which a correlation variable is significant. The value c is the coefficient for the correlation variable and platoon configuration from one of Tables 810. The values σ and maxdiff represent the standard deviation and maximum difference over pitchers for the correlation variable and platoon configuration from one of Tables 24. The constant 38 is the average number of batters faced per nine innings in major league baseball in 2014. Thus, c * σ * 38 and c * maxdiff * 38 are the changes in the number of strikeouts per nine innings associated with changes of σ and maxdiff in the correlation variable if the other variables are held constant. We see that c * maxdiff * 38 is between one and two strikeouts per nine innings depending on the platoonconfiguration.

Table 13

Dependence of strikeouts per nine innings on correlation variables

Configurationvariablec * σ * 38c * maxdiff * 38
RHP vs RHBrmz–0.173–1.322
RHP vs LHBrmz–0.160–1.043
LHP vs RHBrmx–0.270–1.751
Table 14

RHP versus LHB Models for James Shields 2011 and Bartolo Colon 2012

J. Shields/201191.8868.9230.2480.1770.325–0.048
B. Colon/201292.1738.5790.2480.2370.902–0.122

4.4The Shields/Colon example

As an example of the importance of pitch mix and sequencing we present the case of right-handed pitchers James Shields (2011) and Bartolo Colon (2012) against left-handed batters. The pitchers had similar values for the fastball parameters velo and max_pfx_z which led to identical values for F1 for this configuration as shown in Table 14. Colon’s pitches, however, were much more predictable with a fastball fraction f = 0.902 and a vertical movement correlation rmz = 0.237 compared to f = 0.325 and rmz = 0.177 for Shields. As a result, F2 and the overall PˆK approximation is 0.074 higher for Shields. The approximation PˆK underestimates the actual strikeout rate for each pitcher by a few percent and Shields actually posted a strikeout rate PK that is 0.086 higher than Colon for this configuration and pair of years. In summary, the two pitchers have nearly identical fastball parameters and the same value for F1 but differences in pitch mix and sequencing led to a significant advantage in strikeout rate for Shields which is predicted by the model.


The success of a major league pitcher depends on many factors including the velocity and movement of his pitches and on his ability to utilize an effective pitch distribution and sequencing strategy. We have examined the use of pitch-to-pitch correlations for location, velocity, and movement as a measure of pitch sequencing. These correlations characterize the degree to which the properties of an upcoming pitch can be predicted from the properties of the previous pitch. We have derived the pitch-to-pitch correlations for a set of pitchers using PITCHf/x measurements for nearly three million pitches thrown from 2008 to 2014. The data for each pitcher was partitioned according to the handedness of the batter but, in order to maximize sample size, other contextual variables such as the count and inning were not considered. We showed that there is significant year-to-year consistency in the pitch-to-pitch correlation of velocity and movement for all four platoon configurations. We also presented a model that describes the dependence of a pitcher’s strikeout rate on a number of variables that include fastball velocity and movement as well as a fastball fraction descriptor for pitch distribution and the pitch-to-pitch correlation descriptors for pitch sequencing. We showed that a pitcher’s strikeout rate increases as his fastball velocity and vertical movement increase. We also showed that a pitcher’s strikeout rate decreases, other factors equal, as his predictability in terms of fastball fraction and pitch-to-pitch correlation increases.

Since the fastball is the most common pitch in major league baseball, our fastball-centric model was able to capture a significant fraction of the variance in strikeout rate while allowing evaluation of the role of the new pitch-to-pitch correlation descriptors. As might be expected, the largest errors in the model occurred for pitchers with an exceptional offspeed pitch since the benefit of these pitches is not explicitly captured by the model. Thus, a more detailed model could include information about the number, frequency, and physical properties of a pitcher’s offspeed pitches and how well these pitches complement each other and the pitcher’s fastball. Pitch location is another important factor which affects a pitcher’s strikeout rate that could be incorporated into future models. The current model also neglects the impact of a pitcher’s delivery which can be beneficial if, for example, he hides the ball well or detrimental if he inadvertently provides clues about the identity of the upcoming pitch.


We thank Arunav Singh for help with data processing. The PITCHf/x data was obtained from  and the data used for computing strikeout rate was obtained from



Arthur R. , (2014) , Entropy and the eephus [Online]. Available:


Arthur R. , (2014) , The art and science of sequencing [Online]. Available:


Bonney P. , (2015) , Defining the pitch sequencing question [Online]. Available:


Cameron D. , (2009) , Velocity and K/9 [Online]. Available:


Carleton R. , (2013) , Should I worry about my favorite pitcher? [Online]. Available:


Fast M. , (2010) , What the heck is PITCHf/x? In Distelheim J. , Tsao B. , Oshan J. , Bolado C. and Jacobs B. , editors, The Hardball Times Baseball Annual, pages 153–158. The Hardball Times, (2010) .


Gassko D. , (2010) , When a pitcher meets a hitter [Online]. Available:


Glaser C. , (2010) , The influence of batters’ expectations on pitch perception [Online]. Available:


Gray R. , (2002) , Behavior of college baseball players in a virtual batting task, Journal of Experimental Psychology: Human Perception and Performance 28: (5), 1131–1148.


Gray R. , (2002) , Markov at the bat: A model of cognitive processing in baseball batters, Psychological Science 13: (6), 542–547.


Greenhouse J. , (2010) , Lidge’s pitches [Online]. Available:


Healey G. , (2015) , Modeling the probability of a strikeout for a batter/pitcher matchup, IEEE Transactions on Knowledge and Data Engineering 27: (9), 2415–2423.


Lichtman M. , (2013) , Pitch types and the times through the order penalty [Online]. Available:


Long J. , Judge J. and Pavlidis H. , (2017) , Introducing pitch tunnels [Online]. Available:


Nathan A. , (2012) , Determining pitch movement from PITCHf/x data [Online]. Available:


Roegele J. , (2014) , The effects of pitch sequencing [Online]. Available:


Tango T. , Lichtman M. and Dolphin A. , (2007) , The Book: Playing the Percentages in Baseball. Potomac Books, Dulles, Virgina.


Weinstein M. , (2015) , Finding value in fastball mixing [Online]. Available: