You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Matchup models for the probability of a ground ball and a ground ball hit

Abstract

We develop matchup models for the probability of a ground ball and a ground ball hit using twelve years of major league baseball play-by-play data. The models are based on player descriptors that can be estimated reliably from small samples which facilitates the use of the models for prediction. The model for ground ball probability is obtained by generalizing the log5 model to include both ground ball and strikeout rates for the batter and pitcher. A strikeout rate cross term is shown to be significant in this model which leads to regions of the matchup space, termed matched and mismatched Krate configurations, where either the batter or pitcher is favored relative to the log5 prediction. We also build a model for the probability that a ground ball becomes a hit which separates the contributions of the batter, pitcher, and defense. We show that this probability has a strong dependence on the pitcher’s ground ball and strikeout rates and that the structure of this dependence changes with the platoon configuration. We give a physical justification for the model and provide examples of pitchers with characteristics that significantly lower or raise their expected ground ball hit rates. The new models for the probability of a ground ball and a ground ball hit are tested on out-of-sample data and shown to provide more accurate predictions than alternative models.

1Introduction

The ability to predict the distribution of outcomes for a batter/pitcher matchup in baseball is useful for informing roster construction and player usage decisions (Koo, 2013). The historical samples that are available for a particular batter/pitcher matchup, however, are typically too small to support accurate prediction (Fox, 2005a) (Stern and Sugano, 2007) (Tango et al., 2007). An alternative approach is to develop predictive models that are based on characteristics of the batter and pitcher. James (1983) with Adams introduced the log5 model that predicts the probability of a binary outcome for a confrontation between two players as a function of the outcome rates for the players and for the environment. The log5 model, which is also known as the James function, has a number of desirable properties (Hammond et al., 2015) and has been used for many years to model the probability of outcomes in baseball (Carleton, 2009) (Fox, 2005b) (Levitt, 1999). It was recently shown using nearly one million observations that the log5 model accurately predicts the probability of a strikeout for a matchup and that incorporating additional explanatory variables can be used to improve the accuracy of the model (Healey, 2015).

About thirty-two percent of batter/pitcher matchups in major league baseball in 2014 resulted in a ground ball. The expected run value of a ground ball is significantly less than the average run value for a matchup in general which makes this outcome a desirable result for a pitcher (Murphy, 2015). Both batters and pitchers have a significant influence on the probability that a confrontation ends with a ground ball. Batters with uppercut swings, for example, will tend to hit fewer ground balls than batters with flatter swings. On the other hand, pitchers who specialize in offerings that are thrown in the lower part of the strike zone with downward movement will tend to induce more ground balls than other pitchers (Lependorf, 2013). Ground ball rates also depend on the platoon configuration for a matchup which is defined by the handedness (left or right) of the batter and pitcher. The ability of batters and pitchers to hit and induce ground balls is a repeatable skill and studies have shown that batter and pitcher ground ball rates can be estimated reliably using small samples (Carleton, 2012) (Carleton, 2013).

We will use twelve years of major league play-by-play data to develop a model for the probability of a ground ball for a batter/pitcher matchup. Starting from the log5 model which utilizes the batter and pitcher ground ball rates, we show that an additional strikeout rate cross term is highly significant for all four platoon configurations. This cross term leads to regions of the matchup space that have a significantly higher or lower probability of a ground ball than the standard log5 prediction. These regions occur when the batter and pitcher ground ball rates deviate significantly from the league average. This is consistent with the work of Morey and Cohen (2015) who also observed differences between log5 estimates and the outcome of simulations for cases where batter and pitcher rates deviate from league averages. We define matched Krate configurations for which ground balls are less likely than log5 predicts and, in addition, we present evidence that these configurations also lead to fewer strikeouts. Thus, matched Krate configurations are favorable for batters for these outcomes. Similarly, we define mismatched Krate configurations which have the opposite property. The new model is evaluated on out-of-sample data.

We will also build a model for the probability that a ground ball becomes a hit. Several researchers have studied the variables that affect the probability that a batted ball in general becomes a hit with particular attention devoted to the influence of the pitcher. McCracken (2001) postulated that there was little, if any, difference in the ability of major league pitchers to affect opponent batting average on batted balls in the field of play (BABIP). While this assertion provided a useful approximation, subsequent research showed that this claim was not strictly correct. Tippett (2003) concluded that a pitcher’s influence on BABIP is significant. He observed, for example, that pitchers with a high strikeout rate tend to allow a lower BABIP which has been confirmed by several subsequent studies (Bradbury, 2005) (Swartz, 2010a). Lichtman (2004) showed that pitchers have considerable control over their ground ball rate which impacts BABIP since ground balls become hits more often than fly balls. Swartz (2010b) used additional data to confirm this conclusion and to further quantify the dependence of BABIP on a pitcher’s ground ball rate. Lichtman (2004) had also speculated that pitchers might be able to control how hard a ball is hit and suggested the use of batted ball speed to investigate this hypothesis. Several years later, HITf/x data (Jensen, 2009) which provides estimates of the speed and direction of batted balls became available. In a 2011 study, Fast (2011a) used HITf/x measurements to show that both batters and pitchers influence the speed of a batted ball in the plane of the playing field and that batters control a larger share of the variance. He also showed (Fast, 2011b) that this speed has a strong correlation with the likelihood that a batted ball becomes a hit. Thus, batters and pitchers can influence their BABIP by affecting both the vertical launch angle and the speed of batted balls. The probability that a batted ball becomes a hit also depends on the defensive ability of the team in the field since defenders with greater range will typically allow fewer hits over a given distribution of batted balls.

In this paper, we will develop a predictive model for the probability that a ground ball results in a hit for a batter/pitcher matchup. Log5 is not a useful starting point for this model since the required batter and pitcher ground ball batting averages cannot be estimated reliably using small samples (Carleton, 2012) (Carleton, 2013). Instead, we use alternative explanatory variables in a binary logit model. We show that the probability that a ground ball becomes a hit depends on the platoon configuration and that, for most regions of the parameter space, is negatively correlated with the pitcher’s ground ball and strikeout rates. We also quantify the impact of the pitcher’s infield defense and the running speed of the batter. Our results build on Swartz’s (2010b) analysis of ground ball pitchers and on Lederer’s (2009) observations regarding the variables that impact ground ball BABIP. We present a physical justification for the model that is based on Cartwright’s (2012) analysis of HITf/x data. The model is also tested on out-of-sample data.

2Matchup Models

2.1Binary Logit Model

A logit model is often used to characterize the probability of a result in a binary experiment as a function of a set of explanatory variables. We will use this model to represent the probability of a groundball and its outcome in a matchup between a batter and a pitcher. If we let E represent the probability of a ground ball for a matchup, then the logit model takes the form

(1)
E=F(c+c1x1+c2x2++cnxn)

where x1, x2, …, xn are the explanatory variables and the logistic function

(2)
F(S)=11+e-S·

ensures that the probability E is between 0 and 1. Several authors including Wooldridge (2013) provide a more detailed description of the logit and related models.

2.2Log5 Model

The log5 model (James, 1983) is a standard technique for representing the probability of an outcome in a binary experiment and has been widely used to describe matchups in sports. For our application, we denote the league, batter, and pitcher ground ball rates by L, B, and P with corresponding odds ratios Lo = L/(1 - L) , Bo = B/(1 - B) , and Po = P/(1 - P) . The log5 probability E* of a ground ball for a matchup between a batter and a pitcher satisfies (Healey, 2015)

(3)
E*=F(-ln(Lo)+ln(Bo)+ln(Po))

and, therefore, the log5 model is a special case of the logit model in (1) with n = 2, c = - ln(Lo) , c1 = 1.0, x1 = ln(Bo) , c2 = 1.0, and x2 = ln(Po) . The mathematical properties of the log5 model have been examined in detail (Hammond et al., 2015).

3Modeling the Probability of a Ground Ball

3.1Player Descriptors

We will investigate models in the form of equation (1) for predicting the probability of a ground ball. A first step is to establish a set of descriptors for batters and pitchers that can be used to derive the model explanatory variables. Carleton (2012, 2013) showed that strikeout rate and ground ball rate reach a high reliability at smaller sample sizes than are required for other candidate player descriptor variables. This enables these rates to be estimated reliably for many players using only the observations within a single platoon configuration for a single season. For this reason, batter and pitcher strikeout and ground ball rates will be used to define the model explanatory variables.

Player descriptors will be computed using Retrosheet play-by-play data. Since the information required to compute ground ball rates has only been recorded since 2003, our analysis will consider matchups in major league baseball over the years from 2003 to 2014. Before player descriptors are computed, we remove all plate appearances that resulted in a bunt or an intentional walk and we also remove all plate appearances with a pitcher as a batter. Adjusted plate appearances refer to plate appearance totals after this removal of bunts, intentional walks, and pitchers as batters. For both batters and pitchers, strikeout rate is defined as strikeouts divided by adjusted plate appearances and ground ball rate is defined as ground balls divided by adjusted plate appearances. We note that ground ball rate is often defined as the ratio of ground balls to balls in play, but we instead use adjusted plate appearances in the denominator for consistency with the log5 model.

Strikeout and ground ball rates vary from season to season and also depend on the platoon configuration. Figure 1 shows that strikeout rates have been increasing since 2003 and that rates tend to be higher for same-sided platoon configurations (LHP vs LHB and RHP vs RHB). Figure 2 shows that ground ball rates decreased from 2003 to 2009 but have increased from 2009 to 2014. In addition, same-sided configurations have led to higher ground ball rates over the last few years. We will represent each player and the league using separate strikeout and ground ball rates for each year and for each platoon configuration. Table 1, for example, gives the individual player descriptors for switch-hitter Victor Martinez and right-handed pitcher Felix Hernandez for the 2014 season.

Fig.1

League average strikeout rate.

League average strikeout rate.
Fig.2

League average ground ball rate.

League average ground ball rate.
Table 1

Player descriptors for switch-hitter Victor Martinez and right-handed pitcher Felix Hernandez for 2014

Player NameRoleYearConfigurationSO RateGB Rate
Victor MartinezBatter2014RHP vs LHB0.0607380.379610
Victor MartinezBatter2014LHP vs RHB0.0921050.302632
Felix HernandezPitcher2014RHP vs RHB0.2965880.375328
Felix HernandezPitcher2014RHP vs LHB0.2571980.383877

3.2Logistic Regression

Following previous work (Healey, 2015), player descriptors for a batter or pitcher will be regarded as reliable if the player amassed at least 150 adjusted plate appearances for a year and platoon configuration. Thus, the data set for analysis will include every plate appearance from 2003 to 2014 except bunts, intentional walks, and pitchers as batters for which reliable player descriptors are available for both the batter and pitcher for the year and platoon configuration. Table 2 summarizes the total number of plate appearances for each platoon configuration thatsatisfy these criteria.

Table 2

Number of observations used for each platoon configuration over the years 2003 to 2014

LHP vs LHBLHP vs RHBRHP vs LHBRHP vs RHB
25945133351444797480101

Using the set of plate appearance observations for a platoon configuration, logistic regression can be used to recover the associated logit model for a set of explanatory variables. We evaluated models that included all combinations of log odds ratio and linear terms with cross terms in ground ball and strikeout rates. The model with the most significant variables is given by

(4)
E=F(c0ln(Lo)+c1ln(Bo)+c2ln(Po)+c3BˆKPˆK)

where Lo, Bo, and Po are the odds ratios of the league, batter, and pitcher ground ball rates L, B, and P for the year and platoon configuration and BˆK and PˆK are the centered strikeout rates

(5)
BˆK=BK-LK,PˆK=PK-LK

where BK, PK, and LK are the strikeout rates for the batter, pitcher, and league for the year and platoon configuration. Equation (4) uses the same explanatory variables as the log5 model in equation (3) with the additional strikeout rate cross term c3BˆKPˆK. We note that the individual strikeout rate terms BˆK and PˆK were not significant for the prediction of E .

Tables 3, 4, 5, and 6 present the results of the logistic regression for each of the four platoon configurations. Each table contains the coefficients (c0, c1, c2, c3) , standard errors, z-statistics, andp-values that result when using the log5 coefficient values (c0 = -1.0, c1 = 1.0, c2 = 1.0, c3 = 0.0) as the null hypothesis. We see that the c0, c1, and c2 coefficient values are close to the log5 values and that the p-values for these coefficients are all above 0.05 except for the c1 coefficient for the RHP versus RHB configuration. Since the null hypothesis includes the log5 coefficient values for c0, c1, and c2, the p-values indicate that we can accept the standard log5 coefficient values for eleven of the twelve cases and use the slightly larger value of 1.028248 for c1 for the RHP versus RHB configuration. In addition, the cross term BˆKPˆK has a negative coefficient and is highly significant for all four configurations.

Table 3

Binary logit output, LHP versus LHB, 25945 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(Lo)log odds league GB rate-1.0013870.059800-0.0231850.9815
ln(Bo)log odds batter GB rate1.0102430.0367740.2785430.7806
ln(Po)log odds pitcher GB rate1.0171450.0489370.3503380.7261
BˆKPˆK(batter SO rate)*(pitcher SO rate)-11.772563.718200-3.1661980.0015
Table 4

Binary logit output, LHP versus RHB, 133351 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(Lo)log odds league GB rate-0.9891040.0300830.3622000.7172
ln(Bo)log odds batter GB rate1.0145240.0196490.7391640.4598
ln(Po)log odds pitcher GB rate0.9694260.022077-1.3848550.1661
BˆKPˆK(batter SO rate)*(pitcher SO rate)-10.974082.227715-4.9261600.0000
Table 5

Binary logit output, RHP versus LHB, 444797 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(Lo)log odds league GB rate-1.0197800.015758-1.2552320.2094
ln(Bo)log odds batter GB rate1.0184850.0104191.7741150.0760
ln(Po)log odds pitcher GB rate1.0010880.0112900.0963680.9232
BˆKPˆK(batter SO rate)*(pitcher SO rate)-8.1746511.203490-6.7924530.0000
Table 6

Binary logit output, RHP versus RHB, 480101 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(Lo)log odds league GB rate-1.0184370.014757-1.2493350.2115
ln(Bo)log odds batter GB rate1.0282480.0109122.5886420.0096
ln(Po)log odds pitcher GB rate0.9939060.009630-0.6328510.5268
BˆKPˆK(batter SO rate)*(pitcher SO rate)-5.0306611.056461-4.7618030.0000

The strikeout rate cross term is the primary difference between the four-variable model E in equation (4) and the log5 model E* in equation (3). Let D = E - E* be the difference between the models for a plate appearance observation. Table 7 presents the mean and maximum values of |D| over all of the plate appearance observations that were used to build the models in Tables 3, 4, 5, and 6. The largest differences exceed seven percent in predicted ground ball probability.

Table 7

Mean and maximum of absolute difference between 4-variable model and log5

Pit_HandBat_HandObservationsMean(|D|)Max(|D|)
LeftLeft259450.0076680.074980
LeftRight1333510.0046670.073378
RightLeft4447970.0032060.050879
RightRight4801010.0025760.046752

3.3Matched and Mismatched Krate configurations

Figures 3 and 4 allow us to examine the differences between E and E* as a function of the batter and pitcher strikeout rates. Figure 3 plots the D = E - E* surface as a function of BK and PK for the RHP versus LHB configuration for 2014 (LK = 0.191) with the batter and pitcher ground ball rates set to the league average (B = P = L = 0.320) for this configuration. The shape of the surface will be similar for the other platoon configurations with the degree of curvature dependent on the size of the c3 coefficient. We will refer to matchups for which BK and PK are both significantly below or both significantly above the mean LK as matched Krate configurations. We will refer to matchups for which BK and PK are both significantly different from the mean LK but are on different sides of LK as mismatched Krate configurations. Figure 4 shows the structure of the surface along the two orthogonal directions BK = PK and BK = 0.5 - PK . The BK = PK curve shows that for matched Krate configurations we will see fewer ground balls than log5 predicts. The BK = 0.5 - PK curve shows that for mismatched Krate configurations we will see more ground balls than log5 predicts. The structure of this surface may result from the way that the interaction between the distribution of pitches and swings changes as pitcher and batter strikeout rates change.

Fig.3

E - E* surface for RHP versus LHB, 2014, (B = P = L).

E - E* surface for RHP versus LHB, 2014, (B = P = L).
Fig.4

One-dimensional slices of E - E* surface (RHP versus LHB, 2014).

One-dimensional slices of E - E* surface (RHP versus LHB, 2014).

3.4The Aoki, Kershaw, and Diamond Example

As an example of the difference between the log5 model and the four-variable model of equation (4), we consider the case of left-handed batter Nori Aoki against left-handed pitchers Clayton Kershaw and Scott Diamond in 2013. The strikeout and ground ball rates for the three players are shown in Table 8 and the league average ground ball and strikeout rates for this year and configuration are L = 0.329 and LK = 0.232 . We see that the Aoki/Kershaw matchup is a mismatched Krate configuration while the Aoki/Diamond matchup is a matched Krate configuration. Since Diamond’s ground ball rate is significantly higher than Kershaw’s, the log5 ground ball probability for Aoki/Diamond (E* = 0.622854) is significantly higher than for Aoki/Kershaw(E* = 0.494198) . However, since Aoki has a low strikeout rate and Kershaw and Diamond have high and low strikeout rates respectively, the strikeout rate cross term will have a significant impact on these matchups. Table 9 shows that the predicted ground ball probability E using equation (4) is significantly different from E* for both matchups and that E is actually higher for Aoki/Kershaw than for Aoki/Diamond. Thus, even though Aoki/Diamond has a log5 ground ball probability that is about 0.129 higher than for Aoki/Kershaw, the inclusion of the strikeout rate cross term in the model results in a higher predicted ground ball probability for Aoki/Kershaw.

Table 8

Player descriptors for Nori Aoki, Clayton Kershaw, and Scott Diamond for 2013

Player NameRoleYearConfigurationSO RateGB Rate
Nori AokiBatter2013LHP vs LHB0.0588240.604278
Clayton KershawPitcher2013LHP vs LHB0.3870970.238710
Scott DiamondPitcher2013LHP vs LHB0.1241830.346405
Table 9

Comparison of 4-variable model and log5 for LHP vs. LHB matchups for 2013

Matchup4-variable modellog5Difference
Aoki vs. Kershaw0.5691780.4941980.074980
Aoki vs. Diamond0.5688880.622854-0.053966

3.5Do More Ground Balls Mean Fewer Strikeouts?

Since pitchers induce more ground balls for mismatched Krate configurations than the log5 model predicts, we might reasonably ask whether these additional ground balls come at the expense of fewer strikeouts. We can answer this question by considering a model for strikeout probability EK of the form

(6)
EK=F(c0ln(LKo)+c1ln(BKo)+c2ln(PKo)+c3BˆKPˆK)

where LKo, BKo, and PKo are the odds ratios of LK, BK, and PK . This model uses the same explanatory variables as log5 for predicting strikeout probability but includes the additional strikeout rate cross term BˆKPˆK as in equation (4).

Tables 10 and 11 present the results of the logistic regression for EK for the RHP versus LHB and RHP versus RHB platoon configurations. The BˆKPˆK cross term was not near significance for the platoon configurations that involve left-handed pitchers which was likely due to the smaller numbers of observations for these cases. As before, the log5 coefficient values (c0 = -1.0, c1 = 1.0, c2 = 1.0, c3 = 0.0) are used to define the null hypothesis. The resulting c0, c1, and c2 coefficients are all close to the log5 values with only the ln(BKo) variable resulting in a p-value that suggests rejecting the null hypothesis. The p-values for the BˆKPˆK cross term approach significance for the two cases with p-values of 0.2342 and 0.0634 and for both cases the c3 coefficient is negative. This suggests that pitchers will achieve more strikeouts than the log5 prediction for mismatched Krate configurations and fewer strikeouts than the log5 prediction for matched Krate configurations.

In summary, the BˆKPˆK cross term has a negative value and is significant for all four platoon configurations for predicting ground ball probability and borders on significance with a negative value for the two platoon configurations with the most observations for predicting strikeout probability. For mismatched Krate configurations, therefore, pitchers achieve both more ground balls and more strikeouts than log5 predicts. On the other hand, for matched Krate configurations, pitchers achieve fewer ground balls and fewer strikeouts than log5 predicts. Given that ground balls and strikeouts are both positive results for pitchers, the analysis reveals that pitchers are favored for these outcomes relative to log5 for mismatched Krate configurations while batters are favored for matched Krate configurations.

Table 10

Binary logit output, RHP versus LHB, 444797 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LKo)log odds league strikeout rate-1.0177620.016340-1.0870010.2770
ln(BKo)log odds batter strikeout rate1.0224940.0106732.1074790.0351
ln(PKo)log odds pitcher strikeout rate0.9907920.011804-0.7800760.4353
BˆKPˆK(batter SO rate)*(pitcher SO rate)-1.7360381.459464-1.1895040.2342
Table 11

Binary logit output, RHP versus RHB, 480101 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LKo)log odds league strikeout rate-1.0108670.015366-0.7072690.4794
ln(BKo)log odds batter strikeout rate1.0122290.0101181.2086020.2268
ln(PKo)log odds pitcher strikeout rate0.9937370.010932-0.5729090.5667
BˆKPˆK(batter SO rate)*(pitcher SO rate)-2.2726331.224326-1.8562320.0634

3.6Utility for Out-of-Sample Prediction

We also evaluated the use of the new model for the analysis of out-of-sample data. For this purpose, we used the ground ball rates L, B, P and the strikeout rates LK, BK, PK observed in 2014 along with the model presented in Tables 36 which was derived using 2003-2014 data to predict the probability of outcomes in 2015. We considered all 2015 matchups which involve a batter and pitcher for which the rates estimated for 2014 were deemed reliable according to the criteria described in section 3.2. Let Ep* be the predicted ground ball probability for a 2015 matchup using the standard log5 model with 2014 rates and let Ep be the predicted ground ball probability for a 2015 matchup using the four-variable model defined by Tables 36 using 2014 rates. We evaluated each model according to the log-likelihood of the 2015 matchups using the model. We also considered a baseline model which assigns a predicted ground ball probability for every 2015 matchup as the 2014 league average ground ball rate L for the platoon configuration.

Table 12 compares the three models. We see that the 4-variable model has a larger log-likelihood than log5 for each platoon configuration. We also see that both models perform significantly better than the baseline model which assigns the league average prediction to each matchup. The differences in the log-likelihood for the models can be used to compute a p-value for the use of the 4-variable model over the 3-variable model for this out-of-sample data. For the two configurations involving right-handed pitchers, which have the largest number of observations, the p-values are less than 0.05 which supports the use of the 4-variable model. For the configurations involving left-handed pitchers, the log-likelihood values are only slightly better for the 4-variable model and the p values exceed 0.2.

Table 12

Log-likelihood for out-of-sample prediction

Pit_HandBat_HandObservationsLeague averagelog54-variable model
LeftLeft1083-719.6-695.7-695.6
LeftRight5224-3301.8-3245.9-3245.1
RightLeft26420-16493.0-16255.9-16253.7
RightRight24598-15729.0-15542.1-15537.6

4Modeling the Probability of a Ground Ball Hit

The fate of the ground balls hit by a batter or allowed by a pitcher over the course of a season can have a significant impact on the overall success of the players and their teams. In this section, we consider models for the probability EH that a ground ball results in a hit. As before, bunts are not considered to be ground balls and we exclude plate appearances with pitchers as batters.

4.1Model Variables

4.1.1Platoon Configuration

The probability that a ground ball becomes a hit depends on the platoon configuration. Let LA be the league batting average on ground balls which is the ratio of ground ball hits to total ground balls. Figure 5 plots LA for each platoon configuration for the years between 2003 and 2014. We see that platoon configurations involving right-handed batters result in higher values of LA since right-handed batters hit more ground balls to the left side of the infield which require longer throws to first base. We also see that LA depends on the year as, for example, LA rose sharply between 2013 and 2014 for all four platoon configurations. Interestingly, teams deployed an all-time high number of infield shifts in 2014 that were intended to reduce LA (James, 2015). We also note that there is more year-to-year fluctuation in LA for platoon configurations involving left-handed pitchers because these configurations include fewer ground ball observations.

Fig.5

League batting average on ground balls.

League batting average on ground balls.

4.1.2Batter and Pitcher Descriptors

An attempt to model EH might be based on the associated log5 explanatory variables of batter ground ball batting average, pitcher ground ball batting average, and league ground ball batting average. Batting averages for individual batters and pitchers, however, require a large number of plate appearances to reach a high level of reliability (Carleton, 2012) (Carleton, 2013). Thus, a model for EH that uses the log5 explanatory variables would be difficult to apply in practice due to the difficulty of obtaining reliable estimates for the batter and pitcher ground ballbatting averages. As discussed in section 3.1, the B, BK, P, and PK player descriptors can be estimated reliably from small samples and we will consider the use of these descriptors for modeling EH . The probability that a ground ball for a matchup results in a hit also depends on the distribution of the speed and direction of batted balls for the batter and pitcher. Batters who hit harder ground balls, for example, will tend to have a higher ground ball batting average than otherwise similar batters. HITf/x data (Jensen, 2009) can be used to estimate the speed and direction of batted balls, but is not publicly available at this time.

A batter’s running speed also has a significant impact on EH since faster runners beat out more infield hits and force infielders to play shallower which compromises range. A player’s position can be used as a measure of running speed (Lederer, 2009). Centerfielders, for example, are typically faster runners than catchers. Table 13 gives the ground ball averages by position over the years 2003 to 2014. We see that outfielders have the highest ground ball averages and are followed by middle infielders while designated hitters, first basemen, and catchers produce the lowest ground ball averages. Figure 6 plots the ground ball averages by position for the years from 2003 to 2014 and shows that the averages can also vary over time. The ground ball average of designated hitters, for example, declined from 0.244 in 2003 to 0.206 in 2010 but has since increased to 0.234 in 2014. We define the batter positional speed S for a plate appearance as the ratio of ground ball hits to the total number of ground balls that were produced by the batter’s position for that year after removing all plate appearances that involve the current batter. The variable S is not computed separately for each platoon configuration due to the limited number of samples that are available for some position/configuration combinations. Other possible measures for batter speed include the Bill James speed score (James, 1987) which is based on variables such as a player’s number of stolen base attempts, triples, and runs scored per opportunity. We selected the positional speed measure over the Bill James speed score due to the latter’s dependence on variables besides speed. Stolen base attempts, for example, depend on a manager’s tendencies, triples depend on power and good fortune, and runs scored depend on the hitting ability of other batters in a lineup.

Table 13

Ground ball batting average by position over the years 2003 to 2014

CFRFLFSS2B3BDH1BC
0.2570.2510.2480.2440.2430.2400.2290.2290.222
Fig.6

League batting average on ground balls by position.

League batting average on ground balls by position.

4.1.3Defense

Team defense will also affect EH because infielders with greater range will turn more ground balls into outs. We define the infield defense D for a plate appearance as the ratio of ground ball hits allowed to total ground balls allowed by the team in the field during that year after removing all plate appearances that involve the current pitcher. The plate appearances involving the current pitcher are removed to reduce the dependence of D on characteristics of the pitcher that may affect EH but which are captured by other variables in the model. The variable D is not computed separately for each platoon configuration due to the limited number of samples that are available for some team/configuration combinations. As an example, figure 7 plots D for each year from 2003 to 2014 for plate appearances involving left-handed pitcher Mark Buehrle. We see that D can change significantly from year-to-year.

Fig.7

Mark Buehrle ground ball defense by year

Mark Buehrle ground ball defense by year

4.2Qualified Batters Experiment

Logistic regression can be applied to the set of ground ball observations for a platoon configuration to recover a logit model for EH using the model variables described in section 4.1. The Qualified Batters Experiment considers all ground balls hit between 2003 and 2014 in a matchup where both the batter and pitcher rates are reliable. As in section 3.2, we use 150 adjusted plate appearances for a year and platoon configuration as a threshold for the batter and pitcher rates to be considered reliable and we exclude bunts and matchups involving pitchers as batters. We also exclude matchups where the batter is a pinch-hitter since we cannot assign the batter position to these matchups which is necessary to use the positional speed (S) descriptor defined in section 4.1.2. We note that pinch-hitters are a relatively rare occurrence and accounted for only about three percent of major league plate appearances in 2015. Table 14 gives the total number of ground ball observations for each platoon configuration that satisfy these criteria.

Table 14

Number of ground ball observations with qualified batters for each platoon configuration over the years 2003 to 2014

LHP vs LHBLHP vs RHBRHP vs LHBRHP vs RHB
895142443144253161498

We evaluated models for EH that included various combinations of the variables described in section 4.1. The most general resulting model based on the number of significant variables is given by

(7)
EH=F(c0ln(LAo)+c1Sˆ+c2Dˆ+c3Pˆ+c4PˆK+c5PˆPˆK+c6BˆK)

where LAo = LA/(1 - LA) is the odds ratio of the league ground ball batting average LA defined in section 4.1.1 for the year and platoon configuration. Sˆ and Dˆ are the centered speed and defense measures for a matchup and year

(8)
Sˆ=S-LA,Dˆ=D-LA

where S and D are defined in sections 4.1.2  and  4.1.3 and LA is the total ground ball average over all platoon configurations for the year. BˆK and PˆK are the centered strikeout rates defined in section 3.2 and Pˆ is the centered pitcher ground ball rate

(9)
Pˆ=P-L

for the year and platoon configuration.

Tables 1518 present the results of the logistic regression for the four platoon configurations. Each table contains the coefficients, standard errors,z-statistics, and p-values for the model that uses all of the variables in (7) that are significant with a p-value below 0.05 for the configuration. We see that, as expected, the number of significant variables increases as the number of observations for a configuration increases. The only significant variable that depends on a rate descriptor for the batter is the centered batter strikeout rate BˆK and the sign of the coefficient for BˆK varies with the configuration. Thus, the utility of BˆK for modeling EH is questionable and this variable will not be considered by the model examined in the next section.

Table 15

Binary logit output, LHP versus LHB, 8951 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average1.0117170.02320343.602350.0000
Sˆcentered batter speed4.9943752.0119672.4823340.0131
BˆKcentered batter strikeout rate-0.8739130.406389-2.1504330.0315
Table 16

Binary logit output, LHP versus RHB, 42443 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average0.9702960.01021994.947860.0000
Dˆcentered GB defense2.4143250.7074623.4126540.0006
Pˆcentered pitcher GB rate-0.7911240.214093-3.6952290.0002
PˆKcentered pitcher strikeout rate-0.6918960.258133-2.6803880.0074
Table 17

Binary logit output, RHP versus LHB, 144253 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average1.0040870.005598179.36270.0000
Dˆcentered GB defense2.6815780.4082196.5689650.0000
Sˆcentered batter speed3.4127650.5126116.6576180.0000
Pˆcentered pitcher GB rate-0.8450470.096329-8.7725300.0000
BˆKcentered batter strikeout rate-0.5585330.117567-4.7507790.0000
Table 18

Binary logit output, RHP versus RHB, 161498 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average0.9950820.005887169.02570.0000
Dˆcentered GB defense2.6099640.3802996.8629240.0000
Sˆcentered batter speed4.3358260.5135698.4425410.0000
Pˆcentered pitcher GB rate-0.8123200.089557-9.0704710.0000
PˆKcentered pitcher strikeout rate-0.5215510.126566-4.1207960.0000
BˆKcentered batter strikeout rate0.2096600.1050961.9949500.0460
PˆPˆK(pitcher GB rate)*(pitcher SO rate)-2.8511511.339279-2.1288700.0333

4.3All Batters Experiment

Since the regression results in section 4.2 are limited by sample size, we considered an All Batters Experiment that removes BˆK from equation (7) to form the model

(10)
EH=F(c0ln(LAo)+c1Sˆ+c2Dˆ+c3Pˆ+c4PˆK+c5PˆPˆK).

Since this model does not depend on a batter rate descriptor, we can remove the restriction that batter rates are reliable for a matchup. This provides more observations to study the role of the other variables in models for EH . Thus, we repeated the experiment described in section 4.2 with the model of (10) by using the threshold of 150 adjusted plate appearances for the pitcher in a matchup, but by otherwise considering all ground balls after excluding bunts, pitchers as batters, and pinch-hitters. Table 19 gives the total number of ground ball observations for each platoon configuration that satisfy the criteria. We note that the number of observations for each platoon configuration is larger than for the Qualified Batters Experiment as presented in Table 14. The LHP versus LHB configuration, however, still has a relatively small number of observations which limits its utility for analysis.

Table 19

Number of ground ball observations with all batters for each platoon configuration over the years 2003 to 2014

LHP vs LHBLHP vs RHBRHP vs LHBRHP vs RHB
16198106988159807183505

The results of the logistic regression for the All Batters Experiment are presented in Tables 2023. For each platoon configuration, the model is given that uses all of the variables in (10) that have ap-value below 0.05. In contrast to the Qualified Batters Experiment, each of the first five variables in (10) is significant for each platoon configuration except LHP versus LHB. In addition, the PˆPˆK cross term is significant for the RHP versus RHB configuration and the signs of the coefficients for the significant variables are consistent across the configurations. In particular, the c3 and c4 coefficients are negative in each case which causes EH to decrease as a pitcher’s ground ball and strikeout rates increase except over regions of the RHP versus RHB configuration where the PˆPˆK cross term has a large impact.

Table 20

Binary logit output, LHP versus LHB, 16198 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average1.0292650.01734959.325960.0000
Sˆcentered batter speed7.2226111.4764784.8917850.0000
Pˆcentered pitcher GB rate-0.7358980.306736-2.3991220.0164
Table 21

Binary logit output, LHP versus RHB, 106988 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average1.0035240.006521153.89520.0000
Dˆcentered GB defense1.9506660.4510824.3244140.0000
Sˆcentered batter speed2.8653080.6116164.6848180.0000
Pˆcentered pitcher GB rate-0.8312820.135617-6.1296060.0000
PˆKcentered pitcher strikeout rate-0.4663550.162944-2.8620650.0042
Table 22

Binary logit output, RHP versus LHB, 159807 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average1.0085630.005203193.84500.0000
Dˆcentered GB defense2.5226350.3897566.4723490.0000
Sˆcentered batter speed3.4444310.4879367.0591910.0000
Pˆcentered pitcher GB rate-0.9109080.103393-8.8101200.0000
PˆKcentered pitcher strikeout rate-0.2740270.136418-2.0087220.0446
Table 23

Binary logit output, RHP versus RHB, 183505 observations

VariableDescriptionCoefficientStd. Errorz-Statisticp-value
ln(LAo)log odds league GB average1.0102080.005436185.83550.0000
Dˆcentered GB defense2.7210760.3584777.5906660.0000
Sˆcentered batter speed4.5888450.4696549.7706890.0000
Pˆcentered pitcher GB rate-0.8934690.084441-10.581010.0000
PˆKcentered pitcher strikeout rate-0.5811870.119329-4.8704710.0000
PˆPˆK(pitcher GB rate)*(pitcher SO rate)-3.7142971.256525-2.9560070.0031

If we set the batter’s running speed S and the pitcher’s infield defense D to the league average LA then Sˆ and Dˆ vanish from (10) which allows us to focus on the dependence of EH on the pitcher descriptors P and PK . Figure 8 plots EH as a function of P and PK using the coefficients c0, c3, and c4 from Table 22 for the case of 2014 matchups between right-handed pitchers and left-handed batters. We see that EH decreases as P and PK increase since c3 and c4 are negative. The shape of the surface will be similar for other years with small adjustments due to changes in the league averages LA, L, and LK . Figure 9 plots EH as a function of P and PK using the coefficients c0, c3, c4, and c5 from Table 23 for the case of 2014 matchups between right-handed pitchers and right-handed batters. Curvature is added to the surface by the PˆPˆK cross term which is significant for this platoon configuration.

Fig.8

EH surface for RHP versus LHB, 2014, (Sˆ=Dˆ=0).

EH surface for RHP versus LHB, 2014, (Sˆ=Dˆ=0).
Fig.9

EH surface for RHP versus RHB, 2014, (Sˆ=Dˆ=0).

EH surface for RHP versus RHB, 2014, (Sˆ=Dˆ=0).

We can further examine the dependence of EH on pitcher characteristics by setting Sˆ=0 and Dˆ=0 in (10) and considering the deviations from league average C = EH - LA for each instance of a pitcher in our study with more than 150 adjusted plate appearances for a year and platoon configuration. Table 24 presents the number of pitcher instances, the mean of |C|, and the minimum and maximum values of C for the three platoon configurations where at least the first five variables in (10) are significant. We see that the average absolute difference between EH and LA is between seven and nine points of ground ball batting average depending on the configuration and that the maximum differences exceed forty points. Table 25 presents the pitcher and year that correspond to the minimum and maximum values of C for each configuration in Table 24. Cases with negative values of C correspond to pitchers with characteristics that reduce ground ball batting average and we see that both P and PK are well above the league averages of L and LK for these cases as predicted by figures 8 and 9. Cases with positive values of C correspond to pitchers with characteristics that increase ground ball batting average and the ground ball rate P is well below the league average L for these cases as predicted by the figures. For the case of Brad Lidge in 2004, the large positive value of C benefits from the PˆPˆK cross term which becomes large as shown in figure 9 for the RHP versus RHB configuration for pitchers with a small ground ball rate P and a large strikeout rate PK . The last column in Table 25 is the actual ground ball batting average allowed by each pitcher for the year and platoon configuration. We see that the pitchers with characteristics that reduce EH (negative values of C) allowed ground ball averages that are well below the league average while pitchers with characteristics that increase EH (positive values of C) allowed ground ball averages that are well above the league average.

Table 24

Differences Between EH and LA

Pit_HandBat_HandPitcher InstancesMean(|C|)Min(C)Max(C)
LeftRight8340.007055-0.0307350.029030
RightLeft15870.007820-0.0337520.033010
RightRight19880.008661-0.0429530.042676
Table 25

Pitchers With Large Differences Between EH and LA

Pit_HandBat_HandPitcherYearPPKCGB Avg.
LeftRightJonny Venters20110.4842520.228346-0.0307350.203
LeftRightBrad Hand20110.1518320.1413610.0290300.276
RightLeftRoy Halladay20050.5349650.174825-0.0337520.163
RightLeftChris Young20080.1164020.1693120.0330100.318
RightRightBrandon Webb20060.5287080.212919-0.0429530.186
RightRightBrad Lidge20040.1329480.5144510.0426760.304

4.4Utility for Out-of-Sample Prediction

We also assessed the model developed in section 4.3 for the prediction of out-of-sample data. Using the model in equation (10), we considered the prediction of ground ball hit probabilities for 2015 data using the league rate LA and the individual pitcher rates Pˆ and PˆK for each platoon configuration for 2014 in addition to the speed measure Sˆ for 2014. We did not use the infield defense measure Dˆ for the out-of-sample prediction due to the large variation in ground ball defense from year-to-year (see figure 7) due to personnel changes. We considered all 2015 matchups that included pitchers with reliable rates for 2014 according to the criteria in section 3.2 after excluding bunts, pitchers as batters, and pinch-hitters. Let EH1 be the predicted ground ball hit probability for a 2015 matchup using (10) with only the league average and speed variables from 2014 (c2 = c3 = c4 = c5 = 0) and let EH2 be the predicted ground ball hit probability for a model that also includes the individual pitcher variables from 2014 (c2 = 0) where the model coefficients in Tables 2023 are used for each case. We evaluated each model according to the log-likelihood of the 2015 matchups using the model. We also considered a baseline model which assigns a predicted ground ball hit probability for 2015 matchups as the 2014 league average ground ball hit rate LA for the platoon configuration.

Table 26 compares the log-likelihood for the predictive models. We see that using the speed measure Sˆ increases the log-likelihood for each configuration compared to the baseline model. We also see that adding the pitcher descriptors further increases the log-likelihood for each case. As in section 3.6, we can compute p-values to compare the models for this out-of-sample data. For the configurations involving batters and pitchers of opposite hand (LHP versus RHB, RHP versus LHB), the differences in log-likelihood between the league average model and EH1 and between EH1 and EH2 give p-values below 0.10 for each case. For the other configurations, the log-likelihood values have smaller gains as we add variables and the p-values for the transitions exceed 0.15.

Table 26

Log-likelihood for out-of-sample prediction for three models

Pit_HandBat_HandObservationsAverage LALA + speed SLA + S + pitcher GB rate + pitcher SO rate
LeftLeft702-363.5-362.5-361.4
LeftRight4211-2391.6-2389.2-2386.7
RightLeft6744-3575.8-3572.7-3568.3
RightRight7119-4066.1-4065.9-4064.5

4.5Physical Justification

The result that pitchers with high ground ball rates tend to allow a lower batting average on ground balls is consistent with physical intuition. Cartwright (2012) used HITf/x data to examine this phenomenon in detail by considering the distribution of vertical angles of batted balls allowed by a pitcher where a vertical angle of -90° is straight down and a vertical angle of +90° is straight up. He showed that as pitchers achieve a higher ground ball rate the full distribution of opponent batted balls shifts to smaller vertical angles. This shift tends to make ground balls easier to field because they are hit more directly into the ground with a smaller velocity component in the plane of the playing field. For balls in the air, however, this shift in the distribution turns pop-ups with large vertical angles into fly balls and line drives with smaller vertical angles that are more difficult to field. As a result, pitchers with high ground ball rates tend to achieve the best results on ground balls, but typically allow higher batting averages on balls hit in the air (Swartz, 2010b). Murphy (2015) analyzes some of the tradeoffs related to a pitcher’s ground ball versus fly ball tendencies and explores strategies that pitchers with high ground ball rates can employ to improve their results on balls hit inthe air.

5Conclusion

We have shown that the probability of a ground ball for a matchup can be predicted using batter and pitcher descriptors that can be estimated reliably from small samples. The resulting predictive model is a generalization of the log5 formula which is based on the batter and pitcher ground ball rates, but the new model also captures the interaction between batter and pitcher strikeout rates. This interaction leads to matched and mismatched Krate configurations which represent sets of matchups for which the batter or pitcher is favored with respect to both ground balls and strikeouts compared to the log5 prediction. We introduced the Aoki/Kershaw/Diamond example to illustrate the principle of matched and mismatched Krate configurations and to demonstrate how ground ball probability is affected for matchups within these configurations. We also tested the model on out-of-sample data.

The outcome of the ground balls hit or allowed by a team can have a large effect on the team’s performance. Log5 is not useful for predicting the probability that a ground ball results in a hit due to the difficulty of obtaining reliable estimates for the component explanatory variables. Instead, we have employed a logit model to show that the probability of a ground ball hit depends on the platoon configuration and a set of alternative variables that separate the influence of the batter, pitcher, and defense. In order to address sample size issues, we defined an All Batters experiment that focuses on variables that depend on the pitcher, his infield defense, and the batter’s running speed. We showed that the probability of a ground ball becoming a hit depends on both the pitcher’s ground ball and strikeout rates. We also showed that the role of the different explanatory variables depends on the platoon configuration. The model was assessed for the prediction of ground ball hit probability on out-of-sample data.

Descriptors that characterize the distribution of batted ball speeds and launch angles for a batter or pitcher could be used to improve the model, but the data required to generate these descriptors is not publicly available at this time. Additional player descriptors, however, can easily be incorporated into the model as they become available. We provide a physical justification for the dependence of ground ball hit probability on a pitcher’s ground ball rate and also give several examples of pitchers that illustrate properties of the model.

Acknowledgment

The data used in this study was obtained from www.retrosheet.org. I thank Tom Tango for his help with this work.

References

1 

Bradbury J.C., (2005) . Another look at DIPS [Online]. Available: www.hardballtimes.com/another-look-at-dips1.

2 

Carleton R., (2009) . If you’re happy and you know it, get on base [Online]. Available: www.hardballtimes.com/tht-live/if-youre-happy-and-you-know-it-get-on-base.

3 

Carleton R., (2012) . It’s a small sample size after all [Online]. Available: www.baseball.prospectus.com/article.php?articleid=17659.

4 

Carleton R., (2013) . Should I worry about my favorite pitcher? [Online]. www.baseballprospectus.com/article.php?articleid=20516.

5 

Cartwright B., (2011) . What ground balls can tell us about fly balls. In Distelheim J., Simons G., Hale C., editors, The Hardball Times Baseball Annual, 2012, pages 249–254. ACTA Sports, Chicago.

6 

Fast M., (2011) a. Who controls how hard the ball is hit? [Online]. Available: www.baseballprospectus.com/article.php?articleid=15532.

7 

Fast M., (2011) b. How does quality of contact relate to BABIP? [Online]. Available: www.baseballprospectus.com/article.php?articleid=15562.

8 

Fox D., (2005) . Tony LaRussa and the search for significance [Online]. Available: www.hardballtimes.com/tony-larussa-and-the-search-for-significance.

9 

Fox D., (2005) . A short digression into log5 [Online]. Available: www.hardballtimes.com/a-short-digression-into-log5.

10 

Hammond C., Johnson W., & Miller S., (2015) . The James function. Mathematics Magazine, 88: , 54–71.

11 

Healey G., (2015) . Modeling the probability of a strikeout for a batter/pitcher matchup. IEEE Transactions on Knowledge and Data Engineering, 27: (9), 2415–2423.

12 

James B., (1983) . The Bill James Baseball Abstract 1983. Ballantine Books, New York, NY.

13 

James B., (1987) . The Bill James Baseball Abstract 1987. Ballantine Books, New York.

14 

JamesB., (2014) . The Bill James Handbook 2015. ACTA Sports, Chicago.

15 

Jensen P., (2009) . Using HITf/x to measure skill [Online]. Available: www.hardballtimes.com/using-hitf-x-to-measure-skill.

16 

Koo A., (2013) . More moneyball: Oakland’s other platoon advantage [Online]. Available: www.baseballprospectus.com/article.php?articleid=22435.

17 

Lederer R., (2009) . BABIP: slicing and dicing groundball out rates [Online]. Available: baseballanalysts.com/archives/2009/01/babip_slicing_a.php.

18 

Lependorf D., (2013) . Where do ground balls come from? [Online]. Available: www.hardballtimes.com/where-do-ground-balls-come-from.

19 

Levitt D., (1999) . The batter/pitcher match up [Online]. Available: baseball-thinkfactory.org/btf/scholars/levitt/articles/batter-pitcher-matchup.htm.

20 

Lichtman M., (2004) . DIPS revisited [Online]. Available: www.baseballthinkfactory.org/primate_studies/discussion/lichtman_2004-02-29_0.

21 

McCracken V., (2001) . Pitching and defense: How much control do hurlers have? [Online]. Available: www.baseballprospectus.com/article.php?articleid=878.

22 

Morey L. and Cohen M., (2015) . Bias in the log5 estimation of outcome of batter/pitcher matchups, and an alternative. Journal of Sports Analytics, 1: (1), 65–76.

23 

Murphy M., (2015) . Are groundball pitchers overrated [Online]. Available: www.hardballtimes.com/are-groundball-pitchers-overrated.

24 

Stern H., & Sugano A., (2007) . Inference about batter-pitcher matchups in baseball from small samples. In Albert J. and Koning R., editors, Statistical Thinking in Sports, pages 153–165. Chapman and Hall/CRC.

25 

Swartz M., (2010) a. Why SIERA doesn’t throw BABIP out with the bath water [Online]. Available: www.baseballprospectus.com/article.php?articleid=10281.

26 

Swartz M., (2010) b. Ground-ballers: better than you think [Online].Available: www.baseballprospectus.com/article.php?articleid=12581.

27 

Tango T., Lichtman M., & Dolphin A., (2007) . The Book: Playing the Percentages in Baseball. Potomac Books, Dulles, Virgina.

28 

Tippett T., (2003) . Can pitchers prevent hits on balls in play? [Online]. Available: 207.56.97.150/articles/ipavg2.htm.

29 

Wooldridge J., (2013) . Introductory Econometrics: A Modern Approach. South-Western, Cengage Learning, Mason, OH, 5th edition.