Would Matt Williams have broken the single-season home run record? -------------------------------------------------------------------- The 1994 baseball season was particularly exciting, since there were a relatively large number of runs scored, and particular players had great hitting seasons. Some players appeared to have reasonable chances of exceeding the single season record of 61 home runs and one player was close to having a .400 batting average. Unfortunately, the baseball season ended due to a strike on August 11 and fans were left to wonder what would have happened to particular batting records if a full 162 game season had been played. In particular, let's consider Matt Williams who, on August 11, had hit 43 home runs in his first 445 at-bats. If the baseball strike had not occurred, assume that Matt Williams would not be injured and have 199 additional at-bats during the remainder of the season. (This number of at-bats is an estimated number of at-bats used in the 1995 Major League Handbook in their simulation of the 1994 season.) Did he have a reasonable chance of hitting more than a total of 61 home runs and setting the home run record? In other words, was it likely that Williams would hit at least 19 home runs in his final 199 at-bats? The answer to this question depends on one's opinion about Williams' home run hitting ability during the last part of the season. One could assume that his rate of hitting home runs would remain similar to the rate displayed during the first part of the season. Or possibly one could regard Williams as unusually ``hot" during the first 445 at-bats and expect him to cool down during the remainder of the season. Suppose that one views individual at-bats during the last part of the season as independent Bernoulli trials where p is the probability of Williams hitting a home run during a single plate appearance. What is the interpretation of the home run rate parameter p? This is the proportion of home runs of Williams if he was allowed to take a hypothetical large number of at-bats under identical conditions during this last part of the season. In teaching, it is important to distinguish this probability value from the sample proportion p of home runs that Williams might hit during his final 199 at-bats. Also, the value of p represents Williams' home run ability only during this last part of the season. It is very possible that the Williams' probability of hitting a home run changes during the course of a season. It is also possible that Williams' home run hitting ability changes over the course of his career. His home run hitting probability during the end of the 1994 season could be very different than his chances of hitting home runs during previous years. After one understands the meaning of the Bernoulli parameter p, one constructs a probability distribution on a set of plausible values for p which reflects one's beliefs about Williams' home run ability during the remainder of the 1994 season. Since one's opinion about Williams' ability can vary, let us consider the beliefs of three hypothetical baseball fans, Allan, Bob and Sally. In the following, we discuss the opinions of these three fans about Williams home run ability, and describe how one can construct a probability distribution for each fan which matches the individual opinions. The first fan, Allan, believes that Williams' home run ability for the remainder of the 1994 season is best measured by his performance during the first part of the 1994 season. In fact, he thinks that Williams' probability of hitting a home run, p, would be the same over the entire season. In addition, he believes that Williams' home run performance in previous years is irrelevant for learning about his performance during 1994. In other words, Allan believes that Williams' home run probability in 1993 and earlier years is different from his 1994 home run probability. This could be due to a new batting swing or stance or perhaps to some extra strength training during the off-season. Since Allan believes that Williams' probability p remains constant over the entire 1994 season, he will use the batter's home run data in the observed first part of the season to construct his probability distribution. Allan initially knows very little about the value of p, so he considers a large set of possible home run rates .01, .02, .03, ..., .20 and assigns each value in this set the same probability. Using the method described in Section 2, he updates his probabilities with the 1994 data -- s = 45 successes and f = 402 failures, where we define a success as hitting a home run. The revised probabilities for Allan are shown in the below table. p .01 .02 .03 .04 .05 .06 .07 .08 .09 .10 .11 .12 .13 .14 ----------------------------------------------------------------------- Allan 0 0 0 0 0 0 .03 .13 .25 .28 .19 .08 .03 .01 Bob 0 0 0 .14 .14 .14 .14 .14 .14 .14 0 0 0 0 Sally .06 .06 .06 .12 .19 .19 .12 .06 .06 .06 0 0 0 0 The second fan, Bob, has a different opinion about Williams' home run ability for the remainder of the 1994 season. He thinks that Williams' batting performance in his previous major league seasons is relevant for learning about p. So Bob looks at Williams' home run statistics for the previous five seasons. The table below displays the number of at-bats, the number of home runs, and his observed home run rate for these earlier seasons. For each season, one can also learn about the corresponding season home run probability by the method of Section 2. One can assume a priori that the home run probability is uniform over the grid of values .01, .02, ..., .20, and obtain a posterior distribution for the probability using the observed data from the season. In this table, the column 'TRUE RATES' lists values for the season home run probability that received a posterior probability of at least .01. In 1989, for example, Williams hit 18 home runs in 292 at-bats for an observed home run rate of .062. This season hitting data is consistent with home run probabilities of .04, .05, .06, .07, .08, and .09. YEAR HOME RUNS AT-BATS OBS. RATE TRUE RATES ------------------------------------------------ 1989 18 292 .062 {.04, .05, .06, .07, .08, .09} 1990 33 617 .053 {.04, .05, .06, .07, .08} 1991 34 589 .058 {.04, .05, .06, .07, .08 } 1992 20 529 .038 {.02, .03, .04, .05, .06 } 1993 38 579 .066 { .05, .06, .07, .08, .09 } 1994 43 445 .097 {.07, .08, .09, .10, .11, .12, .13} So after looking at the calculations summarized in this table, Bob has some idea about Williams' home run probabilities for the years 1989 - 1993 and the first part of the 1994 season. He is reluctant to pool all of the home run data from previous years, since he believes that Williams' home run ability has changed from 1989 through 1994. But he thinks that the value of the home run probability p for the remainder of the 1994 season is among the plausible home run probabilities from previous seasons. After some reflection, he thinks that p is contained in the set { .04, .05, ..., .10}. He notes that Williams' home run probability in 1992 could have been as low as .02, but he thinks that Williams will be hitting better in the remainder of 1994 than he hit in 1992. Also, it is possible, from the observed data in 1994, that Williams' home run probability could be as high as .13. But Bob feels that this is too big an improvement from the probability values of previous seasons, so he places an upper bound of .10 on Williams' home run probability for the end of 1994. It is hard for Bob to prefer particular values of p, so he decides to assign each value in the set { .04, .05, ..., .10} the same prior probability. The third fan, Sally, has opinions about Williams different from those of the previous two fans. She has been particularly impressed with the home runs that Williams has hit in the first part of 1994. However, she thinks that Williams has been ``hot" during the first part of the season and thinks it would be difficult for him to continue to hit home runs at this hot rate for the remainder of the season. Also, there will be additional pressure placed on Williams during the remainder of the season. The single season home run record of 61 was set over 30 years ago, and Williams will receive extensive media coverage if he gets close to the record. This extra media pressure could have an adverse effect on Williams' hitting ability during this time. (An alternative explanation for this belief is the well known regression effect. In this setting, the regression effect is the phenomena that players who have extreme batting performances in the first half of a season will tend to have more average performances during the second half of the season.) How does Sally construct her prior distribution for p? From looking at Williams' home run record from previous seasons, she thinks that a plausible set of values for p is from .01 to .10. She feels that Williams will cool down from the .097 rate that he displayed in the first part of the 1994 season. She places the largest probabilities on the values of .05 and .06 which are more consistent with Williams' performance during previous years. Although .05 and .06 are the most probable values of p, Sally thinks that there is a small chance that Williams will rise to the occasion and break the home run record. Also, there is a small chance that Williams will display a significant slump and hit very few home runs during the remainder of the season. So she assigns probabilities on p that slowly decrease as one moves away from the most probable values. The extreme home run probabilities of .01 and .10 receive prior probabilities of .06. What is the implication of these different beliefs about Williams' home run rate p on the predictions that he will break the home run record? Using the basic formula of Section 2, one can compute the predictive probabilities of the number of home runs y for each of the three fans. The plots of these probability distributions are shown in the dotplots below. The first thing to notice is that there is substantive variability for each prior in the number of home runs Williams will hit in the remainder of the season. For example, if Allan's prior is used, then Williams could hit anywhere between 10 and 30 additional home runs. Second, these three predictive distributions vary significantly with respect to location and spread, indicating that one's prior probabilities about Williams' home run rate can have a large impact on one's prediction of his performance during the remainder of the season. Allan's predictive distribution: . : :: :. .: :: ::: . :: :: ::: :: :: :: ::: :: : : ::: :: :: ::: :: : . .: ::: :: :: ::: :: :: :.. . .. .: :: ::: :: :: ::: :: :: ::: :. :. ... . +---------+---------+---------+---------+---------+-------y 0.0 7.0 14.0 21.0 28.0 35.0 Bob's predictive distribution: . : .: .: : : :: :: ::: .: ..: :: :: ::: :: . ::: :: :: ::: :: :: . . ::: :: :: ::: :: :: ::. : :: ::: :: :: ::: :: :: ::: :. . . :: ::: :: :: ::: :: :: ::: :: :: ... . . +---------+---------+---------+---------+---------+-------y 0.0 7.0 14.0 21.0 28.0 35.0 Sally's predictive distribution: : . . . .: : :: :.: :: :: ::: : ::: :: :: ::: .: :. ::: :: :: ::: . : : :: :: ::: :: :: ::: :: : : :: :: ::: :: :: ::: :: :: :.. :: :: :: ::: :: :: ::: :: :: ::: .: .: ... . +---------+---------+---------+---------+---------+-------y 0.0 7.0 14.0 21.0 28.0 35.0 Specifically, we are interested in computing the probability that Williams will break the home run record, which is the probability that the number of home runs, y, is 19 or greater. This probability is easily computed from one's predictive distribution -- one sums the predictive probabilities for all values of y equal to 19 or greater. For the three prior distributions corresponding to Allan, Bob and Sally, this probability is given by .571, .205, .099, respectively. So these three fans have very different probabilities that Williams would have broken the record. A baseball fan who looks at the above calculations may wonder who to believe; is it possible to reconcile the above analyses? The general answer is no. This example illustrates the sensitivity of statistical conclusions to assumptions. To predict Matt Williams' home run performance, one must make some assumptions about how his future home run hitting ability (measured by the probability p) is related to what he displayed in previous seasons and the first part of the 1994 season. Since these assumptions critically affect the final conclusion, it makes one think about one's assumptions more carefully. Personally, although I was impressed with Williams' home run performance in 1994, I view the season home run record as a very difficult hurdle to overcome, and so I would give the record breaking event {y greater than or equal to 19} a relatively small probability. The implication of this belief is that I think that that Williams' home run probability p for the remainder of the season would be small and so would assign probabilities similar to those assigned by Bob or Sally. Likewise, another baseball fan would have to think about her prior beliefs about Williams' home run ability to form her personal prediction of Williams breaking the home run record.