How much luck is too much luck?

The pancake of luck and statistical analysis

Nov 25, 2023

The Atlanta Braves traded a mismatch of lotto ticket, possibly washed players to the Chicago White Sox for Aaron Bummer, a pitcher with a terrible 6.79 ERA. Now, the opinion on this trade, if you asked your Sox fan who happens to live one floor above you, is that the Chicago White Sox won. Mike Soroka! He was good (?)! But if you browse Twitter or follow any of the major analytical-minded accounts on Twitter, you’ll see the opposite opinion. Bummer had great peripherals - a 3.50-something FIP/xERA/xFIP, and he had nasty stuff. In the bullpen of Atlanta, he’ll be able to rid some of his bad luck and shake off the demons of Guaranteed Rate Field.

It’s sometimes always about luck when evaluating players - pitchers especially. Maybe they just didn’t have their A-game today, and their next start will hopefully be better. Maybe their ERA is a full one or two runs more than their FIP. Oh boy, I see a player I must give a 2-year, 10M contract! Wah lah, I’m the Rays! For teams, fans, and analysts alike, the skill of determining when a player is unlucky and good is incredibly important. But when does this disconnect between luck and performance occur? How deep does luck run? Is it all luck? Is it skill?

I think there’s layers to statistical analysis. And as analysis gets deeper and deeper, we get closer to true skill. For analyzing a player, there’s a sweet spot for where you should be analyzing. It depends on what the purpose is.

Pancakes

Pancakes. They have layers. And so does statistical analysis. Statistical analysis has layers depending on how deep into the rabbit hole you fall. They have different flavors, but the layers have one goal: to help evaluate the value of a player.

The most surface level you can get is performance within a singular game. If you’re hitting 3 home runs in 4 plate appearances in one game, that’s good! If you’re getting a platinum sombrero (5 strikeouts in one game as a batter), that’s really bad! Ultimately, the goal is to score/prevent runs in games so that the team wins. The only way to do this is to have good single-game performances. This is nearly the closest you can get in statistical analysis to “Who wins and who loses.” However, the sticking power is very little. Hot and cold streaks might exist, but single-game performances are not based on other single-game performances solely. Single-game performances are based on skill, and skill can be gauged by looking at some of the deeper layers.

We must then look one layer deeper at the aggregate. Single-game performances do matter, but it’s a long season. Hopefully, it’ll all balance out at the end. The aggregate is stickier than single-game performances, and it’s also the reason why teams don’t always base 100% of their decisions on single-game performances. Maybe in the playoffs, but Eddie Rosario was probably not worth the $18M the Braves gave him. Aggregates are not single-game performances, and they are not everything. Look at the 2020 Mets, for example. If you got an alien robot who pitched 30 games with 9 innings each and with a 3.00 ERA, would you rather have the robot always give up 3 runs every game or have the robot throw a shutout half the time and give up 6 runs in the other half? In the half-and-half scenario, would you rather have the robot pitch the shutouts in close games or blowouts? Timing is important. It’s hard to control timing, but there is some skill in bullpen management and lineup construction such that three-run homers can be maximized and good pitching bullpen performance can be at its best in close games. This is where your ERA lies.

But then we start pulling out a crystal ball and look at what did happen and what should have happened. This is your FIP, this is your xFIP, this is your xERA, and this is your SIERA. These do amazing things, and for the most part, they’re pretty good at evaluating a player. But there are flaws. Big flaws. If you’re a sinkerballer, groundball heavy pitcher, FIP fails you. FIP does not include the influence of a pitcher on batted balls except for home runs. I used to have a pretty decent-sized gripe about xFIP, and I’ll explain why. xFIP takes FIP and the percentage of flyballs and then attaches an arbitrary HR/FB% ratio to the percentage of flyballs. On average, a flyball will result in a homerun 10% of the time. But there is variation to this. Pitchers with more meatbally fastballs will have a higher HR/FB% ratio. Pitchers with a sweeper or a steep VAA fastball will be able to coax infield flyballs, which are good flyballs. Flyballs are not the devil. xERA looks at exit velocity and launch angle and then gives a wOBA estimation and then an ERA estimation. I have fewer flaws with this, but the two big ones for me are 1) spray angle and 2) facing really good opponents. Including spray angle makes xERA too descriptive and not predictive. Pitchers may have some control over it (location and unexpectedly fast or slow pitches), but spray angle is largely controlled by the hitter. Do you know what’s also controlled by the hitter? Batter skill. If a pitcher (say, Cionel Perez) faced only Aaron Judge, he would have a worse xERA than Cionel Perez facing Connor Wong, even if the outcomes are still the same. Come to think about it, opponent skill basically affects all of these stats, including ERA. Expected stats are incredibly valuable, but they have major flaws. They aren’t the greatest evaluation of skill, though some people use it as an evaluation of skill and that’s fine for the most part.

We turn back to the crystal ball and predict now. This is your pERA, your Dan Szymborski ZiPS, and your Steamer 600. I’m going to tell you a bit of a story. In the Christmas break of 2021, I started getting into really deep baseball analysis. I looked at Twitter and saw this great thing from Max’s Sporting Studio on Twitter called pERA and pwOBA. And I, as a broke teenager, had no way to get access because it was behind a paywall. I also had no clue how to use R, so I tried my very darndest to recreate pERA using a multiple regression formula in Google Sheets. Had it not been for Max Goldstein and him putting his stats behind a paywall, I’d not be here writing this article. He recently made his pstats public, so I’m very happy that I am finally able to look at those. pERA looks at things like velocity, swing rate, batted ball metrics, percentile exit velocities, and advanced stats like these to predict. Predictive stats attempt to look at the future by looking at the past and saying, “Hey, that looks like it’d translate over.” These are typically advanced models that use previous metrics or performance to predict what will happen next year. Dan Szymborski, love that guy. Also probably not here without Dan Szymborski. This might be more distant from single-game performances, but they are definitely sticky. The sole purpose is to predict the next year.

And then we start getting into the more confusing aspects. We’re starting to lose the ERA-like numbers and are starting to get to the point where we have to seek out Alex Chamberlain’s pitch leaderboard or the specialty leaderboards on Fangraphs to find out stats. Think about this for a second. A pitcher isn’t good because they have a good FIP or xFIP. Those stats signal that the pitcher is good, but they are not the root reason. Spencer Strider is good not because he has a low FIP but because he can strike many batters out. And why? Because he gets in zone whiffs due to his high velocity, good release point, and gyro slider. The next pancake level is understanding why. Why are these pitchers having good results? Plate discipline, exit velocities, whiffs, chases, called strikes, and contact quality are all things that explain why a player is good. When evaluating a player, I like to look here. It may not be the greatest predictor of actual performance, but it’s a pretty good gauge of skill.

We go one layer deeper and look at the why of why. Why is this pitcher getting so many whiffs? Well, it’s probably because he throws 102 with a splinker. Why does this batter have bad in-zone whiffs? Well, it’s probably because there’s a hole in his swing. This is where your Stuff+ and Command+ lie. I will argue that the actual pitch data is one layer of pancake deeper than Stuff+ because the pitch data explains why the Stuff+ is good. The deeper the pancake gets, the closer to the root cause you get. I like this layer for explaining players especially because it gets more to the root cause of their profile. You can see a guy and know that he’s good, but knowing why is just as captivating to me. I will argue that the actual pitch metrics are one level deeper than Stuff+ and Command+ because Stuff+ sometimes misses nuances that make pitches better, and pitch metrics explain why Stuff+ is good.

Then comes the advents of the future. Biometrics? Bat design? Measuring the ability of batters and pitchers and then seeing how the limitations of their bodies affect their potential and their performance may be the future. This is a highly specific art that I, for the most part, do not understand. Some players are more physically capable to do certain things better than others. You may be taller, shorter, larger, skinnier, or may have a better ability to supinate, but how can you maximize those things on the mound? It ties heavily into pitch design and swing design. These are the minimal changes that end up making a large difference in performance, but they’re often very hidden. This is the future, but it often requires expensive, expensive technology. That’s why there’s organizations such as Driveline and Tread Athletics with this technology available to players to improve.

Maybe it goes one level deeper. There are memes about how sabermetrics all boils down to if “the player got that dawg in them,” and honestly, for as much flack that traditional scouting gets (due to things like Moneyball), judging based on character might be underrated. All of the layers build on one another, and it feels weird saying this, but it all starts mentally. That was a weird thing to say. A good way to judge a player is to do traditional scouting but then also do the analytical scouting. It’s the best of both worlds. Both are crucial.

So, where to lie

We have created layers of analysis. The top is more surface-level and closer to the actual single-game performance, but sometimes, it can lack when trying to determine the true skill of a player. RBIs are heavily correlated to how well a team does, but most analysts would say that RBIs are not the true skill of a player. Same with wins. Now we look at the bottom of the pancake, and I’d say that the bottom is a good gauge of skill. However, it’s not perfect for determining performance. If a player with perfect mechanics can’t hit the ball, they can’t hit the ball. As we move up, we get closer to performance, and as we move down, we get closer to skill. It’s also that as we move down the pancake, we move forward in time with analysis. We go through old-school analytics with wins and RBIs, then the Moneyball era, then a couple more evolutions, then the Statcast era, analysis of pitches, the advent of Driveline/Tread, and now the future with biomechanics. It’s a war of analytics, with people who don’t want to let go of the past relenting against the new minds.

Maybe they don’t have to, and maybe there’s a sweet spot. For something like Cy Young voting, I think the ideal place to be with analytics is right around the FIP/xERA/xFIP range, but I think it’s perfectly valid to base your vote on ERA. 50/50 ERA/estimators might actually be a very valid way to vote. Cy Young is based on the best pitcher, and it makes sense to vote for the pitcher with the best performance. It also makes sense to vote on the pitcher with the best skill. That being said, it doesn’t make sense to vote for a pitcher based on if they have perfect mechanics or if they “got that dawg in them” or if they got 10K in an 8IP masterpiece that one day in July. Those are the extremes. Hall of Fame voting, I think, is closer to the ERA range. I think old school Hall of Fame voters are still looking at those single-game metrics (RBI/Wins/Saves/H/HR), and I think it’s better off if the aggregate is involved. (Triple Slash) Analyzing players is different. If you’re a team or an analyst, the closer you can get to the bottom layers of the pancake, the better, but it is still important to keep your eyes on the actual performance. The bottom layers can help explain the top layers. They give a profile, they give nuance, and they give a justification. Looking at 115 WRC+ players without knowing their profile is confusing and hard to understand.

Now the top layers. I don’t think it makes sense to make the top layers die. That’s the enjoyment of the game. We want to see hitters hit, pitchers pitch, and our teams win. If everything was numbers, we’d kill 80% of the viewing population who watch baseball. It’s fun to watch ball, and the single-game performances are what draw us in and keep us hooked. The layers below supplement our understanding, but the true enjoyment comes from the top.

Thanks for reading. Will try to write more soon.

Spin Doctor BB

Discussion about this post