Modeling Free Agent Contracts in MLB

Ajay Patel
7 min readMar 9, 2023

--

David Freese (USA Today)

A lot of what’s been done in baseball’s public sphere has focused on evaluating players’s capabilities on the field. And for good reason. That’s where 99% of the available data is, and it’s what matters. However, there’s been a missing piece for a good while and that is understanding how a player’s past performance can be used to predict their next contract.

This work done by myself and good friend Ben Wieland aims to use publicly available data to create contract predictions for free agents. Contemporary ideas of player valuation has largely come through WAR statistics from Fangraphs and Baseball Reference. These values can of course can be converted onto a monetary scale, but we shortened that gap here, taking similar sabermetric stats and applying those straight to contract predictions.

Similar work has been done in hockey and football, but we saw no counterpart in baseball. We saw an opportunity to apply conventional metrics to an off-the-field aspect of the sport, and we ran with it. Inside the rest of this article, you’ll find our process, our results, and our hopes for future work.

The Process

First off, this project wouldn’t be possible without this spreadsheet maintained by Jeff Euston. He’s kept track of every free agent deal since 1991(!) in that sheet. Legend.

We didn’t need that many contracts to test the model on, however, so we trimmed the dataset to players that signed in 2014 or after. This would mainly provide our response variables, as we trained separate models for predicting the number of years on a deal, and the AAV (average annual value) on a deal. The reason for separate models is each model can only predict one given dependent variable, so getting years and AAV from one model would be impossible, hence the use of two separate ones. They would later on be multiplied to get a player’s total contract projection.

Now we needed to gather our input data, which was all gathered from Fangraphs through baseballR. We chose to keep rolling values of our input stats such as WAR, wRC+, etc. to capture a player’s recent value, which would of course have more influence on a potential free agent deal than stats he put up 10 years ago. I’ll skip over some of the smaller steps we took there, and get to the modeling aspect. Models were created separately for hitters and pitchers because of the natural differences in how we evaluate those positions. AAV was done using a generalized linear model, and the inputs for hitters included age, fWAR, ISO, wRC+, K-BB%, and a player’s position group. For pitchers we kept it simple, just using age, fWAR, innings pitched, and whether the player was a starting pitcher or a reliever. Note that AAV values were adjusted based on the year a deal was signed, as salaries have generally trended upward over time, allowing us to compare these values on a fairer scale.

Years were a bit weird to predict because of the amount of 1 year deals that get given out and the nature of its distribution. We tried a KNN and regular poisson distribution, but they didn’t represent the distribution that well. We settled on a zero-inflated poisson technique, which does a good job accounting for the number of 1 year deals, matching the distribution much better than other tested methods. For more insight on this process, I recommend reaching out to either of us.

The Results

Our models did way better than either of us expected. Way better.

This table shows the root mean squared error, mean absolute error, and median absolute error for each of our separate models. Going into this project, we really had no idea what to expect in terms of model performance, because no one had ever done it before. However, the results were a pleasant surprise. We can see that the AAV models are usually within two to four million dollars off the actual AAV value of a deal, with a few outliers tugging the mean error away from the median. As for years, it’s right around a year or so off on most deals, which is understandable with the recent explosion of long term deals in the market but also the amount of short-term high AAV deals as well.

The graph below shows the distribution of the model’s prediction error in AAV.

y axis is count/frequency

Most deals are off by less than five million dollars. The crown jewel of our models was David Freese. (That’s why he’s the cover image, duh.) Our model had him projected for a 1 year, $4,507,248 deal in 2019 when he ended up getting a 1 year, $4,500,000 deal from the Dodgers. We missed the mark by only $7,248, which was unreal to see from something we put together.

You can also get a better look at some of those outlier deals in this representation of model error. That outlier all the way at the end? Bryce Harper’s mega deal with the Phillies. Our models had him projected for 10 years at a whopping $43.5 million AAV when in reality he got 13 years at only $25 million AAV. The reasoning for the difference could be as simple as Harper wanted stability, something different than the model would expect and can’t really account for. We’ll discuss some more outliers in a bit and what we can learn from them. For now, let’s look at prediction error for pitchers.

y axis is count/frequency

We see a very similar trend to hitters here, as most deals were off by less than five million dollars in AAV, a great mark. One of the larger deals the model missed on was Max Scherzer’s recent contract with the Mets. We had him projected for 2 years at $27 million AAV when he got 3 years at $43 million from Uncle Steve. What gives? The 2020 season, that’s what. Due to the shortened season, a lot of cumulative stats like WAR in 2020 can’t be translated to those of other years. Our model doesn’t account for this just yet, but it’s definitely an improvement we have in mind.

What’s the Use?

Player valuation is always tough to put together. It’s hard to look at so many different numbers and try to come up with one singular value that represents what a player should be paid. That’s where we believe our works provides its utility. By comparing our predictions to actuality, we can measure over and underpays. We can help the common fan comprehend what goes into how much a player was paid, or deserved to be paid.

Being able to see in real-time how bad a free agent deal was, rather than waiting three or four years later also gives us a better idea of how to evaluate front offices. We can see which teams are bargain shopping in free agency versus paying up for someone who maybe didn’t deserve it. At the same time, we can compare player’s agents and see who does a good job of making sure their client gets what they deserve, and who doesn’t.

You can see this yourself, as we’ve aggregated all the relevant data into this google spreadsheet that contains every free agent deal we predicted through years and AAV, the actual years and AAV a player got, and the difference in valuation from our model and the teams that paid them. We’re working on getting the teams that signed them onto the sheet, but wanted to share what we have now. I’ll highlight some examples that show the skill of the model, but also some things we can’t capture just yet.

Edwin Encarnación was projected a 3 year, $53,463,717 deal in the 2017 off-season. He got 3 years, $60,000,000 — leaving our model off by less than $10,000,000 total. Brooks Raley was projected a 2 year, $8,772,832 deal in the 2022 off-season. He got 2 years, $10,000,000. Another win for the model. You can browse through more examples in the spreadsheet linked above to see the utility of our work.

One of the main things our model can’t account for is possible or lingering injuries. Take Carlos Correa for example. Our model projected him for an 8 year, $248,623,063 deal in the 2022 off-season. He ended up only getting a 3 year, $105,000,000 deal from the Twins. While the AAV was close, our model was way off in terms of years. And fairly so, the model didn’t know Correa’s medicals. While deals like this don’t happen all too often, they represent an edge case our model can’t fully account for, and probably won’t be able to.

Future Work

We’re still in the rudimentary stages of what we envision this project fully becoming. We want to spend more time selecting model features, especially with pitchers, to get a better representation of what teams may be looking at when they create their valuations but also what we think matters, and what does matter. The adjustment for the 2020 season as well is something we’re looking to add in. General model improvements still remain. We obviously can’t account for injuries and those affect free agency negotiations as we saw with Carlos Correa.

Additionally, the scope of our work only covers free agency. Getting a similar tool for players that go through arbitration would be huge. Arbitration spans so much of the average player’s career, and giving the public a look at what into goes into the process would be extremely worthwhile.

However, what we’ve done applies public baseball knowledge in a way no one else has so far, allowing fans a look into the intricacies of free agency and better analysis of deals that get signed. Bridging the knowledge gap between the public and private sphere of baseball is important to us, and by reading, you’ve contributed to that.

--

--