If you’ve watched a soccer match in the past few years, or even followed any soccer-adjacent twitter accounts, chances are you’ve come across a stat known as xG. xG – or expected goals—is soccer’s way of quantifying the quality of a shot. In its simplest terms, xG measures how often a shot should be expected to be scored, given its distance from the goal, shot angle, whether or not the shot was headed, and a variety of other factors. For a game with such low scoring, xG is helpful when it comes to sorting out the game-to-game luck that turns a season into a roller coaster.
Currently, there are a few analogs in basketball, though they’re far from as ubiquitous as xG. The Spax publishes its own version of the metric. The problem with their version of expected effective field goal percentage (eFG%), however, is that shot distance isn’t exact. Instead, it’s separated into bins: two-pointers <10 feet, two-pointer >=10 feet, three-pointers. This seems useful, but hardly anywhere near the sort of specificity we’d want for evaluating players and shots.
The website pbpstats.com also publishes its expected field goal metrics as something termed "shot quality," which makes it basically equivalent to xG. This allows you to see how often the shots generated by a team or a player would go in based on league average shots. There’s also an in-depth explanation that you can read here.
(In the interest of full disclosure, I only stumbled across the pbpstats model after making my initial model. There are a lot of similarities in the methodology, but also a lot of differences which I think make the model slightly more refined. I’ll get into those more in depth later. I also never really looked at the source code, since my metrics were made in R, and pbp stats uses Python, which I am completely incompetent in.)
During the playoffs, The Athletic’s Seth Partnow has also posted team expected eFG% and actual eFG%, as well as some above/below metrics. He has an article outlining the tenets of the model used, but as it’s described as "proprietary," I don’t have access to the nuts and bolts behind it. I also don’t have access to the Second Spectrum tracking data, which limits certain elements that I could add to my model.
With all that in mind, it's fairly safe to say I’m not attempting to reinvent the wheel here. I am instead trying to make my own wheel out of the tools I have available, and maybe add a few extra spokes. My source code to scrape shots mostly came from Owen J. Phillips’ blog post here. I simply reworked (hijacked might be a more appropriate term) the code to scrape the shot data rather than pulling them into charts. With a few loops (and several hours of patience), I pulled almost every shot for the past ten seasons.
That’s great. Unfortunately, simple shot data doesn’t provide any context for what’s happening in the game, and different shots happen in different contexts.
That’s where the play-by-play comes in.
Using the nbastatR package developed by Alex Bresler, I looped through all games for which I had shot data until I’d collected play-by-play for every game. This, of course, took even more patience than the shot scraping, but again, worth it.
Cleaning the Data
As it stands, I had a lot of data (literally millions of rows). But there were still some issues I had to iron out, as well as some items I wanted to add:
- Player position: This was done utilizing basketball reference classifications. There is some funky stuff going on (e.g. Giannis is listed as a point guard in the 2015-16 season), but overall this worked well enough.
- Shot angle: I was forced to remember how to do trig for this. This angle is calculated so shots directly in line with the rim would be 0 degrees (radians suck), and shots directly toward the corner are 90 degrees.
- Shot type: One of four types based on the action type from the play-by-play data (layup, dunk, hook, jump).
- Dummy variables for certain classifications of shots (i.e. whether or not a shot was a pullup, an alley-oop, a stepback, etc.) based on the action type.
- Score differential: Originally I had intended to use an absolute value of the score differential. After running through the model a couple of times, I then realized I had only made half of the values absolute values, which completely screwed with my algorithm. Anyway, I managed to fix that issue. I also ended up modifying the score differential by the amount of time left in the game to serve as a sort of proxy for game state. After all, a team that’s up 10 in the second quarter is likely to defend harder than a team that’s up ten with a minute left.
- Shot clock: This had to be done via approximation for most items. I used the difference in time from prior events to estimate the shot clock value. Since this was based on play-by-play, there were certain events where the shot clock value ended up being less than zero. For those items, I used the non-transition mean shot clock value.
- Is_transition: This variable was calculated based on the time elapsed on a shot following a steal. Based on the distribution curve from those shots, it looked like the highest portion of shots happened within about six seconds (shot clock >= 18) based on the distribution below. I went ahead and added that tag for post-defensive-rebound shots as well.
- Year_index: This is a reference to the number of seasons from the base season. The base season is the season we are testing (more on that a little later), so it will be set at 0, with one year subtracted for each year to denote the distance from the base year. This variable was added mainly to help capture the non-stationarity in shooting across seasons.
- Is_home: Whether or not the shooting player was playing at home. For the bubble games, I made every team the home team, since teams were all playing in the same arena and there was no travel involved.
- Originally, I had planned to add a qualifier for whether or not a shot was assisted based on the play-by-play commentary. Unfortunately, when I did that, I ran into this issue:
- In other words, the only time a shot was marked as assisted was when it was made. This made trying to differentiate between assisted and non-assisted shots useless based on the data I have. Obviously NBA teams are likely privy to better data than this, but, alas, this is the quandary one finds oneself in when utilizing publicly available data.
These items are pretty different from the pbp stats model, but I think they added more clarity to the model. I also used a single model for all shots, whereas pbpstats uses three separate models for restricted area shots, non-restricted area shots, and threes. Given that my calibration metrics were good, I didn’t see the need to further partition the data, but that may be something worth evaluating in the future.
Okay, this is where the math comes in. I’ll try not to get too far in the weeds, but I feel like it’s important to at least outline how the model was built and the evaluation metrics. If you don’t really care about this, feel free to skip down to the stats section. I promise it won’t hurt my feelings.
With all the data cleaned, and the play-by-play attached to the shot data, it’s time to actually create models. For this, I started by using a tidymodels framework from Thomas Mock. I wanted to start with a linear model and a random forest model, see which one worked better, and then refine it as needed.
Going in, I expected a linear model to fit the best. I’ve done limited work with event data in soccer, but linear modeling has worked best in that situation.
After a quick pass through this dataset, however, it was apparent that a linear model wasn’t the best framework. The log loss was little better than a model that guessed based on the average value, and the calibration plot was a mess.
The subsequent random forest model fit far better (even more so when I managed to rectify my error with the score differential). So instead of a linear model, I fitted two tree-based models: a random forest model via tidymodels, and a boosted tree model (xgboost), the basic structure of which was heavily cribbed from this post at Open Source Football by Ben Baldwin. I won’t bore you with the details, suffice it to say that these wound up fitting much more cleanly.
With the model set, we can also look at the most important features for determining the probability a shot will be made. As expected, shot distance is the biggest factor. The player’s position also plays a role, which isn’t surprising. After all, a point guard is more likely to make a 25-foot shot than a center. Game state also has a relatively large effect, which makes sense given what I mentioned before regarding defense for teams leading or trailing by large margins given time left on the clock.
Even with the model structure set, though, I needed a system for selecting the training data for test seasons. Modeling based on the season we’re fitting would lead to issues with overfitting, which we don’t want.
Instead, I wanted to figure out the best range of prior years to train on. To do that, I started with the 2015-16 and 2016-17 seasons as base seasons, and tested several different prior year ranges for projecting forward to the next season.
After toying with several different frameworks, I settled on trailing three seasons (n-3, n-2, n-1) for season n. This led to the best combined results in terms of log loss and calibration. I also discovered how important it was to use a rolling three years, as the strength of the model deteriorates steadily as you get further and further from the base year. By the time you reach year n+3, a model based on years n-3, n-2, and n-1 are poorly calibrated, which is far from a surprise given the shifting landscape in both shot selection and true shooting across the league.
To get a sense of the difference, here’s a look at the calibration plots for year n vs n+3 from the random forest tidymodel. For this, we want our points to be as close to the line as possible.
Now, with those results in mind, we’ll head over to the xgboost model, run several iterations to get our best fit, and then use those parameters as the basis for each of our models, one for each season.
In the end, we end up with all of these:
A quick explanation of what these charts mean: These are a comparison of shots, with expected FG% (as predicted by the model) on the x axis and actual FG% on the y axis. The closer to the line the points are, the better calibrated the model. In other words, when the points are on the line, it means that shots that the model says are expected to go in 40% of the time actually go in 40% of the time, and so on and so forth.
There’s a few issues with the 2015-16 and 2017-18 seasons*, but for the most part, I’m pretty happy with how this turned out. It's not quite a perfect 5/7 rating, but 6/8 is probably the next best thing.
(*It’s outside the scope of this post, but I adjusted the expected field goal percentage based on the actuals to make them better fit the scoring environment. The 2015-16 adjustment was pretty straightforward; the 2017-18 one less so.)
The models are built. Now comes the fun part: using those to look not only at player shot quality, but also how well players outperform their expected metrics. As Seth Partnow has discussed, there are huge issues with modeling by player taking the shot, but that doesn't mean we can't gauge a player's skill by comparing actual to expected metrics.
First, let’s take a look at each player’s expected effective field goal percentage and their actual effective field goal percentages, and rank them by players who best outperformed their expected numbers. For this, I’ve also added a SHARP+ metric, which is similar to OPS+ or wRC+ in baseball. League average is set to 100, and players are graded based on how well they did above or below that line. For example, a player with a 105 SHARP+ shoots 5% better than an average player given his shot profile, while a 95 SHARP+ player shoots 5% worse.
I've also filtered out players who took less than 200 shots just for sample size purposes.
These aren't exactly the names I expected, but it's a start.
I started this post by discussing expected goals in soccer, and how this model is basically its basketball equivalent. That’s all true, but there is one important caveat. In soccer, shot volume is relatively low. A single team may take 15 shots on goal in an entire 90-minute period. In basketball, teams take upwards of six times as many shots in half the time.
Because of the lower shot volume, it’s harder to judge which players are actually better than expected at making shots. If an average player is scoring more goals than expected based on xG over a few games, it’s generally expected that said player is due for regression in the form of a cold spell. On the other hand, Lionel Messi, who might be the greatest soccer player of all time, has exceeded his xG every single season. The thing is, it takes a long time – seasons long, even – for that data to stabilize, and for you to be able draw any conclusions on whether or not a player actually is an above average goal-scorer rather than a player on a hot streak.
The difference in shot volume means that shouldn’t be the case in basketball. Instead, we should be able to compare the difference in shots made to expected shots and see which players are better/worse than an average shooter. Since we also have shot values, we can then translate those numbers into points and determine how many points a player is worth above or below average (at least from a shooting perspective).
Before that, though, we also need to add one other component: free throws. This model is a lot simpler than that for field goals. Rather than using a fancy tree-based model, I simply used the trailing three seasons of free throws, got average percentage by position (PG, SG, SF, PF, C) and then added those to the data to get their free throw points above replacement. Unlike most shots, free throw percentages by position don’t vary all that much from season to season, so I feel like the simple model makes sense here.
WAR: What is it good for?*
*Surely you didn't think you were getting through a post from me, of all people, without this reference?
Baseball fans are likely familiar with WAR, or Wins Above Replacement, which is a single metric meant to encapsulate a player’s value based on everything he does on the field (hitting, defense, base running). Obviously I don’t have the sort of data required to incorporate defense or assists into the equation, but we can use our points above expected based on the expected and actual percentages above and then translate that into wins to get a single number that should be comparable across seasons regardless of scoring environment.
Both Daryl Morey and former Grizzlies front office employee John Hollinger have done work on Pythagorean wins in basketball, which estimates a team’s expected win percentage based on its total points scored and total points allowed. Using that framework, and the fact that we know the average points scored and allowed for the league, we can work backward to calculate the amount of points needed to add a single win to a team's win total. Since this is a Grizzlies blog, I felt obligated to use Hollinger’s coefficient of 16.5 for my calculations, but a 13.91 works just as well. It simply increases the number of points per win.
With points per win solved (for the 2020-21 season, it's about 27 points), we can now calculate how many wins each player was worth. Here are the current season (2020-21) leaders in SWARm (Shooting Wins Above ReplaceMent) for the regular season.
In other words, it looks like Nikola Jokic is a deserving MVP, at least based on shooting alone. Outside of Jokic, though, these numbers look much more like what I'd have expected from this season. Jokic and Curry at the top, three Nets, and a few other known good shooters. Vucevic is probably the lone shocker there, and the best reminder of the fact that this is a shooting model only.
If you want a more Grizzlies-centric view, here’s how this season’s players rate in shooting points above replacement.
Before you freak out about who is on the bottom of the list, remember: SHOOTING MODEL ONLY. Plus, in this instance, volume makes a massive difference. Ja is where he is because he’s not a great shooter, and he takes a lot of shots, as he should. When you look at his SHARP+ score, however, you can see that he’s also not truly miserable, either (just 6.4% worse than average). It's his volume that hurts him the most when calculating his wins above replacement, which is a cumulative stat. Far worse than him is (shocker) Justise Winslow, who racked up 2/3 of the negative wins as Ja in five times less shots. Yikes.
This should not be discouraging. We all know Ja leaves something to be desired with his shooting. If anything, it should be viewed as a positive. Given his age, the fact that he’s only slightly below average means that if he were to get to just above average, the metric would love him. Besides, this is only evaluating a single aspect of his game. Ja’s value, even just restricting it to offense, is so much more than just shooting.
Which brings us to…
Issues with the Model
As I stated earlier, this stat is not meant to be the sort of all-in-one metric that WAR is in baseball. (To be fair, even in its current state, WAR frequently gets misused.) Instead, it looks to place a value on a single aspect of a player’s game (shooting) and quantify that into an easy-to-understand metric (wins).
However, for as well as the models are built, there are still aspects which aren’t incorporated into them. Whether a shot was assisted, the position of the defense when the shot was made, the player’s movement prior to taking the shot, the other players on the court with said player (hi, Joe Harris!), etc. are all important pieces of the game which I don’t have available in my data. In the near future I'm hoping to incorporate bits and pieces (lineup data, etc.) to try and extrapolate some sort of defensive/on-off numbers, as well as add some value for shots assisted by each player. That's for another day, though.
There's also the issue of data quality. Seth Partnow, as well as Owen J. Phillips, have discussed the issues with trying to analyze information based on the play-by-play reports and the fact that there are plenty of misclassifications. Unfortunately, I don't have access to Second Spectrum, nor do I have the time to sit and evaluate every one of the 200,000+ shots per season to ensure they're classified correctly. In other words, I'm doing the best I can with the time and resources I have.
In spite of those deficiencies, I think this serves as a reasonable way to evaluate player shooting, which, in a make-or-miss league like the NBA, is incredibly important.