The soccer analytics community has produced a tidal wave of Expected Goals analysis in the last couple of months. Okay, maybe a handful of bloggers and analysts can't quite manage a tidal wave. But still. A wave of some kind.
This post on Cartilage Free Captain is a great summation of the various efforts, though I'm not entirely convinced it will "blow the soccer analytics game wide open". Briefly, an expected goals model is an attempt to determine how many goals a team or player should be expected to score based on the characteristics of the shots they take. In theory you can go even further and get expected goals based on other events (like passing), which would bring Expected Goals fairly close to Runs Created for those familiar with baseball sabermetrics. But for soccer, shots are clearly the right place to start.
An expected goals number has a few uses. First, if a team or player is significantly outscoring their expected number of goals you can take that as evidence that they're uncommonly skilled (in that they convert equivalent shots at a higher rate) or that they've been uncommonly lucky (and can be expected to regress). Second, an accurate expected goals model can be an indication of which is the more dangerous team over the short term (even intra-game), even if no goals have been scored yet.
Michael Caley, the CFC author, has developed a model for both the Premier League and MLS (and he keeps updated advanced MLS statistics here), which relies at the base on assigning shots to various zones (then adds adjustments for key pass type, whether it was a header, etc). Martin Eastwood has also developed a model just using distance which gets about 85% of the way to an accurate expected goals number for the Premier League.
So I thought I'd try to do a similar analysis to Martin's but using MLS shot data and which I expect will come up with similar results as Michael's, but hey let's find out!
The data set is a collection of 23,902 shots taken in MLS in the last three years, including the location on the pitch from where the shot was taken. Using those locations we can calculate the conversion rate of shots at various distances (in yards), thusly:
As you can see there's a pretty smooth curve, with a dramatic dropoff from the absurdly high rates at tap-in distances and then at about 7-8 yards it starts to level out. This looks a lot like an exponential distribution, meaning taking the logarithms of the values should yield something like a line (again, this is all consistent with what Martin found in the EPL data). Here's the same chart, but taking the log (base 10) of the conversion rates:
That's a pretty good line, though there's a little curve at the beginning that looks more systematic than random. And it obviously gets a little ragged at the end. But for our purposes those aren't particular areas of interest. Though the regression treats each data point equally, in fact a vast, vast majority of shots in MLS are taken from 4 to about 30 yards away, and in that area the line is quite well fitted. Even with the outliers at the end the overall R-squared is 0.936, which means if you give me a large sample of shots from a certain place on the field, I can get to 94% of the conversion rate without even knowing anything about the shooter, the assisting pass, the defense, etc.
So if we accept that line as an equation, you get to figuring the conversion rate at a distance with the formula
10^(-.0474d - .3302) where d is the distance in yards. Which looks a little imposing, but computers are doing all the work anyway, so who cares.
So that's a dramatically simple Expected Goals model, and yet. . and yet. If I apply it to three years of Seattle Sounders shooting, the expected number of (non-penalty, non-own) goals I get is 130.8. The actual number of goals is.. 131. So over the course of over a thousand shots, I can get within 0.2 goals of the correct number knowing nothing but how far away they were.
Not all teams fare so well and I expect that stylistic tendencies by some teams (particularly how much they lean on headers) will require adjusting the model to keep close to that level of accuracy. Here's the 2013 season in actual goals (again, excluding penalty and own goals), expected goals based on what we'll call Model v0.1, and the difference.
|Real Salt Lake||50||39.44||+10.56|
There are some big outliers here.. well beyond what we could reasonably expect from luck. The teams who fell the most short of their expected goal numbers were San Jose, Houston, and DC United. Notably, two of those teams are much more reliant on headed shots than a typical MLS team. And there was all kinds of evidence that DC was very unlucky as well as being very bad (and you need both to have a historically low number of wins) so it's not surprising to see them there. At the other end, Real Salt Lake and Vancouver scored quite a few more than expected.
One test to see whether a difference is due to skill or luck is to look at a team over time. RSL, for example, was +11 last year. But in 2012 they were +2. And in 2011 -3. So if they've cracked the code of shot conversion, it was a very recent development. Only one team has consistently shown an ability to beat their expected goal total year to year, and I'll discuss them in a future installment.
We can also look at expected goals at the individual level by looking at the players taking the shots. Here are the top 10 players ordered by how much they outperformed their expected goals:
|Marco Di Vaio||11.08||20||+8.92|
Di Vaio is an interesting player to top the list. Of course, he scored a ton of goals. But he also did it with a very distinctive style, hugging the defensive line so tight that he lead the league in being called offside enough to lap the rest of the field. But when the through balls and balls over the top to him were onside, he was totally free of defenders, which would help account for his higher finishing rate. Opta actually tracks breakaways and through balls as well, so it should be theoretically possible to account for those shots.
There's another argument for a stylistic ability to beat the expected goals number. Nagbe sits in 9th place, but Diego Valeri is in 12th (at +3.65) and Will Johnson in 14th (at +3.61). Having 3 players on the top 15 suggests that Caleb Porter's possession-and-throughball style may consistently lead to higher percentage chances.
Now here's the bottom 10 players who significantly missed their expected goal targets:
|Juan Luis Anangono||4.25||2||-2.25|
I'll let Galaxy fans expound to you the many ways in which Zardes' shooting was awry last season. Suffice to say he was a disappointment. There's a significant representation of players who primarily shoot with their head, and that doesn't surprise me since we're not scoring them any differently. To expound on the theory, a shot from 5 yards out with your head is probably a run of the mill header off a set piece and doesn't have a particularly high chance of going in, since it's almost certainly heavily contested and headers are hard to control. On the other hand, a shot with your foot from 5 yards out is a gift and they tend to go in at a high rate. I expect that explains Omar's, Chad's, and (to some extent) Bruin's presence on the list and in the next iteration I'll break out headed conversion rates separately.
As I said, more to come. But for now I think it's a pretty impressive and easy to calculate model for only needing a single piece of data for each shot.