The Role of Shots in Goal Scoring
One of the factors that distinguishes soccer from nearly every sport is the relative rarity of scoring events. This has significant implications in many different contexts, including tactical decisions, player evaluation, tournament organization, and so on. And in particular it has implications for statistical analysis of the game. Because the goal of a team is to win and winning requires outscoring the opponent, team and player analysis must necessarily be rooted in the ability to produce goals. But because goals are relatively rare, we run into significant sample size problems. Say, for example, a team lost two consecutive games by 0 goals to 1. Is that a bad team? Does it have problems finishing? Or were they just unlucky? Does it matter if the team outshot its opponents 20-5 in those games? The answers to those question are at the core of soccer analytics.
One step towards reducing the sample size problem would be to rely on stats other than goals. Shots, for example, happen at a much greater rate than goals, and if you can evaluate players or teams in terms of shots you have a much larger statistical base to look at. But if you look only at shots you are ignoring information about how particular teams or players turns shots into goals. For example, 10 shots from a team that converts a 1/5 of their shots into goals are worth much more than 10 shots from a team that converts only 1/10 of their shots into goals.
You can imagine a team or player's rate of goal scoring as being the product of a number of different rates. First there is the rate at which shots are taken (shots/min), which I'll conveniently call Shot Rate. Then there is the rate at which those shots are on goal (shots on goal / shots), which in the tradition of Chris Anderson (and I assume others) at Soccer By the Numbers and Soccer Analysts I'll call Accuracy. Then there is the rate at which shots on goal go in the net (goals / shots on goal) which in the same tradition we'll call Conversion. A team's goal-scoring rate is the product of these three rates.
So one important step in understanding goal scoring is isolating these rates and examining how teams and players perform on them. In particular, if we can demonstrate that Conversion rates are highly erratic or random, it will bolster the argument for using shots on goal for evaluation. And similarly if we can demonstrate that Accuracy rates are also erratic or random, it will bolster the argument for using shots.
"Randomness" and Skill
Keep in mind that when I say 'randomness' in outcomes, I don't literally mean randomness, as if someone were rolling a die every time a shot was taken. Instead I mean factors that are both extremely unpredictable and largely out of the shooting player's control. For example, when a player takes as hard a shot as possible, extremely small variations in the locations of the laces, moisture at the point of contact, wind, the player's momentum, stability of the plant foot, etc can add up to significant changes in the direction of the ball, which can easily mean the difference between a goal and a shot off the bar. These factors will almost certainly be forever beyond the scope of statistical analysis (and even if they weren't, sample size issues would render them useless) and so instead we group these factors together (along with similar factors affecting a goalkeeper or defender's ability to intervene) and call it Randomness.
One method we can use to determine how much a particular statistic is governed by randomness is to see how it changes over time. Obviously many stats are heavily affected by opportunity and tactics (for example, forwards will naturally take more shots than defenders, regardless of skill), but if we take a look at a player's stats from one half of a season to the next, we should eliminate most of those factors. Player movement from team to team or position to position during a season is relatively rare, and we don't expect a player to improve or deteriorate significantly skillwise during a single season, so the theory is that any changes in a statistic can be attributed in large part to random factors.
Let's take a couple of stats from other sports as examples. One statistic that we can be pretty confident is primarily governed by skill is the rate of point scoring in basketball. Obviously the outcome of any single shot in basketball is not much different that the outcome in soccer and can be quite random, but there are so many more shots in a game of basketball that the random factors will average out in the course of a single game. We can be pretty confident that a player that scores around a point every 2 minutes in the first half of the season will continue to score at that rate in the second. To illustrate that, here's a plot that correlates first half point-per-minute rates of the top 50 scorers in the league to second half rates for the same player this season:
You can see that the cluster is pretty close to a nice line pointing up at a 45 degree angle. I've highlighted a couple of outliers. Jason Richardson has had a pretty significant drop off from a ppm rate of 0.552 in the first half to about 0.415 in the second half. But in general the results correlate nicely. The Pearson correlation is 0.78, and we'll use that as a baseline for a stat governed largely by skill.
In contrast, we can turn our attention to baseball and look at Pitcher's BABIP. That's short for Batting Average on Balls In Play, and it refers to the batting average on balls that are hit somewhere in the field of play (not home runs, not fouls, etc). It's pretty well understood in the baseball Sabermetrics crowd these days that a pitcher's BABIP is largely random. Not completely random, as was originally theorized by Voros McCracken in 2001 — groundball pitchers and flyball pitchers have slightly different rates, for example — but it's mostly random. A pitcher with an unusually high BABIP will almost certainly regress to the mean, which is a great tool in evaluating and prediction performance. Here's a plot of pitcher BABIP from one half to the next last season:
That's a pretty striking contrast. This is what the professionals call a 'blob'. Values are clustered roughly around the mean without any line to indicate correlation and the outliers radiate out in a circle. The Pearson correlation is 0.10 (actually negative, but the sign is meaningless in this case).
So this gives us an idea of the range of correlation rates of stats. A soccer stat that's governed mostly by the skill of the player should have a strong intraseason correlation and one governed mostly by randomness will have a weak one. Another way of putting it is this: suppose you're asked to predict the second half stats of a player for whom you have the first half stats. If it's a stat governed by skill, your best bet is to guess whatever the first half stat was. If it's a stat governed by randomness, your best bet is to guess the league average.
So now let's look at our three rates. For my population I took every MLS player last season who accumulated at least 10 shots on goal in each half to weed out sample size problems. Also keep in mind that I eliminated penalties from the analysis completely. Penalties are obviously their own special beast and would just corrupt any analysis done on the run of play. Here's the plot of the Shot Rate (that's shots / minute) for those players from one half to the next:
That's a good looking correlation. We see a well clustered diagonal line that indicates that first half Shot Rate is correlated with second half Shot Rate. That point way over at the top right of the graph is Edson Buddle, by the way. He took a lot of shots. The Pearson correlation is 0.70. So we can say in general that Shot Rate is not affected significantly by randomness. Players who shoot a lot will continue to shoot a lot.
Now let's look at Accuracy (that's Shots on Goal / Shots). If it's true that there are players who are significantly more or less accurate than others, that should show up as a good correlation here:
But we see the blob again. The Pearson correlation is 0.11, almost as low as pitcher BABIP (!). As an example, in the first half Ryan Johnson had an Accuracy rate of 0.81. In the second half it was 0.47. So either Johnson worsened tremendously as a player in the course of a single season or his Accuracy rate was affected significantly by factors outside his individual skill. In contrast Kheli Dube had a first half Accuracy rate of 0.375. Terrible shooter? His second half accuracy rate was 0.667. What accounts for this? Well, there are certainly sample size issues as we reduce our population to only shots on goal. But I think the sample is large enough, the number of players is large enough, and the resulting correlation is so low that there's really no denying that luck plays a major role in Accuracy. But let me be clear that this doesn't mean that there's no inherent difference in skill in players. Fredy Montero can obviously put a ball on frame from 30 yards out much more often than most other players in the league. But when you take into account all of the factors that constrain that action in a real game: the fact that players will only take a shot that has a decent chance of going in, the fact that players will want to hit the ball with power (and therefore lose accuracy), the fact that defenders will contest nearly every shot, etc, then in the run of play those skill differences get averaged out and accuracy seems to come down largely to factors other than skill.
What about Conversion (that's Goals / Shots on Goal)? This will be of particular interest to Seattle Sounders fans who watched their team pepper the goal frame in three consecutive games only to see one goal come out of it (on a back-post tap in of all things). Here's the same plot for Conversion:
Hopefully not a surprise to anyone at this point, but Conversion rates are even less correlated. A player who has a good first-half conversion rate is just as likely to have a terrible second half as a good one. The Pearson correlation is 0.06, which is effectively nothing. Again this isn't a denial of the existence of skill, but an acknowledgement that the skill comes mostly in getting space for a shot, and conditions, defenders, goalkeepers, and so on have a larger impact on whether the shot goes in. If you don't believe that, take a look at the work Anderson has done on Conversion rates in the EPL this season. It's just not true that the good or most talented teams are the ones with the best Conversion rates. Blackburn leads the league just alongside Newcastle. Tottenham, Arsenal, and Chelsea are all in the bottom half.
So what does all of that mean? Obviously a great deal more research has to be done. I suspect that more detailed information about shot quality (location, distance, nearness of defenders, etc) will reveal more significant correlations between skill and results, but in the meantime — if the stats you have to work with are shots, shots on goal, and goals — then in the short run looking just at goals will reveal almost nothing about individual player skill. Instead what we see is that good players get in positions to take dangerous shots and that dangerous shots will, on average, eventually go in.
This is what coaches are talking about when they evaluate their team based on whether they 'created chances'. If you're taking shots (and you're not shooting just for the sake of it, but taking real shots), then a certain proportion of them will eventually go in.
It also means that we're justified in looking at shot data to evaluate players (for example looking net shots when a player is on the field, which we'll look at more later this season). And it means we're justified in saying the Sounders have been tremendously unlucky in the first three games of the season and that more than anything has determined their results.