(Crossposted to Seattle Soccer Scene)
At the risk of turning Sounder at Heart into nerd central by adding onto malcontenjake's and his own contributions, Jeremiah invited me to add some detail to the simulated MLS season results I dropped into a comment.
At its core, the idea is to predict the final standings by pseudo-randomly simulating out the remaining games in the season. I say pseudo, because the idea is to predict as accurately as possible the likely outcome of each game. But it's random, because game results are sensitive to the tiniest factors (like, for example, Terry Vaughn's obscene decision to award a penalty after Jason Yeisley fell all over himself in injury time in Dallas) so it's better to take a large number of random results and average out the outcomes. Smarter people than me call these Monte Carlo experiments and they're frequently used for sporting events, which have a lot of randomness.
It has some benefits over more formulaic models (like PPG) in that it takes into account the results of specific games. For example, a game between two playoff contenders might be critical to the final standings, but it's impossible for both teams to win: a fact that a simulation will capture, while a PPG model will assign both teams their average points. On the other hand, the results tend not to be dramatically different from PPG and the simulation can be a lot of work and headaches to setup.
Ken at Sports Club Stats does the same sort of work, and I think it's required reading for any sports fan, but I had a couple of complaints about the SCS methodology. First, he predicts outcomes based on previous winning percentage, whereas I prefer to predict point totals (or in MLS, goal totals) and determine the wins from that. In a winning percentage model, a team that wins 2-1 and a team that wins 5-0 are given the same credit, when the second team is almost certainly more dangerous. Also, it's not entirely clear what to do with ties. Also also, home-field advantage is a leaguewide constant, when in fact each team gets a slightly different advantage (or lack thereof) at home. In a goals model, you capture scoreline domination, you have no problem with ties, you can derive a unique homefield advantage, and it's easier to differentiate a team's defensive contribution (its ability to prevent goals) from its offensive contribution (its ability to score goals). Another problem with SCS is that it doesn't weight recent results more heavily. Teams change over time, and no team has changed more dramatically than the Sounders in the last few weeks, so it wouldn't be appropriate to weight their early season results equally with their recent results.
So I decided to turn my complaints into lemonade and setup my own version, after having done something similar for MLB in the past. Without going into ridiculous detail, the fact that soccer scores are clustered around 1 and 0 and are not normally distributed prevents an easy Gaussian (mean and standard deviation) random score generation that might work for the NBA or baseball. So instead I use a sampling algorithm. Imagine that I write each team's goal results for each game on a slip of paper (so LA would get one 4, a 3, and a bunch of 2s, 1s, and 0s) and divide it into home and away and goals allowed and goals scored. Then for each game I take the home team's home scoring pile and dump them into a bucket and the away team's away goals allowed pile and dump them into the same bucket, then I swizzle it around and pull a number. Voila, that team's simulated score result for that game. Obviously it's an overly simplified way to predict a single game's results, and I wouldn't recommend running to Vegas with it, but over a large number of simulations it averages into dividing the good teams from the bad.
Once the basic process is in place, you can introduce a couple of weighting factors. First, you could give the offense's results more weight than the defense's results (by giving them twice as many slips of paper in the bucket), depending on how much offensive vs defensive factors contribute to scoring output. Barring running a full regression to figure out the weight, assuming that they're evenly weighted is a fine approximation, and that's what I do. Second, you can introduce a recency bias (by adding duplicates of the slips of paper for recent games). I currently bias it on a sliding scale, so that the first game of the season gets a weight of 1 and the most recent game gets a weight of 2 (and the middle game therefore gets 1.5, etc). Again, it would take some real math to figure out the best weight for more recent results, but double seems fair and, if anything, a little conservative.
So you pull your slips over and over for each game, then calculate the standings. Then you do it again. Then about a hundred thousand more times — fortunately computers don't complain about doing repetitive tasks like this — and you average out the final standings from each of those runs. And you get something like this:
|Team||Avg Points||Avg Position||% playoffs|