Expected goals from bookmaker odds

I recently read an interesting paper called The Betting Odds Rating System: Using soccer forecasts to forecast soccer by Wunderlich and Memmert. In their paper they develop av variant of the good old Elo rating system. Instead of using the actual outcomes of each match to calculate the ratings, they use the probabilities of the outcomes, which they get from bookmaker odds.

I was wondering if a similar approach could be used together with the goalmodel package I released a couple of months ago. The models available in the package are models I have written about extensively on this blog, and they all work as follows: You use the number of goals scored to get some ratings of the goal scoring and goal conceding rates of each team. You then use these ratings to forecast the expected number of goals in the upcoming games. These expected goals can then be used to calculate the probabilities of the outcome (Home win, draw, away win). A crucial step in these calculations is the assumption that the number of goals scored follow the Poisson distribution (or some related distribution, like the Negative Binomial).

But can we turn this process the other way around, and use bookmaker odds (or odds from other sources) to get expected goals and maybe also attack and defense ratings like we do in the goalmodel package? I think this is possible. I have written a function in R that takes outcome probabilities and searches for a pair of expected goals that matches the probabilities. You can find it on github (Edit: The function is now included in the goalmodel package.). This function relies on using the Poisson distribution.

Next, I have expanded the functionality of the goalmodel package so that you can use expected goals for model fitting instead of just observed goals. This is possible by setting the model argument to “model = ‘gaussian'” or to “model = ‘ls'”. These two options are currently experimental, and are a bit unstable, so if you use them, make sure to check if the resulting parameter estimates make sense.

I used my implied package to convert bookmaker odds from the 2015-16 English Premier League into probabilities (using the power method), found the expected goals, and then fitted a goalmodel using the least squares method. Here are the resulting parameters, from both using the expected goals and observed goals:

I wanted to use this season for comaprison as this was the season Leicester won unexpectedly, and in the Odds-Elo paper (figure 6) it seemed like the ratings based on the odds were more stable than the ones based on the actual results, which increased drastically during the season. In the attack and defense ratings from the goalmodels we see that Leicester have average ratings (which is what ratings close to 0 are) in the model based on odds, and much higher ratings based on the actual results. So the goalmodel and Elo ratings seem to agree, basically.

I also recently discovered another paper titled Combining historical data and bookmakers’odds in modelling football scores, that tries something similar as I have done here. They seem to do the same extraction of the expected goals from the bookmaker odds as I do, but they don’t provide the details. Instead of using the expected goals to fit a model, they fit a model based on actual scores (similar to what the goalmodel package do), and then they take a weighted average of the model based expected goals and the expected goals from the bookmaker odds.

Fitting Bradley-Terry models using Stan

I have recently played around with Stan, which is an excellent software to fit Bayesian models. It is similar to JAGS, which I have used before to fit some regression models for predicting football results. Stan differs from JAGS in a number of ways. Although there is some resemblance between the two, the model specification languages are not compatible with each other. Stan, for instance, uses static typing. On the algorithmic side, JAGS uses the Gibbs sampling technique to sample from the posterior; Stan does not do Gibbs sampling, but has two other sampling algorithms. In Stan you can also get point estimates by using built-in optimization routines that search for the maximum of the posterior distribution.

In this post I will implement the popular Bradley-Terry machine learning model in Stan and test it on some sports data (handball, to be specific).

The Bradley-Terry model is used for making predictions based on paired comparisons. A paired comparison in this context means that two things are compared, and one of them is deemed preferable or better than the other. This can for example occur when studying consumer preferences or ranking sport teams.

The Bradley-Terry model is really simple. Suppose two teams are playing against each other, then the probability that team i beats team j is

\( p(i > j) = \frac{r_i}{r_i + r_j} \)

where \(r_i\) and \(r_j\) are the ratings for the two teams, and should be positive numbers. It is these ratings we want to estimate.

A problem with the model above is that the ratings are not uniquely determined. To overcome this problem the parameters need to be constrained. The most common constraint is to add a sum-to-one constraint

\( \sum_k r_k = 1 \)

I will explore a different constraint below.

Sine we are in a Bayesian setting we need to set a prior distribution for the rating parameters. Given the constraints that the parameters should be positive and sum-to-one the Dirichlet distribution is a natural choice of prior distribution.

\( r_1, r_2, …, r_p \sim Dir(\alpha_1, \alpha_2, …, \alpha_p) \)

where the hyperparameters \(\alpha\) are positive real numbers. I will explore different choices of these below.

Here is the Stan code for the Bradley-Terry model:

data {
	int<lower=0> N; // N games
	int<lower=0> P; // P teams
	
	// Each team is referred to by an integer that acts as an index for the ratings vector. 
	int team1[N]; // Indicator arrays for team 1
	int team2[N]; // Indicator arrays for team 1
	int results[N]; // Results. 1 if team 1 won, 0 if team 2 won.
	
	vector[P] alpha; // Parameters for Dirichlet prior.
}

parameters {
	// Vector of ratings for each team.
	// The simplex constrains the ratings to sum to 1 
	simplex[P] ratings;
}

model {
	real p1_win[N]; // Win probabilities for player 1
	
	ratings ~ dirichlet(alpha); // Dirichlet prior.
	
	for (i in 1:N){
		p1_win[i] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]]);
		results[i] ~ bernoulli(p1_win[i]);
	}
}

The way I implemented the model you need to supply the hyperparameters for the Dirichlet prior via R (or whatever interface you use to run Stan). The match outcomes should be coded as 1 if team 1 won, 0 if team 2 won. The two variables team1 and team2 are vectors of integers that are used to reference the corresponding parameters in the ratings parameter vector.

Before we fit the model to some data we need to consider what values we should give to the hyperparameters. Each of the parameters of the Dirichlet distribution corresponds to the rating of a specific team. Both the absolute magnitude and the relative magnitudes are important to consider. A simple case is when all hyperparameters have the same value. Setting all hyperparameters to be equal to each other, with a value greater or equal to 1, implies a prior belief that all the ratings are the same. If they are between 0 and 1, the prior belief is that the ratings are really really different. The magnitude also plays a role here. The greater the magnitudes are, the stronger the prior belief that the ratings are the same.

Let’s fit some data to the model. Below are the results from fitting the results from 104 games from the 2016-17 season of the Norwegian women’s handball league, with 11 participating teams. I had to exclude six games that ended in a tie, since that kind of result is not supported by the Bradley-Terry model. Extension exists that handle this, but that will be for another time.

Below are the results of fitting the model with different sets of priors, together with the league points for comparison. For this purpose I didn’t do any MCMC sampling, I only found the MAP estimates using the optimization procedures in Stan.

res_dir

For all the priors the resulting ratings give the same ranking. This ranking also corresponds well with the ranking given by the league points, except for Gjerpen and Stabæk which have switched place. We also clearly see the effect of the magnitude of the hyperparameters. When all the \(\alpha\)‘s are 1 the ratings varies from almost 0 to about 0.6. When they are all set to 100 the ratings are almost all the same. If these ratings were used to predict the result of future matches the magnitudes of the hyperparameters could be tuned using cross validation to find the value that give best predictions.

What if we used a different hyperparameter for each team? Below are the results when I set all \(\alpha\)‘s to 10, except for the one corresponding to the rating for Glassverket, which I set to 25. We clearly see the impact. Glassverket is now considered to be the best team. This is nice since it demonstrates how actual prior information, if available, can be used to estimate the ratings.

res_dir2

I also want to mention another way to fit the Bradley-Terry model, but without the sum-to-one constraint. The way to do this is by using a technique that the Stan manual calls soft centering. Instead of having a Dirichlet prior which enforces the constraint, we use a normal distribution prior. This prior will not give strict bounds on the parameter values, but will essentially provide a range of probable values they can take. For the model I chose a prior with mean 20 and standard deviation 6.

\( r_1, r_2, …, r_p \sim N(\mu = 20, \sigma = 6) \)

The mean hyperprior here is arbitrary, but the standard deviation required some more considerations. I reasoned that the best team would probably be in the top 99 percentile of the distribution, approximately three standard deviations above the mean. In this case this would imply a rating of 20 + 3*6 = 38. Similarly, the worst team would probably be rated three standard deviations below the mean, giving a rating of 2. This implies that the best team has a 95% chance of winning against the worst team.

Here is the Stan code:

data {
	int<lower=0> N; 
	int<lower=0> P;

	int team1[N]; 
	int team2[N]; 
	int results[N];

	real<lower=0> prior_mean; // sets the (arbitrary) location of the ratings.
	real<lower=0> prior_sd; // sets the (arbitrary) scale of the ratings.

}

parameters {
	real<lower=0> ratings[P];
}

model {
	real p1_win[N];

	// soft centering (see stan manual 8.7)
	ratings ~ normal(prior_mean, prior_sd); 

	for (i in 1:N){
		p1_win[i] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]]);
		results[i] ~ bernoulli(p1_win[i]);
	}
}

And here are the ratings for the handball teams. The ratings are now on a different scale than before and largely matches the prior distribution. The ranking given by this model agrees with the model with the Dirichlet prior, with Gjerpen and Stabek switched relative to the league ranking.

res_norm

Tuning the Elo ratings: Initial ratings and inter-league matches

In the last post I discussed how to tune the Elo ratings to make the ratings have the best predictive power by finding the optimal update factor (the K-factor) and adjustment for home field advantage. One thing I only mentioned, but did not go into detail about, was that the teams initial ratings will influence this tuning. In this post I will show how we can find good initial ratings that also will mitigate some other problems associated with Elo ratings.

As far as I can tell, setting the initial ratings does not seem to be much discussed. The Elo system updates the ratings by looking at the difference between the actual results and the results predicted by the rating difference between the two opposing teams. To get this to work in the earliest games in the data, you need to supply some initial ratings.

It is possible to set the initial ratings by hand, using your knowledge about the strengths of the different players and teams. This strategy is however difficult to use in practice, since you may not have that knowledge, which in turn would give incorrect ratings. This task would also become more difficult the more teams and players are in your data. An automatic way to get the initial ratings is of course preferable.

The only automatic way to set the initial ratings I have seen is to set all ratings to be equal. This is what they do at FiveThirtyEight. This simple strategy is obviously not optimal. It is a bit far fetched to assume that all teams are equally good at the beginning of your data, even if you could argue that you don’t really know any better. If you have a lot of data going back a long time, then only the earliest period of your ratings will be unrealistic. After a while the ratings will become more realistic and better reflect the true strengths of the teams.

The unrealistic ratings for the earliest data may also cause a problem if you use this to find the optimal K-factor. In Elo ratings the K-factor is a parameter that determines how much new games will influence the ratings. A larger K-factor makes the ratings change a lot after a new game, while a low K-factor will make the ratings change only a little after each new game. If you are trying to make a rating system with good prediction ability and use the earliest games with the unrealistic ratings to tune the K-factor, then it will probably be overestimated. This is because a large K-factor will make the ratings change a lot at the beginning, making the ratings better quicker. A large K-factor will be good in the earliest part of your data, but after a while it may be unrealistically big.

One more challenge with Elo ratings is if you include multiple leagues or competitions in your rating system. Since the Elo ratings are based on the exchange of points, groups of teams that play each other often, such as the teams in the same league, will have ratings that are reasonable calibrated only between each other. This is not the case when you have teams from different leagues play each other. The rating difference between two teams from two leagues will not be as well-calibrated as those within a league.

A nice visualization of this is to plot which teams in your data have played each other. In the plot below each team is represented by a circle, and each line between the circles indicates that the two teams have against played each other. The data is from Premier League and the Championship from the year 2010; two half-seasons for each division. I haven’t added team names to the graph, but the orange circles are the teams that played in the Premier League in both seasons, while the blue circles are teams that played in the Championship or got promoted or relegated.

comparison graph2 crop

We clearly see that the two division cluster together with a lot of comparisons available between the teams. Six teams, those that got promoted and relegated between the two divisions, are clearly shown to fall in between the two large clusters. All comparisons to be made between teams in the two divisions has to rely on the information that is available via these six teams. Including all these teams in a Elo rating, starting with all ratings equal, will surely be completely wrong and will take some time to be realistic. All point exchange between the divisions will have to happen via the promoted and relegated teams.

I have previously investigated this in the context of regression models, where I demonstrated how including data from the Championship improves the prediction of Premier League matches. Se this and this.

So how can we find the initial ratings that will give realistic ratings that also calibrate the ratings between two or more leagues? By using a small amount of data, say one year worth of data, less that you would use to tune the K-factor and home field advantage, you can use an optimization algorithm to find the ratings that best fits the observed outcomes. In doing this you have to use the formula that converts the ratings to expected outcomes, but you do not use the update formula, so this approach can be seen as a static version of the Elo ratings.

Doing the direct optimization is however not completely straightforward. Elo ratings is a zero-sum system. No points are added or removed from the system, only exchanged. This constraint is similar to the sum-to-zero constraint that is sometimes used in regression modeling and Analysis-of-Variance. To overcome this, we can simply set the rating of one of the teams to the negative sum of the ratings of all the other teams.

A further refinement is to include home field advantage into the optimization. In cases where the teams have unequal number of home games, or some games where no teams play at home, this will create more accurate ratings. If not the ratings for those teams with an excess of home games will become unrealistically large.

Doing this procedure, using data from the Premier League and the Championship from 2010 which I used to make the graph above, I get the following ratings (with the average rating being 1500):

elo inits

The procedure also estimated the home field advantage to be 84.3 points.

The data I used for the initial ratings is the first year of the data I used to tune the K-factor in the previous post. How does using these initial ratings influence the this tuning, compared with using the same initial rating for all teams? As expected, the optimal K-factor is smaller. The plot below shows that K=14 is the optimal K, compared with K=18.5 that I found last time. It is also interesting to see that the ratings with initialization are more accurate for the whole range of K’s I tested, than those without.

kwithinit

Tuning the Elo ratings: The K-factor and home field advantage

The Elo rating system is quite simple, and therefore easy implement. In football, FIFA uses is in its womens rankings and the well respected website fivethirtyeight.com also uses Elo ratings to make predictions for NBA and NFL games. Another cool Elo rating site is clubelo.com.

Three year ago I posted some R code for calculating Elo ratings. Its simplicity also makes it easy to modify and extend to include more realistic aspects of the games and competitions that you want to make ratings for, for example home field advantage. I suggest reading the detailed description of the clubelo ratings to get a feel of how the system can be modified to get improved ratings. I have also discussed some ways to extend the Elo ratings here on this blog as well.

If you implement your own variant of the Elo ratings it is necessary to tune the underlying parameters to make the ratings as accurate as possible. For example, a too small K-factor will give ratings that update too slow. The ratings will not adapt well to more recent developments. Vice versa, a too large K-factor will put too much weight on the most recent results. The same goes for the extra points added to the home team rating to account for the home field advantage. If this is poorly tuned, you will get poor predictions.

In order to tune the rating system, we need a way to measure how accurate the ratings are. Luckily the formulation of the Elo system itself can be used for this. The Elo system updates the ratings by looking at the difference between the actual results and the results predicted by the rating difference between the two opposing teams. This difference can be used to tune the parameters of the system. The smaller this difference is, the more accurate are the predictions, so we want to tune the parameters so that this difference is as small as possible.

To formulate this more formally, we use the following criterion to assess the model accuracy:

\( \sum_i[ (exp_{hi} – obs_{hi})^2 + (exp_{ai} – obs_{ai})^2 ] \)

where \(exp_{hi}\) and \(exp_{ai}\) are the expected results of match i for the home team and the away team, respectively. These expectations are a number between 0 and 1, and is calculated based on the ratings of the two teams. \(obs_{hi}\) and \(obs_{ai}\) are the actual result of match i, encoded as 0 for loss, 0.5 for draw and 1 for a win. This criterion is called the squared error, but we will use the mean squared error.

With this criterion in hand, we can try to find the best K-factor. Using data from the English premier league as an example I applied the ratings on the match results from the January 1st 2010 to the end of the 2014-15 season, a total of 2048 matches. I tried it with different values of the K-factor between 7 and 25, in 0.1 increments. Then plotting the average squared error against the K-factor we see that 18.5 is the best K-factor.

bestk185

The K-factor I have found here is, however, probably a bit too large. In this experiment I initialized the ratings for all teams to 1500. This includes the teams that was promoted from the Championship. A more realistic rating system would initialize these teams with a lower rating, perhaps be given the ratings from the relegated teams.

We can of course us this strategy to also find the best adjustment for the home field advantage. The simple way to add the home field advantage is to add some additional points to the ratings for the home team. Here I have used the same number of points in all matches across all season, but different strategies are possible. To find the optimal home field advantage I applied the Elo ratings with K=18.5, using different home field advantages.

besthfa683

From this plot we see that an additional 68.3 points is the optimal amount to add to the rating for the home team.

One might wonder if finding the best K-factor and home field advantage independent of each other is the best way to do it. When I tried to find the best K-factor with the home field advantage set to 68, I found that the best K was 19.5. This is a bit higher than when the home field advantage was 0. I tried to find the optimal pair of K and home field advantage by looking over a grid of possible values. Plotting the accuracy of the ratings against both K and the home field advantage in a contour we get the following:

besthfak

The best K and home field advantage pair can be read from the plot, both of which is a bit higher than the first values I found.

Doing the grid search can take a bit of time, especially if you don’t narrow down the search space by doing some initial tests beforehand. I haven’t really tried it out, but alternating between finding the best K-factor and home field advantage and using the optimal value from the previous round is probably going to be a reasonable strategy here.

BBC’s More Or Less on why the men’s FIFA rankings fail

One of the podcasts I listen to regularly, ‘More Or Less’ from the BBC, had the other day an episode about the (men’s) FIFA rankings. In the episode they discuss a shortcoming in the ranking system that makes it possible for a team to loose points (and thus ranking position) despite winning a match. The reason for this is not fully explained, but looking closer at the descriptions provided at fifa.com I think I see where the problem lies. After each match, rating points are given to the winner (or split if there is a draw). The crucial thing here is that friendly matches (or other non-important matches) gives fewer points than important tournament matches. The published ratings then are basically an average over the points earned for the matches played in the last couple of years. That means that winning a friendly match sometimes will yield fewer than a team’s average points, thus decreasing the average.

Unfortunately the episode did not mention the women’s FIFA ranking system which is based on the much better Elo system, used in chess rankings (and which I have written about previously). In this sort of system a win will almost surely give more points, and not less (the worst case scenario for a win is that no points are earned).

Elo ratings in football: Home field advantage

In my first post about Elo ratings in football I posted the code for a R function where you could adjust the ratings to account for home field advantage. The method is simple: Some extra points are added to the home teams rating when the match predictions (which is based on the ratings) are calculated. My implementation did only support giving a the same fixed amount of extra points to all teams in all games. In other words it is assumed that all teams have the same home field advantage, and that the home field advantage does not change over time. This is of course unrealistic if the point of the ratings is to give as accurate predictions as possible. Still the method is used in the FIFA Women’s World Ranking and in the World Football Elo Ratings.

I know of two ways to implement a more dynamic (and more realistic) home field advantage. The ClubElo ratings (which is perhaps the best football rating site out there), developed by Lars Schiefler, let the home field advantage change over time, similar to how the ratings change. This is done by updating the home field advantage after each game based on the home team’s performance. An article on the ClubElo site describes the details very well.

A rather different method is used in the pi-rating system, developed by Anthony Constantinou. Each team have two ratings, one describing performance when playing at home, the other when playing away. The cool thing about the way this is done is that the two ratings for a team are not calculated separately from each other. It is not the case that only the home matches are used to calculate the home rating and away matches to calculate the away rating; After each match both ratings are updated. The home rating for the home team is updated almost like the regular Elo ratings, and the away rating is then also updated based on how much the home rating has changed, scaled by a factor. That way the two ratings are allowed to deviate from each other, giving rise to an adaptive, team specific home field advantage. The procedure is of course also applied to the away team’s ratings.

The pi-ratings, by the way, are interesting in other ways besides the method for determining the home field advantage. Instead of the ratings being somewhat arbitrary numbers, like most Elo systems, the pi-ratings directly models goal differences. The details are described in the paper Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries. A draft of the paper is also available at the pi-ratings website. While I am at it, I can also recommend Constantinou’s other papers on football prediction.

FIFA Women’s World Ranking and goal difference in Elo ratings.

The FIFA rankings for women’s national teams use a quite different methodology than the one used in ranking men’s national teams. The Women’s World Ranking (WWR) is based on the Elo rating system I wrote about in the previous post. The details for the men’s ranking can be found here, and the details the women’s ranking can be found here.

One thing that makes the WWR interesting is how goal differences are accounted for in the ratings. This is not something found in the ordinary chess-based Elo ratings. The method used in WWR is to let the ‘winning percentage’ change depending on two things: Goal difference and the number of goals the loosing team scored. This is in contrast to the original Elo ratings where the winning team wins 100% and the loosing team wins 0% (a draw is 50%). In WWR a team can never win 100%, the most it can win is 99%. This is the case when the goal difference is more than 6 and the loosing team haven’t scored any goals. The table below is from the pdf file linked to above and shows how many percent the loosing team win.

wwrtable

One strange thing about this table is the column for goal difference 0, implying a draw. I am guessing this is an error since it means that the winning percentage for the loosing team will be greater than the winning team. In the paper I mentioned in my last post where a number of different rating methods were compared (The predictive power of ranking systems in association football by J. Laset et. al), it was assumed that draw would yield both teams 50%, as in the original Elo-ratings. That paper also showed that the Women’s World Ranking was among the rating systems with best prediction.

Here is a plot showing the win percentage when the loosing team has scored 0 and 5 goals. We can see that there is not much gained for the loosing team to score one extra goal (assuming the goal difference stays the same, which of course is dubious), and most of the gain in winning percentage is when a team scores a goal such that the stance goes from a draw to a win.

WWRwinpercantage_nontransparent

The table above only goes up to 5 goals for the loosing team, but for the sake of implementation it is easy to generalize the rule about how much the loosing team gains by scoring an additional goal (with the goal difference is the same). For example will the loosing team gain 0.9 extra winning percentage points when the goal difference is 2. Similarly the gain is 0.6 percentage points when the goal difference is 5.

Below is a R function to compute the win percentage for football matches. It takes as input two vectors with the number of goals scored by the two opponents and returns a vector of win percentages (a number between 0 and 1) for the first team.

winPercentageWWR <- function(team1Goals, team2Goals){
  #calculates the win percentage for team 1.

  stopifnot(length(team1Goals) == length(team2Goals))
  
  perc <- c(0.01, 0.02, 0.03, 0.04, 0.08, 0.15, 0.50, 0.85, 0.92, 0.96, 0.97, 0.98, 0.99)
  add <- c(0, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005)
  
  goalDifferences <- (team1Goals - team2Goals)
  goalDifferences[goalDifferences < -6] <- -6
  goalDifferences[goalDifferences > 6] <- 6

  team1WinPercentage <- numeric(length=length(goalDifferences))

  for (idx in 1:length(goalDifferences)){

    team1WinPercentage[idx] <- perc[goalDifferences[idx]+7] -  
      sign(goalDifferences[idx])*min(team1Goals[idx], team2Goals[idx])*add[abs(goalDifferences[idx])+1]
  }
  return(team1WinPercentage) 
}

Elo ratings in football

I have previously written about some statistical methods for rating football teams and to predict the result of future matches. One was the last squares method and another was the Poisson regression method. None of these methods make good enough predictions. One problem with them is that they don’t incorporate a time perspective. Matches played a year ago is given equal importance as the most recent one. This could however be incorporated by weighing the the older matches less than newer matches. One other problem that I mentioned in the second post about Poisson regression is that teams are treated as categoricals which makes it hard to model the fact that a team’s ability changes over time.

One different kind of method that has been employed a lot in the recent years is the Elo rating system, which were originally developed for rating chess players. The method is rather simple, but I will not explain it in detail here since there are many good explanations of it elsewhere. Wikipedia has a very thorough coverage. The basic principle is that the difference in ratings between the two opposing teams provide a prediction for the result each game. The rating is then updated based on how the teams perform. If a team performs better than expected the rating increases, if they perform worse than expected the rating decrease. How much the rating changes depends on an update factor (often referred to as the K-factor).

Chess and football are of course different in many ways so the method for rating chess players is not directly suitable for rating football teams. The relative simplicity of the Elo system makes it easy to tweak and adjust to better fit football by incorporating things like home field advantage and goal difference. There are many sites around the Internet who provide different variants of Elo ratings, like the World Football Elo Ratings for national teams and Club Elo and Euro Club Index for club teams. FIFA even uses its own Elo system in its Womans World Ranking.

There has even been some research into different football rating systems. A paper titled The predictive power of ranking systems in association football (pdf) by Jan Lasek and others compared different rating systems. Their conclusion was that the different Elo type systems in general were better at predicting match outcomes than other types of rating systems.

I figured I wanted to implement a simple Elo rating system for rating football teams. There is already a package in R, PlayerRatings, which implements several different rating systems based on Elo. In my simple implementation there is no adjustment for goal difference, but I have support for home field advantage. All teams start with an initial rating of 1500. Here is what I got when I calculated the ratings for Premier League in November 2012 based on data going back to 1993. I used an update factor 24 without any home field advantage. There is no particular reason for this as I did this mostly as a proof of concept.

  Rating (November 2012)
Man United 1807
Man City 1767
Chelsea 1696
Arsenal 1658
Tottenham 1645
Everton 1640
Newcastle 1613
Fulham 1591
Liverpool 1567
West Brom 1562
Leeds 1552
Wigan 1543
Swansea 1526
Sunderland 1524
Stoke 1521
Middlesboro 1516
Norwich 1509
Aston Villa 1498
West Ham 1494
Birmingham 1493
Blackpool 1483
Ipswich 1481
Bolton 1479
Charlton 1470
Sheffield United 1458
Blackburn 1450
Reading 1448
Sheffield Weds 1447
Coventry 1447
Middlesbrough 1443
QPR 1440
Barnsley 1439
Portsmouth 1438
Southampton 1437
Oldham 1436
Crystal Palace 1433
Leicester 1433
Nott’m Forest 1430
Hull 1422
Burnley 1418
Wolves 1414
Wimbledon 1413
Watford 1411
Bradford 1406
Swindon 1404
Derby 1297

The table seems reasonable I think except for a couple of things. There is a problem related to relegation and promotion. Since I have used data back to 1993 every team who has played in the Premier League is given a rating. If a team is relegated to the Championship, their rating will no longer be updated. We can see that this creates some strange results. Take the two lowest rated teams for example. Derby has not been in the Premier League since the 2007-2008 season. Swindon, which is rated about 100 points higher than Derby, has not played in the Premier League since 1993-1994 season! Swindon now play in the fourth level of the English league system. So the ratings for the teams not in the Premier League should be considered invalid.

Relegation and promotion also creates a problem with inflated ratings. The Elo system is created so that the total number of points in the league should be constant. When a team is promoted they start with an initial rating of 1500, and if they later gets relegated they will probably have lost some of those points to the other teams in the league. In fact, we see that many of the teams with ratings less than 1500 no longer plays in the Premier League. The points they have lost are still in present in the league even though the team isn’t. This means that over time the average ratings of the teams in the league will increase.

The code I have written takes a data frame as input and works “out of the box” with data from football-data.co.uk. If you are going to use it yourself you have to make sure the data is sorted by date as the rating function just loops from top to bottom.

Here is how you can use it:

dta <- read.csv("yourdata.csv")
elo <- eloRating(data=dta)
print(elo)

And here is the code:


eloRating <- function(home="HomeTeam", away="AwayTeam", homeGoals="FTHG",
                      awayGoals="FTAG", data, kfactor=24, initialRating=1500,
                      homeAdvantage=0){
  
  #Make a list to hold ratings for all teams
  all.teams <- levels(as.factor(union(levels(as.factor(data[[home]])),
                                      levels(as.factor(data[[away]])))))
  
  ratings <- as.list(rep(initialRating, times=length(all.teams)))
  names(ratings) <- all.teams

  #Loop trough data and update ratings
  for (idx in 1:dim(data)[1]){
  
    #get current ratings
    homeTeamName <- data[[home]][idx]
    awayTeamName <- data[[away]][idx]
    homeTeamRating <- as.numeric(ratings[homeTeamName]) + homeAdvantage
    awayTeamRating <- as.numeric(ratings[awayTeamName])
    
    #calculate expected outcome 
    expectedHome <- 1 / (1 + 10^((awayTeamRating - homeTeamRating)/400))
    expectedAway <- 1 - expectedHome
    
    #Observed outcome
    goalDiff <- data[[homeGoals]][idx] - data[[awayGoals]][idx]
    if (goalDiff == 0){
      resultHome <- 0.5
      resultAway <- 0.5
    }
    else if (goalDiff < 0){
      resultHome <- 0
      resultAway <- 1
    }
    else if (goalDiff > 0){
      resultHome <- 1
      resultAway <- 0
    }
    
    #update ratings
    ratings[homeTeamName] <- as.numeric(ratings[homeTeamName]) + kfactor*(resultHome - expectedHome)
    ratings[awayTeamName] <- as.numeric(ratings[awayTeamName]) + kfactor*(resultAway - expectedAway)
  }
  
  #prepare output
  ratingsOut <- as.numeric(ratings)
  names(ratingsOut) <- names(ratings)
  ratingsOut <- sort(ratingsOut, decreasing=TRUE)

  return(ratingsOut) 
}