A simple re-implementation of the Dixon-Coles model

A couple of years ago I implemented the Dixon-Coles model for predicting football results here on this blog. That series of of blog posts is my most popular since I keep getting comments on it, some four years later.

One of the most common requests is advice on how to expand the model to include additional predictors. Unfortunately with the implementation I posted this was not so straightforward. It relied on some design matrices with dummy-coded variables, which is a standard way of doing things in regression modeling. The DC model isn’t a standard regression modeling problem, so using matrices complicated things. I posted some updates and variant across several posts, which in the end made the code a bit hard to follow and to modify.

Anyway, I’ve had a simpler implementation lying around for a while, and since there’s been far between updates on this blog lately I thought I’d post it.

First load some data from the engsoccerdata package. I’m going to use the 2011-12 season of the English Premier League, so the results can be compared with what I got from the first implementation.

library(dplyr)
library(engsoccerdata)

england %>% 
  filter(Season == 2011,
         tier==1) %>% 
  mutate(home = as.character(home),
         visitor = as.character(visitor))-> england_2011

Next we should create a list of initial parameter values. This will be used as a starting point for estimating the parameters. The list contains vectors of four groups of parameters, the attack and defense parameters of all teams, the home field advantage and the Dixon-Coles adjustment (rho). The attack and defense vector are named so that it is easy to look up the relevant parameter later on.

Notice also that a sum-to-zero constraint has to be added to the defense parameters, so in reality we are estimating one defense parameter less than the number of teams. Check this post for some more explanation of this.

# Make a vector of all team names. 
all_teams <- sort(unique(c(england_2011$home, england_2011$AwayTeam)), decreasing = FALSE)
n_teams <- length(all_teams)

# list of parameters with initial values.
parameter_list <- list(attack = rep(0.2, n_teams),
                      defense = rep(-0.01, n_teams-1),
                      home = 0.1,
                      rho= 0.00)

names(parameter_list$attack) <- all_teams
names(parameter_list$defense) <- all_teams[-1] # the first parameter is computed from the rest.

Next we need a function that calculates the negative log-likelihood function, to be used with R’s built in optimizer.

One trick I use here is to relist the parameters. The optimizer want all parameter values as a single vector. When you have a lot of parameters that group together and is used in different parts of the model, this can quickly create some complicated indexing and stuff. By supplying the original parameter list, plus having named vectors, these problems essentially disappear.

Also notice how the expected goals are now simply computed by looking up the relevant parameters in the parameter list and adding them together. No need for matrix multiplications.

The Dixon-Coles adjustment function tau is the same as in the original implementation.

dc_negloglik <- function(params, goals_home, goals_visitor,
                     team_home, team_visitor, param_skeleton){
  
  # relist, to make things easier.
  plist <- relist(params, param_skeleton)
  
  # There is a sum-to-zero constraint on defense parameters.
  # The defense parameter for the first team is computed from the rest.
  plist$defense <- c(sum(plist$defense)*-1, plist$defense)
  names(plist$defense)[1] <- names(plist$attack[1]) # add name to first element.

  # Home team expected goals
  lambda_home <- exp(plist$attack[team_home] + plist$defense[team_visitor] + plist$home)
  
  # Away team expected goals
  lambda_visitor <- exp(plist$attack[team_visitor] + plist$defense[team_home])
  
  # Dixon-Coles adjustment
  dc_adj <- tau(goals_home, goals_visitor, lambda_home, lambda_visitor, rho = plist$rho)
  
  # Trick to avoid warnings.
  if (any(dc_adj <= 0)){
    return(Inf)
  }
  
  # The log-likelihood
  log_lik_home <- dpois(goals_home, lambda = lambda_home, log=TRUE)
  log_lik_visitor <- dpois(goals_visitor, lambda = lambda_visitor, log=TRUE)
  
  log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj)))
  
  return(log_lik*-1)
  
}

To actually estimate the parameters we feed the function, data and initial values to optim, and check the results.


optim_res <- optim(par = unlist(parameter_list), fn=dc_negloglik,
                   goals_home = england_2011$hgoal,
                   goals_visitor = england_2011$vgoal,
                   team_home = england_2011$home, team_visitor = england_2011$visitor,
                   param_skeleton=parameter_list, method = 'BFGS')

# relist, and calculate the remaining parameter. 
parameter_est <- relist(optim_res$par, parameter_list)
parameter_est$defense <- c( sum(parameter_est$defense) * -1, parameter_est$defense)
names(parameter_est$defense)[1] <- names(parameter_est$attack[1]) 

I get the same home field advantage (0.27) and rho (-0.13) as in the original implementation. The other parameters differ, however. This is because of the sum-to-zero constraints are coded in a different way. This should not matter and both ways should give the same predictions.

I have not yet said anything about how to expand the model to include other predictors, but hopefully this implementation should make it easier. You can just add some new arguments to the dc_negloglik function that takes the variables in question as input, and add new parameter vectors to the parameter list as needed. Then the calculations of the expected goals should be modified to include the new parameters and predictors.

The Bayesian Bradley-Terry model with draws

In the previous post I tried out the Stan software to implement two Bayesian versions of the Bradley-Terry (BT) model. One drawback of the Bradley-Terry model is that it can’t handle draws, which seriously hampers its utility in modelling sports data. That was one reason I used handball results rather than football results as the example, since draws are rare in handball.

One (of several) extension of the BT model that can handle draws is the Davidson model. This was developed in the 1970 paper ‘On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments‘. In short, the model adds a new parameter, \(\nu\), which influences the probability of a draw. When \(\nu = 0\), the model becomes the ordinary BT model.

In my Stan implementation below I use a Dirichlet prior on the ratings, like last time. The consequence of this is that the sum of all the ratings is 1. In the BT model this gives us the the nice interpretation that the rating the probability of a team of winning against an hypothetical average team. This property is not exactly carried over to the Davidson model, but a related property is. The ratio of two ratings, \(\pi_1 / \pi_2\), is the probability that team 1 wins against team 2, applies to both the BT model and the Davidson model

In my implementation of the BT model I used the Bernoulli distribution to model the outcomes, which is appropriate when we only have two outcomes. As you can see from the code below, we now have to use the categorical distribution, since we now have three outcomes. I also use an exponential prior on \(\nu\). Admittedly, I have no particular reason for this except that it is the traditional choice for parameters that has to have only positive values.

Anyway, here is the Stan code:

data {
	int<lower=0> N; // N games
	int<lower=0> P; // P teams
	
	// Each team is referred to by an integer that acts as an index for the ratings vector. 
	int team1[N]; // Indicator arrays for team 1
	int team2[N]; // Indicator arrays for team 1
	int results[N]; // Results. 1 if home win, 2 if away won, 3 if a draw. 
	
	real<lower=0> nu_prior_rate;
	
	vector[P] alpha; // Parameters for Dirichlet prior.
}

parameters {
	// Vector of ratings for each player
	// The simplex constrains the ratings to sum to 1 
	simplex[P] ratings;
	
	// Parameter adjusting the probability of draw.
	real<lower=0> nu; 
}


model {
	
	// Array of length 3 vectors for the three outcome probabilies for each game. 
	vector[3] result_probabilities[N];
	real nu_rating_prod;
	
	ratings ~ dirichlet(alpha); // Dirichlet prior on the ratings.
	
	nu ~ exponential(nu_prior_rate); // exponential prior on nu.
	
	for (i in 1:N){
		
		// nu multiplied by the harmonic mean of the ratings.
		nu_rating_prod = sqrt(ratings[team1[i]] * ratings[team2[i]]) * nu;
		
		result_probabilities[i][3] = nu_rating_prod / (ratings[team1[i]] + ratings[team2[i]] + nu_rating_prod);
		result_probabilities[i][1] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]] + nu_rating_prod);
		result_probabilities[i][2] = 1 - (result_probabilities[i][1] + result_probabilities[i][3]);
		
		results[i] ~  categorical(result_probabilities[i]);
	}
	
}

Another thing I wanted to do this time was to do proper MCMC sampling, so we could get the Bayesian posterior credibility intervals. The sampling takes longer time than the optimization procedure I used last time, but it only took a few seconds to get a decent amount of samples.

For the reanalysis of the handball data from last time I set the Dirichlet prior parameters to 5 for all teams, and the rate parameter for the exponential prior on \(\nu\) is 1. We can visualize the estimate of the ratings and their uncertainties (95% intervall) using a forest plot:

The results agree with the ones from last time, but this time we also see that the credibility intervals are rather large. This is perhaps not that surprising, since the amount of data is rather limited. The posterior (mean) point estimate for \(\nu\) is 0.15.

But let’s take a look at some English Premier League football data. With the ordinary BT model this would not work so well since there’s a lot of draws in football. Ignoring them would not be tenable. Below are the ratings, with 95% credibility interval, based on data from the 2015-15 season, using the same prior parameters as in the handball data set. The league points are shown in parenthesis for comparison.

The ratings generally agree with the points, except in a few instances, where a team or two have switched places. Another interesting thing to notice is that the width of the intervals seem to be related to the magnitude of the rating. I am not exactly sure why that is, but I suspect its due to the fact that the ratings are in a sense binomial probabilities, and these are known to have greater variance the closer they are to 0.5.

The point estimate for \(\nu\) is 0.85 for this data set. Compared to the 0.15 for the handball data, it is clear that this reflects the higher overall probability of draws in football. In the handball data set only 6 games ended in a draw, while in the football data set about 20% of the games was a draw.

Fitting Bradley-Terry models using Stan

I have recently played around with Stan, which is an excellent software to fit Bayesian models. It is similar to JAGS, which I have used before to fit some regression models for predicting football results. Stan differs from JAGS in a number of ways. Although there is some resemblance between the two, the model specification languages are not compatible with each other. Stan, for instance, uses static typing. On the algorithmic side, JAGS uses the Gibbs sampling technique to sample from the posterior; Stan does not do Gibbs sampling, but has two other sampling algorithms. In Stan you can also get point estimates by using built-in optimization routines that search for the maximum of the posterior distribution.

In this post I will implement the popular Bradley-Terry machine learning model in Stan and test it on some sports data (handball, to be specific).

The Bradley-Terry model is used for making predictions based on paired comparisons. A paired comparison in this context means that two things are compared, and one of them is deemed preferable or better than the other. This can for example occur when studying consumer preferences or ranking sport teams.

The Bradley-Terry model is really simple. Suppose two teams are playing against each other, then the probability that team i beats team j is

\( p(i > j) = \frac{r_i}{r_i + r_j} \)

where \(r_i\) and \(r_j\) are the ratings for the two teams, and should be positive numbers. It is these ratings we want to estimate.

A problem with the model above is that the ratings are not uniquely determined. To overcome this problem the parameters need to be constrained. The most common constraint is to add a sum-to-one constraint

\( \sum_k r_k = 1 \)

I will explore a different constraint below.

Sine we are in a Bayesian setting we need to set a prior distribution for the rating parameters. Given the constraints that the parameters should be positive and sum-to-one the Dirichlet distribution is a natural choice of prior distribution.

\( r_1, r_2, …, r_p \sim Dir(\alpha_1, \alpha_2, …, \alpha_p) \)

where the hyperparameters \(\alpha\) are positive real numbers. I will explore different choices of these below.

Here is the Stan code for the Bradley-Terry model:

data {
	int<lower=0> N; // N games
	int<lower=0> P; // P teams
	
	// Each team is referred to by an integer that acts as an index for the ratings vector. 
	int team1[N]; // Indicator arrays for team 1
	int team2[N]; // Indicator arrays for team 1
	int results[N]; // Results. 1 if team 1 won, 0 if team 2 won.
	
	vector[P] alpha; // Parameters for Dirichlet prior.
}

parameters {
	// Vector of ratings for each team.
	// The simplex constrains the ratings to sum to 1 
	simplex[P] ratings;
}

model {
	real p1_win[N]; // Win probabilities for player 1
	
	ratings ~ dirichlet(alpha); // Dirichlet prior.
	
	for (i in 1:N){
		p1_win[i] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]]);
		results[i] ~ bernoulli(p1_win[i]);
	}
}

The way I implemented the model you need to supply the hyperparameters for the Dirichlet prior via R (or whatever interface you use to run Stan). The match outcomes should be coded as 1 if team 1 won, 0 if team 2 won. The two variables team1 and team2 are vectors of integers that are used to reference the corresponding parameters in the ratings parameter vector.

Before we fit the model to some data we need to consider what values we should give to the hyperparameters. Each of the parameters of the Dirichlet distribution corresponds to the rating of a specific team. Both the absolute magnitude and the relative magnitudes are important to consider. A simple case is when all hyperparameters have the same value. Setting all hyperparameters to be equal to each other, with a value greater or equal to 1, implies a prior belief that all the ratings are the same. If they are between 0 and 1, the prior belief is that the ratings are really really different. The magnitude also plays a role here. The greater the magnitudes are, the stronger the prior belief that the ratings are the same.

Let’s fit some data to the model. Below are the results from fitting the results from 104 games from the 2016-17 season of the Norwegian women’s handball league, with 11 participating teams. I had to exclude six games that ended in a tie, since that kind of result is not supported by the Bradley-Terry model. Extension exists that handle this, but that will be for another time.

Below are the results of fitting the model with different sets of priors, together with the league points for comparison. For this purpose I didn’t do any MCMC sampling, I only found the MAP estimates using the optimization procedures in Stan.

res_dir

For all the priors the resulting ratings give the same ranking. This ranking also corresponds well with the ranking given by the league points, except for Gjerpen and Stabæk which have switched place. We also clearly see the effect of the magnitude of the hyperparameters. When all the \(\alpha\)‘s are 1 the ratings varies from almost 0 to about 0.6. When they are all set to 100 the ratings are almost all the same. If these ratings were used to predict the result of future matches the magnitudes of the hyperparameters could be tuned using cross validation to find the value that give best predictions.

What if we used a different hyperparameter for each team? Below are the results when I set all \(\alpha\)‘s to 10, except for the one corresponding to the rating for Glassverket, which I set to 25. We clearly see the impact. Glassverket is now considered to be the best team. This is nice since it demonstrates how actual prior information, if available, can be used to estimate the ratings.

res_dir2

I also want to mention another way to fit the Bradley-Terry model, but without the sum-to-one constraint. The way to do this is by using a technique that the Stan manual calls soft centering. Instead of having a Dirichlet prior which enforces the constraint, we use a normal distribution prior. This prior will not give strict bounds on the parameter values, but will essentially provide a range of probable values they can take. For the model I chose a prior with mean 20 and standard deviation 6.

\( r_1, r_2, …, r_p \sim N(\mu = 20, \sigma = 6) \)

The mean hyperprior here is arbitrary, but the standard deviation required some more considerations. I reasoned that the best team would probably be in the top 99 percentile of the distribution, approximately three standard deviations above the mean. In this case this would imply a rating of 20 + 3*6 = 38. Similarly, the worst team would probably be rated three standard deviations below the mean, giving a rating of 2. This implies that the best team has a 95% chance of winning against the worst team.

Here is the Stan code:

data {
	int<lower=0> N; 
	int<lower=0> P;

	int team1[N]; 
	int team2[N]; 
	int results[N];

	real<lower=0> prior_mean; // sets the (arbitrary) location of the ratings.
	real<lower=0> prior_sd; // sets the (arbitrary) scale of the ratings.

}

parameters {
	real<lower=0> ratings[P];
}

model {
	real p1_win[N];

	// soft centering (see stan manual 8.7)
	ratings ~ normal(prior_mean, prior_sd); 

	for (i in 1:N){
		p1_win[i] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]]);
		results[i] ~ bernoulli(p1_win[i]);
	}
}

And here are the ratings for the handball teams. The ratings are now on a different scale than before and largely matches the prior distribution. The ranking given by this model agrees with the model with the Dirichlet prior, with Gjerpen and Stabek switched relative to the league ranking.

res_norm

Which model is the best?

I had a discussion on Twitter a couple of weeks ago about which model is the best for predicting football results. I have suspected that the Dixon & Coles model (DC), which is a modification of the Poisson model, tend to overfit. Hence it should not generalize well and give poorer predictions. I have written about one other alternative to the Poisson model, namely the Conway-Maxwell Poisson model (COMP). This is a model for count data that can be both over-, equi- and underdispersed. It is basically a Poisson model but without the assumption that the variance equals the mean. I have previously done some simple analyses comparing the Poisson, DC and COMP models, and concluded then that the COMP model was superior. The analysis was however a bit to simple, so I have now done a more proper evaluation of the models.

A proper way to evaluatie the models is to do a backtest. For each day there is a game played, the three models are fitted to the available historical data (but not data from the future, that would be cheating) and then used to predict the match outcomes for that day. I did this for two leagues, the English Premier League and German Bundesliga. The models were fitted to data from both the top league and the second tier divisions, since this improves the models, but only the results of the top division was predicted and used in the evaluation. I used a separate home field advantage for the two divisions and the rho parameter in the DC model and the dispersion parameter in the COMP model was estimated using the top division only.

To measure the model’s predictive ability I used the Ranked Probability Score (RPS). This is the proper measure to evaluate predictions for the match outcome in the form of probabilities for home win, draw and away win. The range of the RPS goes from 0 (best possible predictions) to 1 (worst possible prediction). Since the three models actually model the number of goals, I also looked at the probability they gave for the actual score.

For all three models I used the Dixon & Coles method to weight the historical data that is used in training the models. This requires tuning. For both the English and German leagues I backtested the models on different values of the weighting parameter \(\xi\) on the seasons from 2005-06 to 2009-10, with historical data available from 1995. I then used the optimal \(\xi\) for backtesting the seasons 2010-11 up to December 2016. This last validation period covers 1980 Bundesliga matches and 2426 Premier League matches.

Here are the RPS for the three models plottet against \(\xi\). Lower RPS is better and lower \(\xi\) weights more recent data higher.

xi epl

xi bundesliga

The graphs show a couple of things. First, all three models have best predictive ability at the same value of \(\xi\), and that they compare similarly also for non-optimal values of \(\xi\). This makes things a bit easier since we don’t have to worry that a different value of \(\xi\) will alter our evaluations about which model is the best.

Second, there is quite some difference between the models for the German and English data. In the English data the COMP model is clearly best, while the DC is the worst. In the German league, the DC is clearly better, and the COMP and Poisson models are pretty much equally good.

So I used the optimal values of \(\xi\) (0.0021 and 0.0015 for Premier League and Bundesliga, respectively) to validate the models in the data from 2010 and onwards.

Here is a table of the mean RPS for the three models:

rps table

We see that for the both English Premier League and German Bundesliga the DC model offers best predictions. The COMP model comes second in Premier League, but has worst performance in the Bundesliga. It is interesting that the DC model performed worst in the tuning period for the Premier League, now was the best one. For the Bundesliga the models compared similarly as in the tuning period.

I also looked at how often the DC and COMP models had lower RPS than the Poisson model. The results are in this table:

compare with poisson

The COMP model outperformed the Poisson model in more than 60% of the matches in both leagues, while the DC model did so only about 40% of the time.

When looking at the goal scoring probabilities. Here is a table of the sum of the minus log probabilities for the actual scoreline. Here a lower number also indicates better predictions.

score log probabilities

Inn both the Premier League and Bundesliga the Poisson model was best, followed by COMP, with the DC model last.

We can also take a look at the parameter values for the extra parameters the DC and COMP models has. Remember that the DC models is becomes the Poisson model when rho = 0, while the COMP model is the same as the Poisson model when upsilon = 1, and is underdispersed when upsilon is greater than 1.

params ger

params epl

The parameter estimates fluctuates a bit. It is intersting to see that the rho parameter in the DC model tend to be below 1, which gives the opposite direction of what Dixon and Coles found in their 1997 paper. In the Premier League, the parmater makes a big jump to above 0 at the end of the 2013-14 season. The parameter appears to be a bit more consistent in the Bundesliga, but also there we see a short period where the parameter is around 0.

The dispseriosn parameter upsilon also isn’t all that consistent. It is generally closer to 1 in the Bundesliga than in the Premier League. I think this is consistent with why this model was better in the Premier League than in the Bundesliga.

All inn all I think it is hard to conclude which of the three models is the best. The COMP and DC models both adjusts the Poisson model in their own specific ways, and this may explain why the different ways of measuring their predictive abilities are so inconsistent. The DC model seem to be better in the German Bundesliga than in the English Premier League. I don’t think any of the two models are generally better than the ordinary Poisson model, but it could be worthwhile to look more into when the two models are better, and perhaps they could be combined?

The comparison graph part 2

In the last post I wrote about how a graph could be used to explore an important aspect of a data set of football matches, namely whom has played against whom. In this post I will present a more interesting graph. Here is how a graph of 4500 international matches, including friendlies, world cups, and continental cups, from 2010 to 2015:

intgr1

There are 214 teams in this data set, each represented by a circle, and if two teams has played against each other, there is a line drawn between the two circles. It becomes clear when we see this graph that the graph is complicated, with a lot of lines between the circles, and it is hard to make a drawing that shows the structure really well.

There are a few things we can see clearly, though. The first is that the graph is highly connected. All teams are at least indirectly comparable with all other teams. There are no unconnected subgraphs. One measure of how connected the graph is, is the average number of edges the nodes have. In this graph this number is 23.2, which means that each team has on average played against 23 other teams.

On interesting thing we also notice is the “arm” on the right side of the plot, with a handful of teams that is more or less separated from the rest of the teams. These are teams from the Pacific nations, such as Fiji, Samoa and Cook Islands and so on.

In a data set like this we can find some interesting types of indirect comparisons. One example I found in the above graph was Norway and Japan, who has not played against each other in the five year period the data spans, but they have both played against two other teams that link them together: Zambia and Greece.

intgr_norjap

I haven’t found a decent measure of the overall connectedness between two nodes, that incorporates all indirect links of all degrees, but that could be an interesting thing to look at.

Another thing we can do with a graph like this is a cluster analysis. A cluster analysis gives us a broader look at the connectedness in the graph by finding groups of nodes that are more connected to each other that to those in the other groups. In other words we are trying to find groups of countries that play against each other a lot.

A simple clustering of the graph gives the following clusters, with some of the country names shown. The clustering algorithm identified 5 clusters that rather perfectly corresponds to the continents. This is perhaps not so surprising since the continental competitions (including the World Cup qualifications) make up a large portion of the data.

intgr_cluster

The comparison graph

In my last post on Elo ratings I used a graph to illustrate why it is hard to compare the strengths of teams that play in different leagues.

compgraph2

This is based on data from two half-seasons of English Premier League and the Championship. Each team is represented by a node or vertex, which is drawn as a circle. Each edge between the nodes is drawn as line between them and indicates that the two teams have against played each other at least once. I didn’t add team names to the graph, but the orange nodes are the teams that played in the Premier League in both seasons, while the blue nodes are teams that played in the Championship or got promoted or relegated. The graph shows both why comparisons between teams in different leagues can be difficult, and why including data from the Championship can improve the prediction of Premier league matches. In this post I will go more into details about what I call the comparison graph and how it can be used.

We can easily recognize a few patterns in the graph above. The most obvious one is the cluster of several teams where everyone (or nearly everyone) has played against each other. In a fully played season every team has played each other and the graph is said to be fully connected. If the graph is fully connected we should be able to have a good idea about the relative strengths between every team. Here is an example of a fully connected graph representing a fully played season with 10 teams.

graph_example

Another important pattern is the lack of edges between two teams. If two teams hasn’t played each other, but both has played a third team, they are indirectly comparable. Here we see that both Team B and Team C has played team A, but they have not played each other.

graph_example1

If you are going to predict the outcome of a match between Team B and Team C this graph shows you that you should be careful. The information we have of the relative strengths between them is only indirect. This can in some situations where you have very limited data be almost the same as having no data at all. Suppose both Team C and Team B won huge victories over Team A. This would perhaps indicate that Team A is crap, but we would have very little indication which of Team B and Team C is better. If on the other hand Team A beat Team C, and Team B beat Team A, we would have had a strict ordering, so it does not automatically mean that we can’t make anything out of the data.

Another important pattern in a graph is whether there are any disconnected subgraphs. Here we have two or more groups of teams that has played only against other teams within their own group, but not against the teams in the other groups. In the first few rounds of a season we can see patterns like this.

There are a lot of interesting things you can do with the comparison graph, but that will make for a future post.

Tuning the Elo ratings: Initial ratings and inter-league matches

In the last post I discussed how to tune the Elo ratings to make the ratings have the best predictive power by finding the optimal update factor (the K-factor) and adjustment for home field advantage. One thing I only mentioned, but did not go into detail about, was that the teams initial ratings will influence this tuning. In this post I will show how we can find good initial ratings that also will mitigate some other problems associated with Elo ratings.

As far as I can tell, setting the initial ratings does not seem to be much discussed. The Elo system updates the ratings by looking at the difference between the actual results and the results predicted by the rating difference between the two opposing teams. To get this to work in the earliest games in the data, you need to supply some initial ratings.

It is possible to set the initial ratings by hand, using your knowledge about the strengths of the different players and teams. This strategy is however difficult to use in practice, since you may not have that knowledge, which in turn would give incorrect ratings. This task would also become more difficult the more teams and players are in your data. An automatic way to get the initial ratings is of course preferable.

The only automatic way to set the initial ratings I have seen is to set all ratings to be equal. This is what they do at FiveThirtyEight. This simple strategy is obviously not optimal. It is a bit far fetched to assume that all teams are equally good at the beginning of your data, even if you could argue that you don’t really know any better. If you have a lot of data going back a long time, then only the earliest period of your ratings will be unrealistic. After a while the ratings will become more realistic and better reflect the true strengths of the teams.

The unrealistic ratings for the earliest data may also cause a problem if you use this to find the optimal K-factor. In Elo ratings the K-factor is a parameter that determines how much new games will influence the ratings. A larger K-factor makes the ratings change a lot after a new game, while a low K-factor will make the ratings change only a little after each new game. If you are trying to make a rating system with good prediction ability and use the earliest games with the unrealistic ratings to tune the K-factor, then it will probably be overestimated. This is because a large K-factor will make the ratings change a lot at the beginning, making the ratings better quicker. A large K-factor will be good in the earliest part of your data, but after a while it may be unrealistically big.

One more challenge with Elo ratings is if you include multiple leagues or competitions in your rating system. Since the Elo ratings are based on the exchange of points, groups of teams that play each other often, such as the teams in the same league, will have ratings that are reasonable calibrated only between each other. This is not the case when you have teams from different leagues play each other. The rating difference between two teams from two leagues will not be as well-calibrated as those within a league.

A nice visualization of this is to plot which teams in your data have played each other. In the plot below each team is represented by a circle, and each line between the circles indicates that the two teams have against played each other. The data is from Premier League and the Championship from the year 2010; two half-seasons for each division. I haven’t added team names to the graph, but the orange circles are the teams that played in the Premier League in both seasons, while the blue circles are teams that played in the Championship or got promoted or relegated.

comparison graph2 crop

We clearly see that the two division cluster together with a lot of comparisons available between the teams. Six teams, those that got promoted and relegated between the two divisions, are clearly shown to fall in between the two large clusters. All comparisons to be made between teams in the two divisions has to rely on the information that is available via these six teams. Including all these teams in a Elo rating, starting with all ratings equal, will surely be completely wrong and will take some time to be realistic. All point exchange between the divisions will have to happen via the promoted and relegated teams.

I have previously investigated this in the context of regression models, where I demonstrated how including data from the Championship improves the prediction of Premier League matches. Se this and this.

So how can we find the initial ratings that will give realistic ratings that also calibrate the ratings between two or more leagues? By using a small amount of data, say one year worth of data, less that you would use to tune the K-factor and home field advantage, you can use an optimization algorithm to find the ratings that best fits the observed outcomes. In doing this you have to use the formula that converts the ratings to expected outcomes, but you do not use the update formula, so this approach can be seen as a static version of the Elo ratings.

Doing the direct optimization is however not completely straightforward. Elo ratings is a zero-sum system. No points are added or removed from the system, only exchanged. This constraint is similar to the sum-to-zero constraint that is sometimes used in regression modeling and Analysis-of-Variance. To overcome this, we can simply set the rating of one of the teams to the negative sum of the ratings of all the other teams.

A further refinement is to include home field advantage into the optimization. In cases where the teams have unequal number of home games, or some games where no teams play at home, this will create more accurate ratings. If not the ratings for those teams with an excess of home games will become unrealistically large.

Doing this procedure, using data from the Premier League and the Championship from 2010 which I used to make the graph above, I get the following ratings (with the average rating being 1500):

elo inits

The procedure also estimated the home field advantage to be 84.3 points.

The data I used for the initial ratings is the first year of the data I used to tune the K-factor in the previous post. How does using these initial ratings influence the this tuning, compared with using the same initial rating for all teams? As expected, the optimal K-factor is smaller. The plot below shows that K=14 is the optimal K, compared with K=18.5 that I found last time. It is also interesting to see that the ratings with initialization are more accurate for the whole range of K’s I tested, than those without.

kwithinit

Tuning the Elo ratings: The K-factor and home field advantage

The Elo rating system is quite simple, and therefore easy implement. In football, FIFA uses is in its womens rankings and the well respected website fivethirtyeight.com also uses Elo ratings to make predictions for NBA and NFL games. Another cool Elo rating site is clubelo.com.

Three year ago I posted some R code for calculating Elo ratings. Its simplicity also makes it easy to modify and extend to include more realistic aspects of the games and competitions that you want to make ratings for, for example home field advantage. I suggest reading the detailed description of the clubelo ratings to get a feel of how the system can be modified to get improved ratings. I have also discussed some ways to extend the Elo ratings here on this blog as well.

If you implement your own variant of the Elo ratings it is necessary to tune the underlying parameters to make the ratings as accurate as possible. For example, a too small K-factor will give ratings that update too slow. The ratings will not adapt well to more recent developments. Vice versa, a too large K-factor will put too much weight on the most recent results. The same goes for the extra points added to the home team rating to account for the home field advantage. If this is poorly tuned, you will get poor predictions.

In order to tune the rating system, we need a way to measure how accurate the ratings are. Luckily the formulation of the Elo system itself can be used for this. The Elo system updates the ratings by looking at the difference between the actual results and the results predicted by the rating difference between the two opposing teams. This difference can be used to tune the parameters of the system. The smaller this difference is, the more accurate are the predictions, so we want to tune the parameters so that this difference is as small as possible.

To formulate this more formally, we use the following criterion to assess the model accuracy:

\( \sum_i[ (exp_{hi} – obs_{hi})^2 + (exp_{ai} – obs_{ai})^2 ] \)

where \(exp_{hi}\) and \(exp_{ai}\) are the expected results of match i for the home team and the away team, respectively. These expectations are a number between 0 and 1, and is calculated based on the ratings of the two teams. \(obs_{hi}\) and \(obs_{ai}\) are the actual result of match i, encoded as 0 for loss, 0.5 for draw and 1 for a win. This criterion is called the squared error, but we will use the mean squared error.

With this criterion in hand, we can try to find the best K-factor. Using data from the English premier league as an example I applied the ratings on the match results from the January 1st 2010 to the end of the 2014-15 season, a total of 2048 matches. I tried it with different values of the K-factor between 7 and 25, in 0.1 increments. Then plotting the average squared error against the K-factor we see that 18.5 is the best K-factor.

bestk185

The K-factor I have found here is, however, probably a bit too large. In this experiment I initialized the ratings for all teams to 1500. This includes the teams that was promoted from the Championship. A more realistic rating system would initialize these teams with a lower rating, perhaps be given the ratings from the relegated teams.

We can of course us this strategy to also find the best adjustment for the home field advantage. The simple way to add the home field advantage is to add some additional points to the ratings for the home team. Here I have used the same number of points in all matches across all season, but different strategies are possible. To find the optimal home field advantage I applied the Elo ratings with K=18.5, using different home field advantages.

besthfa683

From this plot we see that an additional 68.3 points is the optimal amount to add to the rating for the home team.

One might wonder if finding the best K-factor and home field advantage independent of each other is the best way to do it. When I tried to find the best K-factor with the home field advantage set to 68, I found that the best K was 19.5. This is a bit higher than when the home field advantage was 0. I tried to find the optimal pair of K and home field advantage by looking over a grid of possible values. Plotting the accuracy of the ratings against both K and the home field advantage in a contour we get the following:

besthfak

The best K and home field advantage pair can be read from the plot, both of which is a bit higher than the first values I found.

Doing the grid search can take a bit of time, especially if you don’t narrow down the search space by doing some initial tests beforehand. I haven’t really tried it out, but alternating between finding the best K-factor and home field advantage and using the optimal value from the previous round is probably going to be a reasonable strategy here.

My predictions for the 2016-17 Premier League

This year I am participating in Simon Gleave‘s Premier League prediction competition. It is an interesting initiative, as both statistical models and and more informal approaches are compared.

Last time I participated in something like this was midway trough the last Premier League season for statsbomb.com’s compilation. This time, however, the predictions are made before the first match has been played. To be honest, I think it is futile to try to model and predict an unplayed season since any model based only on previous results will necessarily reproduce what has already happened. This approach will work OK for predicting the result of the next couple of matches midway trough a season, but making predictions for the start of a season is really hard since the teams have brought inn some new players and gotten rid of other and perhaps also changed managers and so on. And not to forget that we also try predict results 9 months into the future.

When May comes and my predictions are completely wrong, I am not going to be embarrassed.

Last time I wanted to use the Conway-Maxwell-Poisson model, but I did not get it to work when I included data from several seasons plus data from the Championship. I still did not get it to work properly, but this time I tried a different approach to estimate the parameters. I ended up using a two-step approach, where I first estimate the attack and defense parameters with the independent Poisson model, and then, keeping those parameters fixed, I estimated the dispersion parameter by itself.

To fit the model I used Premier League data from the 2010-11 season to the 2015-16 season. I also included data from the 2015-16 season of the Championship (including the playoff) to be able to get some information on the promoted teams. I used the Dixon-Coles weighting scheme with \(\xi = 0.0019\). I used a separate parameter for home field advantage for Premier League and the Championship. I also used separate dispersion parameters for the two divisions.

I estimated the dispersion parameter for the Premier League to be 1.103, about the same as I previously estimated in some individual Premier League seasons, indicating some underdispersion in the goals. Interestingly, the dispersion parameter for the Championship was only 1.015.

Anyway, here are my projected league table with expected (or average) point totals. This is completely based on the model, I have not done any adjustments to it.

Team Points
Manchester City 73.70
Arsenal 69.73
Leicester City 64.12
Manchester United 63.95
Chelsea 63.84
Tottenham 62.53
Southampton 60.51
Liverpool 60.37
Everton 51.48
West Ham 51.12
Middlesbrough 46.30
Swansea 44.59
Burnley 44.20
Stoke City 42.99
Hull 42.49
Crystal Palace 41.33
Watford 41.23
Sunderland 39.83
West Bromwich Albion 39.21
Bournemouth 36.37

The Norwegian election survey: Voter turnout across generations and age groups

In the last post I used data from the Norwegian election survey to look at how party preferences changed between generations. One thing I didn’t look at was if there was any differences in participation between the generations. While the Norwegian elections generally has a high turnout, the general trend has been a decline. Some numbers on voter turnout are available from Statistics Norway, and a plot of turnout for the national elections for parliament and the local elections show that this is especially true for the local elections. For the parliament elections there seems to be a sudden drop in turnout at the 1993 election. Before that the general turnout was somewhere between 80% and 85%. From 1993 and onwards it has been somewhere between 75% and 80%.

turnout_elections

This time I decided to only look at the surveys done in connection with the parliamentary elections. This was to avoid much clutter with the differences between the local and national elections. After I gathered the data from the elections surveys using the web analysis tool at the web page for Norwegian Center For Research Data (see the link above), I plotted the voter turnout for each election for each birth cohort (in 7-year groups). In the plot is each election represented by a line. The same thing I did in the previous post, in other words.

turnout_cohort

We clearly see a trend in which younger generations, those born after about 1955, are less likely to vote. But there is also a clear indication that the young voters, when they get older, are more likely to vote. This trend can for example be seen for the 1970-generation. The earliest generations where this group could vote are those lines where the line ends in about 1970. In the earliest elections where this group could vote, more than 25% did not vote, but in the more recent elections only 10% of this generation did not vote.

Also notice how in each election, the oldest group also seem to have a tendency to not vote. This could perhaps be explained by the older population generally has poorer health and will therefore not prioritize to get out to vote. But I also suspect this is partially explained by random variation, as the oldest birth groups have relatively few respondents in the surveys.

We can plot the turnout by age instead of birth year to get a better view of the differences between age groups. Here I used 5-year groups instead of 7. In this plot the lines do seem to align a bit better.

turnout_age

Still another figure we could do is to plot the turnout for different age groups, and then see how this has changed from election to election. Here I have plotted only two age groups, those 25 or younger, and those older than 25. Also shown is the national turnout, which is not from the election survey, but are the official turnout numbers. This is the same as in the first plot above.

turnout_young

We see again that the young voters have lower turnout than the older ones, which by now should be no surprise. In addition, the difference between the young and the old seem follow each other between the elections to a large degree, going up and down in a similar pattern, but it also become noticeably wider from the 1993 election. From just looking at this plot, it could seem as if the lower turnout among the young could explain a lot of the decrease that happened in the 1993 election, but keep in mind that the younger group is a relatively small group. Not pictured in the plot is the uncertainty of the estimates, which gives the unreasonable results in the 1965 and 1985 elections, where both the young and old have higher turnout (as measured by the survey) than the official numbers.

So from looking at these plots, it seems like when people where born, what age you are and which election it is influence whether you vote or not. But the effect of these three aspects is hard, if not impossible, to untangle. The reason for this is simple: How old you are is fully determined by when you are and when you were born. You can of course turn it around and say the same for the two other aspects: If you know two of them, you also know the third. From a modeling point of view this dependency makes it hard to put these three variables in a regression model, but there are some literature out there on how this kind of Age-Period-Cohort analysis (as it is called) could be done.

But does this mean we can’t really learn anything from it? I think we can. The kind of analysis like the one I have done here is of course rather informal and descriptive, no p-values or effect sizes or stuff like that, but I think it is clear that age plays an important role. The third plot, with age on the horizontal axis, looks much nicer than the second plot, with birth year on the horizontal. The lines align rather nicely. We can also see this in the cohort plot, where the 1970-generation had a low turnout in the first elections they could participate in, but in the more recent elections they participate as much as those born before that.

Whether the changes in participation among the young over time is a period effect or a cohort effect is more difficult to say. It seems to covary with the general trend, but it also has it’s own component. This does not seem to play a large role, except perhaps a change at the 1985 election (or among those born in the 1960’s, depending on your view).