# Which model is the best?

I had a discussion on Twitter a couple of weeks ago about which model is the best for predicting football results. I have suspected that the Dixon & Coles model (DC), which is a modification of the Poisson model, tend to overfit. Hence it should not generalize well and give poorer predictions. I have written about one other alternative to the Poisson model, namely the Conway-Maxwell Poisson model (COMP). This is a model for count data that can be both over-, equi- and underdispersed. It is basically a Poisson model but without the assumption that the variance equals the mean. I have previously done some simple analyses comparing the Poisson, DC and COMP models, and concluded then that the COMP model was superior. The analysis was however a bit to simple, so I have now done a more proper evaluation of the models.

A proper way to evaluatie the models is to do a backtest. For each day there is a game played, the three models are fitted to the available historical data (but not data from the future, that would be cheating) and then used to predict the match outcomes for that day. I did this for two leagues, the English Premier League and German Bundesliga. The models were fitted to data from both the top league and the second tier divisions, since this improves the models, but only the results of the top division was predicted and used in the evaluation. I used a separate home field advantage for the two divisions and the rho parameter in the DC model and the dispersion parameter in the COMP model was estimated using the top division only.

To measure the model’s predictive ability I used the Ranked Probability Score (RPS). This is the proper measure to evaluate predictions for the match outcome in the form of probabilities for home win, draw and away win. The range of the RPS goes from 0 (best possible predictions) to 1 (worst possible prediction). Since the three models actually model the number of goals, I also looked at the probability they gave for the actual score.

For all three models I used the Dixon & Coles method to weight the historical data that is used in training the models. This requires tuning. For both the English and German leagues I backtested the models on different values of the weighting parameter $$\xi$$ on the seasons from 2005-06 to 2009-10, with historical data available from 1995. I then used the optimal $$\xi$$ for backtesting the seasons 2010-11 up to December 2016. This last validation period covers 1980 Bundesliga matches and 2426 Premier League matches.

Here are the RPS for the three models plottet against $$\xi$$. Lower RPS is better and lower $$\xi$$ weights more recent data higher.

The graphs show a couple of things. First, all three models have best predictive ability at the same value of $$\xi$$, and that they compare similarly also for non-optimal values of $$\xi$$. This makes things a bit easier since we don’t have to worry that a different value of $$\xi$$ will alter our evaluations about which model is the best.

Second, there is quite some difference between the models for the German and English data. In the English data the COMP model is clearly best, while the DC is the worst. In the German league, the DC is clearly better, and the COMP and Poisson models are pretty much equally good.

So I used the optimal values of $$\xi$$ (0.0021 and 0.0015 for Premier League and Bundesliga, respectively) to validate the models in the data from 2010 and onwards.

Here is a table of the mean RPS for the three models:

We see that for the both English Premier League and German Bundesliga the DC model offers best predictions. The COMP model comes second in Premier League, but has worst performance in the Bundesliga. It is interesting that the DC model performed worst in the tuning period for the Premier League, now was the best one. For the Bundesliga the models compared similarly as in the tuning period.

I also looked at how often the DC and COMP models had lower RPS than the Poisson model. The results are in this table:

The COMP model outperformed the Poisson model in more than 60% of the matches in both leagues, while the DC model did so only about 40% of the time.

When looking at the goal scoring probabilities. Here is a table of the sum of the minus log probabilities for the actual scoreline. Here a lower number also indicates better predictions.

Inn both the Premier League and Bundesliga the Poisson model was best, followed by COMP, with the DC model last.

We can also take a look at the parameter values for the extra parameters the DC and COMP models has. Remember that the DC models is becomes the Poisson model when rho = 0, while the COMP model is the same as the Poisson model when upsilon = 1, and is underdispersed when upsilon is greater than 1.

The parameter estimates fluctuates a bit. It is intersting to see that the rho parameter in the DC model tend to be below 1, which gives the opposite direction of what Dixon and Coles found in their 1997 paper. In the Premier League, the parmater makes a big jump to above 0 at the end of the 2013-14 season. The parameter appears to be a bit more consistent in the Bundesliga, but also there we see a short period where the parameter is around 0.

The dispseriosn parameter upsilon also isn’t all that consistent. It is generally closer to 1 in the Bundesliga than in the Premier League. I think this is consistent with why this model was better in the Premier League than in the Bundesliga.

All inn all I think it is hard to conclude which of the three models is the best. The COMP and DC models both adjusts the Poisson model in their own specific ways, and this may explain why the different ways of measuring their predictive abilities are so inconsistent. The DC model seem to be better in the German Bundesliga than in the English Premier League. I don’t think any of the two models are generally better than the ordinary Poisson model, but it could be worthwhile to look more into when the two models are better, and perhaps they could be combined?

# The comparison graph part 2

In the last post I wrote about how a graph could be used to explore an important aspect of a data set of football matches, namely whom has played against whom. In this post I will present a more interesting graph. Here is how a graph of 4500 international matches, including friendlies, world cups, and continental cups, from 2010 to 2015:

There are 214 teams in this data set, each represented by a circle, and if two teams has played against each other, there is a line drawn between the two circles. It becomes clear when we see this graph that the graph is complicated, with a lot of lines between the circles, and it is hard to make a drawing that shows the structure really well.

There are a few things we can see clearly, though. The first is that the graph is highly connected. All teams are at least indirectly comparable with all other teams. There are no unconnected subgraphs. One measure of how connected the graph is, is the average number of edges the nodes have. In this graph this number is 23.2, which means that each team has on average played against 23 other teams.

On interesting thing we also notice is the “arm” on the right side of the plot, with a handful of teams that is more or less separated from the rest of the teams. These are teams from the Pacific nations, such as Fiji, Samoa and Cook Islands and so on.

In a data set like this we can find some interesting types of indirect comparisons. One example I found in the above graph was Norway and Japan, who has not played against each other in the five year period the data spans, but they have both played against two other teams that link them together: Zambia and Greece.

I haven’t found a decent measure of the overall connectedness between two nodes, that incorporates all indirect links of all degrees, but that could be an interesting thing to look at.

Another thing we can do with a graph like this is a cluster analysis. A cluster analysis gives us a broader look at the connectedness in the graph by finding groups of nodes that are more connected to each other that to those in the other groups. In other words we are trying to find groups of countries that play against each other a lot.

A simple clustering of the graph gives the following clusters, with some of the country names shown. The clustering algorithm identified 5 clusters that rather perfectly corresponds to the continents. This is perhaps not so surprising since the continental competitions (including the World Cup qualifications) make up a large portion of the data.

# Tuning the Elo ratings: Initial ratings and inter-league matches

In the last post I discussed how to tune the Elo ratings to make the ratings have the best predictive power by finding the optimal update factor (the K-factor) and adjustment for home field advantage. One thing I only mentioned, but did not go into detail about, was that the teams initial ratings will influence this tuning. In this post I will show how we can find good initial ratings that also will mitigate some other problems associated with Elo ratings.

As far as I can tell, setting the initial ratings does not seem to be much discussed. The Elo system updates the ratings by looking at the difference between the actual results and the results predicted by the rating difference between the two opposing teams. To get this to work in the earliest games in the data, you need to supply some initial ratings.

It is possible to set the initial ratings by hand, using your knowledge about the strengths of the different players and teams. This strategy is however difficult to use in practice, since you may not have that knowledge, which in turn would give incorrect ratings. This task would also become more difficult the more teams and players are in your data. An automatic way to get the initial ratings is of course preferable.

The only automatic way to set the initial ratings I have seen is to set all ratings to be equal. This is what they do at FiveThirtyEight. This simple strategy is obviously not optimal. It is a bit far fetched to assume that all teams are equally good at the beginning of your data, even if you could argue that you don’t really know any better. If you have a lot of data going back a long time, then only the earliest period of your ratings will be unrealistic. After a while the ratings will become more realistic and better reflect the true strengths of the teams.

The unrealistic ratings for the earliest data may also cause a problem if you use this to find the optimal K-factor. In Elo ratings the K-factor is a parameter that determines how much new games will influence the ratings. A larger K-factor makes the ratings change a lot after a new game, while a low K-factor will make the ratings change only a little after each new game. If you are trying to make a rating system with good prediction ability and use the earliest games with the unrealistic ratings to tune the K-factor, then it will probably be overestimated. This is because a large K-factor will make the ratings change a lot at the beginning, making the ratings better quicker. A large K-factor will be good in the earliest part of your data, but after a while it may be unrealistically big.

One more challenge with Elo ratings is if you include multiple leagues or competitions in your rating system. Since the Elo ratings are based on the exchange of points, groups of teams that play each other often, such as the teams in the same league, will have ratings that are reasonable calibrated only between each other. This is not the case when you have teams from different leagues play each other. The rating difference between two teams from two leagues will not be as well-calibrated as those within a league.

A nice visualization of this is to plot which teams in your data have played each other. In the plot below each team is represented by a circle, and each line between the circles indicates that the two teams have against played each other. The data is from Premier League and the Championship from the year 2010; two half-seasons for each division. I haven’t added team names to the graph, but the orange circles are the teams that played in the Premier League in both seasons, while the blue circles are teams that played in the Championship or got promoted or relegated.

We clearly see that the two division cluster together with a lot of comparisons available between the teams. Six teams, those that got promoted and relegated between the two divisions, are clearly shown to fall in between the two large clusters. All comparisons to be made between teams in the two divisions has to rely on the information that is available via these six teams. Including all these teams in a Elo rating, starting with all ratings equal, will surely be completely wrong and will take some time to be realistic. All point exchange between the divisions will have to happen via the promoted and relegated teams.

I have previously investigated this in the context of regression models, where I demonstrated how including data from the Championship improves the prediction of Premier League matches. Se this and this.

So how can we find the initial ratings that will give realistic ratings that also calibrate the ratings between two or more leagues? By using a small amount of data, say one year worth of data, less that you would use to tune the K-factor and home field advantage, you can use an optimization algorithm to find the ratings that best fits the observed outcomes. In doing this you have to use the formula that converts the ratings to expected outcomes, but you do not use the update formula, so this approach can be seen as a static version of the Elo ratings.

Doing the direct optimization is however not completely straightforward. Elo ratings is a zero-sum system. No points are added or removed from the system, only exchanged. This constraint is similar to the sum-to-zero constraint that is sometimes used in regression modeling and Analysis-of-Variance. To overcome this, we can simply set the rating of one of the teams to the negative sum of the ratings of all the other teams.

A further refinement is to include home field advantage into the optimization. In cases where the teams have unequal number of home games, or some games where no teams play at home, this will create more accurate ratings. If not the ratings for those teams with an excess of home games will become unrealistically large.

Doing this procedure, using data from the Premier League and the Championship from 2010 which I used to make the graph above, I get the following ratings (with the average rating being 1500):

The procedure also estimated the home field advantage to be 84.3 points.

The data I used for the initial ratings is the first year of the data I used to tune the K-factor in the previous post. How does using these initial ratings influence the this tuning, compared with using the same initial rating for all teams? As expected, the optimal K-factor is smaller. The plot below shows that K=14 is the optimal K, compared with K=18.5 that I found last time. It is also interesting to see that the ratings with initialization are more accurate for the whole range of K’s I tested, than those without.

# Tuning the Elo ratings: The K-factor and home field advantage

The Elo rating system is quite simple, and therefore easy implement. In football, FIFA uses is in its womens rankings and the well respected website fivethirtyeight.com also uses Elo ratings to make predictions for NBA and NFL games. Another cool Elo rating site is clubelo.com.

Three year ago I posted some R code for calculating Elo ratings. Its simplicity also makes it easy to modify and extend to include more realistic aspects of the games and competitions that you want to make ratings for, for example home field advantage. I suggest reading the detailed description of the clubelo ratings to get a feel of how the system can be modified to get improved ratings. I have also discussed some ways to extend the Elo ratings here on this blog as well.

If you implement your own variant of the Elo ratings it is necessary to tune the underlying parameters to make the ratings as accurate as possible. For example, a too small K-factor will give ratings that update too slow. The ratings will not adapt well to more recent developments. Vice versa, a too large K-factor will put too much weight on the most recent results. The same goes for the extra points added to the home team rating to account for the home field advantage. If this is poorly tuned, you will get poor predictions.

In order to tune the rating system, we need a way to measure how accurate the ratings are. Luckily the formulation of the Elo system itself can be used for this. The Elo system updates the ratings by looking at the difference between the actual results and the results predicted by the rating difference between the two opposing teams. This difference can be used to tune the parameters of the system. The smaller this difference is, the more accurate are the predictions, so we want to tune the parameters so that this difference is as small as possible.

To formulate this more formally, we use the following criterion to assess the model accuracy:

$$\sum_i[ (exp_{hi} – obs_{hi})^2 + (exp_{ai} – obs_{ai})^2 ]$$

where $$exp_{hi}$$ and $$exp_{ai}$$ are the expected results of match i for the home team and the away team, respectively. These expectations are a number between 0 and 1, and is calculated based on the ratings of the two teams. $$obs_{hi}$$ and $$obs_{ai}$$ are the actual result of match i, encoded as 0 for loss, 0.5 for draw and 1 for a win. This criterion is called the squared error, but we will use the mean squared error.

With this criterion in hand, we can try to find the best K-factor. Using data from the English premier league as an example I applied the ratings on the match results from the January 1st 2010 to the end of the 2014-15 season, a total of 2048 matches. I tried it with different values of the K-factor between 7 and 25, in 0.1 increments. Then plotting the average squared error against the K-factor we see that 18.5 is the best K-factor.

The K-factor I have found here is, however, probably a bit too large. In this experiment I initialized the ratings for all teams to 1500. This includes the teams that was promoted from the Championship. A more realistic rating system would initialize these teams with a lower rating, perhaps be given the ratings from the relegated teams.

We can of course us this strategy to also find the best adjustment for the home field advantage. The simple way to add the home field advantage is to add some additional points to the ratings for the home team. Here I have used the same number of points in all matches across all season, but different strategies are possible. To find the optimal home field advantage I applied the Elo ratings with K=18.5, using different home field advantages.

From this plot we see that an additional 68.3 points is the optimal amount to add to the rating for the home team.

One might wonder if finding the best K-factor and home field advantage independent of each other is the best way to do it. When I tried to find the best K-factor with the home field advantage set to 68, I found that the best K was 19.5. This is a bit higher than when the home field advantage was 0. I tried to find the optimal pair of K and home field advantage by looking over a grid of possible values. Plotting the accuracy of the ratings against both K and the home field advantage in a contour we get the following:

The best K and home field advantage pair can be read from the plot, both of which is a bit higher than the first values I found.

Doing the grid search can take a bit of time, especially if you don’t narrow down the search space by doing some initial tests beforehand. I haven’t really tried it out, but alternating between finding the best K-factor and home field advantage and using the optimal value from the previous round is probably going to be a reasonable strategy here.

# My predictions for the 2016-17 Premier League

This year I am participating in Simon Gleave‘s Premier League prediction competition. It is an interesting initiative, as both statistical models and and more informal approaches are compared.

Last time I participated in something like this was midway trough the last Premier League season for statsbomb.com’s compilation. This time, however, the predictions are made before the first match has been played. To be honest, I think it is futile to try to model and predict an unplayed season since any model based only on previous results will necessarily reproduce what has already happened. This approach will work OK for predicting the result of the next couple of matches midway trough a season, but making predictions for the start of a season is really hard since the teams have brought inn some new players and gotten rid of other and perhaps also changed managers and so on. And not to forget that we also try predict results 9 months into the future.

When May comes and my predictions are completely wrong, I am not going to be embarrassed.

Last time I wanted to use the Conway-Maxwell-Poisson model, but I did not get it to work when I included data from several seasons plus data from the Championship. I still did not get it to work properly, but this time I tried a different approach to estimate the parameters. I ended up using a two-step approach, where I first estimate the attack and defense parameters with the independent Poisson model, and then, keeping those parameters fixed, I estimated the dispersion parameter by itself.

To fit the model I used Premier League data from the 2010-11 season to the 2015-16 season. I also included data from the 2015-16 season of the Championship (including the playoff) to be able to get some information on the promoted teams. I used the Dixon-Coles weighting scheme with $$\xi = 0.0019$$. I used a separate parameter for home field advantage for Premier League and the Championship. I also used separate dispersion parameters for the two divisions.

I estimated the dispersion parameter for the Premier League to be 1.103, about the same as I previously estimated in some individual Premier League seasons, indicating some underdispersion in the goals. Interestingly, the dispersion parameter for the Championship was only 1.015.

Anyway, here are my projected league table with expected (or average) point totals. This is completely based on the model, I have not done any adjustments to it.

Team Points
Manchester City 73.70
Arsenal 69.73
Leicester City 64.12
Manchester United 63.95
Chelsea 63.84
Tottenham 62.53
Southampton 60.51
Liverpool 60.37
Everton 51.48
West Ham 51.12
Middlesbrough 46.30
Swansea 44.59
Burnley 44.20
Stoke City 42.99
Hull 42.49
Crystal Palace 41.33
Watford 41.23
Sunderland 39.83
West Bromwich Albion 39.21
Bournemouth 36.37

# My predictions for the rest of the Premier League season

A couple of weeks ago Constantinos Chappas asked on twitter for predictions for the results of the remaining season of English Premier League:

I had been thinking about posting some predictions about the Premier League around new years, since this season is really exciting and it would be a great opportunity to see how well my models would cope with everything that is currently going on. I have never posted any predictions before, so this will surely be an interesting experience. And I thought Chappas’ initiative was really interesting, so that surely gave me a nice reason to come trough.

Today Chappas posted the combined results from all 15 participants so I thought I could share some of the details behind my contribution.

I originally wanted to use the Conway-Maxwell model I have written about recently, but I had some problems with the estimation procedure, so I instead used a classic Poisson model. I used data on Premier League and Championship results going back to the 2011-12 season. By including data from the Champoionship I hope to get better predictions, like I have demonstrated before. Since I used data from a long time back I used the Dixon-Coles weighting scheme, which make more recent games have a greater impact on the predictions. The weighting parameter $$\xi$$ was set to 0.0019, which gives a bit more weight on more recent games than the 0.0018 I found to be most optimal earlier.

I fitted the model and calculated the probabilities for the remaining games of the season. From these probabilities I simulated the rest of the season ten thousand times. From these simulations we can get the probabilities and expectations for the end of season results.

So how do I predict the league table will look like at the end of the season?

Team Points
Manchester City 75.7
Arsenal 75.2
Tottenham 65.6
Leicester City 64.8
Manchester United 64.3
Liverpool 58.2
West Ham 56.1
Chelsea 54.7
Everton 53.7
Crystal Palace 53.7
Stoke City 52.9
Watford 51.9
Southampton 50.6
West Bromwich Albion 45.8
Norwich City 43.7
Bournemouth 42.9
Swansea City 40.9
Newcastle 34.5
Sunderland 31.5
Aston Villa 23.1

Although I predict 0.2 points more for Manchester City than Arsenal, the probabilities for both of them to win is 47.0%. I also give Tottenham a 2.3% chance, Leicester 2.1% and Manchester United a 1.5%. At last, Liverpool have a 0.1% chance. The other teams have a chance less than 0.04%.

I will come back with an update with my entire table with probabilities for all positions for all teams.

# Underdispersed Poisson alternatives seem to be better at predicting football results

In the previous post I discussed some Poisson-like probability distributions that offer more flexibility than the Poisson distribution. They typically have an extra parameter that controls the variance, or dispersion. The reason I looked into these distributions was of course to see if they could be useful for modeling and predicting football results. I hoped in particular that the distributions that can be underdispersed would be most useful. If the underdispersed distributions describe the data well then the model should predict the outcome of a match better than the ordinary Poisson model.

The model I use is basically the same as the independent Poisson regression model, except that the part with the Poisson distribution is replaced by one of the alternative distributions. Let the $$Y_{ij}$$ be the number of goals scored in game i by team j

$$Y_{ij} \sim f(\mu_{ij}, \sigma)$$
$$log(\mu_{ij}) = \gamma + \alpha_j + \beta_k$$

where $$\alpha_j$$ is the attack parameter for team j, and $$\beta_k$$ is the defense parameter for opposing team k, and $$\gamma$$ is the home field advantage parameter that is applied only if team j plays at home. $$f(\mu_{ij}, \sigma)$$ is one of the probability distributions discussed in the last post, parameterized by the location parameter mu and dispersion parameter sigma.

To these models I fitted data from English Premier League from the 2010-11 season to the 2014-15 season. I also used Bundesliga data from the same seasons. The models were fitted separately for each season and compared to each other with AIC. I consider this only a preliminary analysis and I have therefore not done a full scale testing of the accuracy of predictions where I refit the model before each match day and use Dixon-Coles weighting.

The five probability distributions I used in the above model was the Poisson (PO), negative binomial (NBI), double Poisson (DPO), Conway-Maxwell Poisson (COM) and the Delaporte (DEL) which I did not mention in the last post. All of these, except the Conway-Maxwell Poisson, were easy to fit using the gamlss R package. I also tried two other gamlss-supported models, the Poisson inverse Gaussian and Waring distributions, but the fitting algorithm did not work properly. To fit the Conway-Maxwell Poisson model I used the CompGLM package. For good measure I also fitted the data to the Dixon-Coles bivariate Poisson model (DC). This model is a bit different from the rest of the models, but since I have written about it before and never really tested it I thought this was a nice opportunity to do just that.

The AIC calculated from each model fitted to the data is listed in the following table. A lower AIC indicates that the model is better. I have indicated the best model for each data set in red.

The first thing to notice is that the two models that only account for overdispersion, the Negative Binomial and Delaporte, are never better than the ordinary Poisson model. The other and more interesting thing to note, is that the Conway-Maxwell and Double Poisson models are almost always better than the ordinary Poisson model. The Dixon-Coles model is also the best model for three of the data sets.

It is of course necessary to take a look at the estimates of the parameters that extends the three models from the Poisson model, the $$\sigma$$ parameter for the Conway-Maxwell and double Poisson and the $$\rho$$ for the Dixon-Coles model. Remember that for the Conway-Maxwell a $$\sigma$$ greater than 1 indicates underdispersion, while for the Double Poisson model a $$\sigma$$ less than 1 is indicates underdispersion. For the Dixon-Coles model a $$\rho$$ less than 0 indicates an excess of 0-0 and 1-1 scores and fewer 0-1 and 1-0 scores, while it is the opposite for $$\rho$$ greater than 0.

It is interesting to see that the estimated dispersion parameters indicate underdispersion for all the data sets. It is also interesting to see that the data sets where the parameter estimates are most indicative of equidispersion is where the Poisson model is best according to AIC (Premier League 2013-14 and Bundesliga 2010-11 and 2014-15).

The parameter estimates for the Dixon-Coles model do not give a very consistent picture. The sign seem to change a lot from season to season for the Premier League data, and for the data sets where the Dixon-Coles model was found to be best, the signs were in the opposite direction of what where the motivation described in the original 1997 paper. Although it does not look so bad for the Bundesliga data, this makes me suspect that the Dixon-Coles model is prone to overfitting. Compared to the Conway-Maxwell and double Poisson models that can capture more general patterns in all of the data, the Dixon-Coles model extends the Poisson model to just parts of the data, the low scoring outcomes.

It would be interesting to do fuller tests of the prediction accuracy of these three models compared to the ordinary Poisson model.

# Some alternatives to the Poisson distribution

One important characteristic of the Poisson distribution is that both its expectation and the variance equals parameter $$\lambda$$. A consequence of this is that when we use the Poisson distribution, for example in a Poisson regression, we have to assume that the variance equals the expected value.

The equality assumption may of course not hold in practice and there are two ways in which this assumption can be wrong. Either the variance is less than the expectation or it is greater than the expectation. This is called under- and overdispersion, respectively. When the equality assumption holds, it is called equidispersion.

There are two main consequences if the assumption does not hold: The first is that standard errors of the parameter estimates, which are based on the Poisson, are wrong. This could lead to wrong conclusions when doing inference. The other consequence happens when you use the Poisson to make predictions, for example how many goals a football team will score. The probabilities assigned to each number of goals to be scored will be inaccurate.

(Under- and overdispersion should not be confused with heteroscedasticity in ordinary linear regression. Poisson regression models are naturally heteroscedastic because of the variance-expectation equality. Dispersion refers to what relationship there is between the variance and the expected value, in other words what form the heteroscedasticity takes.)

When it comes to modeling and predicting football results using the Poisson, a good thing would be if the data were actually underdispersed. That would mean that the probabilities for the predicted number of goals scored would be higher around the expectation, and it would be possible to make more precise predictions. The increase in precision would be greatest for the best teams. Even if the data were really overdispersed, we would still get probabilities that more accurately reflect the observed number of goals, although the predictions would be less precise.

This is the reason why I have looked into alternatives to the Poisson model that are suitable to model count data and that are capable of being over- and underdispersed. Except for the negative binomial model there seems to have been little focus on more flexible Poisson-like models in the literature, although there are a handful of papers from the last 15 years with some applied examples.

I should already mention the gamlss package, which is an extremely useful package that can fit a large number of regression type models in R. I like to think of it as the glm function on steroids. It can be used to create regression models for a large number of distributions (50+) and using different forms of dependent variables (for example random effects and splines) and doing regression on distribution parameters other than the usual expectation parameters.

The models that I have considered usually have two parameters. The two parameters are often not easy to interpret, but the distributions can be re-parameterized (which is done in the gamlss package) so that the parameters describe the location (denoted $$\mu$$, often the same as the expectation) and shape (denoted $$\sigma$$, often a dispersion parameter that modifies the association between the expectation and variance). Another typical property is that they equal the Poisson for certain values of the shape parameter.

As I have already mentioned, the kind of model that is most often put forward as an alternative to the Poisson is the Negative binomial distribution (NBI). The advantages of the negative binomial are that is well studied and good software packages exists for using it. The shape parameter $$\sigma > 0$$ determines the overdispersion (relative to the Poisson) so that the closer it is to 0, the more it resembles the Poisson. This is a disadvantage as it can not be used to model underdispersion (or equidispersion, although in practice it can come arbitrarily close to it). Another similar, but less studied, model is the Poisson-inverse Gaussian (PIG). It too has a parameter $$\sigma > 0$$ that determines the overdispersion.

A large class of distributions, called Weighted Poisson distributions, is capable of being both over- and underdispersed. (The terms Weighted in the name comes from a technique used to derive the distribution formulas, not that the data is weighted) A paper describing this class can be found here. The general form of the probability distribution is

$$P(x;\theta,\alpha)=\frac{e^{\mu x+\theta t(x)}}{x!C(\theta,\alpha)}$$

where $$t(x)$$ is one of a large number of possible functions, and $$C(\theta,\alpha)$$ is a normalizing constant which makes sure all probabilities in the distribution sum to 1. Note that I have denoted the two parameters using $$\theta$$ and $$\alpha$$ and not $$\mu$$ and $$\sigma$$ to indicate that these are not necessarily location and shape parameters. I think this and interesting class of distributions that I want to look more into, but since they are not generally implemented in any R package that I know of I will not consider them further now.

Another model that is capable of being over- and underdispersed is the Conway–Maxwell–Poisson distribution (COM), which incidentally is a special case of the class of Weighted Poisson distributions mentioned above (see this paper). The Poisson distribution is a special case of the COM when $$\sigma = 1$$, and is underdispersed when $$\sigma > 1$$ and overdispersed when $$\sigma$$ is between 0 and 1. One drawback with the COM model is that the expected value depends on both parameters $$\mu$$ and $$\sigma$$, although it is dominated by $$\mu$$. This makes the interpretation a bit difficult, but it may not be a problem when making predictions.

Unfortunately, the COM model is not supported by the gamlss package, but there are some other R packages that implements it. I have tried a few of them and the only one that I got to work is CompGLM, which for some reason does not use the location ($$\mu$$) and shape ($$\sigma$$) parameterization.

The Double Poisson (DP) is another interesting distribution which also equals the Poisson distribution when $$\sigma = 1$$, but is overdispersed when $$\sigma > 1$$ and underdispersed when $$\sigma$$ is between 0 and 1. The expectation does not depend on the shape parameter $$\sigma$$, and it is approximately equal to the location parameter $$\mu$$. Another interesting thing about the Double Poisson is that it is belongs to a larger group of distributions called double exponential families which also lets you derive a binomial-like distribution with an extra dispersion parameter which can be useful in a logistic regression setting (see this paper, or this preprint).

In a follow up post I will try to use these distributions in regression models similar to the independent Poisson model.

# A hectic schedule has some effect on the outcome of a football match

It may be that a football team who has had a hectic period with a lot of games will, because of lack of training and restitution, perform poorer. The Wikipedia page for the FA Cup mentions Manchester United’s absence from the cup as a reason for why they won the Premier League by 18 points in the 1999-2000 season. If this is indeed the case, then this is something we could try to exploit in a prediction model.

I basically used the same data and model as I have used before. I used data from the English Championship and the Premier League, and predicted the Premier League games from January 2007 until January 2015 using the independent Poisson model with the Dixon & Coles weighting method (more details on the setup here and here). In addition I constructed a new variable, the number of matches each team has played the last x number of days, were we can use and try different values of x. As a pretentious shorthand I will call this the Match Schedule Intensity Index (MSII). Matches from the FA Cup, Europa Cup and Champions League were also included in the calculations.

As usual I used the ranked probability score (RPS) to assess the prediction accuracy.

I tried four different number of days back in time: 21, 25, 28 and 31 days. I also varied the time weighing parameter $$\xi$$ a bit to see how these things varied together.

Plotting the RPS, number of days back in time and the different values of $$\xi$$ against each other gives the following:

We see that looking back 28 days in time gives the lowest RPS and this the most accurate predictions of the four alternatives. 25 days is almost as good as 28 days, while 21 and 31 days performs poorer than not having the MSII in the model at all. I am not sure how important the drop in RPS is, as the changes are around the 4th and 5th decimal place. It is probably not that much, but on the other hand, this is an average over 3000 matches, and the number of days back in time seems to be a more important parameter than the small changes in $$\xi$$ that I tried.

It is also interesting to see what effect the MSII has on the number of goals scored. I plotted the estimated multiplicative effect for each additional match for all the fitted models from 2007 to 2015 using the best model with 28 days and $$\xi=0.0020$$.

I expected the effect of additional matches to be negative, meaning the more games the team has recently played, the fewer goals will they be expected to score. This seems to be at least halfway true, except for a few dips over on the positive side around 2010 and 2013-2014, and a rather large positive effect from the start in 2007 until 2008. This was a bit surprising, and I don’t know why. It would be interesting to redo the analysis with data going further back in time to see how far back the positive effect goes.

Is the effect large? Not really. The most extreme values of the multiplicative effects for the MSII is around 0.97 and 1.04. These values means that for each match a team has played more in the last four weeks they are expected to score around 3-4% more or fewer goals. This effect is around 10% for a team that has played four matches in four weeks, which is a typical mid-season schedule. This is not that big of a deal for individual matches, but it seem to improve the predictions in the long run. But I also think it is necessary to keep in mind that the effect in seems to be mostly absent in some periods.

# Better prediction, not just for promoted teams

Ian posted an interesting question that had a lot to do with the post I posted last week:

I have implemented the model to make predictions with two different approaches. The first approach is the standard where I use all matches played in a league to predict a match between Team A and Team B. The second approach is to use just matches played by Team A and Team B to predict the outcome of when they both play each other.

Now would you say that the second approach should be more accurate? As surely the only results which matter for predicting the match between Team A and B is of those two teams?

My answer was that regression models use all the data to estimate the parameters, and that the parameter estimates for Team A and Team B probably will be more precise by including matches where neither team is playing. The intuition for this is that both teams play against a whole bunch of other teams during the season, and the more accurate parameter estimates we can get for these other teams, the more information are we going to get from the matches involving either Team A or Team B. One possible way of getting more accurate parameter estimates for all the other teams is to include data from more matches, if available. And at last, more precise parameter estimates should hopefully provide better predictions.

This is not exactly what I demonstrated in the last post. There I just demonstrated that more data, especially related to promoted teams, will give better predictions on average across the whole Premier League. I did not investigate exactly where these improved predictions occur. It could be that all that gain was just related to the improved parameter estimates of the promoted teams.

That is why, prompted by Ian’s comment, I took a closer look at the predictions. Using the model fitted with data from the Premier League and the Championship, with separate home field advantage for the two divisions, I decided to look at how well the predictions were for some Premier League Teams. Recall that this was the model that made the best predictions in the previous post. I decided to look at only the matches between Manchester United, Arsenal, Aston Villa, Chelsea, Liverpool, Everton and Tottenham since these teams have played in Premier League for a long time.

When only looking at these teams, and using Premier League data only, the RPS was 0.24462. When the Championship were included in the data, RPS were a bit smaller, 0.24436. So this means that including more data, not directly related to this group of teams, improved predictions within that group.

I also tried the model without separate home field advantage parameter for the two divisions, and the predictions got worse for this group of teams. This was not the case when looking at the predictions for all Premier League matches, were it got better on average. This demonstrates an important point that I did not mention in my reasoning above: More data is not necessarily a good thing if your model can’t properly handle it.