A couple of years ago I implemented the Dixon-Coles model for predicting football results here on this blog. That series of of blog posts is my most popular since I keep getting comments on it, some four years later.

One of the most common requests is advice on how to expand the model to include additional predictors. Unfortunately with the implementation I posted this was not so straightforward. It relied on some design matrices with dummy-coded variables, which is a standard way of doing things in regression modeling. The DC model isn’t a standard regression modeling problem, so using matrices complicated things. I posted some updates and variant across several posts, which in the end made the code a bit hard to follow and to modify.

Anyway, I’ve had a simpler implementation lying around for a while, and since there’s been far between updates on this blog lately I thought I’d post it.

First load some data from the engsoccerdata package. I’m going to use the 2011-12 season of the English Premier League, so the results can be compared with what I got from the first implementation.

library(dplyr) library(engsoccerdata) england %>% filter(Season == 2011, tier==1) %>% mutate(home = as.character(home), visitor = as.character(visitor))-> england_2011

Next we should create a list of initial parameter values. This will be used as a starting point for estimating the parameters. The list contains vectors of four groups of parameters, the attack and defense parameters of all teams, the home field advantage and the Dixon-Coles adjustment (rho). The attack and defense vector are named so that it is easy to look up the relevant parameter later on.

Notice also that a sum-to-zero constraint has to be added to the defense parameters, so in reality we are estimating one defense parameter less than the number of teams. Check this post for some more explanation of this.

# Make a vector of all team names. all_teams <- sort(unique(c(england_2011$home, england_2011$visitor)), decreasing = FALSE) n_teams <- length(all_teams) # list of parameters with initial values. parameter_list <- list(attack = rep(0.2, n_teams), defense = rep(-0.01, n_teams-1), home = 0.1, rho= 0.00) names(parameter_list$attack) <- all_teams names(parameter_list$defense) <- all_teams[-1] # the first parameter is computed from the rest.

Next we need a function that calculates the negative log-likelihood function, to be used with R’s built in optimizer.

One trick I use here is to relist the parameters. The optimizer want all parameter values as a single vector. When you have a lot of parameters that group together and is used in different parts of the model, this can quickly create some complicated indexing and stuff. By supplying the original parameter list, plus having named vectors, these problems essentially disappear.

Also notice how the expected goals are now simply computed by looking up the relevant parameters in the parameter list and adding them together. No need for matrix multiplications.

The Dixon-Coles adjustment function tau is the same as in the original implementation.

dc_negloglik <- function(params, goals_home, goals_visitor, team_home, team_visitor, param_skeleton){ # relist, to make things easier. plist <- relist(params, param_skeleton) # There is a sum-to-zero constraint on defense parameters. # The defense parameter for the first team is computed from the rest. plist$defense <- c(sum(plist$defense)*-1, plist$defense) names(plist$defense)[1] <- names(plist$attack[1]) # add name to first element. # Home team expected goals lambda_home <- exp(plist$attack[team_home] + plist$defense[team_visitor] + plist$home) # Away team expected goals lambda_visitor <- exp(plist$attack[team_visitor] + plist$defense[team_home]) # Dixon-Coles adjustment dc_adj <- tau(goals_home, goals_visitor, lambda_home, lambda_visitor, rho = plist$rho) # Trick to avoid warnings. if (any(dc_adj <= 0)){ return(Inf) } # The log-likelihood log_lik_home <- dpois(goals_home, lambda = lambda_home, log=TRUE) log_lik_visitor <- dpois(goals_visitor, lambda = lambda_visitor, log=TRUE) log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj))) return(log_lik*-1) }

To actually estimate the parameters we feed the function, data and initial values to optim, and check the results.

optim_res <- optim(par = unlist(parameter_list), fn=dc_negloglik, goals_home = england_2011$hgoal, goals_visitor = england_2011$vgoal, team_home = england_2011$home, team_visitor = england_2011$visitor, param_skeleton=parameter_list, method = 'BFGS') # relist, and calculate the remaining parameter. parameter_est <- relist(optim_res$par, parameter_list) parameter_est$defense <- c( sum(parameter_est$defense) * -1, parameter_est$defense) names(parameter_est$defense)[1] <- names(parameter_est$attack[1])

I get the same home field advantage (0.27) and rho (-0.13) as in the original implementation. The other parameters differ, however. This is because of the sum-to-zero constraints are coded in a different way. This should not matter and both ways should give the same predictions.

I have not yet said anything about how to expand the model to include other predictors, but hopefully this implementation should make it easier. You can just add some new arguments to the dc_negloglik function that takes the variables in question as input, and add new parameter vectors to the parameter list as needed. Then the calculations of the expected goals should be modified to include the new parameters and predictors.

H i,

good job. I think that in function dc_negloglik the line

plist$defense <- c(sum(plist$defense)-1, plist$defense)

have to be

plist$defense <- c(-sum(plist$defense), plist$defense)

to ensure sum to zero constrain

Ralph

Yes, that is true! Thank you for pointing this out.

Hi, I’m back! Playing with the program, I found some different and strange values for rho parameter. With file of england Seasons 2011 the value is the same you found. With values of season 2017 2018 and for different countries the value is incredible huge about -1e+09.

Do you have any ideas for this?

My compliments for your blog

Ralph

Yes there seem to be problem. My initial guess when this happens is to check whether the optimization converges. optim_res$convergence should be 0. You could perhaps try other algorithms (using the method argument) or different initial values.

There could also be a problem with your data, for example if you have two separate groups of teams that hasn’t played each other. Other problems I imagine could be if you have really few cases of low scoring games for where the DC adjustment kicks in.

I have the same problem with the same year-champion. dataset can be found here

https://www.football-data.co.uk/mmz4281/1718/E0.csv

Have you managed to run DC with the same dataset?

I have tried all algorithms/ initial optim params and nothing works.. so frustrating

Have you tried using the goalmodel package? I recommend using that for the DC model. If it fails, try doing a 2-step estimation.

https://github.com/opisthokonta/goalmodel

It works with your package, isn’t it strange? Does it have to do with constraints set in your original paper?

Maybe. The package uses some additional constraints, like having the sum-to-zero constraint for both the defense and attack parameters.

Am trying to implement this in java. I have gone through the original paper and am having trouble understanding the likelihood function. Is it one equation or a system of equations that need to be maximized as a system? Am new to this so any help would be appreciated.

You can think of it as a system of equations, two per match. For a specific set of parameter values you calculate the expected goals for each team in each match using the equations with the attack, defense and home field parameter.

this is not true. (4.2) is the likelihood function. two poisson distributions are linked together by the inflation/deflation function tau. rho should always be negative since lambda, mu >0 and we want to inflate always 0:0, 1:1.

Hey, i’m also trying to understand the model. In the paper of DC it was written that you calculate the parameters using maximum likelihood on the expected goals (?) and here you say it’s the opposite?

Yes, it is maximum likelihood. The original question here was a bit confusing.

Hey, Great post . Thanks! I just don’t understand. In the paper of DC they state alpha & beta > 0 yet in your model I see some values of the attack / defence parameters are negative. ?

Yes, that is because I do things a bit differently, but the way I do it is equivalent with what they do in the paper. In the paper they formulate the model as expected goals = a*b , but I do it as log(expected goals) = a + b. Or as I do in the implementation, expected goals = exp(a + b).

Hi, I guess in the code cited below, it is sufficient to only get the home team names, as the visitor teams are the same.

Also the name “AwayTeam” is not used in the dataframe.

all_teams <- sort(unique(c(england_2011$home, england_2011$AwayTeam)), decreasing = FALSE)

Thanks for a nice blog!

-Halvard

You are right, but it can be useful to check both variables, especially when some teams have only played a few games, for example early in the season. The “AwayTeam” variable is a vestige from some other code, and a mistake. Thanks for pointing that out!

In their paper DC use another formula adding some time decay factor to the games taking into consideration (4.5). Have you figured out how to implement this? Can this easily be added to your code bove?

Yes it can be added somewhat easily. Check out this blog post: http://opisthokonta.net/?p=1013

The key is to multiply each term with the weights in the final sum where the log_lik variable is computed.

thanks for your reply, sorry I missed that already posted article.

Hey, great posts on this subject. Really helping me understand how these prediction models work. Having trouble creating the scaling matrix for this version of the model. Code as below:

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda_home, lambda_visitor, parameter_est['RHO']), nrow=2)

probability_matrix[1:2, 1:2] <- as.numeric(probability_matrix)[1:2, 1:2] * scaling_matrix

Error: non-numeric argument to binary operator

Have you any insights to where i'm going wrong?

Thanks so much.

Pat

There seems to be a problem with at least one of of the variables not being numeric. If you figure out which one it is, you can see if the command that created them was run properly.

Hi, how could i calculate aic/ bic using your code above? Do you think its feasible?

Additionally adding the weights of dc, if i add the weigjt function from your other post and the new log likelihood function, will it run as it is?

Thanks for sharing your great work, i appreciate it.

Here is code to get AIC:

length(optim_res$par)*2 – 2*(optim_res$value*-1)

To do weighted estimation you need to multiply each term in the log-likelihood with the weights, something like this:

log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj))*weights)

Hi, thanks for answering. I have a follow up question. I need to compate the SC model with and without time weights, hence I use your code in order to compare AIC/BIC. Using England’s Premier league (2017-2018) data, and currentDate = sys.Date, xi = 0.0018

I get AIC = 978

BIC = 1168.

If I don’t use time weights (weights = 1) I get quit different results:

AIC: 2183.601

BIC:2373.568

Does this sound logical? If I use a simple double poisson for example with no weights, I get results compared to the latter cases of aic/bic (close to 2000)

*DC model

You can’t compare the model with weights and the model without using AIC. AIC and BIC can only be used in comparing different models using the same data and the same weights.

Thanks, I keep learning stuff about statistics from your posts and comments, I appreciate it.

No problem. Those things can be a bit tricky to understand.

Hey great post again! I was just looking through and I’ve encountered a few issues at times when the rho estimate is exceptionally wide of the mark (into the high thousands) and thus so the attack and defence parameters. Do you have any idea how this comes about? Or how you could limit the range of rho in the optimization? In the paper rho is given a clear limit either side to avoid negatives in the tau function. I know you have included it in your dc_adj but not when you create your scaling matrix which can then skew the probabilitie matrix massively (so it doesn’t sum to 1). Thanks again really enjoy your blog.

Thank you! Yes that sometimes happens, and there could be many reasons why. Sometimes you might have too few data points, perhaps for a particular team. Also if you use weights, and some teams have small weights for all their games, you run into the same problem. Perhaps you should also make sure you have more than just a few games where the DC-adjustment applies.

I also recommend using my goalmodel package for fitting these models, which has become more robust with the latest updates.

Why have you set the initial parameters as

home = 0.2

away = -0.1

home = 0.1

rho = 0

Does it matter what they’re set to?

Also how is home and rho calculated? I can’t see it in the code.

Sorry, I am new and trying to understand this.

It is a good idea to set the starting values to be somewhat realistic, but for the particular numbers I have used? No. But it might create problems if you start with extreme values. Starting with rho=0 is perhaps a good idea since it corresponds to the default model, but setting it to 0.1 or something should not create problems. Rho and home is estimated together with the other parameters.

Firstly thank you for the great blogs – very sticky!

I found that I could replicate the results you publish for the earlier version of the Dixon-Coles method and the improved version when applied to the premier league data.

When I attempted to apply it to data on international matches I ran into difficulties. For your first implementation of Dixon-Coles I found that in the computation of lambda within the DCOptimFn requires the matrix DCm$hometeamDM to have the same number of columns as the number of rows in attack.p (a basic requirement for matrix multiplication).

Am I correct in assuming that your implementations require the input data to be “complete” in the sense that every team has played every other team an equal number of times?

This is mentioned in the Dixon-Coles paper but with reference to Maher and how he obtained a system of linear equations whose roots are the maximum likelihood estimates. However, they then go onto immediately refer to “greater generality” and teams from all four divisions being included in the likelihood in equation (4.3). The latter is not so clear to me – it possibly suggests the data does not need to be “complete”.

I also notice that in the details section for help on R’s optim function it states “Note that arguments after … must match exactly”. Again this is not completely clear – match in what respect(s)? Arguments after … in your code are goals_home, goals_visitor, team_home, team_visitor.

The data does not neccecarily need to be complete, in the sense you write. It seems as a minimum, all teams have to have played each other i a way that makes them “connected”. I explored the “connection” bit in this blog post: http://opisthokonta.net/?p=1511

For international matches it could sometimes be a problem that you get unconected clusters of teams, or teams that are only weakly connected, but also for league data from different divisions, as you write. I think a good rule of thumb is that the better connected two teams are, the better predictions you would get.

More theoretically, a criterion to ensure that you can fit the model is that the design matrix is invertible. This is at least needed to fit linear regression models, but it should also apply to GLM’s and the DC-model as well. We might not use the design matrices directly, but there is a connection between the “connectedness” i wrote above and invertible matrices, somewhere.

In the old implementations i have posted on this blog, which relies on matrix multiplications, it could be a problem if a team has only played one match, which seems like could be a problem here.

Anyway, I would recomend you try the goalmodel package:

https://github.com/opisthokonta/goalmodel

It has some checks to ensure the data is connected.

I forgot to answer the question about optim. You can pass other arguments to optim that is passed along to the dc_negloglik function (or whatever function you want to optimize). I do that here to pass the data to the likelihood function. The names of the arguments you use in optim must match exactly to the argument names that you have defined when you defined the dc_negloglik function.

Thank you. I am impressed at the speed of response to a comment on a blog you posted so long ago!

Yesterday I managed to stay concentrated long enough to get your second (more recent) Dixon-Coles implementation to work with my international soccer matches data set. Basically I had to make sure that every pair of team names in the data-set, both home and away, matched with a team name in “all_teams”. I could not have, for example, a home team in the data-set that matched with a team name in “all_teams” but the corresponding away team name in the data-set not appearing in “all_teams”. This obviously reduced the size of the data-set. Ordinarily I would not “throw away” data that may be useful.

I had already been careful to match up argument names used in optim with those defined in dc_negloglik.

all_teams should have all team names in the data listed. In the code I get them automatically from the data, and you should check that your code does that too. Otherwise it will cause trouble. You should not have to throw away data, as you say.