A couple of years ago I implemented the Dixon-Coles model for predicting football results here on this blog. That series of of blog posts is my most popular since I keep getting comments on it, some four years later.

One of the most common requests is advice on how to expand the model to include additional predictors. Unfortunately with the implementation I posted this was not so straightforward. It relied on some design matrices with dummy-coded variables, which is a standard way of doing things in regression modeling. The DC model isn’t a standard regression modeling problem, so using matrices complicated things. I posted some updates and variant across several posts, which in the end made the code a bit hard to follow and to modify.

Anyway, I’ve had a simpler implementation lying around for a while, and since there’s been far between updates on this blog lately I thought I’d post it.

First load some data from the engsoccerdata package. I’m going to use the 2011-12 season of the English Premier League, so the results can be compared with what I got from the first implementation.

library(dplyr) library(engsoccerdata) england %>% filter(Season == 2011, tier==1) %>% mutate(home = as.character(home), visitor = as.character(visitor))-> england_2011

Next we should create a list of initial parameter values. This will be used as a starting point for estimating the parameters. The list contains vectors of four groups of parameters, the attack and defense parameters of all teams, the home field advantage and the Dixon-Coles adjustment (rho). The attack and defense vector are named so that it is easy to look up the relevant parameter later on.

Notice also that a sum-to-zero constraint has to be added to the defense parameters, so in reality we are estimating one defense parameter less than the number of teams. Check this post for some more explanation of this.

# Make a vector of all team names. all_teams <- sort(unique(c(england_2011$home, england_2011$visitor)), decreasing = FALSE) n_teams <- length(all_teams) # list of parameters with initial values. parameter_list <- list(attack = rep(0.2, n_teams), defense = rep(-0.01, n_teams-1), home = 0.1, rho= 0.00) names(parameter_list$attack) <- all_teams names(parameter_list$defense) <- all_teams[-1] # the first parameter is computed from the rest.

Next we need a function that calculates the negative log-likelihood function, to be used with R’s built in optimizer.

One trick I use here is to relist the parameters. The optimizer want all parameter values as a single vector. When you have a lot of parameters that group together and is used in different parts of the model, this can quickly create some complicated indexing and stuff. By supplying the original parameter list, plus having named vectors, these problems essentially disappear.

Also notice how the expected goals are now simply computed by looking up the relevant parameters in the parameter list and adding them together. No need for matrix multiplications.

The Dixon-Coles adjustment function tau is the same as in the original implementation.

dc_negloglik <- function(params, goals_home, goals_visitor, team_home, team_visitor, param_skeleton){ # relist, to make things easier. plist <- relist(params, param_skeleton) # There is a sum-to-zero constraint on defense parameters. # The defense parameter for the first team is computed from the rest. plist$defense <- c(sum(plist$defense)*-1, plist$defense) names(plist$defense)[1] <- names(plist$attack[1]) # add name to first element. # Home team expected goals lambda_home <- exp(plist$attack[team_home] + plist$defense[team_visitor] + plist$home) # Away team expected goals lambda_visitor <- exp(plist$attack[team_visitor] + plist$defense[team_home]) # Dixon-Coles adjustment dc_adj <- tau(goals_home, goals_visitor, lambda_home, lambda_visitor, rho = plist$rho) # Trick to avoid warnings. if (any(dc_adj <= 0)){ return(Inf) } # The log-likelihood log_lik_home <- dpois(goals_home, lambda = lambda_home, log=TRUE) log_lik_visitor <- dpois(goals_visitor, lambda = lambda_visitor, log=TRUE) log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj))) return(log_lik*-1) }

To actually estimate the parameters we feed the function, data and initial values to optim, and check the results.

optim_res <- optim(par = unlist(parameter_list), fn=dc_negloglik, goals_home = england_2011$hgoal, goals_visitor = england_2011$vgoal, team_home = england_2011$home, team_visitor = england_2011$visitor, param_skeleton=parameter_list, method = 'BFGS') # relist, and calculate the remaining parameter. parameter_est <- relist(optim_res$par, parameter_list) parameter_est$defense <- c( sum(parameter_est$defense) * -1, parameter_est$defense) names(parameter_est$defense)[1] <- names(parameter_est$attack[1])

I get the same home field advantage (0.27) and rho (-0.13) as in the original implementation. The other parameters differ, however. This is because of the sum-to-zero constraints are coded in a different way. This should not matter and both ways should give the same predictions.

I have not yet said anything about how to expand the model to include other predictors, but hopefully this implementation should make it easier. You can just add some new arguments to the dc_negloglik function that takes the variables in question as input, and add new parameter vectors to the parameter list as needed. Then the calculations of the expected goals should be modified to include the new parameters and predictors.

H i,

good job. I think that in function dc_negloglik the line

plist$defense <- c(sum(plist$defense)-1, plist$defense)

have to be

plist$defense <- c(-sum(plist$defense), plist$defense)

to ensure sum to zero constrain

Ralph

Yes, that is true! Thank you for pointing this out.

Hi, I’m back! Playing with the program, I found some different and strange values for rho parameter. With file of england Seasons 2011 the value is the same you found. With values of season 2017 2018 and for different countries the value is incredible huge about -1e+09.

Do you have any ideas for this?

My compliments for your blog

Ralph

Yes there seem to be problem. My initial guess when this happens is to check whether the optimization converges. optim_res$convergence should be 0. You could perhaps try other algorithms (using the method argument) or different initial values.

There could also be a problem with your data, for example if you have two separate groups of teams that hasn’t played each other. Other problems I imagine could be if you have really few cases of low scoring games for where the DC adjustment kicks in.

I have the same problem with the same year-champion. dataset can be found here

https://www.football-data.co.uk/mmz4281/1718/E0.csv

Have you managed to run DC with the same dataset?

I have tried all algorithms/ initial optim params and nothing works.. so frustrating

Have you tried using the goalmodel package? I recommend using that for the DC model. If it fails, try doing a 2-step estimation.

https://github.com/opisthokonta/goalmodel

It works with your package, isn’t it strange? Does it have to do with constraints set in your original paper?

Maybe. The package uses some additional constraints, like having the sum-to-zero constraint for both the defense and attack parameters.

Am trying to implement this in java. I have gone through the original paper and am having trouble understanding the likelihood function. Is it one equation or a system of equations that need to be maximized as a system? Am new to this so any help would be appreciated.

You can think of it as a system of equations, two per match. For a specific set of parameter values you calculate the expected goals for each team in each match using the equations with the attack, defense and home field parameter.

this is not true. (4.2) is the likelihood function. two poisson distributions are linked together by the inflation/deflation function tau. rho should always be negative since lambda, mu >0 and we want to inflate always 0:0, 1:1.

Hey, i’m also trying to understand the model. In the paper of DC it was written that you calculate the parameters using maximum likelihood on the expected goals (?) and here you say it’s the opposite?

Yes, it is maximum likelihood. The original question here was a bit confusing.

Hey, Great post . Thanks! I just don’t understand. In the paper of DC they state alpha & beta > 0 yet in your model I see some values of the attack / defence parameters are negative. ?

Yes, that is because I do things a bit differently, but the way I do it is equivalent with what they do in the paper. In the paper they formulate the model as expected goals = a*b , but I do it as log(expected goals) = a + b. Or as I do in the implementation, expected goals = exp(a + b).

Hi, I guess in the code cited below, it is sufficient to only get the home team names, as the visitor teams are the same.

Also the name “AwayTeam” is not used in the dataframe.

all_teams <- sort(unique(c(england_2011$home, england_2011$AwayTeam)), decreasing = FALSE)

Thanks for a nice blog!

-Halvard

You are right, but it can be useful to check both variables, especially when some teams have only played a few games, for example early in the season. The “AwayTeam” variable is a vestige from some other code, and a mistake. Thanks for pointing that out!

In their paper DC use another formula adding some time decay factor to the games taking into consideration (4.5). Have you figured out how to implement this? Can this easily be added to your code bove?

Yes it can be added somewhat easily. Check out this blog post: http://opisthokonta.net/?p=1013

The key is to multiply each term with the weights in the final sum where the log_lik variable is computed.

thanks for your reply, sorry I missed that already posted article.

Hey, great posts on this subject. Really helping me understand how these prediction models work. Having trouble creating the scaling matrix for this version of the model. Code as below:

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda_home, lambda_visitor, parameter_est['RHO']), nrow=2)

probability_matrix[1:2, 1:2] <- as.numeric(probability_matrix)[1:2, 1:2] * scaling_matrix

Error: non-numeric argument to binary operator

Have you any insights to where i'm going wrong?

Thanks so much.

Pat

There seems to be a problem with at least one of of the variables not being numeric. If you figure out which one it is, you can see if the command that created them was run properly.

Hi, how could i calculate aic/ bic using your code above? Do you think its feasible?

Additionally adding the weights of dc, if i add the weigjt function from your other post and the new log likelihood function, will it run as it is?

Thanks for sharing your great work, i appreciate it.

Here is code to get AIC:

length(optim_res$par)*2 – 2*(optim_res$value*-1)

To do weighted estimation you need to multiply each term in the log-likelihood with the weights, something like this:

log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj))*weights)

Hi, thanks for answering. I have a follow up question. I need to compate the SC model with and without time weights, hence I use your code in order to compare AIC/BIC. Using England’s Premier league (2017-2018) data, and currentDate = sys.Date, xi = 0.0018

I get AIC = 978

BIC = 1168.

If I don’t use time weights (weights = 1) I get quit different results:

AIC: 2183.601

BIC:2373.568

Does this sound logical? If I use a simple double poisson for example with no weights, I get results compared to the latter cases of aic/bic (close to 2000)

*DC model

You can’t compare the model with weights and the model without using AIC. AIC and BIC can only be used in comparing different models using the same data and the same weights.

Thanks, I keep learning stuff about statistics from your posts and comments, I appreciate it.

No problem. Those things can be a bit tricky to understand.

Hey great post again! I was just looking through and I’ve encountered a few issues at times when the rho estimate is exceptionally wide of the mark (into the high thousands) and thus so the attack and defence parameters. Do you have any idea how this comes about? Or how you could limit the range of rho in the optimization? In the paper rho is given a clear limit either side to avoid negatives in the tau function. I know you have included it in your dc_adj but not when you create your scaling matrix which can then skew the probabilitie matrix massively (so it doesn’t sum to 1). Thanks again really enjoy your blog.

Thank you! Yes that sometimes happens, and there could be many reasons why. Sometimes you might have too few data points, perhaps for a particular team. Also if you use weights, and some teams have small weights for all their games, you run into the same problem. Perhaps you should also make sure you have more than just a few games where the DC-adjustment applies.

I also recommend using my goalmodel package for fitting these models, which has become more robust with the latest updates.

Why have you set the initial parameters as

home = 0.2

away = -0.1

home = 0.1

rho = 0

Does it matter what they’re set to?

Also how is home and rho calculated? I can’t see it in the code.

Sorry, I am new and trying to understand this.

It is a good idea to set the starting values to be somewhat realistic, but for the particular numbers I have used? No. But it might create problems if you start with extreme values. Starting with rho=0 is perhaps a good idea since it corresponds to the default model, but setting it to 0.1 or something should not create problems. Rho and home is estimated together with the other parameters.