A couple of years ago I implemented the Dixon-Coles model for predicting football results here on this blog. That series of of blog posts is my most popular since I keep getting comments on it, some four years later.

One of the most common requests is advice on how to expand the model to include additional predictors. Unfortunately with the implementation I posted this was not so straightforward. It relied on some design matrices with dummy-coded variables, which is a standard way of doing things in regression modeling. The DC model isn’t a standard regression modeling problem, so using matrices complicated things. I posted some updates and variant across several posts, which in the end made the code a bit hard to follow and to modify.

Anyway, I’ve had a simpler implementation lying around for a while, and since there’s been far between updates on this blog lately I thought I’d post it.

First load some data from the engsoccerdata package. I’m going to use the 2011-12 season of the English Premier League, so the results can be compared with what I got from the first implementation.

library(dplyr) library(engsoccerdata) england %>% filter(Season == 2011, tier==1) %>% mutate(home = as.character(home), visitor = as.character(visitor))-> england_2011

Next we should create a list of initial parameter values. This will be used as a starting point for estimating the parameters. The list contains vectors of four groups of parameters, the attack and defense parameters of all teams, the home field advantage and the Dixon-Coles adjustment (rho). The attack and defense vector are named so that it is easy to look up the relevant parameter later on.

Notice also that a sum-to-zero constraint has to be added to the defense parameters, so in reality we are estimating one defense parameter less than the number of teams. Check this post for some more explanation of this.

# Make a vector of all team names. all_teams <- sort(unique(c(england_2011$home, england_2011$visitor)), decreasing = FALSE) n_teams <- length(all_teams) # list of parameters with initial values. parameter_list <- list(attack = rep(0.2, n_teams), defense = rep(-0.01, n_teams-1), home = 0.1, rho= 0.00) names(parameter_list$attack) <- all_teams names(parameter_list$defense) <- all_teams[-1] # the first parameter is computed from the rest.

Next we need a function that calculates the negative log-likelihood function, to be used with R’s built in optimizer.

One trick I use here is to relist the parameters. The optimizer want all parameter values as a single vector. When you have a lot of parameters that group together and is used in different parts of the model, this can quickly create some complicated indexing and stuff. By supplying the original parameter list, plus having named vectors, these problems essentially disappear.

Also notice how the expected goals are now simply computed by looking up the relevant parameters in the parameter list and adding them together. No need for matrix multiplications.

The Dixon-Coles adjustment function tau is the same as in the original implementation.

dc_negloglik <- function(params, goals_home, goals_visitor, team_home, team_visitor, param_skeleton){ # relist, to make things easier. plist <- relist(params, param_skeleton) # There is a sum-to-zero constraint on defense parameters. # The defense parameter for the first team is computed from the rest. plist$defense <- c(sum(plist$defense)*-1, plist$defense) names(plist$defense)[1] <- names(plist$attack[1]) # add name to first element. # Home team expected goals lambda_home <- exp(plist$attack[team_home] + plist$defense[team_visitor] + plist$home) # Away team expected goals lambda_visitor <- exp(plist$attack[team_visitor] + plist$defense[team_home]) # Dixon-Coles adjustment dc_adj <- tau(goals_home, goals_visitor, lambda_home, lambda_visitor, rho = plist$rho) # Trick to avoid warnings. if (any(dc_adj <= 0)){ return(Inf) } # The log-likelihood log_lik_home <- dpois(goals_home, lambda = lambda_home, log=TRUE) log_lik_visitor <- dpois(goals_visitor, lambda = lambda_visitor, log=TRUE) log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj))) return(log_lik*-1) }

To actually estimate the parameters we feed the function, data and initial values to optim, and check the results.

optim_res <- optim(par = unlist(parameter_list), fn=dc_negloglik, goals_home = england_2011$hgoal, goals_visitor = england_2011$vgoal, team_home = england_2011$home, team_visitor = england_2011$visitor, param_skeleton=parameter_list, method = 'BFGS') # relist, and calculate the remaining parameter. parameter_est <- relist(optim_res$par, parameter_list) parameter_est$defense <- c( sum(parameter_est$defense) * -1, parameter_est$defense) names(parameter_est$defense)[1] <- names(parameter_est$attack[1])

I get the same home field advantage (0.27) and rho (-0.13) as in the original implementation. The other parameters differ, however. This is because of the sum-to-zero constraints are coded in a different way. This should not matter and both ways should give the same predictions.

I have not yet said anything about how to expand the model to include other predictors, but hopefully this implementation should make it easier. You can just add some new arguments to the dc_negloglik function that takes the variables in question as input, and add new parameter vectors to the parameter list as needed. Then the calculations of the expected goals should be modified to include the new parameters and predictors.

H i,

good job. I think that in function dc_negloglik the line

plist$defense <- c(sum(plist$defense)-1, plist$defense)

have to be

plist$defense <- c(-sum(plist$defense), plist$defense)

to ensure sum to zero constrain

Ralph

Yes, that is true! Thank you for pointing this out.

Hi, I’m back! Playing with the program, I found some different and strange values for rho parameter. With file of england Seasons 2011 the value is the same you found. With values of season 2017 2018 and for different countries the value is incredible huge about -1e+09.

Do you have any ideas for this?

My compliments for your blog

Ralph

Yes there seem to be problem. My initial guess when this happens is to check whether the optimization converges. optim_res$convergence should be 0. You could perhaps try other algorithms (using the method argument) or different initial values.

There could also be a problem with your data, for example if you have two separate groups of teams that hasn’t played each other. Other problems I imagine could be if you have really few cases of low scoring games for where the DC adjustment kicks in.

Am trying to implement this in java. I have gone through the original paper and am having trouble understanding the likelihood function. Is it one equation or a system of equations that need to be maximized as a system? Am new to this so any help would be appreciated.

You can think of it as a system of equations, two per match. For a specific set of parameter values you calculate the expected goals for each team in each match using the equations with the attack, defense and home field parameter.

this is not true. (4.2) is the likelihood function. two poisson distributions are linked together by the inflation/deflation function tau. rho should always be negative since lambda, mu >0 and we want to inflate always 0:0, 1:1.

Hey, i’m also trying to understand the model. In the paper of DC it was written that you calculate the parameters using maximum likelihood on the expected goals (?) and here you say it’s the opposite?

Yes, it is maximum likelihood. The original question here was a bit confusing.

Hey, Great post . Thanks! I just don’t understand. In the paper of DC they state alpha & beta > 0 yet in your model I see some values of the attack / defence parameters are negative. ?

Yes, that is because I do things a bit differently, but the way I do it is equivalent with what they do in the paper. In the paper they formulate the model as expected goals = a*b , but I do it as log(expected goals) = a + b. Or as I do in the implementation, expected goals = exp(a + b).

Hi, I guess in the code cited below, it is sufficient to only get the home team names, as the visitor teams are the same.

Also the name “AwayTeam” is not used in the dataframe.

all_teams <- sort(unique(c(england_2011$home, england_2011$AwayTeam)), decreasing = FALSE)

Thanks for a nice blog!

-Halvard

You are right, but it can be useful to check both variables, especially when some teams have only played a few games, for example early in the season. The “AwayTeam” variable is a vestige from some other code, and a mistake. Thanks for pointing that out!

In their paper DC use another formula adding some time decay factor to the games taking into consideration (4.5). Have you figured out how to implement this? Can this easily be added to your code bove?

Yes it can be added somewhat easily. Check out this blog post: http://opisthokonta.net/?p=1013

The key is to multiply each term with the weights in the final sum where the log_lik variable is computed.

thanks for your reply, sorry I missed that already posted article.

Hey, great posts on this subject. Really helping me understand how these prediction models work. Having trouble creating the scaling matrix for this version of the model. Code as below:

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda_home, lambda_visitor, parameter_est['RHO']), nrow=2)

probability_matrix[1:2, 1:2] <- as.numeric(probability_matrix)[1:2, 1:2] * scaling_matrix

Error: non-numeric argument to binary operator

Have you any insights to where i'm going wrong?

Thanks so much.

Pat

There seems to be a problem with at least one of of the variables not being numeric. If you figure out which one it is, you can see if the command that created them was run properly.