A couple of years ago I implemented the Dixon-Coles model for predicting football results here on this blog. That series of of blog posts is my most popular since I keep getting comments on it, some four years later.
One of the most common requests is advice on how to expand the model to include additional predictors. Unfortunately with the implementation I posted this was not so straightforward. It relied on some design matrices with dummy-coded variables, which is a standard way of doing things in regression modeling. The DC model isn’t a standard regression modeling problem, so using matrices complicated things. I posted some updates and variant across several posts, which in the end made the code a bit hard to follow and to modify.
Anyway, I’ve had a simpler implementation lying around for a while, and since there’s been far between updates on this blog lately I thought I’d post it.
First load some data from the engsoccerdata package. I’m going to use the 2011-12 season of the English Premier League, so the results can be compared with what I got from the first implementation.
library(dplyr) library(engsoccerdata) england %>% filter(Season == 2011, tier==1) %>% mutate(home = as.character(home), visitor = as.character(visitor))-> england_2011
Next we should create a list of initial parameter values. This will be used as a starting point for estimating the parameters. The list contains vectors of four groups of parameters, the attack and defense parameters of all teams, the home field advantage and the Dixon-Coles adjustment (rho). The attack and defense vector are named so that it is easy to look up the relevant parameter later on.
Notice also that a sum-to-zero constraint has to be added to the defense parameters, so in reality we are estimating one defense parameter less than the number of teams. Check this post for some more explanation of this.
# Make a vector of all team names. all_teams <- sort(unique(c(england_2011$home, england_2011$visitor)), decreasing = FALSE) n_teams <- length(all_teams) # list of parameters with initial values. parameter_list <- list(attack = rep(0.2, n_teams), defense = rep(-0.01, n_teams-1), home = 0.1, rho= 0.00) names(parameter_list$attack) <- all_teams names(parameter_list$defense) <- all_teams[-1] # the first parameter is computed from the rest.
Next we need a function that calculates the negative log-likelihood function, to be used with R’s built in optimizer.
One trick I use here is to relist the parameters. The optimizer want all parameter values as a single vector. When you have a lot of parameters that group together and is used in different parts of the model, this can quickly create some complicated indexing and stuff. By supplying the original parameter list, plus having named vectors, these problems essentially disappear.
Also notice how the expected goals are now simply computed by looking up the relevant parameters in the parameter list and adding them together. No need for matrix multiplications.
The Dixon-Coles adjustment function tau is the same as in the original implementation.
dc_negloglik <- function(params, goals_home, goals_visitor, team_home, team_visitor, param_skeleton){ # relist, to make things easier. plist <- relist(params, param_skeleton) # There is a sum-to-zero constraint on defense parameters. # The defense parameter for the first team is computed from the rest. plist$defense <- c(sum(plist$defense)*-1, plist$defense) names(plist$defense)[1] <- names(plist$attack[1]) # add name to first element. # Home team expected goals lambda_home <- exp(plist$attack[team_home] + plist$defense[team_visitor] + plist$home) # Away team expected goals lambda_visitor <- exp(plist$attack[team_visitor] + plist$defense[team_home]) # Dixon-Coles adjustment dc_adj <- tau(goals_home, goals_visitor, lambda_home, lambda_visitor, rho = plist$rho) # Trick to avoid warnings. if (any(dc_adj <= 0)){ return(Inf) } # The log-likelihood log_lik_home <- dpois(goals_home, lambda = lambda_home, log=TRUE) log_lik_visitor <- dpois(goals_visitor, lambda = lambda_visitor, log=TRUE) log_lik <- sum((log_lik_home + log_lik_visitor + log(dc_adj))) return(log_lik*-1) }
To actually estimate the parameters we feed the function, data and initial values to optim, and check the results.
optim_res <- optim(par = unlist(parameter_list), fn=dc_negloglik, goals_home = england_2011$hgoal, goals_visitor = england_2011$vgoal, team_home = england_2011$home, team_visitor = england_2011$visitor, param_skeleton=parameter_list, method = 'BFGS') # relist, and calculate the remaining parameter. parameter_est <- relist(optim_res$par, parameter_list) parameter_est$defense <- c( sum(parameter_est$defense) * -1, parameter_est$defense) names(parameter_est$defense)[1] <- names(parameter_est$attack[1])
I get the same home field advantage (0.27) and rho (-0.13) as in the original implementation. The other parameters differ, however. This is because of the sum-to-zero constraints are coded in a different way. This should not matter and both ways should give the same predictions.
I have not yet said anything about how to expand the model to include other predictors, but hopefully this implementation should make it easier. You can just add some new arguments to the dc_negloglik function that takes the variables in question as input, and add new parameter vectors to the parameter list as needed. Then the calculations of the expected goals should be modified to include the new parameters and predictors.