In the previous installments in this series on implementing the Dixon-Coles model I complained a bit about the time it took to estimate the parameters. In the original implementation in part 1 it took about 40 seconds. Now 40 seconds is not much to complain about, there are a whole lot of other models and algorithms that takes much much longer time to fit (for my master’s I had some computations that took several months). Still, I wanted to make a few improvements.
The approach I described in part 3 is quite acceptable, I think, especially since it takes less than a second to fit the model. But still, I wanted to make some improvements to my original implementation.
There are several reasons for the estimation procedure being slow. I used a general purpose optimizer instead of a tailor-made algorithm, and I didn’t provide the optimizer with a function of the derivative of the model likelihood function, nor the function defining the constraint. This means that the optimizer have to estimate the derivatives by doing a lot of evaluations of the two functions with slight changes in the parameters. The most important speed bump, however, is probably due to how I implemented the constraint that all the average of the attack parameters should equal 1.
The alabama package I used relied on a technique called Lagrange multipliers, which is a very general method for constrained optimization. Instead of relying on general constrained optimization procedures, there is a trick commonly used in linear models with sum-to-zero constrained categorical parameters that we also can use here.
There has been some discussion and confusion in the comments about how categorical variables are coded and how R presents the results of the glm function. A thorough discussion of this is best left for another time, but let me explain how the sum-to-zero constraint is implemented in linear models. We will fit the model with this constraint and then make some adjustments later on to get the correct average-is-one constraint.
The sum-to-zero constraint basically says that the sum of all the parameters for a categorical variable must equal to zero:
If we for example have three levels, we can write out the equation like this:
If we subtract \( \theta_3\) and multiply both sides of the equation by minus 1 we get
Notice how we can write one of the parameters as a simple linear function of the other parameters. We can use this result to construct the design matrix for the categorical variable, incorporating the sum-to-zero constraint (exactly which parameter or level we chose to be a function of the others doesn’t matter, the end results does not differ). Suppose we have the following observations of a three-level categorical variable:
We can then construct the following design matrix:
Notice that we only need two columns (i.e. two variables) to encode the three levels. Since the last parameter is a function of the two other it is redundant. Also notice how the observations in the two last rows, corresponding to the \(C\) observations, will influence the estimation of all the other parameters for this variable. When the two parameters are estimated, the last parameter can be obtained using the result from above relating the last parameter to all the other.
In the Dixon-Coles paper they used the constraint that the average of the attack parameters should be 1. This is not quite the same as the sum-to-zero constraint, but for prediction, it does not matter exactly which constraint we use. Anyway, I will explain later how we can fix this.
To use this trick in the Dixon-Coles implementation we need to make the following changes to our code from part 1. Obviously the first thing we need to change is how the design matrices in the DCmodelData function is computed. We need four matrices now, since the number of parameters estimated directly is different for the attack and defense parameters. Notice how I chose the last of team that appear last in the team.names vector. The teams get sorted alphabetically, so for the 2011-12 Premier League data this is is Wolves.
DCmodelData <- function(df){ team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam))) # attack, with sum-to-zero constraint ## home hm.a <- model.matrix(~ HomeTeam - 1, data=df) hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1 hm.a <- hm.a[,1:(length(team.names)-1)] # away am.a <- model.matrix(~ AwayTeam -1, data=df) am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1 am.a <- am.a[,1:(length(team.names)-1)] # defence, same as before hm.d <- model.matrix(~ HomeTeam - 1, data=df) am.d <- model.matrix(~ AwayTeam -1, data=df) return(list(homeTeamDMa=hm.a, homeTeamDMd=hm.d, awayTeamDMa=am.a, awayTeamDMd=am.d, homeGoals=df$FTHG, awayGoals=df$FTAG, teams=team.names)) }
Some changes to the DCoptimFn function is also needed, so it properly handles the changes we made to the design matrices.
# I don't bother showing the rest of the function nteams <- length(DCm$teams) attack.p <- matrix(params[3:(nteams+1)], ncol=1) #one column less defence.p <- matrix(params[(nteams+2):length(params)], ncol=1) # need to multiply with the correct matrices lambda <- exp(DCm$homeTeamDMa %*% attack.p + DCm$awayTeamDMd %*% defence.p + home.p) mu <- exp(DCm$awayTeamDMa %*% attack.p + DCm$homeTeamDMd %*% defence.p)
We also need to make a the appropriate adjustments to the vectors with the initial parameter values, so that they have the correct lengths.
dcm <- DCmodelData(data) nteams <- length(dcm$teams) #initial parameter estimates attack.params <- rep(.1, times=nteams-1) # one less parameter defence.params <- rep(-0.8, times=nteams) home.param <- 0.06 rho.init <- 0.03 par.inits <- c(home.param, rho.init, attack.params, defence.params) #informative names #skip the last team names(par.inits) <- c('HOME', 'RHO', paste('Attack', dcm$teams[1:(nteams-1)], sep='.'), paste('Defence', dcm$teams, sep='.'))
With these changes we can simply use the built-in optim function in R. There is no need for the DCattackConstr function anymore, or a third party package, since we built the constraint right into the design matrices.
res <- optim(par=par.inits, fn=DCoptimFn, DCm=dcm, method='BFGS')
This takes about 6-7 seconds on my laptop, a decent improvement to the 40 seconds it took before. If you take a look at the resulting parameter estimates in res$par you will see that the attack parameter for Wolves is missing. As I explained earlier, this parameter is easy to find. It is also easy to correct all the parameter estimates so that they become the same as if we had a mean-is-one constraint on the attack parameters. This is done by increasing the attack parameters by one, and decreasing the defense parameters by one. The reason it is that simple is that the sum-to-zero constraint is equivalent with a mean-is-zero constraint.
parameters <- res$par #compute Wolves attack parameter missing.attack <- sum(parameters[3:(nteams+1)]) * -1 #put it in the parameters vector parameters <- c(parameters[1:(nteams+1)], missing.attack, parameters[(nteams+2):length(parameters)]) names(parameters)[nteams+2] <- paste('Attack.', dcm$teams[nteams], sep='') #increase attack by one parameters[3:(nteams+2)] <- parameters[3:(nteams+2)] + 1 #decrease defence by one parameters[(nteams+3):length(parameters)] <- parameters[(nteams+3):length(parameters)] - 1