The Dixon-Coles model, part 4: A trick to speed up estimation

In the previous installments in this series on implementing the Dixon-Coles model I complained a bit about the time it took to estimate the parameters. In the original implementation in part 1 it took about 40 seconds. Now 40 seconds is not much to complain about, there are a whole lot of other models and algorithms that takes much much longer time to fit (for my master’s I had some computations that took several months). Still, I wanted to make a few improvements.

The approach I described in part 3 is quite acceptable, I think, especially since it takes less than a second to fit the model. But still, I wanted to make some improvements to my original implementation.

There are several reasons for the estimation procedure being slow. I used a general purpose optimizer instead of a tailor-made algorithm, and I didn’t provide the optimizer with a function of the derivative of the model likelihood function, nor the function defining the constraint. This means that the optimizer have to estimate the derivatives by doing a lot of evaluations of the two functions with slight changes in the parameters. The most important speed bump, however, is probably due to how I implemented the constraint that all the average of the attack parameters should equal 1.

The alabama package I used relied on a technique called Lagrange multipliers, which is a very general method for constrained optimization. Instead of relying on general constrained optimization procedures, there is a trick commonly used in linear models with sum-to-zero constrained categorical parameters that we also can use here.

There has been some discussion and confusion in the comments about how categorical variables are coded and how R presents the results of the glm function. A thorough discussion of this is best left for another time, but let me explain how the sum-to-zero constraint is implemented in linear models. We will fit the model with this constraint and then make some adjustments later on to get the correct average-is-one constraint.

The sum-to-zero constraint basically says that the sum of all the parameters for a categorical variable must equal to zero:

\( \sum_{i=1} \theta_i = 0 \)

If we for example have three levels, we can write out the equation like this:

\( \theta_1 + \theta_2 + \theta_3 = 0 \)

If we subtract \( \theta_3\) and multiply both sides of the equation by minus 1 we get

\( – \theta_1 – \theta_2 = \theta_3 \)

Notice how we can write one of the parameters as a simple linear function of the other parameters. We can use this result to construct the design matrix for the categorical variable, incorporating the sum-to-zero constraint (exactly which parameter or level we chose to be a function of the others doesn’t matter, the end results does not differ). Suppose we have the following observations of a three-level categorical variable:

\( \begin{bmatrix} A & A & B & B & C & C \end{bmatrix}^T \)

We can then construct the following design matrix:

\( \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ -1 & -1 & \\ -1 & -1 & \end{bmatrix} \)

Notice that we only need two columns (i.e. two variables) to encode the three levels. Since the last parameter is a function of the two other it is redundant. Also notice how the observations in the two last rows, corresponding to the \(C\) observations, will influence the estimation of all the other parameters for this variable. When the two parameters are estimated, the last parameter can be obtained using the result from above relating the last parameter to all the other.

In the Dixon-Coles paper they used the constraint that the average of the attack parameters should be 1. This is not quite the same as the sum-to-zero constraint, but for prediction, it does not matter exactly which constraint we use. Anyway, I will explain later how we can fix this.

To use this trick in the Dixon-Coles implementation we need to make the following changes to our code from part 1. Obviously the first thing we need to change is how the design matrices in the DCmodelData function is computed. We need four matrices now, since the number of parameters estimated directly is different for the attack and defense parameters. Notice how I chose the last of team that appear last in the team.names vector. The teams get sorted alphabetically, so for the 2011-12 Premier League data this is is Wolves.

DCmodelData <- function(df){
  
  team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))
  
  # attack, with sum-to-zero constraint
  ## home
  hm.a <- model.matrix(~ HomeTeam - 1, data=df)
  hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1
  hm.a <- hm.a[,1:(length(team.names)-1)]
  
  # away
  am.a <- model.matrix(~ AwayTeam -1, data=df)
  am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1
  am.a <- am.a[,1:(length(team.names)-1)]
  
  # defence, same as before 
  hm.d <- model.matrix(~ HomeTeam - 1, data=df)
  am.d <- model.matrix(~ AwayTeam -1, data=df)
  
  return(list(homeTeamDMa=hm.a, homeTeamDMd=hm.d,
    awayTeamDMa=am.a, awayTeamDMd=am.d,
    homeGoals=df$FTHG, awayGoals=df$FTAG,
    teams=team.names)) 
}

Some changes to the DCoptimFn function is also needed, so it properly handles the changes we made to the design matrices.

# I don't bother showing the rest of the function  
nteams <- length(DCm$teams)
attack.p <- matrix(params[3:(nteams+1)], ncol=1) #one column less
defence.p <- matrix(params[(nteams+2):length(params)], ncol=1) 
 
# need to multiply with the correct matrices
lambda <- exp(DCm$homeTeamDMa %*% attack.p + DCm$awayTeamDMd %*% defence.p + home.p)
mu <- exp(DCm$awayTeamDMa %*% attack.p + DCm$homeTeamDMd %*% defence.p)

We also need to make a the appropriate adjustments to the vectors with the initial parameter values, so that they have the correct lengths.

dcm <- DCmodelData(data)
nteams <- length(dcm$teams)

#initial parameter estimates
attack.params <- rep(.1, times=nteams-1) # one less parameter
defence.params <- rep(-0.8, times=nteams)
home.param <- 0.06
rho.init <- 0.03
par.inits <- c(home.param, rho.init, attack.params, defence.params)

#informative names
#skip the last team
names(par.inits) <- c('HOME', 'RHO', 
                      paste('Attack', dcm$teams[1:(nteams-1)], sep='.'),
                      paste('Defence', dcm$teams, sep='.'))

With these changes we can simply use the built-in optim function in R. There is no need for the DCattackConstr function anymore, or a third party package, since we built the constraint right into the design matrices.

res <- optim(par=par.inits, fn=DCoptimFn, DCm=dcm, method='BFGS')

This takes about 6-7 seconds on my laptop, a decent improvement to the 40 seconds it took before. If you take a look at the resulting parameter estimates in res$par you will see that the attack parameter for Wolves is missing. As I explained earlier, this parameter is easy to find. It is also easy to correct all the parameter estimates so that they become the same as if we had a mean-is-one constraint on the attack parameters. This is done by increasing the attack parameters by one, and decreasing the defense parameters by one. The reason it is that simple is that the sum-to-zero constraint is equivalent with a mean-is-zero constraint.

parameters <- res$par

#compute Wolves attack parameter
missing.attack <- sum(parameters[3:(nteams+1)]) * -1

#put it in the parameters vector
parameters <- c(parameters[1:(nteams+1)], missing.attack, parameters[(nteams+2):length(parameters)])
names(parameters)[nteams+2] <- paste('Attack.', dcm$teams[nteams], sep='')

#increase attack by one
parameters[3:(nteams+2)] <- parameters[3:(nteams+2)] + 1  

#decrease defence by one
parameters[(nteams+3):length(parameters)] <- parameters[(nteams+3):length(parameters)] - 1 

23 thoughts on “The Dixon-Coles model, part 4: A trick to speed up estimation

  1. Great series of write-ups.

    Could you please explain the exponentiation in the calculation lambda? This doesn’t quite line up with the paper, but I see how the values need to be positive. Shouldn’t the raw data also be exponentiated for dpois()?

    • You are right that the reason for the exponentiation is that the values for lambda should be positive. This technique to make sure the parameters are within the correct domain is called link-functions. Often the link function is expressed in terms of the left hand side of the equation (the response), meaning that we actually make a linear model for the logarithm of the number of goals. Wikipedia calls using the inverse of the link function applied to the predictors for the mean function. Take a look at the table in the link. This is what I have done here, meaning that we don’t have to do anything to the raw data (the number of goals).

      Strangely, I think, this is not discussed in the paper. Implicitly, I think the paper shows that they have used it, since they multiply the attack and defense coefficients instead of adding them together, and mentions that they all have to be greater than 0. I have not tried to fit the model without the link function but my guess is that it would fail.

      • Thank you for the response. Linearising the model seems like it would be a fine explanation were it not for the fact that lambda (or alpha, beta, and gamma) isn’t the predictor; dpois(k, lambda) is. By taking the exponential of lambda, the Poisson distribution becomes dpois(k, exp(lambda)).

        In fact, William Brojanigo (a student of Coles’) does the same in his thesis, but makes it a bit clearer that his model for the home goals is dpois(k, exp(a+b+tau)) which matches what you have. It’s still not quite what Dixon & Coles had, but it makes sense I think when that is made clear.

        Martin Eastwood in his presentation uses a GLM with a log link, so it implicitly does the exponentiation on goals but not on attack/defence parameters.

        I think my issue/confusion was with interpreting lambda. It’s a function of arbitrary parameters, but that function is the expectation value for home goals, so it doesn’t matter what form it has. Checking again, the original Dixon & Coles alpha, beta, and gamma were indeed all strictly positive, so it should also work if one can impose that restriction in the optimiser. In that case it’s a multiplicative model of attack/defence vs the additive one you & Brojanigo have used. They would hopefully both give the same expected means. I suspect you could retain the additive model without the link function if you too enforced the positive restriction.

        I look forward to you (hopefully) following this up with the time-exponentially weighted model. I’m putting some of that R code together myself. I’m also looking into the predictive power of this vs match results and eventually betting odds.

        Thanks again 🙂

  2. Hi. First off. Thanks for taking the time to write your blog. Your articles are always extremely informative. I have a problem that you might be able to guide me on. I’ve just recently started using R and am wanting to recreate the basic Karlis & ntzoufras papers but don’t seem to be able to install bivpois package. I note that you have looked at recreating Dixon Coles but not the newer Karlis papers. Is that because you have had similar issues installing the packages with the latest versions of R?

    • I haven’t tried to install the bivpois package myself, but is seems like it isn’t available from CRAN anymore (although you can find versions of in their archive, dating from 2007). I don’t know why, but my guess is that the package has not been properly updated and become incompatible with newer version of R. You could try to download the package .zip-files and install them from your computer (instead of automatically downloading from CRAN) and see if they work. I think I have seen instructions on this somewhere on StackOverflow.

      But I found this, which looks like an updated version from 2013, that hopefully works. I haven’t tested it, but I would try that first I think.

  3. Great stuff! Sorry I am late to the party.

    I am just curious to know: when you ran your full Dixon-Coles model from part 1 on the 2011 EPL season, what AIC value did you get? I just finally got the Karlis-Ntzoufras (K-N) diagonally-inflated bivariate poisson models up and running. It seems that the K-N model that fits best (diagonally-inflated bivariate poisson model with discrete inflation distribution, Jmax=1) is quite close to Dixon-Coles. Both models inflate the probabilities of only [0,0] and [1,1] scores, the only differences being that Dixon-Coles deflate only [0,1] and [1,0] scores to compensate, while K-N deflates all scores. Also, the K-N model allows for the [0,0] and [1,1] scores to be inflated by different amounts, while Dixon-Coles constrains the increase to be the same for both. The K-N model run on the 2011 EPL season yields an AIC of 2263.

    • Nice to see that someone has tried the K-N model. I have yet to try it out myself. You can find see AIC values in my most recent post where I try some different models. If by the 2011 season you mean the 2011-2012 season, the DC model fit gave an AIC = 2256.7. The KN fit give poorer AIC than even the worst model I tested, the Delaporte model, which seems strange to me. How many parameters, in addition to the home, attack and defense parameters, are there in your model?

      • Yes, that does seem strange. The KN model fit to the 2011(/12) EPL season has 19 attack parameters, 19 defense parameters, 3 constants, one each for lambda1, lambda2, and lambda3 (the home parameter being defined by the difference between the constants for lambda1&2), and 2 inflation parameters, so 19+19+3+2=43 parameters total. One obvious thing I can try is to run KN’s double poisson model (i.e., no lambda3 or inflation parameters) on the same data and see if the AIC I get matches your result for the model without the rho adustment.

        • Update: first I ran a bivariate Poisson model (41 parameters), which yielded an AIC of 2260. Then I ran the (independent) double Poisson (40 parameters), which yielded an AIC of 2258. That seems to agree with your results, so apparently for this particular season the plain vanilla model fits the data quite well.

  4. Hello. I put the code (all below) in R, but after command
    res <- optim(par=par.inits, fn=DCoptimFn, DCm=dcm, method='BFGS')
    an error has occurred:
    Error in optim(par = par.inits, fn = DCoptimFn, DCm = dcm, method = "BFGS") :
    objective function in optim evaluates to length 380 not 1

    Any idea how to fix this?
    Thanks

    library(alabama)
    tau <- Vectorize(function(xx, yy, lambda, mu, rho){
    if (xx == 0 & yy == 0){return(1 – (lambda*mu*rho))
    } else if (xx == 0 & yy == 1){return(1 + (lambda*rho))
    } else if (xx == 1 & yy == 0){return(1 + (mu*rho))
    } else if (xx == 1 & yy == 1){return(1 – rho)
    } else {return(1)}
    })
    DClogLik <- function(y1, y2, lambda, mu, rho=0){
    #rho=0, independence
    #y1: home goals
    #y2: away goals
    sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
    }

    DCmodelData <- function(df){

    team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))

    # attack, with sum-to-zero constraint
    ## home
    hm.a <- model.matrix(~ HomeTeam – 1, data=df)
    hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1
    hm.a <- hm.a[,1:(length(team.names)-1)]

    # away
    am.a <- model.matrix(~ AwayTeam -1, data=df)
    am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1
    am.a <- am.a[,1:(length(team.names)-1)]

    # defence
    hm.d <- model.matrix(~ HomeTeam – 1, data=df)
    am.d <- model.matrix(~ AwayTeam -1, data=df)

    return(list(homeTeamDMa=hm.a, homeTeamDMd=hm.d,
    awayTeamDMa=am.a, awayTeamDMd=am.d,
    homeGoals=df$FTHG, awayGoals=df$FTAG,
    teams=team.names))
    }
    DCoptimFn <- function(params, DCm){

    home.p <- params[1]
    rho.p <- params[2]

    nteams <- length(DCm$teams)
    attack.p <- matrix(params[3:(nteams+1)], ncol=1)
    defence.p <- matrix(params[(nteams+2):length(params)], ncol=1)

    lambda <- exp(DCm$homeTeamDMa %*% attack.p + DCm$awayTeamDMd %*% defence.p + home.p)
    mu <- exp(DCm$awayTeamDMa %*% attack.p + DCm$homeTeamDMd %*% defence.p)
    }
    data <- read.csv(file="C:/Users/Smith/Desktop/Book3.csv", header=TRUE) # EPL season 11-12
    dcm <- DCmodelData(data)
    nteams <- length(dcm$teams)

    #initial parameter estimates
    attack.params <- rep(.1, times=nteams-1)
    defence.params <- rep(-0.8, times=nteams)
    home.param <- 0.06
    rho.init <- 0.03
    par.inits <- c(home.param, rho.init, attack.params, defence.params)

    #informative names
    #skip the last team
    names(par.inits) <- c('HOME', 'RHO',
    paste('Attack', dcm$teams[1:(nteams-1)], sep='.'),
    paste('Defence', dcm$teams, sep='.'))

    res <- optim(par=par.inits, fn=DCoptimFn, DCm=dcm, method='BFGS')

  5. Thanks for the articles but I am having problems with this part of the code
    attack.p <- matrix(params[3:(nteams+1)], ncol=1) #one column less

    Why "One Column Less" ?

    I have modified the code to loop around a list of fixtures and when it exeucutes the attack parameter is missing for the last alphabetically named team in the list

    • When you want to have the attack parameters to sum to zero, as in this case, the last parameter is just the sum of the other attack parameters, multiplied by minus one. This is encoded in the design matrix where the rows where the team corresponding to this parameter have -1’s in the columns corresponding to all the attack parameters.

    • You mean of the DC model or the ordinary Poisson model? I am not really sure, I haven’t tested them thoroughly against each other, but I suspect that the DC model has a tendency to overfit, making the predictions poorer than what the ordinary Poisson model does. See this blog post, where I make some comparisons: http://opisthokonta.net/?p=1210

  6. Hi,

    I used the code you gave me to print out the values for O2.5 and U2.5 goals. However, I am always getting atleast 80% chance of O2.5 even when both teams are not expected to score much. Do you see any problem? Thanks!

    print(paste(“Over 2.5 Goals”, 100 – (sum(probability_matrix[1, 1:3]) + sum(probability_matrix[1:3, 1]) + probability_matrix[2, 2] * 100)))
    print(paste(“Under 2.5 Goals”, (sum(probability_matrix[1, 1:3]) + sum(probability_matrix[1:3, 1]) + probability_matrix[2, 2] * 100)))

    • When you multiply by 100 you only multiply the last term in the sum, which is not what you want to do. You must put parentheses around the sum before multiplying.

    • I guess so, but I am not really familiar with betting strategies, so I am not the best source of advise when it comes to this.

  7. Is there a way of adding in extra parameters to assist with the goal expectancy? i.e. if I have a power rating for each team or similar.

    • That is of course possible. In the full implementation I have given in this series of posts it will take a bit of effort to make the code work for general predictors, but it is possible. The easiest way is to use the approach in part 3 where you first fit a ordinary Poisson model, and then estimate the rho parameter separately. Using the glm function it is straightforward to add additional predictor variables.

Leave a Reply to Mark T Cancel reply

Your email address will not be published. Required fields are marked *