The Dixon-Coles model for predicting football matches in R (part 2)

Part 1 ended with running the optimizer function to estimate the parameters in the model:

library(alabama)
res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

# Take a look at the parameters
res$par

In part 1 I fitted the model to data from the 2011-12 Premier League season. Now it’s time to use the model to make a prediction. As an example I will predict the result of Bolton playing at home against Blackburn.

The first thing we need to do is to calculate the mu and lambda parameters, which is (approximately anyway) the expected number of goals scored by the home and away team. To do this wee need to extract the correct parameters from the res$par vector. Recall that I in the last post gave the parameters informative names that consists of the team name prefixed by either Attack or Defence.
Also notice that I have to multiply the team parameters and then exponentiate the result to get the correct answer.

Update: For some reason I got the idea that the team parameters should be multiplied together, instead of added together, but I have now fixed the code and the results.

# Expected goals home
lambda <- exp(res$par['HOME'] + res$par['Attack.Bolton'] + res$par['Defence.Blackburn'])

# Expected goals away
mu <- exp(res$par['Attack.Blackburn'] + res$par['Defence.Bolton'])

We get that Bolton is expected to score 2.07 goals and Blackburn is expected to score 1.59 goals.

Since the model assumes dependencies between the number of goals scored by the two teams, it is insufficient to just plug the lambda and mu parameters into R’s built-in Poisson function to get the probabilities for the number of goals scored by the two teams. We also need to incorporate the adjustment for the low-scoring results as well. One strategy to do this is to first create a matrix based on the simple independent Poisson distributions:

maxgoal <- 6 # will be useful later
probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))

The number of home goals follows the vertical axis and the away goals follow the horizontal.

Now we can use the estimated dependency parameter rho to create a 2-by-2 matrix with scaling factors, that is then element-wise multiplied with the top left elements of the matrix calculated above:

Update: Thanks to Mike who pointed out a mistake in this code.

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda, mu, res$par['RHO']), nrow=2)
probability_matrix[1:2, 1:2] <- probability_matrix[1:2, 1:2] * scaling_matrix

With this matrix it is easy to calculate the probabilities for the three match outcomes:

HomeWinProbability <- sum(probability_matrix[lower.tri(probability_matrix)])
DrawProbability <- sum(diag(probability_matrix))
AwayWinProbability <- sum(probability_matrix[upper.tri(probability_matrix)])

This gives a probability of 0.49 for home win, 0.21 for draw and 0.29 for away win.

Calculating the probabilities for the different goal differences is a bit trickier. The probabilities for each goal difference can be found by adding up the numbers on the diagonals, with the sum of the main diagonal being the probability of a draw.

awayG <- numeric(maxgoal)
 for (gg in 2:maxgoal){
   awayG[gg-1] <- sum(diag(probability_matrix[,gg:(maxgoal+1)]))
 }
awayG[maxgoal] <- probability_matrix[1,(maxgoal+1)]

homeG <- numeric(maxgoal)
  for (gg in 2:maxgoal){
    homeG[gg-1] <- sum(diag(probability_matrix[gg:(maxgoal+1),]))
  }
homeG[maxgoal] <- probability_matrix[(maxgoal+1),1]

goaldiffs <- c(rev(awayG), sum(diag(probability_matrix)), homeG)
names(goaldiffs) <- -maxgoal:maxgoal

It is always nice to plot the probability distribution:

DCBoltonBlackburn

We can also see compare this distribution with the distribution without the Dixon-Coles adjustment (i.e. the goals scored by the two teams are independent):

DCboltonBlackburn2

As expected, we see that the adjustment gives higher probability for draw, and lower probabilities for goal differences of one goal.

52 thoughts on “The Dixon-Coles model for predicting football matches in R (part 2)

  1. I couldn’t get this to run in RStudio but added this line of code and it worked and my results seem to tally up with yours now, does this seem right?

    # RHO parameter
    rhoooo <- res$par['RHO']

    Running this using this years data gives quite different odds to those on offer, I was quite surprised at how different they are.

    • Yes, you are absolutely correct. Thanks for pointing it out. I probably forgot it because I was going to change the embarrassing variable name.

      • I have a similar problem. I cant see how the rho parameter is used in the tau, DClogLik functions when it is initialised after the line res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

        Can you help?

      • I get 0.47 home win, 0.23 draw and 0.29 away win rather than 0.49, 0.21 and 0.29. I know it’s probably down to the mistake outlined by Mike but I can’t correct it.

          • Hi, yes they were the same, and for the expected goals part I got Bolton – 2.070044 and Blackburn – 1.596322.

          • Also I have a couple more questions if you don’t mind.

            How come the defence parameter is negative?

            What is best way to find the optimum value of RHO? Whenever I use difference data sets I get wildly different values. However using a value of around -0.13 gives fairly good results for home, draw and away odds when compared to the betting markets (I manually set lambda and mu to what the markets are expecting).

          • H,, so you got the same expected goals as I got . It seems, as you wrote, that there is something with calculating the probability matrix. Is the scaling_matrix calculated correctly? And how does the probability_matrix change after you scale the 2×2 top left part of it? Could be something there.

            Remember that the defense parameter is used to predict how many goals the opponent is expected to score. Hence, a negative defense parameter mean that the team let in fewer goals, which is equivalent to say that the opponent score fewer. The more negative defense parameter, the fewer goals they concede.

            If you use other sources for calculating expected goals, I suggest you take a look at this post. There I use the regular Poisson regression model to estimate the expected goals, and then estimate the rho parameter by it self. This should work for other ways of calculating expected goals.

            I have also found different values of Rho. See this post where I compare some different models. I suspect that the DC model does a bit of over fitting and perhaps does not generalize that well.

          • Hey! I know I’m a few years late with this one. I think Wilson Raj is right. I’ve been rerunning this model in Python and using your output as a sanity check. And I’m getting the same numbers as them. The probabilities to which you refer in the text come from the Unadjusted model. You can actually see it in the score difference graph. The Unadjusted model has a value close to 0.21 at 0, while the Adjusted model appears closer to 0.23. Typo aside, I really enjoy the blog!

  2. Hi! I ran your code exactly the same as the website on R studio, but instead of Bolton vs Blackburn I tried Arsenal vs Blackburn and I got this:

    > HomeWinProbability
    [1] 0.4186066
    > DrawProbability
    [1] 0.4192366
    > AwayWinProbability
    [1] 0.1621282

    Could you please explain me why? I changed only the names on this code:

    # Expected goals home
    lambda <- exp(res$par['HOME'] * res$par['Attack.Arsenal'] * res$par['Defence.Blackburn'])

    # Expected goals away
    mu <- exp(res$par['Attack.Blackburn'] * res$par['Defence.Arsenal'])

    • I am not sure what exactly you are asking about. Could you be more specific? The most simple answer is that you get different results because you use different teams, which of course have different parameter values associated with them.

      • Sorry if I was not specific.

        What I wanted to say is…why there is 41% of drawing? I mean, it is not too high? I was expecting more % on Home Team and less % on drawing.

        Thank you for your response

        • Perhaps it’s a bit high, considering Arsenal won 7-1 in that match in that season. But you have to remember that the estimates can be considered as averages across the entire data set. Specific circumstances are not taken into account. Also, what makes the Dixon-Coles model special is that it actually increases the draw probabilities. In this case the increase is perhaps a bit too much, but in most cases it is for the better.

          • Nice! Of course it seems more realistic! What was the change on the code? If possible…could you please show it? Thanks and looking forward to your great work!

          • When you calculate lambda and mu, you must add, not multiply, the parameters together. I have updated the code and figures in the text.

  3. Hello, very interesting and useful post! I just have one question: Is it possible to simulate the result of a match using the probability_matrix after we had modified the probabilities of 0:0,1:0,0:1,1:1? and How? I would like to use this model in order to simulate a the second part of a championship using the first part to build the model.
    Thank you,
    Nick

    • Yes it is possible. I quickly threw together this function that simulates goal differences based on a matrix of probabilities:

      probmat_goaldiff_sim <- function(probmat, nsim=10000){
        stopifnot(nrow(probmat) == ncol(probmat))
        sampled_idx <- sample(1:prod(dim(probmat)), size=nsim, replace=TRUE, prob=as.numeric(probmat))
        rownumm <-sampled_idx %% nrow(probmat)
        rownumm[rownumm==0] <- nrow(probmat)
        colnumm <- floor(sampled_idx / nrow(probmat))
        goaldiffs <- (rownumm-1) - (colnumm-1)
        return(goaldiffs)
      }
      
      # Plot the distribution
      plot(table(probmat_goaldiff_sim(probmat)), type='b')
      

      Since I made this real quick I have not tested it as much as I probably should have, but it seems to work as it should.

      • I used the code to simulate the result and feel strange.
        To be simply I call two teams are HomeTeam and AwayTeam.

        HOME: 0.09129187
        RHO: -0.02383217
        Attack.HomeTeam: 0.92095125
        Attack.AwayTeam: 1.46878625
        Defence.HomeTeam: -0.75826977
        Defence.AwayTeam: -1.77056685

        lambda: 0.468451
        mu: 2.035042

        maxgoal: 15

        Home Win: 0.0720454
        Draw: 0.1841713
        Away Win: 0.7437833

        When I used your code to simulate the match (nsim=1000000), the probabilities are:

        Home Win: 0.256828
        Draw: 0.257366
        Away Win: 0.485806

        Are there any possible reasons to explain the figures? Thanks for your reply.

        • It is hard to tell, but my guess is that when you simulate the goal difference distribution, the vector that is returned will sometimes be of different lengths. Some results are extremely rare, like a goal difference of 8, and even a huge simulation will sometimes miss that. So you have to be careful when you calculate the home/draw/away probabilities. But this is just my guess, the problem could of course be elsewhere.

      • What package do you need to download for probmat to work? I’m getting an error of probmat not being identified

        • You need to give the probability_matrix variable to the probmat_goaldiff_sim() function via the probmat argument.

  4. Hi,

    Thanks for your great post.

    I am wondering if using greater number of history data(say the past 5 season) will generate more realistic result.

    Also, can the dixon cole model be used to model half time result?

    • Thank you!

      I have written some posts about how much data you can include. I suggest taking a look at this post about the Dixon-Coles Weighting method. It basicly downweight the influence of older results in a realistic way. I suggest also reading this that I wrote, about including data from lower divisions as well.

      Yes you can use it to predict the half time result. I can think of to ways. The most obvious is to use half time goals instead of full time goals. If you don’t have half time data, one approach is to simply divide the lambda and mu parameters by two before you calculate the probability matrix.

  5. Hi,

    Love your blog. Is there any way we can use this model to calculate probability of BTTS(both teams to score) and Over 2.5 goals? Maybe based on how many goals each team is expected to score. If so, would you be able to provide the R code for that?

    • For BTTS, I am trying the following code, However it doesnt seem to be working.

      bttsProbability <- 0;
      for (i in 1:10) {
      for (j in 1:10) {
      bttsProbability <- bttsProbability + ((homeG[i] * awayG[j])*100)
      }

      }

      • To get the probability that both teams score at least one goal, you can do

        btts <- sum(probability_matrix[-1,-1])

        To get the probability of 2 goals or less you can do

        sum(probability_matrix[1,1:3]) + sum(probability_matrix[1:3,1]) + probability_matrix[2,2]

  6. Hello. Thank you for your blog.
    Can you help me with Dixon-Coles model please?
    The source on github: https://github.com/xznbum/dc-model
    Data for UEFA 2017 Group stage.
    The problem with DC model is wrong probability (DrawProbability, HomeWinProbability, AwayWinProbability).
    Also, in Poisson model there is problem with singularity in glm.
    Thank you.

    • If you only have group stage data you will get problems. You will get several groups of teams that are not comparable against each other, thus making the parameters impossible to estimate. This applies to both models. The glm function will typically report the problem. In the DC model you should look at the convergence code returned from auglag. I bet it reports non-convergence.

    • They show the probability of each goal difference. A positive goal difference means that the home team scores more goals, 0 goal difference is a draw, and negative goal difference is away team win.

  7. I’m trying to find a way to calculate lambda and mu for all matches in my dataset, but I’m struggling with it.

    After obtaining res$par, you use

    # Expected goals home
    lambda <- exp(res$par['HOME'] + res$par['Attack.Bolton'] + res$par['Defence.Blackburn'])

    # Expected goals away
    mu <- exp(res$par['Attack.Blackburn'] + res$par['Defence.Bolton'])

    To calculate for one specific match (ie Bolton x Blackburn).

    If you want a column of lambdas and mus, how would you go about it?

    I've tried several different combinations, for example:
    lambda.p <- exp(res$par['HOME'] + res$par['Attack.'[dta$HomeTeam]] + res$par['Defence.'[dta$AwayTeam]])

    as well as dcm$HomeTeamDM.

    I just keep getting a vector of "NA"

  8. Took me 2 and a half hours of playing around, but I fixed it:

    lambda.p <- exp(res$par['HOME'] + res$par[paste('Attack', dta$HomeTeam, sep='.')] + res$par[paste('Defence', dta$AwayTeam, sep='.')])

    Just posting it here for future references if people run into the same problem.

    Mu.p would work in a similar way.

  9. Hi again!

    I have two questions:

    1) Why dont we consider RHO when calculating mu as in lambda we consider HOME?

    2) What happens if the HomeWinProbability is 0.50 but the max of goaldiffs is equal 0? I mean, goaldiffs = 0 has the max probability.

    Thank you!

    • RHO only changes the probabilities for some low scoring results, while HOME increases the expected number of goals for the home team. RHO works on the results for both teams, but HOME only applies to the home team.

      A specific goal diff, such as 0, will often be smaller than the sum of different goal diffs, which is the case for the home win probability (ie sum of all goal diff probabilities >= 1).

  10. nice work. i will like to know if it is possible to predict the whole game for let say week 5 of the premier league at once, if i have a csv. file containing the fixture. if it is possible please help me with the code to do it.

    • Yes you can predict the results of several games, between any team you like. I am not going to show you any code, since it will probably not work for your specific file.

  11. Love your code. Was wondering if you could show how to plot adjusted vs unadjusted. Would be helpful. Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *