The Dixon-Coles model for predicting football matches in R (part 1)

Please have a look at the improved code for this model that I have posted here.

When it comes to Poisson regression models for football results, the 1997 paper Modelling Association Football Scores and Inefficiencies in the Football Betting Market (pdf) by Dixon and Coles is often mentioned. In this paper the authors describe an improvement of the independent goals model. The improvement consists of modeling a dependence between the probabilities for the number of goals less than 2 for both teams. They also improve the model by incorporating a time perspective, so that matches played a long time a go does not have as much influence on the parameter estimates.

The model by Dixon and Coles is not as easy to fit as the independent Poisson model I have described earlier. There is no built-in function in R that can estimate it’s parameters, and the authors provide little details about how to implement it. Mostly as an exercise, I have implemented the model in R, but without the time down-weighting scheme.

The estimating procedure uses a technique called maximum likelihood. This is perhaps the most commonly used method for estimating parameters in statistical models. The way it works is that you specify a way to calculate the likelihood of your data for a given set of parameters, and then you need to find the set of parameters that gives the highest possible likelihood of your data. The independent Poisson model is also fitted using a maximum likelihood method. The difference here is that the likelihood used by Dixon and Coles is non-standard.

The model is pretty much similar to other regression models I have discussed. Each team has an attack and a defense parameter, and from a function of these the expected number of goals for each team in a match is calculated. For the rest of this post I am going to assume you have read the paper. There is a link to it in the first paragraph.

The most obvious thing we have to do is to implement the function referred to by the greek letter Tau. This is the function that, dependent on the Rho parameter, computes the degree in which the probabilities for the low scoring goals changes.

tau <- Vectorize(function(xx, yy, lambda, mu, rho){
  if (xx == 0 & yy == 0){return(1 - (lambda*mu*rho))
  } else if (xx == 0 & yy == 1){return(1 + (lambda*rho))
  } else if (xx == 1 & yy == 0){return(1 + (mu*rho))
  } else if (xx == 1 & yy == 1){return(1 - rho)
  } else {return(1)}
})

We can now make a function for the likelihood of the data. A common trick when implementing likelihood functions is to use the log-likelihood instead. The reason is that when the probabilities for each data point for a given set of parameters are multiplied together, they will be too small for the computer to handle. When the probabilities are log-transformed you can instead just add them together.

What this function does is that it takes the vectors of mu (expected home goals) and lambda (expected away goals), Rho, and the vectors of observed home and away goals, and computes the log-likelihood for all the data.

DClogLik <- function(y1, y2, lambda, mu, rho=0){
  #rho=0, independence
  #y1: home goals
  #y2: away goals
  sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
}

The team specific attack and defense parameters are not included in the log-likelihood function. Neither is the code that calculates the expected number of goals for each team in a match (lambda and mu). Before we can calculate these for each match, we need to do some data wrangling. Here is a function that takes a data.frame formated like the data from football-data.co.uk, and returns a list with design matrices and vectors with the match results.

DCmodelData <- function(df){

  hm <- model.matrix(~ HomeTeam - 1, data=df, contrasts.arg=list(HomeTeam='contr.treatment'))
  am <- model.matrix(~ AwayTeam -1, data=df)
  
  team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))
  
  return(list(
    homeTeamDM=hm,
    awayTeamDM=am,
    homeGoals=df$FTHG,
    awayGoals=df$FTAG,
    teams=team.names
    )) 
}

Now we create a function that calculates the log-likelihod from a set of parameters and the data we have. First it calculates the values for lambda and mu for each match, then it passes these and the number of goals scored in each match to the log-likelihood function.

This function needs to be written in such a way that it can be used by another function that will find the parameters that maximizes the log-likelihood. First, all the parameters needs to be given to a single argument in the form of a vector (the params argument). Also, the log-likelihood is multiplied by -1, since the optimization function we are going to use only minimizes, but we want to maximize.

DCoptimFn <- function(params, DCm){

  home.p <- params[1]
  rho.p <- params[2]
  
  nteams <- length(DCm$teams)
  attack.p <- matrix(params[3:(nteams+2)], ncol=1)
  defence.p <- matrix(params[(nteams+3):length(params)], ncol=1)
  
  lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p)
  mu <- exp(DCm$awayTeamDM %*% attack.p + DCm$homeTeamDM %*% defence.p)
  
  return(
    DClogLik(y1=DCm$homeGoals, y2=DCm$awayGoals, lambda, mu, rho.p) * -1
    )
}

One more thing we need before we start optimizing is a function that helps the optimizer handle the constraint that all the attack parameters must sum to 1. If this constraint isn’t given, it will be impossible to find a unique set of parameters that maximizes the likelihood.

DCattackConstr <- function(params, DCm, ...){
  nteams <- length(DCm$teams)
  attack.p <- matrix(params[3:(nteams+2)], ncol=1)
  return((sum(attack.p) / nteams) - 1)
}

Now we are finally ready to find the parameters that maximizes the likelihood based on our data. First, load the data (in this case data from the 2011-12 premier league), and properly handle it with our DCmodelData function:

dta <- read.csv('FAPL1112.csv')
dcm <- DCmodelData(dta)

Now we need to give a set of initial estimates of our parameters. It is not so important what specific values these are, but should preferably be in the same order of magnitude as what we think the estimated parameters should be. I set all attack parameters to 0.1 and all defense parameters to -0.8.

#initial parameter estimates
attack.params <- rep(.01, times=nlevels(dta$HomeTeam))
defence.params <- rep(-0.08, times=nlevels(dta$HomeTeam))
home.param <- 0.06
rho.init <- 0.03
par.inits <- c(home.param, rho.init, attack.params, defence.params)
#it is also usefull to give the parameters some informative names
names(par.inits) <- c('HOME', 'RHO', paste('Attack', dcm$teams, sep='.'), paste('Defence', dcm$teams, sep='.'))

To optimize with equality constraints (all attack parameters must sum to 1) we can use the auglag function in the alabama package. This takes about 40 seconds to run on my laptop, much longer than the independent Poisson model fitted with the built in glm function. This is because the auglag function uses some general purpose algorithms that can work with a whole range of home-made functions, while the glm function is implemented with a specific set of models in mind.

library(alabama)
res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

Voilà! Now the parameters can be found by the command res$par. In a follow-up post I will show how we can use the model to make prediction of match outcomes.

Team Attack Defence
Arsenal 1.37 -0.91
Aston Villa 0.69 -0.85
Blackburn 0.94 -0.47
Bolton 0.92 -0.48
Chelsea 1.23 -0.97
Everton 0.94 -1.15
Fulham 0.93 -0.89
Liverpool 0.89 -1.13
Man City 1.56 -1.43
Man United 1.52 -1.31
Newcastle 1.10 -0.88
Norwich 1.02 -0.62
QPR 0.82 -0.65
Stoke 0.64 -0.87
Sunderland 0.86 -0.99
Swansea 0.85 -0.89
Tottenham 1.24 -1.09
West Brom 0.86 -0.88
Wigan 0.81 -0.71
Wolves 0.79 -0.42
Home 0.27
Rho -0.13

128 thoughts on “The Dixon-Coles model for predicting football matches in R (part 1)

  1. Really good to actually see someone showing how to do this, something there isn’t a ready example for around. Looking forwards to the next instalment.

    • Hi,

      Great stuff, and I would just like to know how you capture defence and attack parameters. I mean from the data, what variable does it look for to determine either of the parameters?

      • Another interpretation of the attack and defense parameters I like is that attack is the effect of a team on the number of goals scored, and the defense is the effect of the team on the number of goals scored when that team is the opposition. Take a look at this post how the data can be arranged to make attack (Team) and defense (Opposition) variables:

        http://opisthokonta.net/?p=276

  2. help with this error, Error in names(par.inits) <- c("HOME", "RHO", paste("Attack", dcm$teams, :
    'names' attribute [44] must be the same length as the vector [40]

    • It’s hard to tell, but by the looks of it, you may have forgotten to run some of the code, or perhaps mixed the code posted here with the code posted in part 4.

      Another possibility is that something went wrong in the DCmodelData function. Perhaps try this to create the attack.params and defense.params vectors: rep(.01, times=length(dcm$teams))

  3. help me with this error:
    > res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)
    Error in if (xx == 0 & yy == 0) { : missing value where TRUE/FALSE needed

    • Seems to have something to do with the tau-function. Do you have any missing values in one of the vectors with the number of goals?

      • That´s what I thought, but I copy the value as they are.

        tau <- Vectorize(function(xx, yy, lambda, mu, rho){
        if (xx == 0 & yy == 0){return(1 – (lambda*mu*rho))
        } else if (xx == 0 & yy == 1){return(1 + (lambda*rho))
        } else if (xx == 1 & yy == 0){return(1 + (mu*rho))
        } else if (xx == 1 & yy == 1){return(1 – rho)
        } else {return(1)}
        })

        I don´t know what´s happening.

        • The tau-function is of course fine, but are you sure you don’t have any missing values in your data? I think that is what is causing the error.

          • It is just that I use different names for the arguments in the different functions that means the same. The code could have been better with respect to this.

    • Yes this tend to happen during the optimizing, and the reason is that the optimizer tries a combination of values that makes the tau function be negative. Taking the logarithm of a negative number will obviously not work. I am not sure how to fix it, but as long as the optimizer converges I don’t think it is a problem.

  4. Hello.
    first, i was impressed your post.
    i have some question about params inits.
    if you don’t mind, can you tell me the reason why you set up the value like as follow?? attack.params <- rep(.01, times=nlevels(dta$HomeTeam))
    defence.params <- rep(-0.08, times=nlevels(dta$HomeTeam))
    home.param <- 0.06
    rho.init <- 0.03
    regards,

    • I just tried out some values and went for some that seemed to work most of the time. They may not work on all data sets and if the optimization fails you should just try some other values.

  5. Great blog!

    I read the original paper by Dixon and Coles, and they claimed that the assumption of independence between scores is reasonable except for the scores 0-0, 1-0, 0-1 and 1-1, and that why they introduced the tua function in this way. I am not sure if that the case with any data when we assume independence or it is just with their data. How did they detect that the assumption is violated in the case when the score is less than 2? Because in my data, which is the last season of La liga (Spanish League) based on the standardized residuals of the chi-square test of independence there is no problem with independence assumption with scores less than 2. I would really like to know, is it a general idea? or it varies from data to another, and what was the case with your data?

    Thank you, and again your blog is very interesting and helpful.

    • Thank you! Dixon and Coles used English data from 1992-95 (6500+ games) and their paper they describe how they inspected the independence assumption. They first calculated the expected number instances for each result under the independence assumption, then they divided this by the observed number of games with this result. They also computed the standard errors for this ratio. They present this analysis in table 2 in their paper. Then I think they just took a look at the table and saw that there were quite lot more draws for the low scoring games, and fewer 0-1 and 1-0 games.

      I have compared the Dixon-Coles model against some other models here, and from those analyses I suspect that the model tend to overfit to the data a bit.

      • Do you know how these standard errors are calculated? I am trying to replicate this in R but am unsure of the method used.

        I am assuming you referring to Table 1 and Table 2 in the Dixon and Coles paper

  6. Great blog – i have a question:

    why is your lambda

    lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p)

    and not

    lambda <- DCm$homeTeamDM %*% attack.p * DCm$awayTeamDM %*% defence.p * home.p ??

    why do you exponentiate it? i thought lambda = alpha_i * beta_j * gamma as in Dixon Coles?

    Thanks very much

    • Mathematically they are the same. I used the exponentiation since that is what is commonly done in standard Poisson regression. Although I haven’t really tested it, I think the exponentiation also will make the optimization easier, since you are guarantied not to have a negative lambda for any value of the attack and defense parameteres. This is not the case when you multiply the parameters together.

  7. Hi, this blog is amazing, but I’m a little stuck understanding one bit.
    Would you mind explaining this line of code please (in the DCoptimFn function).

    attack.p <- matrix(params[3:(nteams+2)], ncol=1)

    Thanks.

    • Thank you! This line of code creates a 1-column matrix of the attack parameters. To make the DCoptimFn work with the optimizer it has to get all parameters in just one vector. When you have a lot of parameters, and they belong to different groups, you have to do a lot of indexing to get the right parameters from the ‘params’ argument. The first two elements of ‘params’ are the home and rho parameters and then follows all the attack parameters, so the indexing has to start at the third element. The number of attack parameters is the same as the number of teams, so the last attack parameter is then the nteam+2th element in ‘params’.

      • Thank you so much, that’s cleared a lot up for me, there’s just one more thing I’m still a little unsure about. What data is going into params, i.e how do we get the home & rho parameters and the attack parameters for each team?

        Thanks!

        • You mean how you estimate them? That is taken care by the optimizer, which in this case is the auglag function from the alabama package. This function find the most likely parameters (the likelihhod for a given set of parameters is defined by the DCoptimFn function) for the data you give it, which is in this case passed via the DCmodeData function.

  8. In DClogLik I think y1!( i.e factorial of y1) should be multiplied to dpois(y1,lambda) and similarly with y2 and dpois(y2, mu) because the formula 4.3 in the original paper does not have the 2 factorials in the denominator.

    Am I missing something here?

    Thanks.

    • You are right that they have removed the term log(1/x!) which could (should?) have been in equation 4.3. As they write, that formula is only proportional to the true log-likelihood. The true log likelihood would have the same form as the log of equation 4.2. The reason they can drop the log(1/x!) term is that is does not depend on the parameters. That is to say, dropping or keeping it does not alter the parameter estimates.

      • Okay! thanks I got it.
        One more thing though. If I want to include more parameters like number of league titles won by the club,or money spent on transfers, those would go in the lambda and mu equations right?

  9. Awesome posts! One error I’m getting though:

    > res res$par
    Error: object ‘res’ not found

    any chance you know why? thanks!

    • Yes, it’s probably because you have not run the line of code that creates the res object, or some error happened when you ran it.

  10. Hi dear,
    Your post is a life saver in my journey on sport analytics,

    I have try to implemented the code your provided all went well except for this error i am getting when I run this line

    #line code
    res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

    #result
    Error in DCm$awayTeamDM %*% defence.p : non-conformable arguments

    any suggestion on what is going on and how i can improve this,thanks for your time and help

    • There seem to be a mismatch between the number of defence parameters and the number of columns in the away team defence matrix. I suggest looking at how these two look, and at the code that crated these two.

      • Just started to implement this and had the same problem. Reason for this was too small data set. All teams had not played against all teams yet. Added previous season’s data and things started to work.

        • Thank you for sharing this! In general, I think that using less than a season of data is a bad idea, even if you get the model to work. Its simply not enough data to get reasonable results.

    • I’d think so. You could try to fit the model to the results from the first half, or you could try to simply divide the lambda and mu parameters (in part 2) by two.

  11. Hi,
    As someone with minimal math skills and zero programming knowledge, where should I start if I want to eventually be able to make a model like this?
    Should I learn R first? Or maybe take some course on statistics?
    I have some experience building a Poisson-Elo based model in Excel (like the one on clubelo.com) but this seems to be on a whole other level.
    I may be wrong, but it doesn’t seem like it would be possible to build a Dixon-Coles model in Excel.

    • I would suggest learning R first. Some of the maths/statistics will out of neccesity come by itself along the way. If you have implemented Elo-type models in excel, you should do fine. I think it would be possible to implement the DC model in excel. I alsmost managed to implement the ordinary Poisson model in OpenOffice, but it lacked optimization routines that I think Excel have. Since the DC model is just a tweaked Poisson model, it should be possible.

  12. Hi Jonas,

    I had a tangential question concerning the Dixon-Coles paper.

    How would you go about seeking to replicate the analysis undertaken in Table 2 on page 268? Calculating the ratio is easy to do – but how would you test in R whether or not the particular match outcomes observed in real life in the matrix were statistically different to the joint probability of independent home and away scoring functions?

    I’m looking into doing the model for a non-European league and just interested to test rather than assume that only 0-0, 1-0 and 0-1 outcomes are the only ones needing adjustment.

    I saw another paper (http://www.ylikerroin.com/file/Complete.pdf) which suggested adopting a poisson correction matrix be overlayed the estimated probability matrix for home and away poisson scoring processes to potentially correct for more than just 0-0, 0-1 and 1-0 issues.

    • I guess the standard thing to do would be a Fisher Exact test or Chi-Square test of independence. I am, however, not a great fan of Chi-sqaure tests, as I think they are a bit difficult to interpret. A different approach I have used is to fit the models and calculate AIC. The AIC sometimes indicate better fit for the DC model, but it is inconsistent with the direction the effect. Sometimes the effect is in the opposite direction of what is shown in the paper. I haven’t really tested the DC model properly, but I suspect there is some overfitting going on.

      For other kinds of adjustments, I would suggest looking into the bivpois R package, and the accompanying paper.

      Where in that paper are the correction matrix thing described?

      • Thanks for replying so quickly and for the links.

        The correction method is mentioned in pp53-54 of the paper in my previous post. The author admits the method is ad hoc – and I haven’t gone through the process as yet to determine if the method is any good or not on the data I have.

        The paper analysed the effectiveness of different models by summing the probabilities assigned to the correct game prediction – however I think RPS or RPSS (and comparing its effectiveness against a model of assigning 1/3 probabilities to W/D/W or aggregated bookmaker odds) would be better.

        The league I am focusing on is the Australian A-League. It is different to the major European leagues in that there is no promotion/relegation (at this stage) and it is not uncommon for senior squads to change significantly from season to season due to the operation of a salary cap (with some exemptions) and there being no domestic transfer market (i.e. it is relatively easy to get out of a contract at one Australian A-League club and move to another even whilst under contract).

        • Thank you. I am not quite sure what to think about the “empirical correction” method. I think perhaps it could be viewed as a sort of regularization, where you adjust the probabilities from your model (presumably calculated from knowing who the teams are and the home field advantage, etc) towards the overall mean distribution. This could make your predictions less precise, but perhaps more realistic in the sense that your predictions don’t get too confident.

          One thing to try would be to use the overall mean matrix (M) to adjust the matrix from a model (K) using a correction factor y (between 0 and 1), like this:

          (1-y)*M + y*K

          and try to tune the interpolation factor y for optimal prediction somehow. I am not so sure it will do better, but who knows?

          I once tried to adjust the time-weighting scheme from the DC paper to incorporate an additional “past season” effect, but it did not give better predictions for the Premier League. Perhaps this could be more appropriate for the Australian league?

    • I do things a bit differently from the paper, but it is mathematically the same. In the paper they multiply the coefficients together to get the expected goals. Then you of course can’t have negative coefficients. Here I add the coefficients together and exponentiate the result, similar to what is done in the more common formulations of poisson regressions. Then you can have negative coefficients, since the exponentiation makes the end result positive.

      • Hi again, I was just wondering if you could whip up some code in order to do it in the way that Dixon and Coles originally did it per chance?

        • Maybe. I have made a new implementation that is a lot simpler and not relying on matrices/linear algebra. It is thus easier to modify and extend. This could easy be modified to use the multiplication operator instead of addition. Maybe.

  13. Hi,

    Can you explain how the probability matrix works? How can I use it to predict the chance of say 4 or more goals in match or maybe one team scoring 2 or more goals?

    • Each element in the matrix is the probability of that particular score. The sum of all probabilities in the matrix sum to (approximately) 1, so it is a probability distribution over all scores. To get the probabilities you want you can sum over all the scores that match your criterion. The probability of the home team scoring more than 2 goals is the sum of all rows from row 3 (remember that row 1 correspond to 0 goals) and downwards, for example.

      • sorry, i still don’t quite understand how it works. Could you maybe show the code for predicting 4 or more goals in the match and home team scoring 2 or more? Perhaps then I can use that to understand it better.

        • Heres a quick way to calculate the probability of 4 or less goals in total. More than four would be just 1-prb.

          prb <- 0
          for (ii in 1:5){
            prb <- prb + sum(probmat[ii,1:(5-ii+1) ])
          }
          
          • thanks!, also how would I be able to calculate the chance for say home team to score 2 or more goals?

  14. Hi! Great article.

    I have one question.

    Which data are you using? I mean, which columns of the data that football-data.co.uk provides?

    Thank you very much!

        • For this model you can use 0 for all parameters, or those that I used in the post. I have tried to use more informed starting values, for example using the logarithm of the average number of goals scored and conceded for each team. This usually gives faster estimation, but the results are the same.

  15. I’m confused as to where in this code you are looping through all of the games. It seems as if the attack parameters aren’t being updated after each match during optimization.

    I’m curious because I’m attempting to implement something similar in Python using scipy.optimize.minimize.

    Thanks!

    • In the DCoptimFn function the expected goals (lambda and mu) for each game is calculated with matrix opperations, so there is no explicit loop construct in the code. In the DClogLik I use R’s vectorized operations. In R all strings, numerics and logicals (bool in Python) are 1D arrays, and all common operations (+, -, *, == etc) work elementwise on these arrays, so there is often no need to write explicit loops, like in this case.

      • Thank you for the reply!

        The last problem I’m having is with my lambda and mu calculations. For some reason sometimes mu and/or lambda are negative, and attempting to do the log of a negative is not possible as it returns a math domain error.

        I’m doing the same lambda and mu calculations in Python (with matrix operations).

        Any help would be greatly appreciated!

  16. When calculating the attack and defence ratings does it take into account the strength of the opponent in each game?
    eg. Scoring goals against a team with a good defence rating counts for more compared to scoring against a team with a weak defence.

    • Yes! The predicted number of goals scored by a team is a function of the attack strength of that team, and the defense strength of the opposition. All parameters are estimated simultaneously, to give the optimal fit to the data. The opposite of what you write is also true. Scoring goals against a team with good defense will not only improve the attack rating, but also reduce the oppositions defense rating. But this is done over all matches simultaneously, giving a sort of average.

  17. Hi, firstly thanks for the great blog it has helped me a lot.

    I had a question on using your above code and adding an allowance for time weighting. (This is more R related than theoretical but any help would be appreciated.) I have tried this by doing the following:

    1.) adding a weights vector to the list created by the ‘DCmodelData’ function (with all weights set to 1 to start with) .

    2.) Passing this vector to the ‘DClogLik’ function from the ‘DoptimFCn’ function.

    3.) Multiplying each part the sum() in the ‘DClogLik’ by the weight.

    Then running the optimiser as before which runs without error but only for 3 iterations and does not produce meaningful results. Do you have any idea why the optimiser would produce such a different solution given such a minor addition to the code?

    Thanks.
    Graeme

    • Thats strange. What convergence code do you get? The weights should all be positive. Are yours? Are you sure the weights are multiplied correctly? And have you looked at the other posts about the DC model I have written? There might be something in those.

  18. Hi,

    This is great stuff, but what would it look like if you optimised it using the glm function (how could I write such a code). Reason I ask is because I dont have alabam package in my R.

    • Take a look at this post to fit a simple model using the glm function. Also take a look at this post where I fit the DC model using the glm function. At last, take a look at this post, where I fit the complete DC model using the built in optim function. Have you tried installing the package using the command install.packages(“alabama”)?

  19. Is there any way to convert this model to python easily? Or is there already code out there for a dixon coles model in python?

    Thanks!

  20. Hi, great blog!
    I found Martin Eastwood code which worked fine on my R. I was excited to find your code for applying the Dixon&Coles adjustment in R. I copy pasted and run all your codes into R and have not been able to get to the end. I have received same error messages posted by people previously. Could some of problem be caused by team.names/team component. I am just getting null vector in team. What is the purpose of the first line team.names and teams=team.names? Also runnin function unique on data frame gives me null vector..

    team.names <- unique(c(df$HomeTeam), (df$AwayTeam))

    return(list(
    homeTeamDM=hm,
    awayTeamDM=am,
    homeGoals=df$FTHG,
    awayGoals=df$FTAG,
    teams=team.names

    • What error message are you talking about? Can you post it here?

      The first line merges the vector of home teams and away teams, and find all unique team names. This creates a vector of the names of all teams in the data set. This is then put in the list that is returned by the DCmodelData function.

  21. Hi,

    You are amazing! These post series are great!

    By the way, I have a question. Is there a way to use non-integer values in this model? I have some ratings which I want to use and instead of goals I want ratings! (Or maybe a way to include them?)

    PS: The current model I use is like part 4 of your articles.

    Thanks for your support!

    • By the way, this is the error that I get when I use non-integer values

      “Error in optim(par = par.inits, fn = DCoptimFn, DCm = dcm, method = “BFGS”) :
      initial value in ‘vmmin’ is not finite”

      With goals it runs correctly, but I think changing a parameter in optim function could make this work? Or par.inits initial values? As far as I understand, the function is evaluating NaN, NA, Inf…but I want to replace goals with this: “1] 3.07 0.75 0.73 0.41 0.70 0.64 1.71 0.27 1.75 1.99 3.19 2.18 0.29 2.22 1.04 2.14
      [17] 2.43 0.91 0.69 2.06 1.27 0.85 1.40 0.44 0.53 0.29 2.13 2.06 1.62 0.85 1.19 4.68” (Sample data)

      Thanks!!

      • You can’t use non-integer values in this model. This is because the model is based on the poisson distribution, which is only valid for non-negative integers. Perhaps using a gaussian distribution is more appropriate in your case. The easiest thing would probably be to fit a linear regression model using the built-in lm function.

        • Ok! So instead of

          “m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())"

          Must be

          "m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=GAUSSIAN())"

          (Part 3 of your articles

          • Yes, but you need to put gaussian is lower case. You will get a warning saying that it is inappropriate to use the Gaussian family with the glm function and you should use the lm function instead.

  22. Awesome and very usefull post – thanks a lot!

    I get a problem though – when using data from the site you refer, some of the csv-files contains rows with NA’s in FTHG and FTAG and just empty in HomeTeam, AwyTeam etc.

    Now it is not a problem just removing them, the problem is that when I try and find the parametres as you do, the data thinks, that there is a HomeTeam and AwayTeam named “” (blank). You wouldn’t expect it to optimize parametres for this team, but it does! When optimizing using the Premier League 2014/15 season, I get a “” team with attack parameter 0.8301 (it’s defence parameter is just the initial value we choose from start).

    Do you know what happens here and why?? I’v tried to change the levels of both hometeam and awayteam, putting Arsenal on the blank, but this gives me different parameters. If I put any other team in it in levels, I again get new different parameters!

    Hope you can help!

    • I thin the problem is that loading csv files makes the HomeTeam and AwayTeam columns of the factor type. So even if you remove those games with blank team names the blank team name is still kept as a possible value. try to load the data using stringsAsFactors=FALSE in the read.table or read.csv functions. By the way, take a look at my most recent blog post for a simpler implementation.

  23. Sorry if it seems like i’m nitpicking, but i’m new to R so was just wondering if you could tell me why, in DClogLik, you have done sum(a + b + c) rather than just sum(a, b, c) or (a + b + c)? Again, sorry if this is a silly question but I wanted to know if this method made a difference to the output

    • I did it because each element in the three vectors have a value that corresponds to the same match, so it felt natural to add them together, before all terms are summed again. It could be of interest to inspect each of the terms, especially for debugging, but you are correct that this is not necessary in this particular case.

      • Thank you! Also, in your most recent post about this I was wondering if you could explain the intuition behind the sum to zero constraint on the defense parameters?

        • The constraint is actually not necessary in the most recent implementation, because I forgot to include the intercept. Anyway, there is general problem when including categorical predictors in a regression model. The reason is that without some constraints you can not uniquely estimate the parameters. So for instance, you could add some random value to the intercept, and then just subtract this value from the defense parameters, and still get the same predicted score. With the sum-to-zero constraint (or some other constraint, such as setting one of the parameters to zero), you bypass this problem.

  24. Hi, could you tutor me on how to implement the model in R? I’m willing to pay you at a really good rate. I just need help with the model.

  25. Hey i have a question.

    When we maximize their joint probability function, and are trying to get the derivatives, how do we incorporate the tau function there, because you know how there 4 different functions of the tau depending on what the goals are. So, do we maximize the joint probability with the FOUR different tau functions SEPARATELY? Giving us 4 different equations? And we just use the one we need depending on what score we calculating? Or is therr a way to get the tau in the joint completely, giving us just one equation. Im confused here cause you dont even explain the maximization process in good detail.

    Thanks

    • I don’t calculate the derivatives analytically, only numerically. In this post I use the auglag function, but in the other posts I use the optim function in R. If a function returning the derivative is not supplied to the auglag and optim functions, they use a type of finite differences approximation. Which tau-function is used depends on the score, so for each match only one of the function is used.

  26. I didn’t understand very well like is the dataset. When I saw dcm <- DCmodelData(dta), dcm is a list with 4 elements. Ok. But I didn't understand the structure of dcm$homeTeamDM is like zeros and ones. Could you explain it for me? I believe that I don't understanding how the dataset must be. I trying to reply it with other variables, like shots and goalkeeper's defense.

  27. Hi, great work. I would like to know, why “DClogLik <- function(y1, y2, lambda, mu, rho=0){
    #rho=0, independence
    #y1: home goals
    #y2: away goals
    sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
    }"
    has no argument return.
    tks

    • If there is no return statement, the result of the last line is returned. I agree that it is confusing, and it is a bad practice to not use return().

  28. Hi, jonas. John again, whats is contrasts.arg=list(HomeTeam=’contr.treatment’) ? what does it do? I remove it from code but I didn’t see any change.

  29. Hi! This is still a great and incredibly helpful post, thank you! I know i’m very late, but I see you are still answering questions. I have a small one pertaining to the last line of code. Why don’t we have to specify the ‘params’ and ‘DCm’ arguments in the DCoptimFn and DCattackConstr functions within the auglag function? I see that DCm is specified at the end of the auglag function, and I understand that DCoptimFn and DCattackConstr take the par=par.inits for their ‘params’ argument. But why doesn’t something like res <- auglag(par=par.inits, fn=DCoptimFn(par.inits, dtm), heq=DCattackConstr(par.inits,dtm)) work? I guess it's just the nature of the auglag function to pass the 'par' argument to the first argument for the functions specified at fn and heq?

    • Yes, you are right, the auglag function has a similar interface as the built-in optim function. You only have to specify the name of the function, and the optimizer gives the par argument as the first argument to the specified functions.

  30. How can you extend the model to account for more factors? Like friendliess, weather, coach changes and such? Cause we only have the attack strength, defense strength and the home advantage. So, how would you go on to account for those? Can you give an example? Would we have to also multiply those by the stregh and defense values?

  31. I noticed that the Dixon-Coles paper defined lambda as a multiplicative combination of the attack/defence/home parameters, but you have implemented an exp(attack + defence + home) lambda. I am aware that this is multiplicative in the sense that you get exp(attack)*exp(defence)*exp(home), but I was just wondering why you did what you did?

    Cheers

    • It makes it easier to estimate the parameters. In the multiplicative formulation you need to make sure the parameters aren’t negative. The numerical optimisation used for estimating the parameters work better if the parameters doesn’t have bounds on them. Using the exp(attack + defence + home) formulation is also the standard way of formulating the Poisson regression model, and I think it is best to have the two models formulated the same way.

  32. Hi Jonas,
    in your beautiful package goalmodels I see that tha parameter eta1 is calculated as “intercpet + attack1 + home advantage – defense2”. Why? It would be “…+ defense2” or not? With “… – defense2” when defense2 decreases (strong defense) the expected goal of team1 increases, instead of decreases. Can you explain me this, maybe i’m wrong. Thank you very much.

    • It is true that it is minus defense. Standard glm formulation uses plus, but for this particular application many use minus. The reason is that you can then interpret greater attack and defense ratibgs as the team being better. A higher defense rating with the minus forumaltion means that the opposition will be expected to score fewer goals.

      Using plus or minus does not matter for the prediction abillity, it is just a matter of taste.

  33. I have tried using the code in part 4 for 2013/2014 data of epl. It however says

    res <- optim(par=par.inits, fn=DCoptimFn, DCm = dcm, method='BFGS')
    Error in fn(par, …) : object 'Dcm' not found

    What is DCm and is there anyway to define it in the earlier stages to avoid this. Also, when trying to get mu, my value always ends up in NULL. Don't seem to get the problem is. My earlier code is:

    DClogLik <- function(y1, y2, lambda, mu, rho=0){
    #rho=0, independence
    #y1: home goals
    #y2: away goals
    sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
    }

    #Format the data similar to the data frame in S1314 and return the list
    DCmodelData <- function(df){
    team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))
    # attack, with sum-to-zero constraint
    ## home
    hm.a <- model.matrix(~ HomeTeam – 1, data=df)
    hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1
    hm.a <- hm.a[,1:(length(team.names)-1)]

    # away
    am.a <- model.matrix(~ AwayTeam -1, data=df)
    am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1
    am.a <- am.a[,1:(length(team.names)-1)]

    hm.d <- model.matrix(~ HomeTeam – 1, data=df)
    am.d <- model.matrix(~ AwayTeam -1, data=df)

    return(list(
    HomeTeamDMa=hm.a, HomeTeamDMd=hm.d,
    AwayTeamDMa=am.a, AwayTeamDMd=am.d,
    HomeGoals=df$FTHG,
    AwayGoals=df$FTAG,
    teams=team.names
    ))
    }

    #Compute a function that find parameters that maximise the log-likelihood
    DCoptimFn <- function(params, DCm){

    home.p <- params[1]
    rho.p <- params[2]

    nteams <- length(Dcm$teams)
    attack.p <- matrix(params[3:(nteams+1)], ncol=1)
    defence.p <- matrix(params[(nteams+2):length(params)], ncol=1)

    lambda <- exp(DCm$HomeTeamDMa %*% attack.p + DCm$AwayTeamDMd %*% defence.p + home.p)
    mu <- exp(DCm$AwayTeamDMa %*% attack.p + DCm$HomeTeamDMd %*% defence.p)

    return(DClogLik(y1=DCm$HomeGoals, y2=DCm$AwayGoals, lambda, mu, rho.p) * -1)
    }

  34. My code is a combination of part 1 and 4. But I seem to be getting NULL for both lambda and mu. Could you please check why:

    #Implement the function ‘tau’which is dependent on the parameter Rho.
    tau <- Vectorize(function(xx, yy, lambda, mu, rho){
    if (xx == 0 & yy == 0){return(1 – (lambda*mu*rho))
    } else if (xx == 0 & yy == 1){return(1 + (lambda*rho))
    } else if (xx == 1 & yy == 0){return(1 + (mu*rho))
    } else if (xx == 1 & yy == 1){return(1 – rho)
    } else {return(1)}
    })

    #Compute the log-likelihood function from the above function
    DClogLik <- function(y1, y2, lambda, mu, rho=0){
    #rho=0, independence
    #y1: home goals
    #y2: away goals
    sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
    }

    #Format the data similar to the data frame in S1314 and return the list
    DCmodelData <- function(df){
    team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))
    # attack, with sum-to-zero constraint
    ## home
    hm.a <- model.matrix(~ HomeTeam – 1, data=df)
    hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1
    hm.a <- hm.a[,1:(length(team.names)-1)]

    # away
    am.a <- model.matrix(~ AwayTeam -1, data=df)
    am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1
    am.a <- am.a[,1:(length(team.names)-1)]

    hm.d <- model.matrix(~ HomeTeam – 1, data=df)
    am.d <- model.matrix(~ AwayTeam -1, data=df)

    return(list(
    HomeTeamDMa=hm.a, HomeTeamDMd=hm.d,
    AwayTeamDMa=am.a, AwayTeamDMd=am.d,
    HomeGoals=df$FTHG,
    AwayGoals=df$FTAG,
    teams=team.names
    ))
    }

    #Compute a function that find parameters that maximise the log-likelihood
    DCoptimFn <- function(params, DCm){

    home.p <- params[1]
    rho.p <- params[2]

    nteams <- length(Dcm$teams)
    attack.p <- matrix(params[3:(nteams+1)], ncol=1)
    defence.p <- matrix(params[(nteams+2):length(params)], ncol=1)

    lambda <- exp(DCm$HomeTeamDMa %*% attack.p + DCm$AwayTeamDMd %*% defence.p + home.p)
    mu <- exp(DCm$AwayTeamDMa %*% attack.p + DCm$HomeTeamDMd %*% defence.p)

    return(DClogLik(y1=DCm$HomeGoals, y2=DCm$AwayGoals, lambda, mu, rho.p) * -1)
    }

    #Load the data and use the DCmodelData function
    data <- read.csv('S1112.csv')
    dcm <- DCmodelData(data)
    nteams <- length(dcm$teams)

    #Initial parameter estimates
    attack.params <- rep(0.01, times=nteams-1)
    defence.params <- rep(-0.08, times=nteams)
    home.param <- 0.06
    rho.init <- 0.03
    par.inits <- c(home.param, rho.init, attack.params, defence.params)

    #Give the parameters some informative names
    names(par.inits) <- c('HOME', 'RHO', paste('Attack', dcm$teams[1:(nteams-1)], sep='.'), paste('Defence', dcm$teams, sep='.'))

    #Use the auglag function to optimise the function
    res <- optim(par=par.inits, fn=DCoptimFn, DCm = dcm, method='BFGS')
    res$par

    #Run the code to see the results of the attacks and defense parameters for each team
    parameters <- res$par

    #compute Wolves attack parameter
    missing.attack <- sum(parameters[3:(nteams+1)]) * -1
    res$missing.attack

    #put it in the parameters vector
    parameters <- c(parameters[1:(nteams+1)], missing.attack, parameters[(nteams+2):length(parameters)])
    names(parameters)[nteams+2] <- paste('Attack.', dcm$teams[nteams], sep='')

    #increase attack by one
    parameters[3:(nteams+2)] <- parameters[3:(nteams+2)] + 1

    #decrease defence by one
    parameters[(nteams+3):length(parameters)] <- parameters[(nteams+3):length(parameters)] – 1

    # Expected Home goals
    lambda <- exp(parameters['HOME'] + res$par['Attack.Chelsea'] + parameters['Defence.Tottenham'])
    res$lambda

    # Expected Away goals
    mu <- exp(parameters['Attack.Tottenham'] + parameters['Defence.Chelsea'])
    res$mu

  35. Hello, I really liked your article, thanks for it, tell me what formula you used to calculate RHO (I know that this is the Spearman coefficient) and what data you used to calculate it. What formula did you calculate ? and what data did you use for this? (I don’t know R and Phyton so I don’t understand it in the code) ?i is average count of goals for one match? ?j is average goal conceded for one match? ? is average home goal for one match/average away goal for match ?

    • RHO in this case is not the correlation coefficient, but a particlar parameter of the Dixon-Coles model. It controlls how much of the probabilites of the scorelines 1-1 and 0-0 are adjusted relative to 0-1 and 1-0. See the original paper for the details. There is no simple formula to get an estimate for it, so you have to estimate it using maximum likelihood, which is what the code does.

      • I looked at the original article and found this condition
        https://photos.app.goo.gl/VH4YhN6Tqd8spWy49

        how exactly to calculate n is not written there, that’s the way I found

        i calculated p in this way
        we take the data:
        m – the number of matches of the English Premier League for 3 seasons is 1140 matches

        x is the number of matches where the score was 1-0 or 0-1 or 0-0 for example let there be 300 matches

        it is poisson distributed https://photos.app.goo.gl/uQyQ8SVzK9mhhzz38

        after that calculate the lambda for these poisson matches using the maximum likelihood method

        https://photos.app.goo.gl/WGkGE6Qtpnd4eTDC8

        as can be seen from the calculations, lambda is the ratio of the sum of the number of occurrences of the event x to the total number of matches m

        at the end to find p i have to divide lambda by m (number of matches in 3 seasons).

        Did I calculate correctly?

        • I don’t think this is correct. I am a bit confused what you mean. In the Maximum Likelihood calculation, the xi is the count in observation i, and what you end up with is that the estimate for lamba is simply the average. So in that calculation the x is not the number of occurrences of 0-0, 0-1, 1-0 and 1-1. I think the way to go is to assume known lambda and mu (the two parameters in the original paper), and then in the iid setting, try to find the maximum likelihood of p. If that works, you can try to find p in the case where lambdai and mui are different for each match i.

  36. I like your articles, they’re rich and clear better than any source I have found on internet so far. I don’t use R for sure, Im using use excel and I am trying to implement dixon and coles model inside it since I noticed it holds an optimizing addsin( Excel solver_GRG nonlinear ). I have arleady calculated mu and lambda using initial attack,defense,home advantage and Rho parameters to enable optimizer to find starting points, created the loglikehood function using VBA(including tau function) for every match(EPL 2011/2012) but I have these questions, am I going to maximize the total loglikehood? if so then does it mean I have to also add each team’s attack and defense parameters,home advantage and Rho together for each match to obtain a well organised variable table with team and parameters columns which my optimizer will find variables? or there’s something Im missing here? Thank you so much

    • My experience in this is limited, I tried doing something similar in libre office once, but they didnt have any good optimization routines, so bear that in mind. You seem to be on the right track. For each match (row) you calculate the mu and lambda (from attack, defence and hfa for the given teams from a separate table). With mu and lambda, you then compute the log-likelighood based for each match. You may also apply the DC adjustment. Then you sum all the log-likelihood for all games for the given paramters. Then you need to maximize the summed log-likelihood for the parameter table.

      • Thank you much for your reply. Everything is working smoothly, Total likelihood:-1087.359295, Home advantage:0.272891286 and Rho: -0.133663948. it seems we got same outcomes here ,the excel solver seem to be really nice, now Im looking forward to add a time dumping factor, I guess Im going to add its log to the loglikelihood of each mach, right?
        And plus I tried to separate home/away defense and attach parameters so I ended up getting Totallikelihood:-1068.625693, Home advantage:0.17576459 and Rho:-0.172846829 But I have this problem, in the original Dixon and Coles paper they seem to have only the average of the attach home lambdas equal to 1 hence above you’ve said that all attack parameters are equal to 1( average of all attach parameters = 1). What’s the difference here?

Leave a Reply

Your email address will not be published. Required fields are marked *