The Dixon-Coles model for predicting football matches in R (part 3)

Posted on December 2, 2014 by opisthokonta

About a moth ago Martin Eastwood of the pena.lt/y blog put up some slides from a talk he gave about predicting football results in R. He presented in detail the independent Poisson regression model, and how to implement it. He also briefly mentioned and showed the bivariate adjustments in the Dixon-Coles model. I was curious about how he had implemented it since I had just finished my own implementation. In the comments he said that he used a two-stage approach, first estimating the attack and defense parameters using the independent Poisson model, and then estimating the rho parameter by it self.

This method may be less accurate than fitting the complete model, but it will probably be more accurate than the independent Poisson model. It is without a doubt faster and easier to implement.

We start with loading the data, and then making a new data.frame that contains two rows per match, as described in my post about the independent Poisson model.

dta <- read.csv('FAPL1112.csv')

# Data formated for the independent model
# Store in new variable, we need the data in original format later
dta.indep <- data.frame(Team=as.factor(c(as.character(dta$HomeTeam),
                            as.character(dta$AwayTeam))),
                            Opponent=as.factor(c(as.character(dta$AwayTeam),
                            as.character(dta$HomeTeam))),
                            Goals=c(dta$FTHG, dta$FTAG),
                            Home=c(rep(1, dim(dta)[1]), rep(0, dim(dta)[1])))

Now fit the model:

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

Since we now have estimated the attack, defense and home parameters we can use the built-in functions in R to calculate the expected home and away scores (lambda and mu).

To calculate lambda and mu, we use the fitted function. I organized the data so that all the rows with the goals scored by the home team comes before all the rows with the goals by the away teams. Whats more, the match in the first row in the home team part corresponds to the match in the first row in the away team part, so it is easy to get the corresponding expectations correct.

expected <- fitted(m)
home.expected <- expected[1:nrow(dta)]
away.expected <- expected[(nrow(dta)+1):(nrow(dta)*2)]

To estimate the rho parameter we can use the tau and DClogLik function we defined in part 1. We just wrap it inside a function we pass to the built in optimizer in R:

DCoptimRhoFn <- function(par){
  rho <- par[1]
  DClogLik(dta$FTHG, dta$FTAG, home.expected, away.expected, rho)
}

res <- optim(par=c(0.1), fn=DCoptimRhoFn, control=list(fnscale=-1), method='BFGS')

The optimization finishes in an instant. As before we get the parameter values by looking at res$par. The estimated rho parameter is -0.126, which is reassuringly not that different from the -0.134 we got from the full model. This is is also about the same values Justin Worrall gets at his sportshacker.net blog.

To make predictions we can reuse most of the code from part 2. The only substantial difference is how we calculate the expected goals, which is a bit simpler this time:

# Expected goals home
lambda <- predict(m, data.frame(Home=1, Team='Bolton', Opponent='Blackburn'), type='response')

# Expected goals away
mu <- predict(m, data.frame(Home=0, Team='Blackburn', Opponent='Bolton'), type='response')

This two-stage approach is much faster and simpler. We don’t have to manually create the design matrices and use matrix algebra to calculate the expected scores. We also don’t have to write as much code to keep track of all the parameters. I haven’t really compared all the different models against each other, so I can’t say which one makes the best predictions, but my guess is that this two-stage approach gives results similar to the fully specified Dixon-Coles model.

70 thoughts on “The Dixon-Coles model for predicting football matches in R (part 3)”

Brad on February 17, 2015 at 5:35 pm said:

Great articles! I really enjoy your posts. Do you have any thoughts on how I would go about producing a matrix of expected goals by team for each possible opponent?

Reply ↓
- opisthokonta on February 19, 2015 at 6:24 pm said:
  Just wrap the code to make the matrix inside a function that takes the result from the optimizer and the names of the two teams as arguments. You can do something like this:
```
DCpredict <- function(res, homeTeam, awayTeam){

  # Make some strings to extract the correct variables from res.
  homeAttack <- paste('Attack', homeTeam, sep='.')
  homeDefence <- paste('Defence', homeTeam, sep='.')
  awayAttack <- paste('Attack', awayTeam, sep='.')
  awayDefence <- paste('Defence', awayTeam, sep='.')
  
  lambda <- exp(res$par['HOME'] + res$par[homeAttack] + res$par[awayDefence])
  mu <- exp(res$par[awayAttack] + res$par[homeDefence])
  
  # Paste code here that calculates the probability_matrix
  
  return(probability_matrix)
}
```
  Then you can simply loop trough all opponents for, say, Bolton. Suppose you have all opponents in the all.opponents variable, you can do something like this:
```
for (opponent in all.opponents){
  DCpredict(res, 'Bolton', opponent)
}
```
  To get all matrices with Bolton as the away team, just swap the two last arguments in the function.
  Reply ↓
  - Brad on February 20, 2015 at 2:50 am said:
    
    Thanks! I will try that out.
    
    Reply ↓
Ian on March 12, 2015 at 1:41 pm said:

Thank you for posting these articles.

I have some questions about implementing the model as you have done here.

If I understand this correctly…

res <- optim(par=c(0.1), fn=DCoptimRhoFn, control=list(fnscale=-1), method='BFGS')

gives you the rho value which you can get by calling res$par[1]. This rho value is calculated by using the DCoptimRhoFn function which passes the initial 0.1 rho value to DClogLik. However the DClogLik function sets rho to equal 0 so the initial value of rho is never used.

Also, you state res$par[1] is the estimated rho value but it isn't used anywhere in the model from what I can see.

Also, after lambda and mu are calculated, am I correct in assuming to get the probabilities per goal, I can just do something like:

home_goals_prob <- dpois(0:10,lambda)
away_goals_prob <- dpois(0:10, mu)

Finally, I get warning messages stating:
In log(tau(y1, y2, lambda, mu, rho)) : NaNs produced

Reply ↓
- opisthokonta on March 12, 2015 at 5:52 pm said:
  
  The warnings I think you can just ignore. I think they come from the optimizer trying some values that makes the tau-function return negative values (for which the logarithm is not defined).
  
  The DClogLik-function does not set rho to be 0, it just has that as the default value is case you don’t supply it. You are right that the rho parameter is not used in the model fitted by the glm-function. That is why I have called this a two stage approach: First fit the regular independent Poisson model, then use that fitted model as a basis for estimating the rho parameter. To calculate the probabilities properly, using the Dixon-Coles adjustment (involving the rho), you have to use the code I posted in part 2.
  
  Reply ↓
  - Ian on March 12, 2015 at 8:10 pm said:
    
    I understand.
    
    I seem to have got it working now. The probability for a home win has decreased and the probability for a draw has increased when compared to the glm fitted model.
    
    Thanks for the clarification.
    
    Reply ↓
Ian on March 21, 2015 at 4:39 pm said:

I have a question about the GLM model.

It is created using

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

When I get a summary of the model, Arsenal aren't listed as a team and when I query for them, I get an error but the system is still able to produce predictions for them.

Why is that? I would like to see the attack/defence strengths for Arsenal.

Reply ↓
- opisthokonta on March 22, 2015 at 5:07 pm said:
  
  R automatically sets the first level of a categorical factor as the baseline or control. This actually makes all the other parameters interpretable only as the teams abilities relative to Arsenal. A more natural parameterization is to use the sum-to-zero constraint, which I explain in more details in part 4. What parameterization you use does not influence the predictions, only how the parameters are interpreted.
  
  Also, check out the comments on my first installment on the independent Poisson model, where I got the same question.
  
  Reply ↓
  - Ian on March 22, 2015 at 7:12 pm said:
    
    I see. So Arsenal are set to have a baseline attack/defence rating of 0?
    
    Also on a side note, I have implemented the model to make predictions with two different approaches.
    
    The first approach is the standard where I use all matches played in a league to predict a match between Team A and Team B.
    
    The second approach is to use just matches played by Team A and Team B to predict the outcome of when they both play each other.
    
    Both approaches produce different probabilities. Sometimes they are similar depending on the teams involved but sometimes they are vastly different.
    
    Now I assume they are different because of the calculated Home advantage using the glm function. With the first approach, the home advantage is calculated using the previous results of all teams but with the second approach the home advantage is calculated using just the previous results of Team A and B.
    
    Now would you say that the second approach should be more accurate? As surely the only results which matter for predicting the match between Team A and B is of those two teams?
    
    Reply ↓
    - opisthokonta on March 22, 2015 at 7:51 pm said:
      
      It is not just the home field advantage that makes the probabilities different. The beauty of regression models is how the parameters are estimated by considering all the data. Two specific teams only play each other at most a few times a year, seldom more than four times. This of course gives too little information to make good estimates and predictions. By using all the data you exploit the fact that both Team A and Team B also plays against Team C, Team D and so on. In technical terms this is called indirect comparisons. You can see how I use this in the post I just published. Also see this post over at 11tegen11.
      
      Perhaps I should write a post explaining in more detail how all this works.
      
      Edit: I guess you could say Arsenal’s parameters are 0. Equivalently you could say that the information about Arsenal is incorporated in the intercept (along with the overall average number of goals scored) and the home field advantage. If you add a -1 in your glm-formula, the intercept is removed, and you get the Arsenal attack parameter, but not the defense/opponent parameter.
      
      Reply ↓
      - Ian on March 22, 2015 at 8:05 pm said:
        
        Perhaps I wasn’t clear.
        
        What I meant by first approach is including results for all teams so when predicting the result of Team A vs Team B, you will use all the previous matches played by Team A, B, C, D…X, Y, Z.
        
        So Team C vs Team J would be included when predicting Team A vs Team B. This is the standard approach I have seen.
        
        The alternative I am suggesting is to only use past results which involve Team A and Team B. So we would have results of Team A vs Team E, Team A vs Team P and Team B vs Team Y, Team B vs Team U. We don’t care about Team N vs Team R or Team W vs Team K for example.
        
        Wouldn’t this work better? How would the result of Team U vs Team O influence Team A vs Team B? Surely we only care about the matches Team A and Team B have played against other teams?
      - opisthokonta on March 22, 2015 at 8:30 pm said:
        
        Ok, I think I understand better now. I have not really tested what you describe, but my guess is that the predictions will be poorer. Look at it this way. If both Team A and Team B plays against both Team U and Team O, then those matches would be more informative the more you knew about the parameters of Team U and Team O. A match between Team U and Team O would almost certainly make those parameters more precise. And also any indirect comparison between the two could be informative.
        
        I am not saying this holds true in all circumstances. Some matches are not representative for a team, especially those played in certain competitions, or when the team has a lot of injured players. But still I think most league matches should be considered representative.
      - Ian on March 22, 2015 at 8:47 pm said:
        
        This is a response to your latest message to me.
        
        I understand what you’re saying and will try both approaches to see which works best.
        
        And again, just to confirm, a baseline attack/strength from the glm function would be equal to 0, correct?
      - Ian on March 23, 2015 at 9:21 am said:
        
        After reading your posts, I looked to using the sum-to-zero constraint to get the attack and defence parameters for the baseline team (Arsenal in this case).
        
        model <- glm(Goals ~ Home + Team + Opponent – 1, family=poisson(link=log), data=yrdta, contrasts=list(Team="contr.sum"))
        
        Now this just returned the attack parameters, the defence was still missing.
        
        I then changed the constraint to be
        
        contrasts=list(Opponent="contr.sum")
        
        However the defence parameter is still not provided.
        
        Am I correct to think that it won't be possible to get a defence parameter? And if so then the next best thing is just to consider the attack/defence of the baseline team to be 0 if I require both strengths?
      - opisthokonta on March 24, 2015 at 7:48 pm said:
        
        You do get a defense parameter for Arsenal. It is probably the one called Opponent1. What you don’t get is the defense parameter for the last level in the opponent factor. I don’t know why R does this, it is really annoying. This is probably the team with a name that comes last when the names are sorted alphabetically. If your model is saved in myfittedmodel you can figure out which team is last by looking at myfittedmode$xlevels$Opponent. To get a list of all the defense/opponent parameters for each team you can try this code:
        
        OpponentCoeffs <- myfittedmodel$coefficients[grep('Opponent', names(myfittedmodel$coefficients))] OpponentCoeffs <- append(OpponentCoeffs, sum(OpponentCoeffs) * -1) names(OpponentCoeffs) <- myfittedmodel$xlevels$Opponent
Martin McKenna on July 24, 2015 at 7:20 am said:

First off, love your very informative and interesting posts.
I’ve been effectively replicating your approach with the Python Statsmodels package and am hung up on a problem with the output. The model summary shows the same parameters for an individual team and its corresponding opponent with the home parameter being zero. I’ve replicated the same approach in R and am getting the same issue. Seems like a data setup problem but struggling to find it ?

Thanks
Martin

Reply ↓
- opisthokonta on July 24, 2015 at 9:02 am said:
  
  It is hard to tell what the problem is, but are you sure you have formatted the data and the “home” variable like I describe here?
  
  Reply ↓
Moe on March 20, 2016 at 9:53 pm said:

Great work,

I implement your both methods on the last season of La Liga and I got strange estimates of the rho. First I use the long method to estimates rho and i get rho=0.009 which is very small and does not effect the probability which means I have the quite similar probabilities as the one from Poisson independence model. Then I implement the two-stage estimates and i got rho= -0.04!! significant difference, isn’t?

However, I returned to the original paper and I found that “the parameters will be estimable only if there is information from matches between teams of different divisions” page 271, second paragraph, so i add the second devision and run both codes again and get -0.08 in both.

Do you think that’s the reason why it did’t work first.

Reply ↓
- opisthokonta on March 21, 2016 at 7:06 pm said:
  
  I don’t think that is such a big difference. I got more different values between different seasons of English Premier League. Have your checked the standard error? I think what they say about the estimable parameters refers to the defense and attack parameters when you use data from different leagues and divisions. If you have two groups of teams that where those in one group never play against any in the other, it will be impossible to estimate the relative differences between the teams in the two groups.
  
  Reply ↓
Pedro on July 8, 2016 at 2:36 pm said:

One of the best articles i ever read.

One question: i’m testing the brazilian serie A, and when i generate the prob.matrix, i can’t get 100%. Do you know why?

Reply ↓
- opisthokonta on July 9, 2016 at 6:21 pm said:
  
  It is difficult to say, how much difference is there? Try to set the maxgoal variable to 10 or something.
  
  Reply ↓
  - Pedro on July 11, 2016 at 1:13 pm said:
    
    It worked, thanks!!
    
    I have one more question? Is it possible to translate those codes to php?
    
    Reply ↓
    - opisthokonta on July 11, 2016 at 3:14 pm said:
      
      I guess it is possible, in principle at least. But I am not so sure it is a good idea, php is not really made to do this kind of things, I think. But what do I know? Perhaps there are some good php libraries out there for doing statistics.
      
      Reply ↓
      - Pedro on August 17, 2016 at 2:57 pm said:
        
        I’m thinking in a way to put it into c#.
        
        Thanks again, you are the best!
Adam on September 22, 2016 at 10:22 am said:

I’ve just come across this excellent series of posts and really hope you’re still reading the post comments!

You mention that the following line gives the attack and defense figures for each team:

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

Could you clarify exactly how I can extract the attack/defense figures from that, so I can ultimately list each team and their individual attack/defense strengths?

I read your discussion with Ian above but as a newbie to R, I'm still not really clear on how to do this I'm afraid.

Thanks!

Reply ↓
- opisthokonta on September 25, 2016 at 11:57 am said:
  
  You can use the summary function to take a look at the parameters and other model diagnostics:
  
  summary(m)
  
  If you want just the coefficients you can use the coef function:
  
  coef(m)
  
  Remember that the model uses one of the teams as reference (usually the first team when sorted alphabetically), so this team does not have a separate parameter. The abilities for this team is absorbed in the intercept, so you should add the intercept to all the other defense and attack parameters, and use the intercept at the parameters for the missing team.
  
  Reply ↓
  - Adam on October 1, 2016 at 10:57 am said:
    
    Great thanks, that works.
    
    Have you done much research into the accuracy of this model’s predictions?
    
    I’m extracting the probabilities of each scoreline like so:
    
    cs <- dpois(0:10, lambda) %o% dpois(0:10, mu)
    
    Then dividing 100 by the probability to obtain decimal odds of each scoreline.
    
    If I compare the 0-0 scoreline with the Correct Score market on Betfair, there is a fairly large discrepancy in most cases, for example today's Swansea vs Liverpool match on Betfair is offered at 25, whereas this model predicts 20.75.
    
    As you may know, the Betfair markets are generally very efficient so it's hard to believe the market has it so "wrong".
    
    Using countrywide data rather than just one league seems to improve things slightly and definitely helps on matches where one of the teams has recently been promoted / relegated (which makes sense), but there's still more of a discrepancy than I would have thought.
    
    I tried using 5 years' worth of data rather than just the last seasons' worth but, confusingly, the results came out exactly the same, which I cannot explain. I would have thought a larger sample would produce at least slightly different results!
    
    Reply ↓
    - opisthokonta on October 1, 2016 at 1:05 pm said:
      
      Remember that this model only relies on historical results to make predictions, nothing more. So there is a lot of information about injuries, change of manager, form etc that you miss out on that the betfair numbers probably reflect.
      
      Using more data from different leagues/divisions will generally improve things, but adding more historical data will probably not. Results from 5 years ago will probably not be that informative for predicting the result of the next match. You can try to down-weight old results like I describe here:
      http://opisthokonta.net/?p=1013
      
      Reply ↓
      - Adam on October 4, 2016 at 10:02 am said:
        
        Brilliant, I had been wondering how to weight the more recent results.
        
        I’ve implemented that and will see what effect it has this week, thanks!
      - Adam on October 10, 2016 at 1:40 pm said:
        
        Just one more question:
        
        Given an already existing poisson distribution showing the probability of various scorelines, is it possible to work backwards and get the expected goals for each team?
        
        So at the moment I have lambda and mu (home and away expected goals respectively) and I use them to create a poisson distribution showing each scoreline:
        
        cs <- dpois(0:10, lambda) %o% dpois(0:10, mu)
        
        Is it possible to do the reverse and calculate lambda and mu from an existing poisson distribution of scorelines?
      - opisthokonta on October 10, 2016 at 5:23 pm said:
        
        Yes that is possible. First you sum all the rows or columns of the cs matrix to get the univariate marginal Poisson distribution. To recover the Poisson parameter is just a matter of finding the expected value of this univariate distribution, which is simply just the sum over all the probabilities for a the number of goals times the number of goals: p(0 goals) * 0 + p(1 goal) * 1 + p(2 goals) * 2 etc.
        
        Here is a simple R command to recover lambda:
        
        lambda_recovered <- sum(rowSums(cs) * 0:10)
      - Adam on October 11, 2016 at 9:13 pm said:
        
        Superb, thanks for all your help!
      - KMA on October 24, 2016 at 2:45 pm said:
        
        Hi opisthokonta,
        First of all thank u for this awsome article.
        I have a question about the data employed in this model and I want to know how I could integrate other data in this model (injuries, change of manager, form of players, weather, etc…)
        Any details or examples of code from ur part will really help me. Thank u
      - opisthokonta on October 24, 2016 at 5:58 pm said:
        
        Yes you could. If you want to add another variable you first have to preprocess it like in the first bulk of the code, which is just a matter of stacking the columns for the variable of the home and away teams. The you can just add the variable to the model definition formula in the glm function.
      - KMA on October 26, 2016 at 5:59 pm said:
        
        Hi Jonas,
        I’m currently analyzing your 4 articles on Dixon-Coles model for predicting football matches in R. It’s an awesome work and I think implementing the dixon-Coles model from A to Z help many of us.
        I’m not statistician but I have some knowledge of maths and statistics. I’m trying to develop a web solution to implement your model and try to improve it. I want to ask you some questions specifically concerning the data used and the parameters.
        1- Budgets in club football can fluctuate massively over short periods of time. A new chairman => New funds, etc. … and that’s can affect significatively the results from a season to onother… Therefore it would be ill-advised to me to take into account the data of too many seasons in the model (It’s theoretical but your studies seems to confirm this)
        So when we add more historical data it will probably have no incidence on the prediction. The last season should be the basis. Can u confirm it?
        2- Concerning the parameters which you have used in the model (an attack and a defense parameter), u’ve said to me two days ago that we can add other parameters such as injuries, change of manager, form, weather, etc. …
        Do you Have any idea how can we quantify these parameters to adjust them to the model ? I mean how can we choose a set of initial estimates of these parameters?
        For example, if I have a transfer parameter which references the new season transfers and their influence on the strenght of both teams. how can I evaluate the initial estimates of this parameter to influence appropriately the model?
        3- To what extent do u think these parameters can have an incidence on the basic predictions (which only relies on historical results to make the predictions?)
        Thank u in advance for your responses.
      - opisthokonta on October 27, 2016 at 8:17 pm said:
        
        Blindly adding more historical data would probably make the predictions worse, but if you have read the paper by Dixon and Coles you will see that they have a weighting scheme to make sure old data doesn’t get too much influence. I have implemented it here. There are other ways of incorporating the time element, for example using what is called dynamic regression models.
        
        Quantifying stuff like player or manager change is perhaps difficult in a model like this. Perhaps a fully Bayesian model would be more appropriate where you could adjust the priors based on the uncertainty induced by change of manage. Stuff like weather is easier, you could add a single variable for the amount of rain, for example, or a dummy variable to indicate sunny weather.
        
        It is hard to tell how this will influence the predictions, I have tried a few different thing on this blog. Rain does not seem to have much impact (although I am far from 100% sure on this), but including a variable for hectic match schedule does improve predictions.
      - KMA on November 16, 2016 at 7:57 pm said:
        
        Hi Jonas,
        
        How are you? Hope fine. I carefully read and analyzed your 4 articles. I have a specific question about all of this work. Do you think it’s possible to use this model (I refer to your work) as a “base” and implement Bayes & neural networks with “gRain” or “bnlearn” & “neuralnet” package in R to make the predictions more accurate? More specifically, if we implement this model (Dixon-coles with time-weighted Poisson regression and parameters which I’ve added and which can be easely quantified).
        1- how can I do to add in this model a parameter calculated on the basis of a bayesian network ? I mean how can I quantify the parameter in THIS model and give it a corresponding “value” to the bayesian return value?
        2- Suppose we have a First result of Prediction with all that, How a Neural Network can be implemented to help us to have a more accurate prediction? Is the model implemented is totally independent of a NN model or it can run in parallel?
        Thank u in advance.
        
        Good night.
      - opisthokonta on November 18, 2016 at 12:01 pm said:
        
        I am not all that familiar with neural networks, and I am not completely sure I understand you questions. So you are going to use a neural net to create some sort of measure of player form, injuries, weather conditions, etc and the give these to the DC model? I guess it could work, but then why not just use the NN to calculate the predictions? It is possible to combine the predictions of several prediction routines, but I don’t know a lot of specifics on that.
      - KMA on November 19, 2016 at 11:05 am said:
        
        Hi Jonas, I’m not going to use a neural net for that but Bayesian net to create some sort of measure of player form, injuries, tactic efficiency/deficiency, etc and then give these to the DC model? My question is how do u think I can integrate the final BM result to this DC model? How Can I give a value to these parameters into the DC model.
      - opisthokonta on November 20, 2016 at 3:30 pm said:
        
        If you use the two-step method in this post, where you first estimate the regular Poisson regression model, and then estimate the rho parameter, it is really simple to extend the model with more terms. You just add the new covariates calculated from the BN model to the dta.indep data frame, and then just add the covariates to the model formula in the glm function.
Peter on October 17, 2016 at 3:47 pm said:

Hi,

there is a possibility to put it on stata language?

Thanks for your work.

Reply ↓
- opisthokonta on October 18, 2016 at 6:47 pm said:
  
  Sorry, I am not familiar with stata.
  
  Reply ↓
Jordan Bateman on January 20, 2017 at 12:36 am said:

Hi, loving the articles, just wondering in your parameters for the DC model the attack variables (alpha) aren’t in the same format as in the paper. For example the alpha range from 1.1 to 1.9, which I understand is because of the varying leagues, but I don’t understand how you obtain these from the 0.0 we have from Arsenal for example.
Similarly I’m unsure why the defence parameters are negative
Thanks

Reply ↓
- opisthokonta on January 21, 2017 at 3:12 pm said:
  
  Thank you! I have done a couple of thing different from the paper. In the paper they multiply the parameters together, whereas I add them together and then exponentiate them. The two approaches are mathematically equivalent and will give the same model performance, but the way I do it is the more common approach (like in a generalized linear model). The raw parameter values are therefore not directly comparable. If you exponentiate them you should get values closer to those in the paper. The other thing is that I have included an intercept, which the Arsenal attack performance is baked into. Again, when you use the model to predict, you will get the same answer since the intercept is always there, and the other parameter values takes that into account. Take a look at part 4 where I do things a bit differently, closer to the way they do it in the paper.
  
  Reply ↓
  - Jordan Bateman on January 22, 2017 at 6:01 pm said:
    
    Thanks for the quick response even though the original post is over a year old!
    Would you be able to explain how I would be able to incorperate their original method to obtain the alpha and beta values that they obtained as I am trying to prepare a comprehensive report on the DC and Maher models and use them to predict future seasons.
    For the Maher model it was simple to use rpois to simulate a season but I am unsure how to do this for a non independent model.
    Thanks again
    
    Reply ↓
    - opisthokonta on January 23, 2017 at 7:35 pm said:
      
      You could re-implement the model more in line with what they do, but i think the easiest to get at least “close enough” would be to discard the intercept and exponentiate the coefficients.
      
      If you want to directly simulate the goal distribution I posted some code here to do that.
      
      Reply ↓
      - Jordan Bateman on January 25, 2017 at 5:25 pm said:
        
        I’ve managed to get a simulation for the probability matrix as you’ve given, but how would I then use that to simulate a game?
        In the Maher/Poisson model it’s simply 2 for loops with
        rpois(1, k2*params$Alpha[i]*params$Beta[j])
        But obviously now the score isn’t independent, so I’m unsure how to use the probability matrix to simulate a result from those probabilities?
      - opisthokonta on January 27, 2017 at 2:45 pm said:
        
        If you want to simulate scores, such as 2-1, you can try this code:
        
        # Quickly create a probmat probmat <- dpois(0:15, 5.2) %o% dpois(0:15, 1.2) # Sample 500 coordinates from the probmat. nn <- prod(dim(probmat)) ss <- sample(1:nn, size=500, replace=TRUE, prob=as.numeric(probmat)) # Figure out which coordiantes was sampled. # -1 because the matrix coordinate counts from # 1, but goals starts at 0. aa <- arrayInd(ss, .dim=dim(probmat)) - 1 colnames(aa) <- c('Row team', 'Column team')
Quekster on January 23, 2017 at 3:49 pm said:

Could be a very simple question for you, but it had puzzled me for quite some time –
Do you know how Dixon and Coles got their standard errors in Table 1 (Empirical estimates of each score probability for joint and marginal probability functions)? They wrote “standard errors on the basis of an underlying multinomial model are show in parentheses”.

Reply ↓
- opisthokonta on January 23, 2017 at 7:38 pm said:
  
  They probably used the inverse of the hessian matrix. You can see how you can do it in this answer at stackexchange.
  
  Reply ↓
Beb on March 12, 2017 at 5:44 pm said:

Hi, really enjoy the articles! Some really great stuff!

I had a question if you knew how to create a matrix to join to the dta data frame to show the DC probability estimates for Home, Draw and Away in all games over the season. e.g When Arsenal play Manchester united, what the model probability estimates would be for Home, Draw or Away but for all games across the season?

Reply ↓
- opisthokonta on March 13, 2017 at 5:25 pm said:
  
  Thanks! When I use the independent format (as in dta.indep, where each match is represented by two rows), I create an ID number for each match in dta that I then carry over to dta.indep. This makes it much easier to keep track of the matches and the related information (results, probabilities etc) across data formats.
  
  Reply ↓
  - Beb on March 14, 2017 at 6:27 am said:
    
    Thanks for the speedy reply, I will try that!
    
    I also read in a previous article you mentioned model estimation by back testing, to predict the probability results of the next game using only previous game data. I wondered if you had any R code of how to carry that out, as I can’t get my head around how I could do this?
    
    I was then hoping of trying to build a graph like in Dixon and Coles showing how teams defence and attack parameters change dynamically over time, if this is something you have also been able to do?
    
    Reply ↓
    - opisthokonta on March 25, 2017 at 4:29 pm said:
      
      Take a look at this comment by Oliver, he gives some more details. There is no reason to give R code for this, since it is fairly straightforward to do, and you would probably have to tweak it a lot to suit your needs anyway.
      
      Reply ↓
Beb on March 14, 2017 at 8:33 am said:

Sorry one more thing, in the response to Brad “Do you have any thoughts on how I would go about producing a matrix of expected goals by team for each possible opponent?”

I have tried the code, to try and get all matrices with Bolton as the home team, but I am having no luck here too… I have a feeling it is probably to do with how I am producing all.opponents, but I am unsure what the correct solution would be ?

all.opponents <- c(paste('Attack', dcm$teams, sep='.'), paste('Defence', dcm$teams, sep='.') )

DCpredict <- function(res, homeTeam, awayTeam){

# Make some strings to extract the correct variables from res.
homeAttack <- paste('Attack', homeTeam, sep='.')
homeDefence <- paste('Defence', homeTeam, sep='.')
awayAttack <- paste('Attack', awayTeam, sep='.')
awayDefence <- paste('Defence', awayTeam, sep='.')

lambda <- exp(res$par['HOME'] + res$par[homeAttack] + res$par[awayDefence])
mu <- exp(res$par[awayAttack] + res$par[homeDefence])

# Paste code here that calculates the probability_matrix

maxgoal <- 6 # will be useful later
probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda, mu, res$par['RHO']), nrow=2)
probability_matrix[1:2, 1:2] <- probability_matrix[1:2, 1:2] * scaling_matrix

return(probability_matrix)
}

for (opponent in all.opponents){
DCpredict(res, 'Bolton', opponent)
}

Reply ↓
- opisthokonta on March 25, 2017 at 4:26 pm said:
  
  I am not sure what error you get, but it may be the all.opponent vector that is the problem, as you say. Have you taken a look at it? You seem to be adding attack and Defence to the names, which I don’t think is correct. You should be able to just use the team name in DCpredict.
  
  Reply ↓
  - Beb on March 28, 2017 at 5:33 am said:
    
    I solved it now thanks!
    
    Reply ↓
Leon on July 25, 2017 at 1:11 pm said:

Hi,
I want to add the time weight functionality (with Xi) in this two stage approach (life is much easier while using this approach since I’m not using R).
Unfortunately I’m struggling to archeive this since your original method to add a time weight uses the DClogLik functionality (which is not used in this two stage approach).
Could you explain how to add this?

Reply ↓
- opisthokonta on July 30, 2017 at 2:20 pm said:
  
  In my example you could calculate the weights, and use them in the glm function via the weights argument. If your software support GLM’s with weighted observations it should not be difficult.
  
  Edit: I saw in you other comment that you use matlab. I haven’t used matlab myself, so I don’t know if matlab has support for weighted GLMs, but I would be surprised if they didn’t.
  
  Reply ↓
Leon on July 30, 2017 at 6:12 pm said:

Thanks a lot for for your answer(s)! I did not realize it was that simple! I’m indeed using Matlab since I don’t have any experience with R (although I’m learning it know). Matlab fuctions however are quite similar. glm = fitglm in Matlab and does contain the weigh function as well.
I finished the two stage approach and will convert the real DC model to Matlab in a while.

Reply ↓
Leon on September 10, 2017 at 11:28 am said:

What probablities did you get while using this method? In part two you mentioned 0.49 for home win, 0.21 for draw and 0.29 for away win.
Unfortunately I’m getting different results (although they are close), but I want to verify my model 🙂

Home: 0.4650271
Draw: 0.2337383
Win: 0.2946523

goal expectations:
mu: 1.614786
lambda: 2.051267
Rho: -0.1259698

Reply ↓
- opisthokonta on September 11, 2017 at 3:00 pm said:
  
  I don’t remember, and I don’t have the script on this computer. The results should be similar with both methods, as yours is. You also get the same estimate of Rho that I got, which is reassuring. When you calculate the home, draw and away probabilities you have to remember to calculate the full probability matrix and calculate the results from that, like in part 2. Otherwise the results will be just the same as the independent Poisson model.
  
  Reply ↓
alfredo on November 28, 2017 at 8:57 am said:

Hello, I ran the code and got an error at res:

Error in optim(par = c(0.1), fn = DCoptimRhoFn, control = list(fnscale = -1), :
initial value in ‘vmmin’ is not finite

Could you help me please?

Reply ↓
- opisthokonta on November 29, 2017 at 4:08 pm said:
  
  It looks like the DCoptimRhoFn function returns strange values. A few things you can try is to set a different starting value for par, or try one of the other algoriphms in optim by setting the method argument. Try also to evaluate the DCoptimRhoFn function for some different values to check if it works properly (return a finite number).
  
  Reply ↓
Dhruti Shah on March 16, 2021 at 1:10 pm said:

would you be able to provide a code for fitting a model with half the season only, and then how do you use dixon coles to predict the next half of the season? i can’t seem to get my head around this.

Reply ↓
- opisthokonta on March 17, 2021 at 10:13 am said:
  
  If you use the goalmodel package to fit your model, you can use the predict functions to make predictions. This is the easiest way to do what you want. Just split you data.frame with the results and fixtures in two. Ise the first to fit the model, and use the second for making predictions.
  
  https://github.com/opisthokonta/goalmodel
  
  Reply ↓
  - Dhruti Shah on March 18, 2021 at 5:14 pm said:
    
    I have gone through this, but I don’t get what codes to write to divide the dataframe into 2. I would be really grateful if you could provide a short code showing how you use half to fit, and where in the code do you use the second half of the data frame. It would be really helpful
    
    Reply ↓
    - opisthokonta on March 18, 2021 at 6:29 pm said:
      
      If your data.frame is in a variable called df, you can use the square brackets fro subsetting, so for example:
      
      df_firsthalf <- df[1:180,] df_secondhald <- df[181:380,] or if you have a date variable in your data frame that is correctly formated as a Date type, you can do df_first <- df[df$date <= '2018-31-12',] df_second <- df[df$date > ‘2018-31-12’,]
      
      Since you have problem with such a basic task in R, I would recomend you go trough a R for beginers tutorial or something.
      
      Reply ↓
Dhruti Shah on March 22, 2021 at 12:10 pm said:

I fit this poisson model for s1819.

df <- apply(s1819, 1, function(row){data.frame(team=c(row['HomeTeam'], row['AwayTeam']), opponent=c(row['AwayTeam'], row['HomeTeam']), goals=c(row['FTHG'], row['FTAG']), home=c(1, 0))})

df <- do.call(rbind, df)

head(df)

df$goals <- as.numeric(df$goals)

str(df)

df_firsthalf <- head(df, nrow(df)/2)
df_secondhalf <- tail(df, nrow(df)/2)

#Fitting the poisson model using the 'df' data from above
model <- glm(goals~ home + team + opponent, family=poisson(link=log), data=df_firsthalf)
summary(model)

#expected number of goals for both teams
predictHome <- predict(model, data.frame(home=1, team="Southampton", opponent="West Ham"), type="response")
predictAway <- predict(model, data.frame(home=0, team="West Ham", opponent="Southampton"), type="response")

print(predictHome)
print(predictAway)

Is that correct for the prediction part? Will it predict the second half with the way I have done it?

Reply ↓
- opisthokonta on March 24, 2021 at 3:00 pm said:
  
  Its hard to tell if the data.frame is split correctly, since i dont know the ordering of the data in the data frame. Anyway, the prediction part is wrong, since you give the predict functionds frames with Southampton vs West Ham.
  
  Reply ↓

70 thoughts on “The Dixon-Coles model for predicting football matches in R (part 3)”

Leave a Reply to opisthokonta Cancel reply