When it comes to Poisson regression models for football results, the 1997 paper *Modelling Association Football Scores and Inefficiencies in the Football Betting Market* (pdf) by Dixon and Coles is often mentioned. In this paper the authors describe an improvement of the independent goals model. The improvement consists of modeling a dependence between the probabilities for the number of goals less than 2 for both teams. They also improve the model by incorporating a time perspective, so that matches played a long time a go does not have as much influence on the parameter estimates.

The model by Dixon and Coles is not as easy to fit as the independent Poisson model I have described earlier. There is no built-in function in R that can estimate it’s parameters, and the authors provide little details about how to implement it. Mostly as an exercise, I have implemented the model in R, but without the time down-weighting scheme.

The estimating procedure uses a technique called maximum likelihood. This is perhaps the most commonly used method for estimating parameters in statistical models. The way it works is that you specify a way to calculate the likelihood of your data for a given set of parameters, and then you need to find the set of parameters that gives the highest possible likelihood of your data. The independent Poisson model is also fitted using a maximum likelihood method. The difference here is that the likelihood used by Dixon and Coles is non-standard.

The model is pretty much similar to other regression models I have discussed. Each team has an attack and a defense parameter, and from a function of these the expected number of goals for each team in a match is calculated. For the rest of this post I am going to assume you have read the paper. There is a link to it in the first paragraph.

The most obvious thing we have to do is to implement the function referred to by the greek letter Tau. This is the function that, dependent on the Rho parameter, computes the degree in which the probabilities for the low scoring goals changes.

tau <- Vectorize(function(xx, yy, lambda, mu, rho){ if (xx == 0 & yy == 0){return(1 - (lambda*mu*rho)) } else if (xx == 0 & yy == 1){return(1 + (lambda*rho)) } else if (xx == 1 & yy == 0){return(1 + (mu*rho)) } else if (xx == 1 & yy == 1){return(1 - rho) } else {return(1)} })

We can now make a function for the likelihood of the data. A common trick when implementing likelihood functions is to use the log-likelihood instead. The reason is that when the probabilities for each data point for a given set of parameters are multiplied together, they will be too small for the computer to handle. When the probabilities are log-transformed you can instead just add them together.

What this function does is that it takes the vectors of mu (expected home goals) and lambda (expected away goals), Rho, and the vectors of observed home and away goals, and computes the log-likelihood for all the data.

DClogLik <- function(y1, y2, lambda, mu, rho=0){ #rho=0, independence #y1: home goals #y2: away goals sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu))) }

The team specific attack and defense parameters are not included in the log-likelihood function. Neither is the code that calculates the expected number of goals for each team in a match (lambda and mu). Before we can calculate these for each match, we need to do some data wrangling. Here is a function that takes a data.frame formated like the data from football-data.co.uk, and returns a list with design matrices and vectors with the match results.

DCmodelData <- function(df){ hm <- model.matrix(~ HomeTeam - 1, data=df, contrasts.arg=list(HomeTeam='contr.treatment')) am <- model.matrix(~ AwayTeam -1, data=df) team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam))) return(list( homeTeamDM=hm, awayTeamDM=am, homeGoals=df$FTHG, awayGoals=df$FTAG, teams=team.names )) }

Now we create a function that calculates the log-likelihod from a set of parameters and the data we have. First it calculates the values for lambda and mu for each match, then it passes these and the number of goals scored in each match to the log-likelihood function.

This function needs to be written in such a way that it can be used by another function that will find the parameters that maximizes the log-likelihood. First, all the parameters needs to be given to a single argument in the form of a vector (the *params* argument). Also, the log-likelihood is multiplied by -1, since the optimization function we are going to use only minimizes, but we want to maximize.

DCoptimFn <- function(params, DCm){ home.p <- params[1] rho.p <- params[2] nteams <- length(DCm$teams) attack.p <- matrix(params[3:(nteams+2)], ncol=1) defence.p <- matrix(params[(nteams+3):length(params)], ncol=1) lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p) mu <- exp(DCm$awayTeamDM %*% attack.p + DCm$homeTeamDM %*% defence.p) return( DClogLik(y1=DCm$homeGoals, y2=DCm$awayGoals, lambda, mu, rho.p) * -1 ) }

One more thing we need before we start optimizing is a function that helps the optimizer handle the constraint that all the attack parameters must sum to 1. If this constraint isn’t given, it will be impossible to find a unique set of parameters that maximizes the likelihood.

DCattackConstr <- function(params, DCm, ...){ nteams <- length(DCm$teams) attack.p <- matrix(params[3:(nteams+2)], ncol=1) return((sum(attack.p) / nteams) - 1) }

Now we are finally ready to find the parameters that maximizes the likelihood based on our data. First, load the data (in this case data from the 2011-12 premier league), and properly handle it with our *DCmodelData* function:

dta <- read.csv('FAPL1112.csv') dcm <- DCmodelData(dta)

Now we need to give a set of initial estimates of our parameters. It is not so important what specific values these are, but should preferably be in the same order of magnitude as what we think the estimated parameters should be. I set all attack parameters to 0.1 and all defense parameters to -0.8.

#initial parameter estimates attack.params <- rep(.01, times=nlevels(dta$HomeTeam)) defence.params <- rep(-0.08, times=nlevels(dta$HomeTeam)) home.param <- 0.06 rho.init <- 0.03 par.inits <- c(home.param, rho.init, attack.params, defence.params) #it is also usefull to give the parameters some informative names names(par.inits) <- c('HOME', 'RHO', paste('Attack', dcm$teams, sep='.'), paste('Defence', dcm$teams, sep='.'))

To optimize with equality constraints (all attack parameters must sum to 1) we can use the auglag function in the alabama package. This takes about 40 seconds to run on my laptop, much longer than the independent Poisson model fitted with the built in *glm* function. This is because the *auglag* function uses some general purpose algorithms that can work with a whole range of home-made functions, while the *glm* function is implemented with a specific set of models in mind.

library(alabama) res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

Voilà! Now the parameters can be found by the command *res$par*. In a follow-up post I will show how we can use the model to make prediction of match outcomes.

Team | Attack | Defence |
---|---|---|

Arsenal | 1.37 | -0.91 |

Aston Villa | 0.69 | -0.85 |

Blackburn | 0.94 | -0.47 |

Bolton | 0.92 | -0.48 |

Chelsea | 1.23 | -0.97 |

Everton | 0.94 | -1.15 |

Fulham | 0.93 | -0.89 |

Liverpool | 0.89 | -1.13 |

Man City | 1.56 | -1.43 |

Man United | 1.52 | -1.31 |

Newcastle | 1.10 | -0.88 |

Norwich | 1.02 | -0.62 |

QPR | 0.82 | -0.65 |

Stoke | 0.64 | -0.87 |

Sunderland | 0.86 | -0.99 |

Swansea | 0.85 | -0.89 |

Tottenham | 1.24 | -1.09 |

West Brom | 0.86 | -0.88 |

Wigan | 0.81 | -0.71 |

Wolves | 0.79 | -0.42 |

Home | 0.27 | |

Rho | -0.13 |

Really good to actually see someone showing how to do this, something there isn’t a ready example for around. Looking forwards to the next instalment.

could please show me how should I store the data

help with this error, Error in names(par.inits) <- c("HOME", "RHO", paste("Attack", dcm$teams, :

'names' attribute [44] must be the same length as the vector [40]

It’s hard to tell, but by the looks of it, you may have forgotten to run some of the code, or perhaps mixed the code posted here with the code posted in part 4.

Another possibility is that something went wrong in the DCmodelData function. Perhaps try this to create the attack.params and defense.params vectors: rep(.01, times=length(dcm$teams))

help me with this error:

> res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

Error in if (xx == 0 & yy == 0) { : missing value where TRUE/FALSE needed

Seems to have something to do with the tau-function. Do you have any missing values in one of the vectors with the number of goals?

That´s what I thought, but I copy the value as they are.

tau <- Vectorize(function(xx, yy, lambda, mu, rho){

if (xx == 0 & yy == 0){return(1 – (lambda*mu*rho))

} else if (xx == 0 & yy == 1){return(1 + (lambda*rho))

} else if (xx == 1 & yy == 0){return(1 + (mu*rho))

} else if (xx == 1 & yy == 1){return(1 – rho)

} else {return(1)}

})

I don´t know what´s happening.

The tau-function is of course fine, but are you sure you don’t have any missing values in your data? I think that is what is causing the error.

is it “xx” replaced to “y1” and “yy” replaced to “y2”?

It is just that I use different names for the arguments in the different functions that means the same. The code could have been better with respect to this.

R shows warning message

In log(tau(y1, y2, lambda, mu, rho)) : NaNs produced

Yes this tend to happen during the optimizing, and the reason is that the optimizer tries a combination of values that makes the tau function be negative. Taking the logarithm of a negative number will obviously not work. I am not sure how to fix it, but as long as the optimizer converges I don’t think it is a problem.

Is there any research on scoring process? I do appreciate if available.

http://wwwf.imperial.ac.uk/~ejm/M3S4/Problems/football.pdf

Hello.

first, i was impressed your post.

i have some question about params inits.

if you don’t mind, can you tell me the reason why you set up the value like as follow?? attack.params <- rep(.01, times=nlevels(dta$HomeTeam))

defence.params <- rep(-0.08, times=nlevels(dta$HomeTeam))

home.param <- 0.06

rho.init <- 0.03

regards,

I just tried out some values and went for some that seemed to work most of the time. They may not work on all data sets and if the optimization fails you should just try some other values.

Great blog!

I read the original paper by Dixon and Coles, and they claimed that the assumption of independence between scores is reasonable except for the scores 0-0, 1-0, 0-1 and 1-1, and that why they introduced the tua function in this way. I am not sure if that the case with any data when we assume independence or it is just with their data. How did they detect that the assumption is violated in the case when the score is less than 2? Because in my data, which is the last season of La liga (Spanish League) based on the standardized residuals of the chi-square test of independence there is no problem with independence assumption with scores less than 2. I would really like to know, is it a general idea? or it varies from data to another, and what was the case with your data?

Thank you, and again your blog is very interesting and helpful.

Thank you! Dixon and Coles used English data from 1992-95 (6500+ games) and their paper they describe how they inspected the independence assumption. They first calculated the expected number instances for each result under the independence assumption, then they divided this by the observed number of games with this result. They also computed the standard errors for this ratio. They present this analysis in table 2 in their paper. Then I think they just took a look at the table and saw that there were quite lot more draws for the low scoring games, and fewer 0-1 and 1-0 games.

I have compared the Dixon-Coles model against some other models here, and from those analyses I suspect that the model tend to overfit to the data a bit.

Great blog – i have a question:

why is your lambda

lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p)

and not

lambda <- DCm$homeTeamDM %*% attack.p * DCm$awayTeamDM %*% defence.p * home.p ??

why do you exponentiate it? i thought lambda = alpha_i * beta_j * gamma as in Dixon Coles?

Thanks very much

Mathematically they are the same. I used the exponentiation since that is what is commonly done in standard Poisson regression. Although I haven’t really tested it, I think the exponentiation also will make the optimization easier, since you are guarantied not to have a negative lambda for any value of the attack and defense parameteres. This is not the case when you multiply the parameters together.

Thank you very much for your reply.

Can I just check that once you get your parameters, you should really log it, right?

Thanks

I think you exponentiate it… (rather than log it)..

can you confirm Jonas?

Thanks!

You can just have your parameters as is, but when you are doing predictions, you should exponentiate the linear predictors, as shown in part 2: http://opisthokonta.net/?p=913

Hi, this blog is amazing, but I’m a little stuck understanding one bit.

Would you mind explaining this line of code please (in the DCoptimFn function).

attack.p <- matrix(params[3:(nteams+2)], ncol=1)

Thanks.

Thank you! This line of code creates a 1-column matrix of the attack parameters. To make the DCoptimFn work with the optimizer it has to get all parameters in just one vector. When you have a lot of parameters, and they belong to different groups, you have to do a lot of indexing to get the right parameters from the ‘params’ argument. The first two elements of ‘params’ are the home and rho parameters and then follows all the attack parameters, so the indexing has to start at the third element. The number of attack parameters is the same as the number of teams, so the last attack parameter is then the nteam+2th element in ‘params’.

Thank you so much, that’s cleared a lot up for me, there’s just one more thing I’m still a little unsure about. What data is going into params, i.e how do we get the home & rho parameters and the attack parameters for each team?

Thanks!

You mean how you estimate them? That is taken care by the optimizer, which in this case is the auglag function from the alabama package. This function find the most likely parameters (the likelihhod for a given set of parameters is defined by the DCoptimFn function) for the data you give it, which is in this case passed via the DCmodeData function.

Thanks!!!

In DClogLik I think y1!( i.e factorial of y1) should be multiplied to dpois(y1,lambda) and similarly with y2 and dpois(y2, mu) because the formula 4.3 in the original paper does not have the 2 factorials in the denominator.

Am I missing something here?

Thanks.

You are right that they have removed the term log(1/x!) which could (should?) have been in equation 4.3. As they write, that formula is only proportional to the true log-likelihood. The true log likelihood would have the same form as the log of equation 4.2. The reason they can drop the log(1/x!) term is that is does not depend on the parameters. That is to say, dropping or keeping it does not alter the parameter estimates.

Okay! thanks I got it.

One more thing though. If I want to include more parameters like number of league titles won by the club,or money spent on transfers, those would go in the lambda and mu equations right?

Yes you can put other covariates in the model as well by adding them to the mu and lambda equations.

Awesome posts! One error I’m getting though:

> res res$par

Error: object ‘res’ not found

any chance you know why? thanks!

Yes, it’s probably because you have not run the line of code that creates the res object, or some error happened when you ran it.

Hi dear,

Your post is a life saver in my journey on sport analytics,

I have try to implemented the code your provided all went well except for this error i am getting when I run this line

#line code

res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

#result

Error in DCm$awayTeamDM %*% defence.p : non-conformable arguments

any suggestion on what is going on and how i can improve this,thanks for your time and help

There seem to be a mismatch between the number of defence parameters and the number of columns in the away team defence matrix. I suggest looking at how these two look, and at the code that crated these two.

Hi, is there anyway we can use the model for first half predictions?

I’d think so. You could try to fit the model to the results from the first half, or you could try to simply divide the lambda and mu parameters (in part 2) by two.

Hi,

As someone with minimal math skills and zero programming knowledge, where should I start if I want to eventually be able to make a model like this?

Should I learn R first? Or maybe take some course on statistics?

I have some experience building a Poisson-Elo based model in Excel (like the one on clubelo.com) but this seems to be on a whole other level.

I may be wrong, but it doesn’t seem like it would be possible to build a Dixon-Coles model in Excel.

I would suggest learning R first. Some of the maths/statistics will out of neccesity come by itself along the way. If you have implemented Elo-type models in excel, you should do fine. I think it would be possible to implement the DC model in excel. I alsmost managed to implement the ordinary Poisson model in OpenOffice, but it lacked optimization routines that I think Excel have. Since the DC model is just a tweaked Poisson model, it should be possible.

Hi, since football data does not have all leagues, I made my own csv that is formatted like this: http://prntscr.com/dd0dle.

For some reason, I am getting the following errors:Error in eval(expr, envir, enclos) : object ‘HomeTeam’ not found

Actually, I have fixed that. The spreadsheet can be found here:http://www.sharecsv.com/dl/7c37cd087c37fc79fab623d886aed4f6/DE1.csv. But the probabilities I am getting are all NA. Do you know why?

never mind, fixed it!

Cool!

Hi Jonas,

I had a tangential question concerning the Dixon-Coles paper.

How would you go about seeking to replicate the analysis undertaken in Table 2 on page 268? Calculating the ratio is easy to do – but how would you test in R whether or not the particular match outcomes observed in real life in the matrix were statistically different to the joint probability of independent home and away scoring functions?

I’m looking into doing the model for a non-European league and just interested to test rather than assume that only 0-0, 1-0 and 0-1 outcomes are the only ones needing adjustment.

I saw another paper (http://www.ylikerroin.com/file/Complete.pdf) which suggested adopting a poisson correction matrix be overlayed the estimated probability matrix for home and away poisson scoring processes to potentially correct for more than just 0-0, 0-1 and 1-0 issues.

I guess the standard thing to do would be a Fisher Exact test or Chi-Square test of independence. I am, however, not a great fan of Chi-sqaure tests, as I think they are a bit difficult to interpret. A different approach I have used is to fit the models and calculate AIC. The AIC sometimes indicate better fit for the DC model, but it is inconsistent with the direction the effect. Sometimes the effect is in the opposite direction of what is shown in the paper. I haven’t really tested the DC model properly, but I suspect there is some overfitting going on.

For other kinds of adjustments, I would suggest looking into the bivpois R package, and the accompanying paper.

Where in that paper are the correction matrix thing described?

Thanks for replying so quickly and for the links.

The correction method is mentioned in pp53-54 of the paper in my previous post. The author admits the method is ad hoc – and I haven’t gone through the process as yet to determine if the method is any good or not on the data I have.

The paper analysed the effectiveness of different models by summing the probabilities assigned to the correct game prediction – however I think RPS or RPSS (and comparing its effectiveness against a model of assigning 1/3 probabilities to W/D/W or aggregated bookmaker odds) would be better.

The league I am focusing on is the Australian A-League. It is different to the major European leagues in that there is no promotion/relegation (at this stage) and it is not uncommon for senior squads to change significantly from season to season due to the operation of a salary cap (with some exemptions) and there being no domestic transfer market (i.e. it is relatively easy to get out of a contract at one Australian A-League club and move to another even whilst under contract).

Thank you. I am not quite sure what to think about the “empirical correction” method. I think perhaps it could be viewed as a sort of regularization, where you adjust the probabilities from your model (presumably calculated from knowing who the teams are and the home field advantage, etc) towards the overall mean distribution. This could make your predictions less precise, but perhaps more realistic in the sense that your predictions don’t get too confident.

One thing to try would be to use the overall mean matrix (M) to adjust the matrix from a model (K) using a correction factor y (between 0 and 1), like this:

(1-y)*M + y*K

and try to tune the interpolation factor y for optimal prediction somehow. I am not so sure it will do better, but who knows?

I once tried to adjust the time-weighting scheme from the DC paper to incorporate an additional “past season” effect, but it did not give better predictions for the Premier League. Perhaps this could be more appropriate for the Australian league?

Hi, I’m just wondering why the defence parameters are negative here whilst in the original paper they are positive?

I do things a bit differently from the paper, but it is mathematically the same. In the paper they multiply the coefficients together to get the expected goals. Then you of course can’t have negative coefficients. Here I add the coefficients together and exponentiate the result, similar to what is done in the more common formulations of poisson regressions. Then you can have negative coefficients, since the exponentiation makes the end result positive.

Hi again, I was just wondering if you could whip up some code in order to do it in the way that Dixon and Coles originally did it per chance?

Maybe. I have made a new implementation that is a lot simpler and not relying on matrices/linear algebra. It is thus easier to modify and extend. This could easy be modified to use the multiplication operator instead of addition. Maybe.

Hi,

Can you explain how the probability matrix works? How can I use it to predict the chance of say 4 or more goals in match or maybe one team scoring 2 or more goals?

Each element in the matrix is the probability of that particular score. The sum of all probabilities in the matrix sum to (approximately) 1, so it is a probability distribution over all scores. To get the probabilities you want you can sum over all the scores that match your criterion. The probability of the home team scoring more than 2 goals is the sum of all rows from row 3 (remember that row 1 correspond to 0 goals) and downwards, for example.

sorry, i still don’t quite understand how it works. Could you maybe show the code for predicting 4 or more goals in the match and home team scoring 2 or more? Perhaps then I can use that to understand it better.

Heres a quick way to calculate the probability of 4 or less goals in total. More than four would be just 1-prb.

thanks!, also how would I be able to calculate the chance for say home team to score 2 or more goals?

Hi! Great article.

I have one question.

Which data are you using? I mean, which columns of the data that football-data.co.uk provides?

Thank you very much!

I use the HomeTeam, AwayTeam, FTHG and FTAG.

Thank you for your answer!

How do I set initial parameters? I mean, how can I set a reasonable initial parameters?

For this model you can use 0 for all parameters, or those that I used in the post. I have tried to use more informed starting values, for example using the logarithm of the average number of goals scored and conceded for each team. This usually gives faster estimation, but the results are the same.

I’m confused as to where in this code you are looping through all of the games. It seems as if the attack parameters aren’t being updated after each match during optimization.

I’m curious because I’m attempting to implement something similar in Python using scipy.optimize.minimize.

Thanks!

In the DCoptimFn function the expected goals (lambda and mu) for each game is calculated with matrix opperations, so there is no explicit loop construct in the code. In the DClogLik I use R’s vectorized operations. In R all strings, numerics and logicals (bool in Python) are 1D arrays, and all common operations (+, -, *, == etc) work elementwise on these arrays, so there is often no need to write explicit loops, like in this case.

Thank you for the reply!

The last problem I’m having is with my lambda and mu calculations. For some reason sometimes mu and/or lambda are negative, and attempting to do the log of a negative is not possible as it returns a math domain error.

I’m doing the same lambda and mu calculations in Python (with matrix operations).

Any help would be greatly appreciated!

That’s weird. Are you sure you haven’t forgotten to exponentiate them?

When calculating the attack and defence ratings does it take into account the strength of the opponent in each game?

eg. Scoring goals against a team with a good defence rating counts for more compared to scoring against a team with a weak defence.

Yes! The predicted number of goals scored by a team is a function of the attack strength of that team, and the defense strength of the opposition. All parameters are estimated simultaneously, to give the optimal fit to the data. The opposite of what you write is also true. Scoring goals against a team with good defense will not only improve the attack rating, but also reduce the oppositions defense rating. But this is done over all matches simultaneously, giving a sort of average.

Hi, firstly thanks for the great blog it has helped me a lot.

I had a question on using your above code and adding an allowance for time weighting. (This is more R related than theoretical but any help would be appreciated.) I have tried this by doing the following:

1.) adding a weights vector to the list created by the ‘DCmodelData’ function (with all weights set to 1 to start with) .

2.) Passing this vector to the ‘DClogLik’ function from the ‘DoptimFCn’ function.

3.) Multiplying each part the sum() in the ‘DClogLik’ by the weight.

Then running the optimiser as before which runs without error but only for 3 iterations and does not produce meaningful results. Do you have any idea why the optimiser would produce such a different solution given such a minor addition to the code?

Thanks.

Graeme

Thats strange. What convergence code do you get? The weights should all be positive. Are yours? Are you sure the weights are multiplied correctly? And have you looked at the other posts about the DC model I have written? There might be something in those.