About a moth ago Martin Eastwood of the pena.lt/y blog put up some slides from a talk he gave about predicting football results in R. He presented in detail the independent Poisson regression model, and how to implement it. He also briefly mentioned and showed the bivariate adjustments in the Dixon-Coles model. I was curious about how he had implemented it since I had just finished my own implementation. In the comments he said that he used a two-stage approach, first estimating the attack and defense parameters using the independent Poisson model, and then estimating the rho parameter by it self.

This method may be less accurate than fitting the complete model, but it will probably be more accurate than the independent Poisson model. It is without a doubt faster and easier to implement.

We start with loading the data, and then making a new *data.frame *that contains two rows per match, as described in my post about the independent Poisson model.

dta <- read.csv('FAPL1112.csv') # Data formated for the independent model # Store in new variable, we need the data in original format later dta.indep <- data.frame(Team=as.factor(c(as.character(dta$HomeTeam), as.character(dta$AwayTeam))), Opponent=as.factor(c(as.character(dta$AwayTeam), as.character(dta$HomeTeam))), Goals=c(dta$FTHG, dta$FTAG), Home=c(rep(1, dim(dta)[1]), rep(0, dim(dta)[1])))

Now fit the model:

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

Since we now have estimated the attack, defense and home parameters we can use the built-in functions in R to calculate the expected home and away scores (lambda and mu).

To calculate lambda and mu, we use the *fitted* function. I organized the data so that all the rows with the goals scored by the home team comes before all the rows with the goals by the away teams. Whats more, the match in the first row in the home team part corresponds to the match in the first row in the away team part, so it is easy to get the corresponding expectations correct.

expected <- fitted(m) home.expected <- expected[1:nrow(dta)] away.expected <- expected[(nrow(dta)+1):(nrow(dta)*2)]

To estimate the rho parameter we can use the *tau* and *DClogLik* function we defined in part 1. We just wrap it inside a function we pass to the built in optimizer in R:

DCoptimRhoFn <- function(par){ rho <- par[1] DClogLik(dta$FTHG, dta$FTAG, home.expected, away.expected, rho) } res <- optim(par=c(0.1), fn=DCoptimRhoFn, control=list(fnscale=-1), method='BFGS')

The optimization finishes in an instant. As before we get the parameter values by looking at res$par. The estimated rho parameter is -0.126, which is reassuringly not that different from the -0.134 we got from the full model. This is is also about the same values Justin Worrall gets at his sportshacker.net blog.

To make predictions we can reuse most of the code from part 2. The only substantial difference is how we calculate the expected goals, which is a bit simpler this time:

# Expected goals home lambda <- predict(m, data.frame(Home=1, Team='Bolton', Opponent='Blackburn'), type='response') # Expected goals away mu <- predict(m, data.frame(Home=0, Team='Blackburn', Opponent='Bolton'), type='response')

This two-stage approach is much faster and simpler. We don’t have to manually create the design matrices and use matrix algebra to calculate the expected scores. We also don’t have to write as much code to keep track of all the parameters. I haven’t really compared all the different models against each other, so I can’t say which one makes the best predictions, but my guess is that this two-stage approach gives results similar to the fully specified Dixon-Coles model.

Great articles! I really enjoy your posts. Do you have any thoughts on how I would go about producing a matrix of expected goals by team for each possible opponent?

Just wrap the code to make the matrix inside a function that takes the result from the optimizer and the names of the two teams as arguments. You can do something like this:

Then you can simply loop trough all opponents for, say, Bolton. Suppose you have all opponents in the all.opponents variable, you can do something like this:

To get all matrices with Bolton as the away team, just swap the two last arguments in the function.

Thanks! I will try that out.

Thank you for posting these articles.

I have some questions about implementing the model as you have done here.

If I understand this correctly…

res <- optim(par=c(0.1), fn=DCoptimRhoFn, control=list(fnscale=-1), method='BFGS')

gives you the rho value which you can get by calling res$par[1]. This rho value is calculated by using the DCoptimRhoFn function which passes the initial 0.1 rho value to DClogLik. However the DClogLik function sets rho to equal 0 so the initial value of rho is never used.

Also, you state res$par[1] is the estimated rho value but it isn't used anywhere in the model from what I can see.

Also, after lambda and mu are calculated, am I correct in assuming to get the probabilities per goal, I can just do something like:

home_goals_prob <- dpois(0:10,lambda)

away_goals_prob <- dpois(0:10, mu)

Finally, I get warning messages stating:

In log(tau(y1, y2, lambda, mu, rho)) : NaNs produced

The warnings I think you can just ignore. I think they come from the optimizer trying some values that makes the tau-function return negative values (for which the logarithm is not defined).

The DClogLik-function does not set rho to be 0, it just has that as the default value is case you don’t supply it. You are right that the rho parameter is not used in the model fitted by the glm-function. That is why I have called this a two stage approach: First fit the regular independent Poisson model, then use that fitted model as a basis for estimating the rho parameter. To calculate the probabilities properly, using the Dixon-Coles adjustment (involving the rho), you have to use the code I posted in part 2.

I understand.

I seem to have got it working now. The probability for a home win has decreased and the probability for a draw has increased when compared to the glm fitted model.

Thanks for the clarification.

I have a question about the GLM model.

It is created using

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

When I get a summary of the model, Arsenal aren't listed as a team and when I query for them, I get an error but the system is still able to produce predictions for them.

Why is that? I would like to see the attack/defence strengths for Arsenal.

R automatically sets the first level of a categorical factor as the

baselineorcontrol. This actually makes all the other parameters interpretable only as the teams abilities relative to Arsenal. A more natural parameterization is to use the sum-to-zero constraint, which I explain in more details in part 4. What parameterization you use does not influence the predictions, only how the parameters are interpreted.Also, check out the comments on my first installment on the independent Poisson model, where I got the same question.

I see. So Arsenal are set to have a baseline attack/defence rating of 0?

Also on a side note, I have implemented the model to make predictions with two different approaches.

The first approach is the standard where I use all matches played in a league to predict a match between Team A and Team B.

The second approach is to use just matches played by Team A and Team B to predict the outcome of when they both play each other.

Both approaches produce different probabilities. Sometimes they are similar depending on the teams involved but sometimes they are vastly different.

Now I assume they are different because of the calculated Home advantage using the glm function. With the first approach, the home advantage is calculated using the previous results of all teams but with the second approach the home advantage is calculated using just the previous results of Team A and B.

Now would you say that the second approach should be more accurate? As surely the only results which matter for predicting the match between Team A and B is of those two teams?

It is not just the home field advantage that makes the probabilities different. The beauty of regression models is how the parameters are estimated by considering

allthe data. Two specific teams only play each other at most a few times a year, seldom more than four times. This of course gives too little information to make good estimates and predictions. By using all the data you exploit the fact that both Team A and Team B also plays against Team C, Team D and so on. In technical terms this is calledindirect comparisons. You can see how I use this in the post I just published. Also see this post over at 11tegen11.Perhaps I should write a post explaining in more detail how all this works.

Edit: I guess you could say Arsenal’s parameters are 0. Equivalently you could say that the information about Arsenal is incorporated in the intercept (along with the overall average number of goals scored) and the home field advantage. If you add a -1 in your glm-formula, the intercept is removed, and you get the Arsenal attack parameter, but not the defense/opponent parameter.

Perhaps I wasn’t clear.

What I meant by first approach is including results for all teams so when predicting the result of Team A vs Team B, you will use all the previous matches played by Team A, B, C, D…X, Y, Z.

So Team C vs Team J would be included when predicting Team A vs Team B. This is the standard approach I have seen.

The alternative I am suggesting is to only use past results which involve Team A and Team B. So we would have results of Team A vs Team E, Team A vs Team P and Team B vs Team Y, Team B vs Team U. We don’t care about Team N vs Team R or Team W vs Team K for example.

Wouldn’t this work better? How would the result of Team U vs Team O influence Team A vs Team B? Surely we only care about the matches Team A and Team B have played against other teams?

Ok, I think I understand better now. I have not really tested what you describe, but my guess is that the predictions will be poorer. Look at it this way. If both Team A and Team B plays against both Team U and Team O, then those matches would be more informative the more you knew about the parameters of Team U and Team O. A match between Team U and Team O would almost certainly make those parameters more precise. And also any indirect comparison between the two could be informative.

I am not saying this holds true in all circumstances. Some matches are not representative for a team, especially those played in certain competitions, or when the team has a lot of injured players. But still I think most league matches should be considered representative.

This is a response to your latest message to me.

I understand what you’re saying and will try both approaches to see which works best.

And again, just to confirm, a baseline attack/strength from the glm function would be equal to 0, correct?

After reading your posts, I looked to using the sum-to-zero constraint to get the attack and defence parameters for the baseline team (Arsenal in this case).

model <- glm(Goals ~ Home + Team + Opponent – 1, family=poisson(link=log), data=yrdta, contrasts=list(Team="contr.sum"))

Now this just returned the attack parameters, the defence was still missing.

I then changed the constraint to be

contrasts=list(Opponent="contr.sum")

However the defence parameter is still not provided.

Am I correct to think that it won't be possible to get a defence parameter? And if so then the next best thing is just to consider the attack/defence of the baseline team to be 0 if I require both strengths?

You do get a defense parameter for Arsenal. It is probably the one called Opponent1. What you don’t get is the defense parameter for the

lastlevel in the opponent factor. I don’t know why R does this, it is really annoying. This is probably the team with a name that comes last when the names are sorted alphabetically. If your model is saved in myfittedmodel you can figure out which team is last by looking at myfittedmode$xlevels$Opponent. To get a list of all the defense/opponent parameters for each team you can try this code:First off, love your very informative and interesting posts.

I’ve been effectively replicating your approach with the Python Statsmodels package and am hung up on a problem with the output. The model summary shows the same parameters for an individual team and its corresponding opponent with the home parameter being zero. I’ve replicated the same approach in R and am getting the same issue. Seems like a data setup problem but struggling to find it ?

Thanks

Martin

It is hard to tell what the problem is, but are you sure you have formatted the data and the “home” variable like I describe here?

Great work,

I implement your both methods on the last season of La Liga and I got strange estimates of the rho. First I use the long method to estimates rho and i get rho=0.009 which is very small and does not effect the probability which means I have the quite similar probabilities as the one from Poisson independence model. Then I implement the two-stage estimates and i got rho= -0.04!! significant difference, isn’t?

However, I returned to the original paper and I found that “the parameters will be estimable only if there is information from matches between teams of different divisions” page 271, second paragraph, so i add the second devision and run both codes again and get -0.08 in both.

Do you think that’s the reason why it did’t work first.

I don’t think that is such a big difference. I got more different values between different seasons of English Premier League. Have your checked the standard error? I think what they say about the estimable parameters refers to the defense and attack parameters when you use data from different leagues and divisions. If you have two groups of teams that where those in one group never play against any in the other, it will be impossible to estimate the relative differences between the teams in the two groups.

One of the best articles i ever read.

One question: i’m testing the brazilian serie A, and when i generate the prob.matrix, i can’t get 100%. Do you know why?

It is difficult to say, how much difference is there? Try to set the maxgoal variable to 10 or something.

It worked, thanks!!

I have one more question? Is it possible to translate those codes to php?

I guess it is possible, in principle at least. But I am not so sure it is a good idea, php is not really made to do this kind of things, I think. But what do I know? Perhaps there are some good php libraries out there for doing statistics.

I’m thinking in a way to put it into c#.

Thanks again, you are the best!

I’ve just come across this excellent series of posts and really hope you’re still reading the post comments!

You mention that the following line gives the attack and defense figures for each team:

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

Could you clarify exactly how I can extract the attack/defense figures from that, so I can ultimately list each team and their individual attack/defense strengths?

I read your discussion with Ian above but as a newbie to R, I'm still not really clear on how to do this I'm afraid.

Thanks!

You can use the summary function to take a look at the parameters and other model diagnostics:

summary(m)

If you want just the coefficients you can use the coef function:

coef(m)

Remember that the model uses one of the teams as reference (usually the first team when sorted alphabetically), so this team does not have a separate parameter. The abilities for this team is absorbed in the intercept, so you should add the intercept to all the other defense and attack parameters, and use the intercept at the parameters for the missing team.

Great thanks, that works.

Have you done much research into the accuracy of this model’s predictions?

I’m extracting the probabilities of each scoreline like so:

cs <- dpois(0:10, lambda) %o% dpois(0:10, mu)

Then dividing 100 by the probability to obtain decimal odds of each scoreline.

If I compare the 0-0 scoreline with the Correct Score market on Betfair, there is a fairly large discrepancy in most cases, for example today's Swansea vs Liverpool match on Betfair is offered at 25, whereas this model predicts 20.75.

As you may know, the Betfair markets are generally very efficient so it's hard to believe the market has it so "wrong".

Using countrywide data rather than just one league seems to improve things slightly and definitely helps on matches where one of the teams has recently been promoted / relegated (which makes sense), but there's still more of a discrepancy than I would have thought.

I tried using 5 years' worth of data rather than just the last seasons' worth but, confusingly, the results came out exactly the same, which I cannot explain. I would have thought a larger sample would produce at least slightly different results!

Remember that this model only relies on historical results to make predictions, nothing more. So there is a lot of information about injuries, change of manager, form etc that you miss out on that the betfair numbers probably reflect.

Using more data from different leagues/divisions will generally improve things, but adding more historical data will probably not. Results from 5 years ago will probably not be that informative for predicting the result of the next match. You can try to down-weight old results like I describe here:

http://opisthokonta.net/?p=1013

Brilliant, I had been wondering how to weight the more recent results.

I’ve implemented that and will see what effect it has this week, thanks!

Just one more question:

Given an already existing poisson distribution showing the probability of various scorelines, is it possible to work backwards and get the expected goals for each team?

So at the moment I have lambda and mu (home and away expected goals respectively) and I use them to create a poisson distribution showing each scoreline:

cs <- dpois(0:10, lambda) %o% dpois(0:10, mu)

Is it possible to do the reverse and calculate lambda and mu from an existing poisson distribution of scorelines?

Yes that is possible. First you sum all the rows or columns of the cs matrix to get the univariate marginal Poisson distribution. To recover the Poisson parameter is just a matter of finding the expected value of this univariate distribution, which is simply just the sum over all the probabilities for a the number of goals times the number of goals: p(0 goals) * 0 + p(1 goal) * 1 + p(2 goals) * 2 etc.

Here is a simple R command to recover lambda:

lambda_recovered <- sum(rowSums(cs) * 0:10)

Superb, thanks for all your help!

Hi opisthokonta,

First of all thank u for this awsome article.

I have a question about the data employed in this model and I want to know how I could integrate other data in this model (injuries, change of manager, form of players, weather, etc…)

Any details or examples of code from ur part will really help me. Thank u

Yes you could. If you want to add another variable you first have to preprocess it like in the first bulk of the code, which is just a matter of stacking the columns for the variable of the home and away teams. The you can just add the variable to the model definition formula in the glm function.

Hi Jonas,

I’m currently analyzing your 4 articles on Dixon-Coles model for predicting football matches in R. It’s an awesome work and I think implementing the dixon-Coles model from A to Z help many of us.

I’m not statistician but I have some knowledge of maths and statistics. I’m trying to develop a web solution to implement your model and try to improve it. I want to ask you some questions specifically concerning the data used and the parameters.

1- Budgets in club football can fluctuate massively over short periods of time. A new chairman => New funds, etc. … and that’s can affect significatively the results from a season to onother… Therefore it would be ill-advised to me to take into account the data of too many seasons in the model (It’s theoretical but your studies seems to confirm this)

So when we add more historical data it will probably have no incidence on the prediction. The last season should be the basis. Can u confirm it?

2- Concerning the parameters which you have used in the model (an attack and a defense parameter), u’ve said to me two days ago that we can add other parameters such as injuries, change of manager, form, weather, etc. …

Do you Have any idea how can we quantify these parameters to adjust them to the model ? I mean how can we choose a set of initial estimates of these parameters?

For example, if I have a transfer parameter which references the new season transfers and their influence on the strenght of both teams. how can I evaluate the initial estimates of this parameter to influence appropriately the model?

3- To what extent do u think these parameters can have an incidence on the basic predictions (which only relies on historical results to make the predictions?)

Thank u in advance for your responses.

Blindly adding more historical data would probably make the predictions worse, but if you have read the paper by Dixon and Coles you will see that they have a weighting scheme to make sure old data doesn’t get too much influence. I have implemented it here. There are other ways of incorporating the time element, for example using what is called dynamic regression models.

Quantifying stuff like player or manager change is perhaps difficult in a model like this. Perhaps a fully Bayesian model would be more appropriate where you could adjust the priors based on the uncertainty induced by change of manage. Stuff like weather is easier, you could add a single variable for the amount of rain, for example, or a dummy variable to indicate sunny weather.

It is hard to tell how this will influence the predictions, I have tried a few different thing on this blog. Rain does not seem to have much impact (although I am far from 100% sure on this), but including a variable for hectic match schedule does improve predictions.

Hi Jonas,

How are you? Hope fine. I carefully read and analyzed your 4 articles. I have a specific question about all of this work. Do you think it’s possible to use this model (I refer to your work) as a “base” and implement Bayes & neural networks with “gRain” or “bnlearn” & “neuralnet” package in R to make the predictions more accurate? More specifically, if we implement this model (Dixon-coles with time-weighted Poisson regression and parameters which I’ve added and which can be easely quantified).

1- how can I do to add in this model a parameter calculated on the basis of a bayesian network ? I mean how can I quantify the parameter in THIS model and give it a corresponding “value” to the bayesian return value?

2- Suppose we have a First result of Prediction with all that, How a Neural Network can be implemented to help us to have a more accurate prediction? Is the model implemented is totally independent of a NN model or it can run in parallel?

Thank u in advance.

Good night.

I am not all that familiar with neural networks, and I am not completely sure I understand you questions. So you are going to use a neural net to create some sort of measure of player form, injuries, weather conditions, etc and the give these to the DC model? I guess it could work, but then why not just use the NN to calculate the predictions? It is possible to combine the predictions of several prediction routines, but I don’t know a lot of specifics on that.

Hi Jonas, I’m not going to use a neural net for that but Bayesian net to create some sort of measure of player form, injuries, tactic efficiency/deficiency, etc and then give these to the DC model? My question is how do u think I can integrate the final BM result to this DC model? How Can I give a value to these parameters into the DC model.

If you use the two-step method in this post, where you first estimate the regular Poisson regression model, and then estimate the rho parameter, it is really simple to extend the model with more terms. You just add the new covariates calculated from the BN model to the dta.indep data frame, and then just add the covariates to the model formula in the glm function.

Hi,

there is a possibility to put it on stata language?

Thanks for your work.

Sorry, I am not familiar with stata.

Hi, loving the articles, just wondering in your parameters for the DC model the attack variables (alpha) aren’t in the same format as in the paper. For example the alpha range from 1.1 to 1.9, which I understand is because of the varying leagues, but I don’t understand how you obtain these from the 0.0 we have from Arsenal for example.

Similarly I’m unsure why the defence parameters are negative

Thanks

Thank you! I have done a couple of thing different from the paper. In the paper they multiply the parameters together, whereas I add them together and then exponentiate them. The two approaches are mathematically equivalent and will give the same model performance, but the way I do it is the more common approach (like in a generalized linear model). The raw parameter values are therefore not directly comparable. If you exponentiate them you should get values closer to those in the paper. The other thing is that I have included an intercept, which the Arsenal attack performance is baked into. Again, when you use the model to predict, you will get the same answer since the intercept is always there, and the other parameter values takes that into account. Take a look at part 4 where I do things a bit differently, closer to the way they do it in the paper.

Thanks for the quick response even though the original post is over a year old!

Would you be able to explain how I would be able to incorperate their original method to obtain the alpha and beta values that they obtained as I am trying to prepare a comprehensive report on the DC and Maher models and use them to predict future seasons.

For the Maher model it was simple to use rpois to simulate a season but I am unsure how to do this for a non independent model.

Thanks again

You could re-implement the model more in line with what they do, but i think the easiest to get at least “close enough” would be to discard the intercept and exponentiate the coefficients.

If you want to directly simulate the goal distribution I posted some code here to do that.

I’ve managed to get a simulation for the probability matrix as you’ve given, but how would I then use that to simulate a game?

In the Maher/Poisson model it’s simply 2 for loops with

rpois(1, k2*params$Alpha[i]*params$Beta[j])

But obviously now the score isn’t independent, so I’m unsure how to use the probability matrix to simulate a result from those probabilities?

If you want to simulate scores, such as 2-1, you can try this code:

Could be a very simple question for you, but it had puzzled me for quite some time –

Do you know how Dixon and Coles got their standard errors in Table 1 (Empirical estimates of each score probability for joint and marginal probability functions)? They wrote “standard errors on the basis of an underlying multinomial model are show in parentheses”.

They probably used the inverse of the hessian matrix. You can see how you can do it in this answer at stackexchange.

Hi, really enjoy the articles! Some really great stuff!

I had a question if you knew how to create a matrix to join to the dta data frame to show the DC probability estimates for Home, Draw and Away in all games over the season. e.g When Arsenal play Manchester united, what the model probability estimates would be for Home, Draw or Away but for all games across the season?

Thanks! When I use the independent format (as in dta.indep, where each match is represented by two rows), I create an ID number for each match in dta that I then carry over to dta.indep. This makes it much easier to keep track of the matches and the related information (results, probabilities etc) across data formats.

Thanks for the speedy reply, I will try that!

I also read in a previous article you mentioned model estimation by back testing, to predict the probability results of the next game using only previous game data. I wondered if you had any R code of how to carry that out, as I can’t get my head around how I could do this?

I was then hoping of trying to build a graph like in Dixon and Coles showing how teams defence and attack parameters change dynamically over time, if this is something you have also been able to do?

Take a look at this comment by Oliver, he gives some more details. There is no reason to give R code for this, since it is fairly straightforward to do, and you would probably have to tweak it a lot to suit your needs anyway.

Sorry one more thing, in the response to Brad “Do you have any thoughts on how I would go about producing a matrix of expected goals by team for each possible opponent?”

I have tried the code, to try and get all matrices with Bolton as the home team, but I am having no luck here too… I have a feeling it is probably to do with how I am producing all.opponents, but I am unsure what the correct solution would be ?

all.opponents <- c(paste('Attack', dcm$teams, sep='.'), paste('Defence', dcm$teams, sep='.') )

DCpredict <- function(res, homeTeam, awayTeam){

# Make some strings to extract the correct variables from res.

homeAttack <- paste('Attack', homeTeam, sep='.')

homeDefence <- paste('Defence', homeTeam, sep='.')

awayAttack <- paste('Attack', awayTeam, sep='.')

awayDefence <- paste('Defence', awayTeam, sep='.')

lambda <- exp(res$par['HOME'] + res$par[homeAttack] + res$par[awayDefence])

mu <- exp(res$par[awayAttack] + res$par[homeDefence])

# Paste code here that calculates the probability_matrix

maxgoal <- 6 # will be useful later

probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda, mu, res$par['RHO']), nrow=2)

probability_matrix[1:2, 1:2] <- probability_matrix[1:2, 1:2] * scaling_matrix

return(probability_matrix)

}

for (opponent in all.opponents){

DCpredict(res, 'Bolton', opponent)

}

I am not sure what error you get, but it may be the all.opponent vector that is the problem, as you say. Have you taken a look at it? You seem to be adding attack and Defence to the names, which I don’t think is correct. You should be able to just use the team name in DCpredict.

I solved it now thanks!

Hi,

I want to add the time weight functionality (with Xi) in this two stage approach (life is much easier while using this approach since I’m not using R).

Unfortunately I’m struggling to archeive this since your original method to add a time weight uses the DClogLik functionality (which is not used in this two stage approach).

Could you explain how to add this?

In my example you could calculate the weights, and use them in the glm function via the weights argument. If your software support GLM’s with weighted observations it should not be difficult.

Edit: I saw in you other comment that you use matlab. I haven’t used matlab myself, so I don’t know if matlab has support for weighted GLMs, but I would be surprised if they didn’t.

Thanks a lot for for your answer(s)! I did not realize it was that simple! I’m indeed using Matlab since I don’t have any experience with R (although I’m learning it know). Matlab fuctions however are quite similar. glm = fitglm in Matlab and does contain the weigh function as well.

I finished the two stage approach and will convert the real DC model to Matlab in a while.