About a moth ago Martin Eastwood of the pena.lt/y blog put up some slides from a talk he gave about predicting football results in R. He presented in detail the independent Poisson regression model, and how to implement it. He also briefly mentioned and showed the bivariate adjustments in the Dixon-Coles model. I was curious about how he had implemented it since I had just finished my own implementation. In the comments he said that he used a two-stage approach, first estimating the attack and defense parameters using the independent Poisson model, and then estimating the rho parameter by it self.

This method may be less accurate than fitting the complete model, but it will probably be more accurate than the independent Poisson model. It is without a doubt faster and easier to implement.

We start with loading the data, and then making a new *data.frame *that contains two rows per match, as described in my post about the independent Poisson model.

dta <- read.csv('FAPL1112.csv') # Data formated for the independent model # Store in new variable, we need the data in original format later dta.indep <- data.frame(Team=as.factor(c(as.character(dta$HomeTeam), as.character(dta$AwayTeam))), Opponent=as.factor(c(as.character(dta$AwayTeam), as.character(dta$HomeTeam))), Goals=c(dta$FTHG, dta$FTAG), Home=c(rep(1, dim(dta)[1]), rep(0, dim(dta)[1])))

Now fit the model:

m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())

Since we now have estimated the attack, defense and home parameters we can use the built-in functions in R to calculate the expected home and away scores (lambda and mu).

To calculate lambda and mu, we use the *fitted* function. I organized the data so that all the rows with the goals scored by the home team comes before all the rows with the goals by the away teams. Whats more, the match in the first row in the home team part corresponds to the match in the first row in the away team part, so it is easy to get the corresponding expectations correct.

expected <- fitted(m) home.expected <- expected[1:nrow(dta)] away.expected <- expected[(nrow(dta)+1):(nrow(dta)*2)]

To estimate the rho parameter we can use the *tau* and *DClogLik* function we defined in part 1. We just wrap it inside a function we pass to the built in optimizer in R:

DCoptimRhoFn <- function(par){ rho <- par[1] DClogLik(dta$FTHG, dta$FTAG, home.expected, away.expected, rho) } res <- optim(par=c(0.1), fn=DCoptimRhoFn, control=list(fnscale=-1), method='BFGS')

The optimization finishes in an instant. As before we get the parameter values by looking at res$par. The estimated rho parameter is -0.126, which is reassuringly not that different from the -0.134 we got from the full model. This is is also about the same values Justin Worrall gets at his sportshacker.net blog.

To make predictions we can reuse most of the code from part 2. The only substantial difference is how we calculate the expected goals, which is a bit simpler this time:

# Expected goals home lambda <- predict(m, data.frame(Home=1, Team='Bolton', Opponent='Blackburn'), type='response') # Expected goals away mu <- predict(m, data.frame(Home=0, Team='Blackburn', Opponent='Bolton'), type='response')

This two-stage approach is much faster and simpler. We don’t have to manually create the design matrices and use matrix algebra to calculate the expected scores. We also don’t have to write as much code to keep track of all the parameters. I haven’t really compared all the different models against each other, so I can’t say which one makes the best predictions, but my guess is that this two-stage approach gives results similar to the fully specified Dixon-Coles model.