# The Dixon-Coles model, part 4: A trick to speed up estimation

In the previous installments in this series on implementing the Dixon-Coles model I complained a bit about the time it took to estimate the parameters. In the original implementation in part 1 it took about 40 seconds. Now 40 seconds is not much to complain about, there are a whole lot of other models and algorithms that takes much much longer time to fit (for my master’s I had some computations that took several months). Still, I wanted to make a few improvements.

The approach I described in part 3 is quite acceptable, I think, especially since it takes less than a second to fit the model. But still, I wanted to make some improvements to my original implementation.

There are several reasons for the estimation procedure being slow. I used a general purpose optimizer instead of a tailor-made algorithm, and I didn’t provide the optimizer with a function of the derivative of the model likelihood function, nor the function defining the constraint. This means that the optimizer have to estimate the derivatives by doing a lot of evaluations of the two functions with slight changes in the parameters. The most important speed bump, however, is probably due to how I implemented the constraint that all the average of the attack parameters should equal 1.

The alabama package I used relied on a technique called Lagrange multipliers, which is a very general method for constrained optimization. Instead of relying on general constrained optimization procedures, there is a trick commonly used in linear models with sum-to-zero constrained categorical parameters that we also can use here.

There has been some discussion and confusion in the comments about how categorical variables are coded and how R presents the results of the glm function. A thorough discussion of this is best left for another time, but let me explain how the sum-to-zero constraint is implemented in linear models. We will fit the model with this constraint and then make some adjustments later on to get the correct average-is-one constraint.

The sum-to-zero constraint basically says that the sum of all the parameters for a categorical variable must equal to zero:

$$\sum_{i=1} \theta_i = 0$$

If we for example have three levels, we can write out the equation like this:

$$\theta_1 + \theta_2 + \theta_3 = 0$$

If we subtract $$\theta_3$$ and multiply both sides of the equation by minus 1 we get

$$– \theta_1 – \theta_2 = \theta_3$$

Notice how we can write one of the parameters as a simple linear function of the other parameters. We can use this result to construct the design matrix for the categorical variable, incorporating the sum-to-zero constraint (exactly which parameter or level we chose to be a function of the others doesn’t matter, the end results does not differ). Suppose we have the following observations of a three-level categorical variable:

$$\begin{bmatrix} A & A & B & B & C & C \end{bmatrix}^T$$

We can then construct the following design matrix:

$$\begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ -1 & -1 & \\ -1 & -1 & \end{bmatrix}$$

Notice that we only need two columns (i.e. two variables) to encode the three levels. Since the last parameter is a function of the two other it is redundant. Also notice how the observations in the two last rows, corresponding to the $$C$$ observations, will influence the estimation of all the other parameters for this variable. When the two parameters are estimated, the last parameter can be obtained using the result from above relating the last parameter to all the other.

In the Dixon-Coles paper they used the constraint that the average of the attack parameters should be 1. This is not quite the same as the sum-to-zero constraint, but for prediction, it does not matter exactly which constraint we use. Anyway, I will explain later how we can fix this.

To use this trick in the Dixon-Coles implementation we need to make the following changes to our code from part 1. Obviously the first thing we need to change is how the design matrices in the DCmodelData function is computed. We need four matrices now, since the number of parameters estimated directly is different for the attack and defense parameters. Notice how I chose the last of team that appear last in the team.names vector. The teams get sorted alphabetically, so for the 2011-12 Premier League data this is is Wolves.

DCmodelData <- function(df){

team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))

# attack, with sum-to-zero constraint
## home
hm.a <- model.matrix(~ HomeTeam - 1, data=df)
hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1 hm.a <- hm.a[,1:(length(team.names)-1)] # away am.a <- model.matrix(~ AwayTeam -1, data=df) am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1
am.a <- am.a[,1:(length(team.names)-1)]

# defence, same as before
hm.d <- model.matrix(~ HomeTeam - 1, data=df)
am.d <- model.matrix(~ AwayTeam -1, data=df)

return(list(homeTeamDMa=hm.a, homeTeamDMd=hm.d,
awayTeamDMa=am.a, awayTeamDMd=am.d,
homeGoals=df$FTHG, awayGoals=df$FTAG,
teams=team.names))
}


Some changes to the DCoptimFn function is also needed, so it properly handles the changes we made to the design matrices.

# I don't bother showing the rest of the function
nteams <- length(DCm$teams) attack.p <- matrix(params[3:(nteams+1)], ncol=1) #one column less defence.p <- matrix(params[(nteams+2):length(params)], ncol=1) # need to multiply with the correct matrices lambda <- exp(DCm$homeTeamDMa %*% attack.p + DCm$awayTeamDMd %*% defence.p + home.p) mu <- exp(DCm$awayTeamDMa %*% attack.p + DCm$homeTeamDMd %*% defence.p)  We also need to make a the appropriate adjustments to the vectors with the initial parameter values, so that they have the correct lengths. dcm <- DCmodelData(data) nteams <- length(dcm$teams)

#initial parameter estimates
attack.params <- rep(.1, times=nteams-1) # one less parameter
defence.params <- rep(-0.8, times=nteams)
home.param <- 0.06
rho.init <- 0.03
par.inits <- c(home.param, rho.init, attack.params, defence.params)

#informative names
#skip the last team
names(par.inits) <- c('HOME', 'RHO',
paste('Attack', dcm$teams[1:(nteams-1)], sep='.'), paste('Defence', dcm$teams, sep='.'))


With these changes we can simply use the built-in optim function in R. There is no need for the DCattackConstr function anymore, or a third party package, since we built the constraint right into the design matrices.

res <- optim(par=par.inits, fn=DCoptimFn, DCm=dcm, method='BFGS')


This takes about 6-7 seconds on my laptop, a decent improvement to the 40 seconds it took before. If you take a look at the resulting parameter estimates in res$par you will see that the attack parameter for Wolves is missing. As I explained earlier, this parameter is easy to find. It is also easy to correct all the parameter estimates so that they become the same as if we had a mean-is-one constraint on the attack parameters. This is done by increasing the attack parameters by one, and decreasing the defense parameters by one. The reason it is that simple is that the sum-to-zero constraint is equivalent with a mean-is-zero constraint. parameters <- res$par

#compute Wolves attack parameter
missing.attack <- sum(parameters[3:(nteams+1)]) * -1

#put it in the parameters vector
parameters <- c(parameters[1:(nteams+1)], missing.attack, parameters[(nteams+2):length(parameters)])
names(parameters)[nteams+2] <- paste('Attack.', dcm$teams[nteams], sep='') #increase attack by one parameters[3:(nteams+2)] <- parameters[3:(nteams+2)] + 1 #decrease defence by one parameters[(nteams+3):length(parameters)] <- parameters[(nteams+3):length(parameters)] - 1  # The Dixon-Coles model for predicting football matches in R (part 3) About a moth ago Martin Eastwood of the pena.lt/y blog put up some slides from a talk he gave about predicting football results in R. He presented in detail the independent Poisson regression model, and how to implement it. He also briefly mentioned and showed the bivariate adjustments in the Dixon-Coles model. I was curious about how he had implemented it since I had just finished my own implementation. In the comments he said that he used a two-stage approach, first estimating the attack and defense parameters using the independent Poisson model, and then estimating the rho parameter by it self. This method may be less accurate than fitting the complete model, but it will probably be more accurate than the independent Poisson model. It is without a doubt faster and easier to implement. We start with loading the data, and then making a new data.frame that contains two rows per match, as described in my post about the independent Poisson model. dta <- read.csv('FAPL1112.csv') # Data formated for the independent model # Store in new variable, we need the data in original format later dta.indep <- data.frame(Team=as.factor(c(as.character(dta$HomeTeam),
as.character(dta$AwayTeam))), Opponent=as.factor(c(as.character(dta$AwayTeam),
as.character(dta$HomeTeam))), Goals=c(dta$FTHG, dta$FTAG), Home=c(rep(1, dim(dta)[1]), rep(0, dim(dta)[1])))  Now fit the model: m <- glm(Goals ~ Home + Team + Opponent, data=dta.indep, family=poisson())  Since we now have estimated the attack, defense and home parameters we can use the built-in functions in R to calculate the expected home and away scores (lambda and mu). To calculate lambda and mu, we use the fitted function. I organized the data so that all the rows with the goals scored by the home team comes before all the rows with the goals by the away teams. Whats more, the match in the first row in the home team part corresponds to the match in the first row in the away team part, so it is easy to get the corresponding expectations correct. expected <- fitted(m) home.expected <- expected[1:nrow(dta)] away.expected <- expected[(nrow(dta)+1):(nrow(dta)*2)]  To estimate the rho parameter we can use the tau and DClogLik function we defined in part 1. We just wrap it inside a function we pass to the built in optimizer in R: DCoptimRhoFn <- function(par){ rho <- par[1] DClogLik(dta$FTHG, dta$FTAG, home.expected, away.expected, rho) } res <- optim(par=c(0.1), fn=DCoptimRhoFn, control=list(fnscale=-1), method='BFGS')  The optimization finishes in an instant. As before we get the parameter values by looking at res$par. The estimated rho parameter is -0.126, which is reassuringly not that different from the -0.134 we got from the full model. This is is also about the same values Justin Worrall gets at his sportshacker.net blog.

To make predictions we can reuse most of the code from part 2. The only substantial difference is how we calculate the expected goals, which is a bit simpler this time:

# Expected goals home
lambda <- predict(m, data.frame(Home=1, Team='Bolton', Opponent='Blackburn'), type='response')

# Expected goals away
mu <- predict(m, data.frame(Home=0, Team='Blackburn', Opponent='Bolton'), type='response')


This two-stage approach is much faster and simpler. We don’t have to manually create the design matrices and use matrix algebra to calculate the expected scores. We also don’t have to write as much code to keep track of all the parameters. I haven’t really compared all the different models against each other, so I can’t say which one makes the best predictions, but my guess is that this two-stage approach gives results similar to the fully specified Dixon-Coles model.

# The Dixon-Coles model for predicting football matches in R (part 2)

Part 1 ended with running the optimizer function to estimate the parameters in the model:

library(alabama)
res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

# Take a look at the parameters
res$par  In part 1 I fitted the model to data from the 2011-12 Premier League season. Now it’s time to use the model to make a prediction. As an example I will predict the result of Bolton playing at home against Blackburn. The first thing we need to do is to calculate the mu and lambda parameters, which is (approximately anyway) the expected number of goals scored by the home and away team. To do this wee need to extract the correct parameters from the res$par vector. Recall that I in the last post gave the parameters informative names that consists of the team name prefixed by either Attack or Defence.
Also notice that I have to multiply the team parameters and then exponentiate the result to get the correct answer.

Update: For some reason I got the idea that the team parameters should be multiplied together, instead of added together, but I have now fixed the code and the results.

# Expected goals home
lambda <- exp(res$par['HOME'] + res$par['Attack.Bolton'] + res$par['Defence.Blackburn']) # Expected goals away mu <- exp(res$par['Attack.Blackburn'] + res$par['Defence.Bolton'])  We get that Bolton is expected to score 2.07 goals and Blackburn is expected to score 1.59 goals. Since the model assumes dependencies between the number of goals scored by the two teams, it is insufficient to just plug the lambda and mu parameters into R’s built-in Poisson function to get the probabilities for the number of goals scored by the two teams. We also need to incorporate the adjustment for the low-scoring results as well. One strategy to do this is to first create a matrix based on the simple independent Poisson distributions: maxgoal <- 6 # will be useful later probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))  The number of home goals follows the vertical axis and the away goals follow the horizontal. Now we can use the estimated dependency parameter rho to create a 2-by-2 matrix with scaling factors, that is then element-wise multiplied with the top left elements of the matrix calculated above: Update: Thanks to Mike who pointed out a mistake in this code. scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda, mu, res$par['RHO']), nrow=2)
probability_matrix[1:2, 1:2] <- probability_matrix[1:2, 1:2] * scaling_matrix


With this matrix it is easy to calculate the probabilities for the three match outcomes:

HomeWinProbability <- sum(probability_matrix[lower.tri(probability_matrix)])
DrawProbability <- sum(diag(probability_matrix))
AwayWinProbability <- sum(probability_matrix[upper.tri(probability_matrix)])


This gives a probability of 0.49 for home win, 0.21 for draw and 0.29 for away win.

Calculating the probabilities for the different goal differences is a bit trickier. The probabilities for each goal difference can be found by adding up the numbers on the diagonals, with the sum of the main diagonal being the probability of a draw.

awayG <- numeric(maxgoal)
for (gg in 2:maxgoal){
awayG[gg-1] <- sum(diag(probability_matrix[,gg:(maxgoal+1)]))
}
awayG[maxgoal] <- probability_matrix[1,(maxgoal+1)]

homeG <- numeric(maxgoal)
for (gg in 2:maxgoal){
homeG[gg-1] <- sum(diag(probability_matrix[gg:(maxgoal+1),]))
}
homeG[maxgoal] <- probability_matrix[(maxgoal+1),1]

goaldiffs <- c(rev(awayG), sum(diag(probability_matrix)), homeG)
names(goaldiffs) <- -maxgoal:maxgoal


It is always nice to plot the probability distribution:

We can also see compare this distribution with the distribution without the Dixon-Coles adjustment (i.e. the goals scored by the two teams are independent):

As expected, we see that the adjustment gives higher probability for draw, and lower probabilities for goal differences of one goal.

# The Dixon-Coles model for predicting football matches in R (part 1)

Please have a look at the improved code for this model that I have posted here.

When it comes to Poisson regression models for football results, the 1997 paper Modelling Association Football Scores and Inefficiencies in the Football Betting Market (pdf) by Dixon and Coles is often mentioned. In this paper the authors describe an improvement of the independent goals model. The improvement consists of modeling a dependence between the probabilities for the number of goals less than 2 for both teams. They also improve the model by incorporating a time perspective, so that matches played a long time a go does not have as much influence on the parameter estimates.

The model by Dixon and Coles is not as easy to fit as the independent Poisson model I have described earlier. There is no built-in function in R that can estimate it’s parameters, and the authors provide little details about how to implement it. Mostly as an exercise, I have implemented the model in R, but without the time down-weighting scheme.

The estimating procedure uses a technique called maximum likelihood. This is perhaps the most commonly used method for estimating parameters in statistical models. The way it works is that you specify a way to calculate the likelihood of your data for a given set of parameters, and then you need to find the set of parameters that gives the highest possible likelihood of your data. The independent Poisson model is also fitted using a maximum likelihood method. The difference here is that the likelihood used by Dixon and Coles is non-standard.

The model is pretty much similar to other regression models I have discussed. Each team has an attack and a defense parameter, and from a function of these the expected number of goals for each team in a match is calculated. For the rest of this post I am going to assume you have read the paper. There is a link to it in the first paragraph.

The most obvious thing we have to do is to implement the function referred to by the greek letter Tau. This is the function that, dependent on the Rho parameter, computes the degree in which the probabilities for the low scoring goals changes.

tau <- Vectorize(function(xx, yy, lambda, mu, rho){
if (xx == 0 & yy == 0){return(1 - (lambda*mu*rho))
} else if (xx == 0 & yy == 1){return(1 + (lambda*rho))
} else if (xx == 1 & yy == 0){return(1 + (mu*rho))
} else if (xx == 1 & yy == 1){return(1 - rho)
} else {return(1)}
})


We can now make a function for the likelihood of the data. A common trick when implementing likelihood functions is to use the log-likelihood instead. The reason is that when the probabilities for each data point for a given set of parameters are multiplied together, they will be too small for the computer to handle. When the probabilities are log-transformed you can instead just add them together.

What this function does is that it takes the vectors of mu (expected home goals) and lambda (expected away goals), Rho, and the vectors of observed home and away goals, and computes the log-likelihood for all the data.

DClogLik <- function(y1, y2, lambda, mu, rho=0){
#rho=0, independence
#y1: home goals
#y2: away goals
sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
}


The team specific attack and defense parameters are not included in the log-likelihood function. Neither is the code that calculates the expected number of goals for each team in a match (lambda and mu). Before we can calculate these for each match, we need to do some data wrangling. Here is a function that takes a data.frame formated like the data from football-data.co.uk, and returns a list with design matrices and vectors with the match results.

DCmodelData <- function(df){

hm <- model.matrix(~ HomeTeam - 1, data=df, contrasts.arg=list(HomeTeam='contr.treatment'))
am <- model.matrix(~ AwayTeam -1, data=df)

team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))

return(list(
homeTeamDM=hm,
awayTeamDM=am,
homeGoals=df$FTHG, awayGoals=df$FTAG,
teams=team.names
))
}


Now we create a function that calculates the log-likelihod from a set of parameters and the data we have. First it calculates the values for lambda and mu for each match, then it passes these and the number of goals scored in each match to the log-likelihood function.

This function needs to be written in such a way that it can be used by another function that will find the parameters that maximizes the log-likelihood. First, all the parameters needs to be given to a single argument in the form of a vector (the params argument). Also, the log-likelihood is multiplied by -1, since the optimization function we are going to use only minimizes, but we want to maximize.

DCoptimFn <- function(params, DCm){

home.p <- params[1]
rho.p <- params[2]

nteams <- length(DCm$teams) attack.p <- matrix(params[3:(nteams+2)], ncol=1) defence.p <- matrix(params[(nteams+3):length(params)], ncol=1) lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p) mu <- exp(DCm$awayTeamDM %*% attack.p + DCm$homeTeamDM %*% defence.p) return( DClogLik(y1=DCm$homeGoals, y2=DCm$awayGoals, lambda, mu, rho.p) * -1 ) }  One more thing we need before we start optimizing is a function that helps the optimizer handle the constraint that all the attack parameters must sum to 1. If this constraint isn’t given, it will be impossible to find a unique set of parameters that maximizes the likelihood. DCattackConstr <- function(params, DCm, ...){ nteams <- length(DCm$teams)
attack.p <- matrix(params[3:(nteams+2)], ncol=1)
return((sum(attack.p) / nteams) - 1)
}


Now we are finally ready to find the parameters that maximizes the likelihood based on our data. First, load the data (in this case data from the 2011-12 premier league), and properly handle it with our DCmodelData function:

dta <- read.csv('FAPL1112.csv')
dcm <- DCmodelData(dta)


Now we need to give a set of initial estimates of our parameters. It is not so important what specific values these are, but should preferably be in the same order of magnitude as what we think the estimated parameters should be. I set all attack parameters to 0.1 and all defense parameters to -0.8.

#initial parameter estimates
attack.params <- rep(.01, times=nlevels(dta$HomeTeam)) defence.params <- rep(-0.08, times=nlevels(dta$HomeTeam))
home.param <- 0.06
rho.init <- 0.03
par.inits <- c(home.param, rho.init, attack.params, defence.params)
#it is also usefull to give the parameters some informative names
names(par.inits) <- c('HOME', 'RHO', paste('Attack', dcm$teams, sep='.'), paste('Defence', dcm$teams, sep='.'))


To optimize with equality constraints (all attack parameters must sum to 1) we can use the auglag function in the alabama package. This takes about 40 seconds to run on my laptop, much longer than the independent Poisson model fitted with the built in glm function. This is because the auglag function uses some general purpose algorithms that can work with a whole range of home-made functions, while the glm function is implemented with a specific set of models in mind.

library(alabama)
res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)


Voilà! Now the parameters can be found by the command res$par. In a follow-up post I will show how we can use the model to make prediction of match outcomes. Team Attack Defence Arsenal 1.37 -0.91 Aston Villa 0.69 -0.85 Blackburn 0.94 -0.47 Bolton 0.92 -0.48 Chelsea 1.23 -0.97 Everton 0.94 -1.15 Fulham 0.93 -0.89 Liverpool 0.89 -1.13 Man City 1.56 -1.43 Man United 1.52 -1.31 Newcastle 1.10 -0.88 Norwich 1.02 -0.62 QPR 0.82 -0.65 Stoke 0.64 -0.87 Sunderland 0.86 -0.99 Swansea 0.85 -0.89 Tottenham 1.24 -1.09 West Brom 0.86 -0.88 Wigan 0.81 -0.71 Wolves 0.79 -0.42 Home 0.27 Rho -0.13 # Two Bayesian regression models for football results Last fall I took a short introduction course in Bayesian modeling, and as part of the course we were going to analyze a data set of our own. I of course wanted to model football results. The inspiration came from a paper by Gianluca Baio and Marta A. Blangiardo Bayesian hierarchical model for the prediction of football results (link). I used data from Premier League from 2012 and wanted to test the predictions on the last half of the 2012-23 season. With this data I fitted two models: One where the number of goals scored where modeled using th Poisson distribution, and one where I modeled the outcome directly (as home win, away win or draw) using an ordinal probit model. As predictors I used the teams as categorical predictors, meaning each team will be associated with two parameters. The Poisson model was pretty much the same as the first and simplest model described in Baio and Blangiardo paper, but with slightly more informed priors. What makes this model interesting and different from the independent Poisson model I have written about before, apart from being estimated using Bayesian techniques, is that each match is not considered as two independent events when the parameters are estimated. Instead a correlation is implicitly modeled by specifying the priors in a smart way (see figure 1 in the paper, or here), thereby modeling the number of goals scored like a sort-of-bivariate Poisson. Although I haven’t had time to look much into it yet, I should also mention that Baio and Blangiardo extended their model and used it this summer to model the World Cup. You can read more at Baio’s blog. The ordinal probit model exploits the fact that the outcomes for a match can be thought to be on an ordinal scale, with a draw (D) considered to be ‘between’ a home win (H) and an away win (A). An ordinal probit model is in essence an ordinary linear regression model with a continuous response mu, that is coupled with a set of threshold parameters. For any value of mu the probabilities for any category is determined by the cumulative normal distribution and the threshold values. This is perhaps best explained with help from a figure: Here we see an example where the predicted outcome is 0.9, and the threshold parameters has been estimated to 0 and 1.1. The area under the curve is then the probability of the different outcomes. To model the match outcomes I use a model inspired by the structure in the predictors as the Poisson model above. Since the outcomes are given as Away, Draw and Home, the home field advantage is not needed as a separate term. This is instead implicit in the coefficients for each team. This gives the coefficients a different interpretation from the above model. The two coefficients here can be interpreted as the ability when playing at home and the ability when playing away. To get this model to work I had to set the constrains that the threshold separating Away and Draw were below the Draw-Home threshold. This implies that a good team would be expected to have a negative Away coefficient and a positive Home coefficient. Also, the intercept parameter had to be fixed to an arbitrary value (I used 2). To estimate the parameters and make predictions I used JAGS trough the rjags package. For both models, I used the most credible match outcome as the prediction. How well were the last half of the 2012-13 season predictions? The results are shown in the confusion table below. Confusion matrix for Poisson model  actual/predicted A D H A 4 37 11 D 1 35 14 H 0 38 42 Confusion matrix for ordinal probit model  actual/predicted A D H A 19 0 33 D 13 0 37 H 10 0 70 The Poisson got the result right in 44.5% of the matches while the ordinal probit got right in 48.9%. This was better than the Poisson model, but it completely failed to even consider draw as an outcome. Ordinal probit, however, does seem to be able to predict away wins, which the Poisson model was poor at. Here is the JAGS model specification for the ordinal probit model. model { for( i in 1:Nmatches ) { pr[i, 1] <- phi( thetaAD - mu[i] ) pr[i, 2] <- max( 0 , phi( (thetaDH - mu[i]) ) - phi( (thetaAD - mu[i]) ) ) pr[i, 3] <- 1 - phi( (thetaDH - mu[i]) ) y[i] ~ dcat(pr[i, 1:3]) mu[i] <- b0 + homePerf[teamh[i]] + awayPerf[teama[i]] } for (j in 1:Nteams){ homePerf.p[j] ~ dnorm(muH, tauH) awayPerf.p[j] ~ dnorm(muA, tauA) #sum to zero constraint homePerf[j] <- homePerf.p[j] - mean(homePerf.p[]) awayPerf[j] <- awayPerf.p[j] - mean(awayPerf.p[]) } thetaAD ~ dnorm( 1.5 , 0.1 ) thetaDH ~ dnorm( 2.5 , 0.1 ) muH ~ dnorm(0, 0.01) tauH ~ dgamma(0.1, 0.1) muA ~ dnorm(0, 0.01) tauA ~ dgamma(0.1, 0.1) #predicting missing values predictions <- y[392:573] }  And here is the R code I used to run the above model in JAGS. library('rjags') library('coda') #load the data dta <- read.csv('PL_1213.csv') #Remove the match outcomes that should be predicted to.predict <- 392:573 #this is row numbers observed.results <- dta[to.predict, 'FTR'] dta[to.predict, 'FTR'] <- NA #list that is given to JAGS data.list <- list( teamh = as.numeric(dta[,'HomeTeam']), teama = as.numeric(dta[,'AwayTeam']), y = as.numeric(dta[, 'FTR']), Nmatches = dim(dta)[1], Nteams = length(unique(c(dta[,'HomeTeam'], dta[,'AwayTeam']))), b0 = 2 #fixed ) #MCMC settings parameters <- c('homePerf', 'awayPerf', 'thetaDH', 'thetaAD', 'predictions') adapt <- 1000 burnin <- 1000 nchains <- 1 steps <- 15000 thinsteps <- 5 #Fit the model #script name is a string with the file name where the JAGS script is. jagsmodel <- jags.model(script.name, data=data.list, n.chains=nchains, n.adapt=adapt) update(jagsmodel, n.iter=burnin) samples <- coda.samples(jagsmodel, variable.names=parameters, n.chains=nchains, thin=thinsteps, n.iter=steps) #Save the samples save(samples, file='bayesProbit_20131030.RData') #print summary summary(samples)  # The R code for the home field advantage and traveling distance analysis. I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use. First load the data with the coordinates I posted last year. dta.stadiums <- read.csv('stadiums.csv')  I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches. First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that. coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){ #Calculates the distance between two WGS84 coordinates. # #http://en.wikipedia.org/wiki/Haversine_formula #http://www.movable-type.co.uk/scripts/gis-faq-5.1.html dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180)) dlong <- (long2 * (pi/180)) - (long1 * (pi/180)) h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2) c <- 2 * pmin(1, asin(sqrt(h))) d <- radius * c return(d) }  Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches. coord.home <- dta.stadiums[match(dta.matches$HomeTeam, dta.stadiums$FDCOUK), c('Latitude', 'Longitude')] coord.away <- dta.stadiums[match(dta.matches$AwayTeam, dta.stadiums$FDCOUK), c('Latitude', 'Longitude')] dta.matches$Distance <- coordinate.distance(coord.home$Latitude, coord.home$Longitude,
coord.away$Latitude, coord.away$Longitude)


Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match.

avgerage.gd <- function(dta){
#Calculates the average goal difference for each team.

all.teams <- unique(c(levels(dta$HomeTeam), levels(dta$AwayTeam)))
average.goal.diff <- numeric(length(all.teams))
names(average.goal.diff) <- all.teams
for (t in all.teams){
idxh <- which(dta$HomeTeam == t) goals.for.home <- dta[idxh, 'FTHG'] goals.against.home <- dta[idxh, 'FTAG'] idxa <- which(dta$AwayTeam == t)
goals.for.away <- dta[idxa, 'FTAG']
goals.against.away <- dta[idxa, 'FTHG']

n.matches <- length(idxh) + length(idxa)
total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away)

average.goal.diff[t] <- total.goal.difference / n.matches
}
return(average.goal.diff)
}

matchwise.hfa <- function(dta, avg.goaldiff){
#Calculates the matchwise home field advantage based on the average goal
#difference for each team.

n.matches <- dim(dta)[1]
hfa <- numeric(n.matches)
for (idx in 1:n.matches){
hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']]
awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']]
expected.goal.diff <- hometeam.avg - awayteam.avg
observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG']
hfa[idx] <- observed.goal.diff - expected.goal.diff
}
return(hfa)
}


In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA.

dta.matches$HFA <- numeric(dim(dta.matches)[1]) seasons <- levels(dta.matches$Season)

for (i in 1:length(seasons)){
season.l <- dta.matches$Season == seasons[i] h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,])) dta.matches$HFA[season.l] <- h
}


At last we can do the linear regression and make a nice little plot.

m <- lm(HFA ~ Distance, data=dta.matches)
summary(m)

plot(dta.matches$Distance, dta.matches$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance')
abline(m, col='red')


# Poor man’s parallel processing

Here’s a nice trick I learned on how you could implement simple parallel processing capabilities to speed up computations. This trick is only applicable in certain simple cases though, and does not scale very well, so it is best used in one-off scripts rather than in scripts that is used routinely or by others.

Suppose you have a list or an array that you are going to loop trough. Each of the elements in the list takes a long time to process and each iteration is NOT dependent on the result of any of the previous iterations. This is exactly the kind of situation where this trick is applicable.

The trick is to save the result for each iteration in a file whose name is unique to the iteration, and at the beginning of each iteration you simply check if that file already exists. If it does, the script skips to the next iteration. If it doesn’t, you create the file. This way you could run many instances of the script simultaneously, without doing the same iteration twice.

With this trick the results will be spread across different files, but if they are named and formated in a consistent way it is not hard to go trough the files and merge them into a single file.

Here is how it could be done in python:

import os.path

myList = ['bill', 'george', 'barack', 'ronald']

for president in myList:

fileName = 'result_{}'.format(president)

if os.path.isfile(fileName):
print('File {} already exists, continues to the next iteration')
continue

f = open(filename, 'w')

#myResults is the object where your results are stored
f.write(myResults)
f.close()



And in R:


myList <- c('bill', 'george', 'barack', 'ronald')

for (president in myList){

file.name <- paste('results', president, sep='_')

if (file.exists(file.name)){
cat('File', file.name, 'already exists, continues to the next iteration\n')
next
}

file.create(file.name)

#Save the my.result object
save(my.result)
}


# FIFA Women’s World Ranking and goal difference in Elo ratings.

The FIFA rankings for women’s national teams use a quite different methodology than the one used in ranking men’s national teams. The Women’s World Ranking (WWR) is based on the Elo rating system I wrote about in the previous post. The details for the men’s ranking can be found here, and the details the women’s ranking can be found here.

One thing that makes the WWR interesting is how goal differences are accounted for in the ratings. This is not something found in the ordinary chess-based Elo ratings. The method used in WWR is to let the ‘winning percentage’ change depending on two things: Goal difference and the number of goals the loosing team scored. This is in contrast to the original Elo ratings where the winning team wins 100% and the loosing team wins 0% (a draw is 50%). In WWR a team can never win 100%, the most it can win is 99%. This is the case when the goal difference is more than 6 and the loosing team haven’t scored any goals. The table below is from the pdf file linked to above and shows how many percent the loosing team win.

One strange thing about this table is the column for goal difference 0, implying a draw. I am guessing this is an error since it means that the winning percentage for the loosing team will be greater than the winning team. In the paper I mentioned in my last post where a number of different rating methods were compared (The predictive power of ranking systems in association football by J. Laset et. al), it was assumed that draw would yield both teams 50%, as in the original Elo-ratings. That paper also showed that the Women’s World Ranking was among the rating systems with best prediction.

Here is a plot showing the win percentage when the loosing team has scored 0 and 5 goals. We can see that there is not much gained for the loosing team to score one extra goal (assuming the goal difference stays the same, which of course is dubious), and most of the gain in winning percentage is when a team scores a goal such that the stance goes from a draw to a win.

The table above only goes up to 5 goals for the loosing team, but for the sake of implementation it is easy to generalize the rule about how much the loosing team gains by scoring an additional goal (with the goal difference is the same). For example will the loosing team gain 0.9 extra winning percentage points when the goal difference is 2. Similarly the gain is 0.6 percentage points when the goal difference is 5.

Below is a R function to compute the win percentage for football matches. It takes as input two vectors with the number of goals scored by the two opponents and returns a vector of win percentages (a number between 0 and 1) for the first team.

winPercentageWWR <- function(team1Goals, team2Goals){
#calculates the win percentage for team 1.

stopifnot(length(team1Goals) == length(team2Goals))

perc <- c(0.01, 0.02, 0.03, 0.04, 0.08, 0.15, 0.50, 0.85, 0.92, 0.96, 0.97, 0.98, 0.99)
add <- c(0, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005)

goalDifferences <- (team1Goals - team2Goals)
goalDifferences[goalDifferences < -6] <- -6
goalDifferences[goalDifferences > 6] <- 6

team1WinPercentage <- numeric(length=length(goalDifferences))

for (idx in 1:length(goalDifferences)){

team1WinPercentage[idx] <- perc[goalDifferences[idx]+7] -
}
return(team1WinPercentage)
}


# Elo ratings in football

I have previously written about some statistical methods for rating football teams and to predict the result of future matches. One was the last squares method and another was the Poisson regression method. None of these methods make good enough predictions. One problem with them is that they don’t incorporate a time perspective. Matches played a year ago is given equal importance as the most recent one. This could however be incorporated by weighing the the older matches less than newer matches. One other problem that I mentioned in the second post about Poisson regression is that teams are treated as categoricals which makes it hard to model the fact that a team’s ability changes over time.

One different kind of method that has been employed a lot in the recent years is the Elo rating system, which were originally developed for rating chess players. The method is rather simple, but I will not explain it in detail here since there are many good explanations of it elsewhere. Wikipedia has a very thorough coverage. The basic principle is that the difference in ratings between the two opposing teams provide a prediction for the result each game. The rating is then updated based on how the teams perform. If a team performs better than expected the rating increases, if they perform worse than expected the rating decrease. How much the rating changes depends on an update factor (often referred to as the K-factor).

Chess and football are of course different in many ways so the method for rating chess players is not directly suitable for rating football teams. The relative simplicity of the Elo system makes it easy to tweak and adjust to better fit football by incorporating things like home field advantage and goal difference. There are many sites around the Internet who provide different variants of Elo ratings, like the World Football Elo Ratings for national teams and Club Elo and Euro Club Index for club teams. FIFA even uses its own Elo system in its Womans World Ranking.

There has even been some research into different football rating systems. A paper titled The predictive power of ranking systems in association football (pdf) by Jan Lasek and others compared different rating systems. Their conclusion was that the different Elo type systems in general were better at predicting match outcomes than other types of rating systems.

I figured I wanted to implement a simple Elo rating system for rating football teams. There is already a package in R, PlayerRatings, which implements several different rating systems based on Elo. In my simple implementation there is no adjustment for goal difference, but I have support for home field advantage. All teams start with an initial rating of 1500. Here is what I got when I calculated the ratings for Premier League in November 2012 based on data going back to 1993. I used an update factor 24 without any home field advantage. There is no particular reason for this as I did this mostly as a proof of concept.

Rating (November 2012)
Man United 1807
Man City 1767
Chelsea 1696
Arsenal 1658
Tottenham 1645
Everton 1640
Newcastle 1613
Fulham 1591
Liverpool 1567
West Brom 1562
Leeds 1552
Wigan 1543
Swansea 1526
Sunderland 1524
Stoke 1521
Middlesboro 1516
Norwich 1509
Aston Villa 1498
West Ham 1494
Birmingham 1493
Blackpool 1483
Ipswich 1481
Bolton 1479
Charlton 1470
Sheffield United 1458
Blackburn 1450
Sheffield Weds 1447
Coventry 1447
Middlesbrough 1443
QPR 1440
Barnsley 1439
Portsmouth 1438
Southampton 1437
Oldham 1436
Crystal Palace 1433
Leicester 1433
Nott’m Forest 1430
Hull 1422
Burnley 1418
Wolves 1414
Wimbledon 1413
Watford 1411
Swindon 1404
Derby 1297

The table seems reasonable I think except for a couple of things. There is a problem related to relegation and promotion. Since I have used data back to 1993 every team who has played in the Premier League is given a rating. If a team is relegated to the Championship, their rating will no longer be updated. We can see that this creates some strange results. Take the two lowest rated teams for example. Derby has not been in the Premier League since the 2007-2008 season. Swindon, which is rated about 100 points higher than Derby, has not played in the Premier League since 1993-1994 season! Swindon now play in the fourth level of the English league system. So the ratings for the teams not in the Premier League should be considered invalid.

Relegation and promotion also creates a problem with inflated ratings. The Elo system is created so that the total number of points in the league should be constant. When a team is promoted they start with an initial rating of 1500, and if they later gets relegated they will probably have lost some of those points to the other teams in the league. In fact, we see that many of the teams with ratings less than 1500 no longer plays in the Premier League. The points they have lost are still in present in the league even though the team isn’t. This means that over time the average ratings of the teams in the league will increase.

The code I have written takes a data frame as input and works “out of the box” with data from football-data.co.uk. If you are going to use it yourself you have to make sure the data is sorted by date as the rating function just loops from top to bottom.

Here is how you can use it:

dta <- read.csv("yourdata.csv")
elo <- eloRating(data=dta)
print(elo)


And here is the code:


eloRating <- function(home="HomeTeam", away="AwayTeam", homeGoals="FTHG",
awayGoals="FTAG", data, kfactor=24, initialRating=1500,

#Make a list to hold ratings for all teams
all.teams <- levels(as.factor(union(levels(as.factor(data[[home]])),
levels(as.factor(data[[away]])))))

ratings <- as.list(rep(initialRating, times=length(all.teams)))
names(ratings) <- all.teams

#Loop trough data and update ratings
for (idx in 1:dim(data)[1]){

#get current ratings
homeTeamName <- data[[home]][idx]
awayTeamName <- data[[away]][idx]
awayTeamRating <- as.numeric(ratings[awayTeamName])

#calculate expected outcome
expectedHome <- 1 / (1 + 10^((awayTeamRating - homeTeamRating)/400))
expectedAway <- 1 - expectedHome

#Observed outcome
goalDiff <- data[[homeGoals]][idx] - data[[awayGoals]][idx]
if (goalDiff == 0){
resultHome <- 0.5
resultAway <- 0.5
}
else if (goalDiff < 0){
resultHome <- 0
resultAway <- 1
}
else if (goalDiff > 0){
resultHome <- 1
resultAway <- 0
}

#update ratings
ratings[homeTeamName] <- as.numeric(ratings[homeTeamName]) + kfactor*(resultHome - expectedHome)
ratings[awayTeamName] <- as.numeric(ratings[awayTeamName]) + kfactor*(resultAway - expectedAway)
}

#prepare output
ratingsOut <- as.numeric(ratings)
names(ratingsOut) <- names(ratings)
ratingsOut <- sort(ratingsOut, decreasing=TRUE)

return(ratingsOut)
}


# Predicting football results with Poisson regression pt. 2

In part 1 I wrote about the basics of the Poisson regression model for predicting football results, and briefly mentioned how our data should look like. In this part I will look at how we can fit the model and calculate probabilities for the different match outcomes. I will also discuss some problems with the model, and hint at a few improvements.

Fitting the model with R
When we have the data in an appropriate format we can fit the model. R has a built in function glm() that can fit Poisson regression models. The code for loading the data, fitting the model and getting the summary is simple:

#load data

#fit model and get a summary
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=yrdta)
summary(model)


The summary function for fitting the model with data from Premier League 2011-2012 season gives us this (I have removed portions of it for space reasons):

(Edit September 2014: There was some errors in the estimates in the original version of this post. This was because I made some mistakes when I formated the data as described in part one. Thanks to Derek in the comments for pointing this out. )

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)          0.45900    0.19029   2.412 0.015859 *
Home                 0.26801    0.06181   4.336 1.45e-05 ***
TeamAston Villa     -0.69103    0.20159  -3.428 0.000608 ***
TeamBlackburn       -0.40518    0.18568  -2.182 0.029094 *
TeamBolton          -0.44891    0.18810  -2.387 0.017003 *
TeamChelsea         -0.13312    0.17027  -0.782 0.434338
TeamEverton         -0.40202    0.18331  -2.193 0.028294 *
TeamFulham          -0.43216    0.18560  -2.328 0.019886 *
-----
OpponentSunderland  -0.09215    0.20558  -0.448 0.653968
OpponentSwansea      0.01026    0.20033   0.051 0.959135
OpponentTottenham   -0.18682    0.21199  -0.881 0.378161
OpponentWest Brom    0.03071    0.19939   0.154 0.877607
OpponentWigan        0.20406    0.19145   1.066 0.286476
OpponentWolves       0.48246    0.18088   2.667 0.007646 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The Estimate column is the most interesting one. We see that the overall mean is e0.49 = 1.63 and that the home advantage is e0.26 = 1.30 (remember that we actually estimate the logarithm of the expectation, therefore we need to exponentiate the coefficients to get interpretable numbers). If we want to predict the results of a match between Aston Villa at home against Sunderland we could plug the estimates into our formula, or use the predict() function in R. We need to do this twice, one time to predict the number of goals Aston Villa is expected to score, and one time for Sunderland.

#aston villa
predict(model, data.frame(Home=1, Team="Aston Villa", Opponent="Sunderland"), type="response")
# 0.9453705

#for sunderland. note that Home=0.
predict(model, data.frame(Home=0, Team="Sunderland", Opponent="Aston Villa"), type="response")
# 0.999


We see that Aston Villa is expected to score on average 0.945 goals, while Sunderland is expected to score on average 0.999 goals. We can plot the probabilities for the different number of goals against each other:

We can see that Aston Villa has just a bit higher probability for scoring not goals than Sunderland. Sunderland has also just a tiny bit higher probablity for most other number of goals. Both teams have about the same probability of scoring exactly one goal. In general the pattern we see in the plot is consistent with what we would expect considering the expected number of goals.

Match result probabilities
Now that we have our expected number of goals for the two opponents in a match, we can calculate the probabilities for either home win (H), draw (D) and away win (A). But before we continue, there is an assumption in our model that needs to be discussed, namely the assumption that the goals scored by the two teams are independent. This may not be obvious since surely we have included information about who plays against who when we predict the number of goals for each team. But remember that each match is included twice in our data set, and the way the regression method works, each observation are assumed to be independent from the others. We’ll see later that this can cause some problems.

The most obvious way calculate the probabilities of the three outcomes is perhaps to look at goal differences. If we can calculate the probabilities for goal differences (home goals minus away goals) of exactly 0, less than 0, and greater than 0, we get the probabilities we are looking for. I will explain two ways of doing this, both yielding the same result (in theory at least): By using the Skellam distribution and by simulation.

Skellam distribution
The Skellam distribution is the probability distribution of the difference of two independent Poisson distributed variables, in other words, the probability distribution for the goal difference. R does not support it natively, but the VGAM package does. For our example the distribution looks like this:

If we do the calculations we get the probabilities for home win, draw, away win to be 0.329, 0.314, 0.357 respectively.

#Away
sum(dskellam(-100:-1, predictHome, predictAway)) #0.3574468
#Home
sum(dskellam(1:100, predictHome, predictAway)) #0.3289164
#Draw
sum(dskellam(0, predictHome, predictAway)) #0.3136368


Simulation
The second method we can use is simulation. We simulate a number of matches (10000 in our case) by having the computer draw random numbers from the two Poisson distributions and look at the differences. We get the probabilities for the different outcomes by calculating the proportion of different goal differences. The independence assumption makes this easy since we can simulate the number of goals for each team independently of each other.

 set.seed(915706074)
nsim <- 10000
homeGoalsSim <- rpois(nsim, predictHome)
awayGoalsSim <- rpois(nsim, predictAway)
goalDiffSim <- homeGoalsSim - awayGoalsSim
#Home
sum(goalDiffSim > 0) / nsim #0.3275
#Draw
sum(goalDiffSim == 0) / nsim # 0.3197
#Away
sum(goalDiffSim < 0) / nsim #0.3528


The results differ a tiny bit from what we got from using the Skellam distribution. It is still accurate enough to not cause any big practical problems.

How good is the model at predicting match outcomes?
The Poisson regression model is not considered to be among the best models for predicting football results. It is especially poor at predicting draws. Even when the two teams are expected to score the same number of goals it rarely manages to assign the highest probability for a draw. In one attempt I used Premier League data from the last half of one season and the first half of the next season to predict the second half of that season (without refitting the model after each match day). It assigned highest probability to the right outcome in 50% of the matches, but never once predicted a draw.

Lets see at some other problems with the model and suggest some improvements.

One major problem I see with the model is that the predictor variables are categorical. This is a constraint that makes inefficient use of the available data since we get rather few data points per parameter (i.e per team). The model does for example not understand teams are more like each other than others and instead view each team in isolation. There has been some attempts at using Bayesian methods to incorporate information on which teams are better and which are poorer. Se for example this blog. If the teams instead could be reduced to numbers (by using some sort of rating system) we would get fewer parameters to estimate. We could then also incorporate an interaction term, something that is almost impossible with the categorical predictor variables we have. The interaction term in this case would be the effect of a team under or over estimating its opponent.

(As an aside, we could in fact interpret the coefficients in our model as a form of rating of a teams offensive and defensive strength)

Another way the model can be improved is to incorporate a time aspect. The most obvious way to do this is perhaps to weights to the matches such that more recent matches are more important than matches far back in time.

A further improvement would be to look at the level of different players, and not at a team as a whole. For example will a team with many injured players in a match most likely perform poorer than what you would expect. One could use weights to down weight the contribution of matches where this is a problem. A much more powerful idea would be to combine data on match lineup with a rating system for players. This could be used to infer a rating for the whole team in a specific match. In addition to correct for injured players it would also account for new players on a team and players leaving a team. The biggest problem with this approach is lack of available data in a format that is easy to handle.

I don’t think any of the improvements I have discussed here will solve the problem of predicting draws since it originates in the independent Poisson assumption, although I think they could improve predictions in general. To counter the problem of predicting draws I think a very different model would have to be used. I would also like to mention that the improvements I have suggested here are rather general, and could be incorporated in many other prediction models.