The Dixon-Coles model for predicting football matches in R (part 2)

Part 1 ended with running the optimizer function to estimate the parameters in the model:

library(alabama)
res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

# Take a look at the parameters
res\$par

In part 1 I fitted the model to data from the 2011-12 Premier League season. Now it’s time to use the model to make a prediction. As an example I will predict the result of Bolton playing at home against Blackburn.

The first thing we need to do is to calculate the mu and lambda parameters, which is (approximately anyway) the expected number of goals scored by the home and away team. To do this wee need to extract the correct parameters from the res\$par vector. Recall that I in the last post gave the parameters informative names that consists of the team name prefixed by either Attack or Defence.
Also notice that I have to multiply the team parameters and then exponentiate the result to get the correct answer.

Update: For some reason I got the idea that the team parameters should be multiplied together, instead of added together, but I have now fixed the code and the results.

# Expected goals home
lambda <- exp(res\$par['HOME'] + res\$par['Attack.Bolton'] + res\$par['Defence.Blackburn'])

# Expected goals away
mu <- exp(res\$par['Attack.Blackburn'] + res\$par['Defence.Bolton'])

We get that Bolton is expected to score 2.07 goals and Blackburn is expected to score 1.59 goals.

Since the model assumes dependencies between the number of goals scored by the two teams, it is insufficient to just plug the lambda and mu parameters into R’s built-in Poisson function to get the probabilities for the number of goals scored by the two teams. We also need to incorporate the adjustment for the low-scoring results as well. One strategy to do this is to first create a matrix based on the simple independent Poisson distributions:

maxgoal <- 6 # will be useful later
probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))

The number of home goals follows the vertical axis and the away goals follow the horizontal.

Now we can use the estimated dependency parameter rho to create a 2-by-2 matrix with scaling factors, that is then element-wise multiplied with the top left elements of the matrix calculated above:

Update: Thanks to Mike who pointed out a mistake in this code.

scaling_matrix <- matrix(tau(c(0,1,0,1), c(0,0,1,1), lambda, mu, res\$par['RHO']), nrow=2)
probability_matrix[1:2, 1:2] <- probability_matrix[1:2, 1:2] * scaling_matrix

With this matrix it is easy to calculate the probabilities for the three match outcomes:

HomeWinProbability <- sum(probability_matrix[lower.tri(probability_matrix)])
DrawProbability <- sum(diag(probability_matrix))
AwayWinProbability <- sum(probability_matrix[upper.tri(probability_matrix)])

This gives a probability of 0.49 for home win, 0.21 for draw and 0.29 for away win.

Calculating the probabilities for the different goal differences is a bit trickier. The probabilities for each goal difference can be found by adding up the numbers on the diagonals, with the sum of the main diagonal being the probability of a draw.

awayG <- numeric(maxgoal)
for (gg in 2:maxgoal){
awayG[gg-1] <- sum(diag(probability_matrix[,gg:(maxgoal+1)]))
}
awayG[maxgoal] <- probability_matrix[1,(maxgoal+1)]

homeG <- numeric(maxgoal)
for (gg in 2:maxgoal){
homeG[gg-1] <- sum(diag(probability_matrix[gg:(maxgoal+1),]))
}
homeG[maxgoal] <- probability_matrix[(maxgoal+1),1]

goaldiffs <- c(rev(awayG), sum(diag(probability_matrix)), homeG)
names(goaldiffs) <- -maxgoal:maxgoal

It is always nice to plot the probability distribution: We can also see compare this distribution with the distribution without the Dixon-Coles adjustment (i.e. the goals scored by the two teams are independent): As expected, we see that the adjustment gives higher probability for draw, and lower probabilities for goal differences of one goal.

The Dixon-Coles model for predicting football matches in R (part 1)

Please have a look at the improved code for this model that I have posted here.

When it comes to Poisson regression models for football results, the 1997 paper Modelling Association Football Scores and Inefficiencies in the Football Betting Market (pdf) by Dixon and Coles is often mentioned. In this paper the authors describe an improvement of the independent goals model. The improvement consists of modeling a dependence between the probabilities for the number of goals less than 2 for both teams. They also improve the model by incorporating a time perspective, so that matches played a long time a go does not have as much influence on the parameter estimates.

The model by Dixon and Coles is not as easy to fit as the independent Poisson model I have described earlier. There is no built-in function in R that can estimate it’s parameters, and the authors provide little details about how to implement it. Mostly as an exercise, I have implemented the model in R, but without the time down-weighting scheme.

The estimating procedure uses a technique called maximum likelihood. This is perhaps the most commonly used method for estimating parameters in statistical models. The way it works is that you specify a way to calculate the likelihood of your data for a given set of parameters, and then you need to find the set of parameters that gives the highest possible likelihood of your data. The independent Poisson model is also fitted using a maximum likelihood method. The difference here is that the likelihood used by Dixon and Coles is non-standard.

The model is pretty much similar to other regression models I have discussed. Each team has an attack and a defense parameter, and from a function of these the expected number of goals for each team in a match is calculated. For the rest of this post I am going to assume you have read the paper. There is a link to it in the first paragraph.

The most obvious thing we have to do is to implement the function referred to by the greek letter Tau. This is the function that, dependent on the Rho parameter, computes the degree in which the probabilities for the low scoring goals changes.

tau <- Vectorize(function(xx, yy, lambda, mu, rho){
if (xx == 0 & yy == 0){return(1 - (lambda*mu*rho))
} else if (xx == 0 & yy == 1){return(1 + (lambda*rho))
} else if (xx == 1 & yy == 0){return(1 + (mu*rho))
} else if (xx == 1 & yy == 1){return(1 - rho)
} else {return(1)}
})

We can now make a function for the likelihood of the data. A common trick when implementing likelihood functions is to use the log-likelihood instead. The reason is that when the probabilities for each data point for a given set of parameters are multiplied together, they will be too small for the computer to handle. When the probabilities are log-transformed you can instead just add them together.

What this function does is that it takes the vectors of mu (expected home goals) and lambda (expected away goals), Rho, and the vectors of observed home and away goals, and computes the log-likelihood for all the data.

DClogLik <- function(y1, y2, lambda, mu, rho=0){
#rho=0, independence
#y1: home goals
#y2: away goals
sum(log(tau(y1, y2, lambda, mu, rho)) + log(dpois(y1, lambda)) + log(dpois(y2, mu)))
}

The team specific attack and defense parameters are not included in the log-likelihood function. Neither is the code that calculates the expected number of goals for each team in a match (lambda and mu). Before we can calculate these for each match, we need to do some data wrangling. Here is a function that takes a data.frame formated like the data from football-data.co.uk, and returns a list with design matrices and vectors with the match results.

DCmodelData <- function(df){

hm <- model.matrix(~ HomeTeam - 1, data=df, contrasts.arg=list(HomeTeam='contr.treatment'))
am <- model.matrix(~ AwayTeam -1, data=df)

team.names <- unique(c(levels(df\$HomeTeam), levels(df\$AwayTeam)))

return(list(
homeTeamDM=hm,
awayTeamDM=am,
homeGoals=df\$FTHG,
awayGoals=df\$FTAG,
teams=team.names
))
}

Now we create a function that calculates the log-likelihod from a set of parameters and the data we have. First it calculates the values for lambda and mu for each match, then it passes these and the number of goals scored in each match to the log-likelihood function.

This function needs to be written in such a way that it can be used by another function that will find the parameters that maximizes the log-likelihood. First, all the parameters needs to be given to a single argument in the form of a vector (the params argument). Also, the log-likelihood is multiplied by -1, since the optimization function we are going to use only minimizes, but we want to maximize.

DCoptimFn <- function(params, DCm){

home.p <- params
rho.p <- params

nteams <- length(DCm\$teams)
attack.p <- matrix(params[3:(nteams+2)], ncol=1)
defence.p <- matrix(params[(nteams+3):length(params)], ncol=1)

lambda <- exp(DCm\$homeTeamDM %*% attack.p + DCm\$awayTeamDM %*% defence.p + home.p)
mu <- exp(DCm\$awayTeamDM %*% attack.p + DCm\$homeTeamDM %*% defence.p)

return(
DClogLik(y1=DCm\$homeGoals, y2=DCm\$awayGoals, lambda, mu, rho.p) * -1
)
}

One more thing we need before we start optimizing is a function that helps the optimizer handle the constraint that all the attack parameters must sum to 1. If this constraint isn’t given, it will be impossible to find a unique set of parameters that maximizes the likelihood.

DCattackConstr <- function(params, DCm, ...){
nteams <- length(DCm\$teams)
attack.p <- matrix(params[3:(nteams+2)], ncol=1)
return((sum(attack.p) / nteams) - 1)
}

Now we are finally ready to find the parameters that maximizes the likelihood based on our data. First, load the data (in this case data from the 2011-12 premier league), and properly handle it with our DCmodelData function:

dcm <- DCmodelData(dta)

Now we need to give a set of initial estimates of our parameters. It is not so important what specific values these are, but should preferably be in the same order of magnitude as what we think the estimated parameters should be. I set all attack parameters to 0.1 and all defense parameters to -0.8.

#initial parameter estimates
attack.params <- rep(.01, times=nlevels(dta\$HomeTeam))
defence.params <- rep(-0.08, times=nlevels(dta\$HomeTeam))
home.param <- 0.06
rho.init <- 0.03
par.inits <- c(home.param, rho.init, attack.params, defence.params)
#it is also usefull to give the parameters some informative names
names(par.inits) <- c('HOME', 'RHO', paste('Attack', dcm\$teams, sep='.'), paste('Defence', dcm\$teams, sep='.'))

To optimize with equality constraints (all attack parameters must sum to 1) we can use the auglag function in the alabama package. This takes about 40 seconds to run on my laptop, much longer than the independent Poisson model fitted with the built in glm function. This is because the auglag function uses some general purpose algorithms that can work with a whole range of home-made functions, while the glm function is implemented with a specific set of models in mind.

library(alabama)
res <- auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=dcm)

Voilà! Now the parameters can be found by the command res\$par. In a follow-up post I will show how we can use the model to make prediction of match outcomes.

Team Attack Defence
Arsenal 1.37 -0.91
Aston Villa 0.69 -0.85
Blackburn 0.94 -0.47
Bolton 0.92 -0.48
Chelsea 1.23 -0.97
Everton 0.94 -1.15
Fulham 0.93 -0.89
Liverpool 0.89 -1.13
Man City 1.56 -1.43
Man United 1.52 -1.31
Newcastle 1.10 -0.88
Norwich 1.02 -0.62
QPR 0.82 -0.65
Stoke 0.64 -0.87
Sunderland 0.86 -0.99
Swansea 0.85 -0.89
Tottenham 1.24 -1.09
West Brom 0.86 -0.88
Wigan 0.81 -0.71
Wolves 0.79 -0.42
Home 0.27
Rho -0.13

Two Bayesian regression models for football results

Last fall I took a short introduction course in Bayesian modeling, and as part of the course we were going to analyze a data set of our own. I of course wanted to model football results. The inspiration came from a paper by Gianluca Baio and Marta A. Blangiardo Bayesian hierarchical model for the prediction of football results (link).

I used data from Premier League from 2012 and wanted to test the predictions on the last half of the 2012-23 season. With this data I fitted two models: One where the number of goals scored where modeled using th Poisson distribution, and one where I modeled the outcome directly (as home win, away win or draw) using an ordinal probit model. As predictors I used the teams as categorical predictors, meaning each team will be associated with two parameters.

The Poisson model was pretty much the same as the first and simplest model described in Baio and Blangiardo paper, but with slightly more informed priors. What makes this model interesting and different from the independent Poisson model I have written about before, apart from being estimated using Bayesian techniques, is that each match is not considered as two independent events when the parameters are estimated. Instead a correlation is implicitly modeled by specifying the priors in a smart way (see figure 1 in the paper, or here), thereby modeling the number of goals scored like a sort-of-bivariate Poisson.

Although I haven’t had time to look much into it yet, I should also mention that Baio and Blangiardo extended their model and used it this summer to model the World Cup. You can read more at Baio’s blog.

The ordinal probit model exploits the fact that the outcomes for a match can be thought to be on an ordinal scale, with a draw (D) considered to be ‘between’ a home win (H) and an away win (A). An ordinal probit model is in essence an ordinary linear regression model with a continuous response mu, that is coupled with a set of threshold parameters. For any value of mu the probabilities for any category is determined by the cumulative normal distribution and the threshold values. This is perhaps best explained with help from a figure: Here we see an example where the predicted outcome is 0.9, and the threshold parameters has been estimated to 0 and 1.1. The area under the curve is then the probability of the different outcomes.

To model the match outcomes I use a model inspired by the structure in the predictors as the Poisson model above. Since the outcomes are given as Away, Draw and Home, the home field advantage is not needed as a separate term. This is instead implicit in the coefficients for each team. This gives the coefficients a different interpretation from the above model. The two coefficients here can be interpreted as the ability when playing at home and the ability when playing away.

To get this model to work I had to set the constrains that the threshold separating Away and Draw were below the Draw-Home threshold. This implies that a good team would be expected to have a negative Away coefficient and a positive Home coefficient. Also, the intercept parameter had to be fixed to an arbitrary value (I used 2).

To estimate the parameters and make predictions I used JAGS trough the rjags package.

For both models, I used the most credible match outcome as the prediction. How well were the last half of the 2012-13 season predictions? The results are shown in the confusion table below.

Confusion matrix for Poisson model

 actual/predicted A D H A 4 37 11 D 1 35 14 H 0 38 42

Confusion matrix for ordinal probit model

 actual/predicted A D H A 19 0 33 D 13 0 37 H 10 0 70

The Poisson got the result right in 44.5% of the matches while the ordinal probit got right in 48.9%. This was better than the Poisson model, but it completely failed to even consider draw as an outcome. Ordinal probit, however, does seem to be able to predict away wins, which the Poisson model was poor at.

Here is the JAGS model specification for the ordinal probit model.

model {

for( i in 1:Nmatches ) {

pr[i, 1] <- phi( thetaAD - mu[i]  )
pr[i, 2] <- max( 0 ,  phi( (thetaDH - mu[i]) ) - phi( (thetaAD - mu[i]) ) )
pr[i, 3] <- 1 - phi( (thetaDH - mu[i]) )

y[i] ~ dcat(pr[i, 1:3])

mu[i] <- b0 + homePerf[teamh[i]] + awayPerf[teama[i]]
}

for (j in 1:Nteams){
homePerf.p[j] ~ dnorm(muH, tauH)
awayPerf.p[j] ~ dnorm(muA, tauA)

#sum to zero constraint
homePerf[j] <- homePerf.p[j] - mean(homePerf.p[])
awayPerf[j] <- awayPerf.p[j] - mean(awayPerf.p[])
}

thetaAD ~ dnorm( 1.5 , 0.1 )
thetaDH ~ dnorm( 2.5 , 0.1 )

muH ~ dnorm(0, 0.01)
tauH ~ dgamma(0.1, 0.1)

muA ~ dnorm(0, 0.01)
tauA ~ dgamma(0.1, 0.1)

#predicting missing values
predictions <- y[392:573]
}

And here is the R code I used to run the above model in JAGS.

library('rjags')
library('coda')

#Remove the match outcomes that should be predicted
to.predict <- 392:573 #this is row numbers
observed.results <- dta[to.predict, 'FTR']
dta[to.predict, 'FTR'] <- NA

#list that is given to JAGS
data.list <- list(
teamh = as.numeric(dta[,'HomeTeam']),
teama = as.numeric(dta[,'AwayTeam']),
y = as.numeric(dta[, 'FTR']),
Nmatches = dim(dta),
Nteams = length(unique(c(dta[,'HomeTeam'], dta[,'AwayTeam']))),
b0 = 2 #fixed
)

#MCMC settings
parameters <- c('homePerf', 'awayPerf', 'thetaDH', 'thetaAD', 'predictions')
burnin <- 1000
nchains <- 1
steps <- 15000
thinsteps <- 5

#Fit the model
#script name is a string with the file name where the JAGS script is.
update(jagsmodel, n.iter=burnin)

samples <- coda.samples(jagsmodel, variable.names=parameters,
n.chains=nchains, thin=thinsteps,
n.iter=steps)

#Save the samples
save(samples, file='bayesProbit_20131030.RData')

#print summary
summary(samples)

Predicting football results with Adaptive Boosting

Adaptive Boosting, usually referred to by the abbreviation AdaBoost, is perhaps the best general machine learning method around for classification. It is what’s called a meta-algorithm, since it relies on other algorithms to do the actual prediction. What AdaBoost does is combining a large number of such algorithms in a smart way: First a classification algorithm is trained, or fitted, or its parameters are estimated, to the data. The data points that the algorithm misclassifies are then given more weight as the algorithm is trained again. This procedure is repeated a large number of times (perhaps many thousand times). When making predictions based on a new set of data, each of the fitted algorithms predict the new response value, and a the most commonly predicted value is then considered the overall prediction. Of course there are more details surrounding the AdaBoost than this brief summary. I can recommend the book The Elements of Statistical Learning by Hasite, Tibshirani and Friedman for a good introduction to AdaBoost, and machine learning in general.

Although any classification algorithm can be used with AdaBoost, it is most commonly used with decision trees. Decision trees are intuitive models that make predictions based on a combination of simple rules. These rules are usually of the form “if predictor variable x is greater than a value y, then do this, if not, do that”. By “do this” and “do that” I mean continue to a different rule of the same form, or make a prediction. This cascade of different rules can be visualized with a chart that looks sort of like a tree, hence the tree metaphor in the name. Of course Wikipedia has an article, but The Elements of Statistical Learning has a nice chapter about trees too.

In this post I am going to use decision trees and AdaBoost to predict the results of football matches. As features, or predictors I am going to use the published odds from different betting companies, which is available from football-data.co.uk. I am going to use data from the 2012-13 and first half of the 2013-14 season of the English Premier League to train the model, and then I am going to predict the remaining matches from the 2013-14 season.

Implementing the algorithms by myself would of course take a lot of time, but luckily they are available trough the excellent Python scikit-learn package. This package contains lots of machine learning algorithms plus excellent documentation with a lot of examples. I am also going to use the pandas package for loading the data.

import numpy as np
import pandas as pd

dta_fapl2012_2013 = pd.read_csv('FAPL_2012_2013_2.csv', parse_dates=)
dta_fapl2013_2014 = pd.read_csv('FAPL_2013-2014.csv', parse_dates=)

dta = pd.concat([dta_fapl2012_2013, dta_fapl2013_2014], axis=0, ignore_index=True)

#Find the row numbers that should be used for training and testing.
train_idx = np.array(dta.Date < '2014-01-01')
test_idx = np.array(dta.Date >= '2014-01-01')

#Arrays where the match results are stored in
results_train = np.array(dta.FTR[train_idx])
results_test = np.array(dta.FTR[test_idx])

Next we need to decide which columns we want to use as predictors. I wrote earlier that I wanted to use the odds for the different outcomes. Asian handicap odds could be included as well, but to keep things simple I am not doing this now.

feature_columns = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
'IWD', 'IWA','LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
'SOH', 'SOD', 'SOA', 'SBH', 'SBD', 'SBA', 'SJH', 'SJD',
'SJA', 'SYH', 'SYD','SYA', 'VCH', 'VCD', 'VCA', 'WHH',
'WHD', 'WHA']

For some bookmakers the odds for certain matches is missing. In this data this is not much of a problem, but it could be worse in other data. Missing data is a problem because the algorithms will not work when some values are missing. Instead of removing the matches where this is the case we can instead guess the value that is missing. As a rule of thumb we can say that an approximate value for some variables of an observation is often better than dropping the observation completely. This is called imputation and scikit-learn comes with functionality for doing this for us.

The strategy I am using here is to fill inn the missing values by the mean of the odds for the same outcome. For example if the odds for home win from one bookmaker is missing, our guess of this odds is going to be the average of the odds for home win from the other bookmakers for that match. Doing this demands some more work since we have to split the data matrix in three.

from sklearn.preprocessing import Imputer

#Column numbers for odds for the three outcomes
cidx_home = [i for i, col in enumerate(dta.columns) if col[-1] in 'H' and col in feature_columns]
cidx_draw = [i for i, col in enumerate(dta.columns) if col[-1] in 'D' and col in feature_columns]
cidx_away = [i for i, col in enumerate(dta.columns) if col[-1] in 'A' and col in feature_columns]

#The three feature matrices for training
feature_train_home = dta.ix[train_idx, cidx_home].as_matrix()
feature_train_draw = dta.ix[train_idx, cidx_draw].as_matrix()
feature_train_away = dta.ix[train_idx, cidx_away].as_matrix()

#The three feature matrices for testing
feature_test_home = dta.ix[test_idx, cidx_home].as_matrix()
feature_test_draw = dta.ix[test_idx, cidx_draw].as_matrix()
feature_test_away = dta.ix[test_idx, cidx_away].as_matrix()

train_arrays = [feature_train_home, feature_train_draw,
feature_train_away]

test_arrays = [feature_test_home, feature_test_draw,
feature_test_away]

imputed_training_matrices = []
imputed_test_matrices = []

for idx, farray in enumerate(train_arrays):
imp = Imputer(strategy='mean', axis=1) #0: column, 1:rows
farray = imp.fit_transform(farray)
test_arrays[idx] = imp.fit_transform(test_arrays[idx])

imputed_training_matrices.append(farray)
imputed_test_matrices.append(test_arrays[idx])

#merge the imputed arrays
feature_train = np.concatenate(imputed_training_matrices, axis=1)
feature_test = np.concatenate(imputed_test_matrices, axis=1)

Now we are finally ready to use the data to train the algorithm. First an AdaBoostClassifier object is created, and here we need to give supply a set of arguments for it to work properly. The first argument is classification algoritm to use, which is the DecisionTreeClassifier algorithm. I have chosen to supply this algorithms with the max_dept=3 argument, which constrains the training algorithm to not apply more than three rules before making a prediction.

The n_estimators argument tells the algorithm how many decision trees it should fit, and the learning_rate argument tells the algorithm how much the misclassified matches are going to be up-weighted in the next round of decision three fitting. These two values are usually something that you can experiment with since there is no definite rule on how these should be set. The rule of thumb is that the lower the learning rate is, the more estimators you neeed.

The last argument, random_state, is something that should be given if you want to reproduce the model fitting. If this is not specified you will end up with slightly different trained algroithm each time you fit them. See this question on Stack Overflow for an explanation.

At last the algorithm is fitted using the fit() method, which is supplied with the odds and match results.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

DecisionTreeClassifier(max_depth=3),
n_estimators=1000,
learning_rate=0.4, random_state=42)

We can now see how well the trained algorithm fits the training data.

import sklearn.metrics as skm

print skm.confusion_matrix(list(training_pred), list(results_train))

This is the resulting confusion matrix:

 Away Draw Home Away 164 1 0 Draw 1 152 0 Home 0 0 152

We see that only two matches in the training data is misclassified, one away win which were predicted to be a draw and one draw that was predicted to be an away win. Normally with such a good fit we should be wary of overfitting and poor predictive power on new data.

Let’s try to predict the outcome of the Premier League matches from January to May 2014:

print skm.confusion_matrix(list(test_pred), list(results_test))
 Away Draw Home Away 31 19 12 Draw 13 10 22 Home 20 14 59

It successfully predicted the right match outcome in a bit over half of the matches.

The R code for the home field advantage and traveling distance analysis.

I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use.

First load the data with the coordinates I posted last year.

I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches.

First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that.

coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){
#Calculates the distance between two WGS84 coordinates.
#
#http://en.wikipedia.org/wiki/Haversine_formula
#http://www.movable-type.co.uk/scripts/gis-faq-5.1.html
dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180))
dlong <- (long2 * (pi/180)) - (long1 * (pi/180))
h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2)
c <- 2 * pmin(1, asin(sqrt(h)))
d <- radius * c
return(d)
}

Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches.

c('Latitude', 'Longitude')]
c('Latitude', 'Longitude')]

dta.matches\$Distance <- coordinate.distance(coord.home\$Latitude, coord.home\$Longitude,
coord.away\$Latitude, coord.away\$Longitude)

Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match.

avgerage.gd <- function(dta){
#Calculates the average goal difference for each team.

all.teams <- unique(c(levels(dta\$HomeTeam), levels(dta\$AwayTeam)))
average.goal.diff <- numeric(length(all.teams))
names(average.goal.diff) <- all.teams
for (t in all.teams){
idxh <- which(dta\$HomeTeam == t)
goals.for.home <- dta[idxh, 'FTHG']
goals.against.home <- dta[idxh, 'FTAG']

idxa <- which(dta\$AwayTeam == t)
goals.for.away <- dta[idxa, 'FTAG']
goals.against.away <- dta[idxa, 'FTHG']

n.matches <- length(idxh) + length(idxa)
total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away)

average.goal.diff[t] <- total.goal.difference / n.matches
}
return(average.goal.diff)
}

matchwise.hfa <- function(dta, avg.goaldiff){
#Calculates the matchwise home field advantage based on the average goal
#difference for each team.

n.matches <- dim(dta)
hfa <- numeric(n.matches)
for (idx in 1:n.matches){
hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']]
awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']]
expected.goal.diff <- hometeam.avg - awayteam.avg
observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG']
hfa[idx] <- observed.goal.diff - expected.goal.diff
}
return(hfa)
}

In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA.

dta.matches\$HFA <- numeric(dim(dta.matches))
seasons <- levels(dta.matches\$Season)

for (i in 1:length(seasons)){
season.l <- dta.matches\$Season == seasons[i]
h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,]))
dta.matches\$HFA[season.l] <- h
}

At last we can do the linear regression and make a nice little plot.

m <- lm(HFA ~ Distance, data=dta.matches)
summary(m)

plot(dta.matches\$Distance, dta.matches\$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance')
abline(m, col='red')

The minimum violations ranking method

One informative benchmark when ranking and rating sports teams is how many times the ranking has been violated. A ranking violation occurs when a team beats a higher ranked team. Ideally no violations would occur, but in practice this rarely happens. In many cases it is unavoidable, for example in this three team competition: Team A beats team B, team B beats team C, and team C beats team A. In this case, for any of the 6 possible rankings of these three teams at least one violation would occur.

Inspired by this, one could try to construct a ranking with as few violations as possible. A minimum violations ranking (MVR), as it is called. The idea is simple and intuitive, and has been put to use in ranking American college sport teams. The MinV ranking by Jay Coleman is one example.

MV rankings have some other nice properties other than being just an intuitive measure. A MV ranking is the best ranking in terms of backwards predictions. It can also be a method for combining several other rankings, by using the other rankings as the data.

Despite this, I don’t think MV rankings are that useful in the context of football. The main reason for this is that football has a large number of draws and as far as I can tell, a draw has no influence on a MV ranking. A draw is therefore equivalent with no game at all and provides no information.

MV rankings also has another problem. In many cases there can be several rankings that satisfies the MV criterion. This of course depends on the data, but it seems nevertheless to be quite common, such as in the small example above.

Unfortunately, I have not found any software packages that can find a MV ranking. One algorithm is described in this paper (paywall), but I haven’t tried to implemented it myself. Most other MVR methods I have seen seem to be based on defining a set of mathematical constraints and then letting some optimization software search for solutions. See this paper for an example.

Does traveling distance influence home field advantage?

A couple of weeks ago I posted a data set with the location of the stadiums for many of the football teams in Europe. One thing I wanted to use the dataset for was to see if the traveling distance between two teams (as measured by the distance between the two team’s home stadium) influenced home field advantage.

To calculate the home field advantage for each match i did the following: For each team, the average goal difference during the season are calculated (goals scored minus goals conceded divided by the number of matches). Then the expected goal difference for a match is the difference between the average goal differences (home minus away). The home field advantage is then the observed goal difference minus the expected goal difference.

In the 2012-13 Premier League season, for example, Chelsea scored 75 goals and conceded 39 goals in total. Everton scored 55 and conceded 40 goals. Both teams played 38 matches during the season. On average Chelsea had a goal difference of per match of 0.947 and Everton’s average were 0.395. With Chelsea meeting Everton at home the expected goal difference is 0.947 – 0.395 = 0.553. The actual outcome for this match was 2-1, a goal difference of 1. The home field advantage for this match is then 1 – 0.553 = 0.447.

Using data from the 2011-12 and 2012-13 seasons from the top divisions from Spain, France Germany, and the 2012-13 from England I used the stadium coordinates to calculate the traveling distance for the visiting team and the home field advantage. Plotting these two against each other, and drawing the least squares line gives this: There is a great deal of noise in this plot, to put it mildly. The slope of the red line is 0.00006039. This is the estimated increase in number of goals the home team scores for each kilometer the away team has traveled. This is not significantly different from 0 (p-value = 0.646). The intercept, where the red line crosses the vertical axis is 0.4, meaning that the home team is estimated to score 0.4 more goals than expected, if the opposing team has traveled 0 kilometers. This is highly significant (p-value = 1.71e-11).

To be honest, I am a bit surprised to see such a clear lack of effect of traveling distance. I did not expect a particularly strong, or even very significant effect, but I had hoped to see at least a hint at something. Perhaps one reason for the lack of effect is that traveling distance is not necessarily the same as traveling time as longer distances may be covered by air, making them comparable to shorter travels by land.

It should be kept in mind that these results should only apply to the leagues included in the data. It could be that traveling distance could have a significant effect on longer distances, for example in international competitions such as the Champions League or between national teams.

BBC’s More Or Less on why the men’s FIFA rankings fail

One of the podcasts I listen to regularly, ‘More Or Less’ from the BBC, had the other day an episode about the (men’s) FIFA rankings. In the episode they discuss a shortcoming in the ranking system that makes it possible for a team to loose points (and thus ranking position) despite winning a match. The reason for this is not fully explained, but looking closer at the descriptions provided at fifa.com I think I see where the problem lies. After each match, rating points are given to the winner (or split if there is a draw). The crucial thing here is that friendly matches (or other non-important matches) gives fewer points than important tournament matches. The published ratings then are basically an average over the points earned for the matches played in the last couple of years. That means that winning a friendly match sometimes will yield fewer than a team’s average points, thus decreasing the average.

Unfortunately the episode did not mention the women’s FIFA ranking system which is based on the much better Elo system, used in chess rankings (and which I have written about previously). In this sort of system a win will almost surely give more points, and not less (the worst case scenario for a win is that no points are earned).

Dataset: Football stadiums with geographic coordinates

Here is a dataset I have put together with the location and capacity for the stadiums for about 160 European teams. The teams are from England, Scotland, France, Germany and Spain. The data is taken from Wikipedia and should be correct for the last couple of seasons. The French team Lille’s stadium is the current from the 2012 season, and Nice’s stadium is not the current one, but the one they had until the end of last season.

Some of the coordinates are more accurate than others, but I think they should be accurate enough to at least give an indication of the town the team comes from. That is probably true for the teams that have recently moved to another stadium as well; they are probably within the same town. The Guardian has looked into how far English Premier League clubs have moved, and it is usually less than 10 kilometers.

The data table contains 8 columns. The FDCOUK column contains the names of the teams as they appear in data from football-data.co.uk/. Since the names in the FDCOUK column often is abbreviated, more complete names are found in the Team column. There is also a column with the name of the stadium, but I have not been consistent with the naming with regards to traditional names vs. sponsored names.

What can this data set be used for? One thing I want to look into is whether traveling distance for the visiting team in a match influences the home field advantage. I have a couple of other ideas as well, but that will be for another time.

Edit: Updated dataset after I found a trailing whitespace in the FDCOUK column.
Edit December 16 2013: Added two Spanish teams.
Edit March 2nd 2015: Added 27 English teams.

Elo ratings in football: Home field advantage

In my first post about Elo ratings in football I posted the code for a R function where you could adjust the ratings to account for home field advantage. The method is simple: Some extra points are added to the home teams rating when the match predictions (which is based on the ratings) are calculated. My implementation did only support giving a the same fixed amount of extra points to all teams in all games. In other words it is assumed that all teams have the same home field advantage, and that the home field advantage does not change over time. This is of course unrealistic if the point of the ratings is to give as accurate predictions as possible. Still the method is used in the FIFA Women’s World Ranking and in the World Football Elo Ratings.

I know of two ways to implement a more dynamic (and more realistic) home field advantage. The ClubElo ratings (which is perhaps the best football rating site out there), developed by Lars Schiefler, let the home field advantage change over time, similar to how the ratings change. This is done by updating the home field advantage after each game based on the home team’s performance. An article on the ClubElo site describes the details very well.

A rather different method is used in the pi-rating system, developed by Anthony Constantinou. Each team have two ratings, one describing performance when playing at home, the other when playing away. The cool thing about the way this is done is that the two ratings for a team are not calculated separately from each other. It is not the case that only the home matches are used to calculate the home rating and away matches to calculate the away rating; After each match both ratings are updated. The home rating for the home team is updated almost like the regular Elo ratings, and the away rating is then also updated based on how much the home rating has changed, scaled by a factor. That way the two ratings are allowed to deviate from each other, giving rise to an adaptive, team specific home field advantage. The procedure is of course also applied to the away team’s ratings.

The pi-ratings, by the way, are interesting in other ways besides the method for determining the home field advantage. Instead of the ratings being somewhat arbitrary numbers, like most Elo systems, the pi-ratings directly models goal differences. The details are described in the paper Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries. A draft of the paper is also available at the pi-ratings website. While I am at it, I can also recommend Constantinou’s other papers on football prediction.