# Predicting football results with Adaptive Boosting

Adaptive Boosting, usually referred to by the abbreviation AdaBoost, is perhaps the best general machine learning method around for classification. It is what’s called a meta-algorithm, since it relies on other algorithms to do the actual prediction. What AdaBoost does is combining a large number of such algorithms in a smart way: First a classification algorithm is trained, or fitted, or its parameters are estimated, to the data. The data points that the algorithm misclassifies are then given more weight as the algorithm is trained again. This procedure is repeated a large number of times (perhaps many thousand times). When making predictions based on a new set of data, each of the fitted algorithms predict the new response value, and a the most commonly predicted value is then considered the overall prediction. Of course there are more details surrounding the AdaBoost than this brief summary. I can recommend the book The Elements of Statistical Learning by Hasite, Tibshirani and Friedman for a good introduction to AdaBoost, and machine learning in general.

Although any classification algorithm can be used with AdaBoost, it is most commonly used with decision trees. Decision trees are intuitive models that make predictions based on a combination of simple rules. These rules are usually of the form “if predictor variable x is greater than a value y, then do this, if not, do that”. By “do this” and “do that” I mean continue to a different rule of the same form, or make a prediction. This cascade of different rules can be visualized with a chart that looks sort of like a tree, hence the tree metaphor in the name. Of course Wikipedia has an article, but The Elements of Statistical Learning has a nice chapter about trees too.

In this post I am going to use decision trees and AdaBoost to predict the results of football matches. As features, or predictors I am going to use the published odds from different betting companies, which is available from football-data.co.uk. I am going to use data from the 2012-13 and first half of the 2013-14 season of the English Premier League to train the model, and then I am going to predict the remaining matches from the 2013-14 season.

Implementing the algorithms by myself would of course take a lot of time, but luckily they are available trough the excellent Python scikit-learn package. This package contains lots of machine learning algorithms plus excellent documentation with a lot of examples. I am also going to use the pandas package for loading the data.

import numpy as np
import pandas as pd

dta = pd.concat([dta_fapl2012_2013, dta_fapl2013_2014], axis=0, ignore_index=True)

#Find the row numbers that should be used for training and testing.
train_idx = np.array(dta.Date < '2014-01-01')
test_idx = np.array(dta.Date >= '2014-01-01')

#Arrays where the match results are stored in
results_train = np.array(dta.FTR[train_idx])
results_test = np.array(dta.FTR[test_idx])


Next we need to decide which columns we want to use as predictors. I wrote earlier that I wanted to use the odds for the different outcomes. Asian handicap odds could be included as well, but to keep things simple I am not doing this now.

feature_columns = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
'IWD', 'IWA','LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
'SOH', 'SOD', 'SOA', 'SBH', 'SBD', 'SBA', 'SJH', 'SJD',
'SJA', 'SYH', 'SYD','SYA', 'VCH', 'VCD', 'VCA', 'WHH',
'WHD', 'WHA']


For some bookmakers the odds for certain matches is missing. In this data this is not much of a problem, but it could be worse in other data. Missing data is a problem because the algorithms will not work when some values are missing. Instead of removing the matches where this is the case we can instead guess the value that is missing. As a rule of thumb we can say that an approximate value for some variables of an observation is often better than dropping the observation completely. This is called imputation and scikit-learn comes with functionality for doing this for us.

The strategy I am using here is to fill inn the missing values by the mean of the odds for the same outcome. For example if the odds for home win from one bookmaker is missing, our guess of this odds is going to be the average of the odds for home win from the other bookmakers for that match. Doing this demands some more work since we have to split the data matrix in three.

from sklearn.preprocessing import Imputer

#Column numbers for odds for the three outcomes
cidx_home = [i for i, col in enumerate(dta.columns) if col[-1] in 'H' and col in feature_columns]
cidx_draw = [i for i, col in enumerate(dta.columns) if col[-1] in 'D' and col in feature_columns]
cidx_away = [i for i, col in enumerate(dta.columns) if col[-1] in 'A' and col in feature_columns]

#The three feature matrices for training
feature_train_home = dta.ix[train_idx, cidx_home].as_matrix()
feature_train_draw = dta.ix[train_idx, cidx_draw].as_matrix()
feature_train_away = dta.ix[train_idx, cidx_away].as_matrix()

#The three feature matrices for testing
feature_test_home = dta.ix[test_idx, cidx_home].as_matrix()
feature_test_draw = dta.ix[test_idx, cidx_draw].as_matrix()
feature_test_away = dta.ix[test_idx, cidx_away].as_matrix()

train_arrays = [feature_train_home, feature_train_draw,
feature_train_away]

test_arrays = [feature_test_home, feature_test_draw,
feature_test_away]

imputed_training_matrices = []
imputed_test_matrices = []

for idx, farray in enumerate(train_arrays):
imp = Imputer(strategy='mean', axis=1) #0: column, 1:rows
farray = imp.fit_transform(farray)
test_arrays[idx] = imp.fit_transform(test_arrays[idx])

imputed_training_matrices.append(farray)
imputed_test_matrices.append(test_arrays[idx])

#merge the imputed arrays
feature_train = np.concatenate(imputed_training_matrices, axis=1)
feature_test = np.concatenate(imputed_test_matrices, axis=1)


Now we are finally ready to use the data to train the algorithm. First an AdaBoostClassifier object is created, and here we need to give supply a set of arguments for it to work properly. The first argument is classification algoritm to use, which is the DecisionTreeClassifier algorithm. I have chosen to supply this algorithms with the max_dept=3 argument, which constrains the training algorithm to not apply more than three rules before making a prediction.

The n_estimators argument tells the algorithm how many decision trees it should fit, and the learning_rate argument tells the algorithm how much the misclassified matches are going to be up-weighted in the next round of decision three fitting. These two values are usually something that you can experiment with since there is no definite rule on how these should be set. The rule of thumb is that the lower the learning rate is, the more estimators you neeed.

The last argument, random_state, is something that should be given if you want to reproduce the model fitting. If this is not specified you will end up with slightly different trained algroithm each time you fit them. See this question on Stack Overflow for an explanation.

At last the algorithm is fitted using the fit() method, which is supplied with the odds and match results.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

DecisionTreeClassifier(max_depth=3),
n_estimators=1000,
learning_rate=0.4, random_state=42)



We can now see how well the trained algorithm fits the training data.

import sklearn.metrics as skm

print skm.confusion_matrix(list(training_pred), list(results_train))


This is the resulting confusion matrix:

 Away Draw Home Away 164 1 0 Draw 1 152 0 Home 0 0 152

We see that only two matches in the training data is misclassified, one away win which were predicted to be a draw and one draw that was predicted to be an away win. Normally with such a good fit we should be wary of overfitting and poor predictive power on new data.

Let’s try to predict the outcome of the Premier League matches from January to May 2014:

test_pred = adb.predict(feature_test)
print skm.confusion_matrix(list(test_pred), list(results_test))

 Away Draw Home Away 31 19 12 Draw 13 10 22 Home 20 14 59

It successfully predicted the right match outcome in a bit over half of the matches.

# Identifying gender bias in candidate lists in proportional representation elections

The Norwegian parliamentary elections uses a system of proportional representation. Each county has a number of seats in parliament (based on number of inhabitants and area), and the number of seats given to each party almost proportional to the number of votes the party receives on that county. Since each party can win more than one seat the parties has to prepare a ranked list of people to be elected, where the top name is given the first seat, the second name given the second seat etc.

Proportional representation systems like the Norwegian one has been show to be associated with greater gender balance in parliaments than other systems (see table 1 in this paper). Also, the proportion of women in the Norwegian Storting has also increased the last 30 years:

Data source: Statistics Norway, table 08219.

At the 1981 election, 26% of the elected representatives where women. At the 2013 election, the proportion was almost 40%. One mechanism that can explain this persistent female underrepresentation is that men are overrepresented at the top of the electoral lists. Inspired by a bioinformatics method called Gene Set Enrichment (GSEA) I am going to put this hypothesis to the test.

The method is rather simple. Explained in general terms, this is how it works: First you need to calculate a score witch represents the degree of overrepresentation of a category near the top of the list. Each time you encounter an instance belonging to the category your testing you increase the score, otherwise you decrease it. To make the score be a measure of overrepsentation at the top of the list the increase and decrease must be weighted accordingly. The maximum score of this ‘running sum’ is the test statistic. Here I have chosen the function $$\frac{1}{\sqrt(i)}$$ where i is the number the candidate is on the list (number 1 is the top candidate).

To calculate the p-value the same thing is done again repeatedly with different random permutations of the list. The proportion of times the score from these randomizations are greater or equal to the observed score is then the p-value.

I am going to use this method on the election lists from Hordaland county from the 1981 and 2013 election. Hordaland had 15 seats in 1981, and 16 seats in 2013. 3 (20 %) women were elected in 1981 and 5 (31.3 %) in 2013. The election lists are available from the Norwegian Social Science Data Services and the National Library of Norway.

Here are the results for each party at the two elections:

 Party 2013 1981 Ap 1 (0.43) 3.58 (0.49) Frp 3.28 (0.195) 3.56 (0.49) H 1.018 (0.66) 3.17 (0.35) Krf 1.24 (0.43) 2.32(0.138) Sp 2.86 (0.49) 2.86 (0.48) Sv 1 (0.24) 0.29 (0.72) V 1.49 (0.59) 1.37 (0.29)

The number shown is the score, while the p-value is in parenthesis. A higher score means a higher over representation of men at the top of the list.

Even if we ignore problems with multiple testing, none of the parties have a significant over representation of men at the top if the traditional significance threshold of $$p \le 0.05$$ is used. This is perhaps unexpected, as at least the gender balance in the elected candidates after the 1981 election is significantly biased (p = 0.018, one sided exact binomial test).

This really tells us that this method is not really powerful enough to make inferences about this kinds of data. I think one possible improvement would be to somehow score all lists in combination to find an overall gender bias. One could also try a different null model. The one I have used here has randomly shuffled the list in question, maintaining the bias in gender ratio (if any). Instead a the observed score could be compared to random samplings where each gender were sampled with equal probabilities.

My final thought is that this whole significance testing approach is inappropriate. Even if the bias is statistical insignificant, it is still there to influence the gender ratio of the elected members of parliament. From looking at some of the lists and their scores, I will say that all scores greater than 1 at least indicate a positive bias towards having more men at the top.

# The R code for the home field advantage and traveling distance analysis.

I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use.

First load the data with the coordinates I posted last year.

dta.stadiums <- read.csv('stadiums.csv')


I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches.

First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that.

coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){
#Calculates the distance between two WGS84 coordinates.
#
#http://en.wikipedia.org/wiki/Haversine_formula
#http://www.movable-type.co.uk/scripts/gis-faq-5.1.html
dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180))
dlong <- (long2 * (pi/180)) - (long1 * (pi/180))
h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2)
c <- 2 * pmin(1, asin(sqrt(h)))
return(d)
}


Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches.

coord.home <- dta.stadiums[match(dta.matches$HomeTeam, dta.stadiums$FDCOUK),
c('Latitude', 'Longitude')]
coord.away <- dta.stadiums[match(dta.matches$AwayTeam, dta.stadiums$FDCOUK),
c('Latitude', 'Longitude')]

dta.matches$Distance <- coordinate.distance(coord.home$Latitude, coord.home$Longitude, coord.away$Latitude, coord.away$Longitude)  Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match. avgerage.gd <- function(dta){ #Calculates the average goal difference for each team. all.teams <- unique(c(levels(dta$HomeTeam), levels(dta$AwayTeam))) average.goal.diff <- numeric(length(all.teams)) names(average.goal.diff) <- all.teams for (t in all.teams){ idxh <- which(dta$HomeTeam == t)
goals.for.home <- dta[idxh, 'FTHG']
goals.against.home <- dta[idxh, 'FTAG']

idxa <- which(dta$AwayTeam == t) goals.for.away <- dta[idxa, 'FTAG'] goals.against.away <- dta[idxa, 'FTHG'] n.matches <- length(idxh) + length(idxa) total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away) average.goal.diff[t] <- total.goal.difference / n.matches } return(average.goal.diff) } matchwise.hfa <- function(dta, avg.goaldiff){ #Calculates the matchwise home field advantage based on the average goal #difference for each team. n.matches <- dim(dta)[1] hfa <- numeric(n.matches) for (idx in 1:n.matches){ hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']] awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']] expected.goal.diff <- hometeam.avg - awayteam.avg observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG'] hfa[idx] <- observed.goal.diff - expected.goal.diff } return(hfa) }  In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA. dta.matches$HFA <- numeric(dim(dta.matches)[1])
seasons <- levels(dta.matches$Season) for (i in 1:length(seasons)){ season.l <- dta.matches$Season == seasons[i]
h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,]))
dta.matches$HFA[season.l] <- h }  At last we can do the linear regression and make a nice little plot. m <- lm(HFA ~ Distance, data=dta.matches) summary(m) plot(dta.matches$Distance, dta.matches\$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance')
abline(m, col='red')


# Poor man’s parallel processing

Here’s a nice trick I learned on how you could implement simple parallel processing capabilities to speed up computations. This trick is only applicable in certain simple cases though, and does not scale very well, so it is best used in one-off scripts rather than in scripts that is used routinely or by others.

Suppose you have a list or an array that you are going to loop trough. Each of the elements in the list takes a long time to process and each iteration is NOT dependent on the result of any of the previous iterations. This is exactly the kind of situation where this trick is applicable.

The trick is to save the result for each iteration in a file whose name is unique to the iteration, and at the beginning of each iteration you simply check if that file already exists. If it does, the script skips to the next iteration. If it doesn’t, you create the file. This way you could run many instances of the script simultaneously, without doing the same iteration twice.

With this trick the results will be spread across different files, but if they are named and formated in a consistent way it is not hard to go trough the files and merge them into a single file.

Here is how it could be done in python:

import os.path

myList = ['bill', 'george', 'barack', 'ronald']

for president in myList:

fileName = 'result_{}'.format(president)

if os.path.isfile(fileName):
print('File {} already exists, continues to the next iteration')
continue

f = open(filename, 'w')

#myResults is the object where your results are stored
f.write(myResults)
f.close()



And in R:


myList <- c('bill', 'george', 'barack', 'ronald')

for (president in myList){

file.name <- paste('results', president, sep='_')

if (file.exists(file.name)){
cat('File', file.name, 'already exists, continues to the next iteration\n')
next
}

file.create(file.name)

#Save the my.result object
save(my.result)
}


# Gender differences in ski jumping at the Olympics

I had a discussion with some friends the other day about separate sports competitions for men and women. In some sports, like curling, it seems rather unnecessary to have separate competitions. At least assuming the reason for gendered competitions is that being a male or a female may give the competitor an obvious advantage. One sport where we didn’t think it was obvious was ski jumping, so I decided to look at some numbers.

This year’s Olympics was the first time women competed in ski jumping so decided to do a quick comparison of the results from the final round in the men’s and women’s final.

This is what I came up with:

What we see are the estimated distributions for the jump distances for men and women. The mode for the women seems to be a little lower than the mode for the men. We also see that there is much more variability among the women jumpers than among the men and that the women’s distribution have a longer right tail. Still, it looks like the best female jumpers are on par with the best male jumpers and vice verca.

The numbers I used here are not adjusted for wind conditions and other relevant factors, so I will not draw any firm conclusions. I hope to have time to look more into this later, using data from more competitions, adjusting for wind etc.

# The minimum violations ranking method

One informative benchmark when ranking and rating sports teams is how many times the ranking has been violated. A ranking violation occurs when a team beats a higher ranked team. Ideally no violations would occur, but in practice this rarely happens. In many cases it is unavoidable, for example in this three team competition: Team A beats team B, team B beats team C, and team C beats team A. In this case, for any of the 6 possible rankings of these three teams at least one violation would occur.

Inspired by this, one could try to construct a ranking with as few violations as possible. A minimum violations ranking (MVR), as it is called. The idea is simple and intuitive, and has been put to use in ranking American college sport teams. The MinV ranking by Jay Coleman is one example.

MV rankings have some other nice properties other than being just an intuitive measure. A MV ranking is the best ranking in terms of backwards predictions. It can also be a method for combining several other rankings, by using the other rankings as the data.

Despite this, I don’t think MV rankings are that useful in the context of football. The main reason for this is that football has a large number of draws and as far as I can tell, a draw has no influence on a MV ranking. A draw is therefore equivalent with no game at all and provides no information.

MV rankings also has another problem. In many cases there can be several rankings that satisfies the MV criterion. This of course depends on the data, but it seems nevertheless to be quite common, such as in the small example above.

Unfortunately, I have not found any software packages that can find a MV ranking. One algorithm is described in this paper (paywall), but I haven’t tried to implemented it myself. Most other MVR methods I have seen seem to be based on defining a set of mathematical constraints and then letting some optimization software search for solutions. See this paper for an example.

# Does traveling distance influence home field advantage?

A couple of weeks ago I posted a data set with the location of the stadiums for many of the football teams in Europe. One thing I wanted to use the dataset for was to see if the traveling distance between two teams (as measured by the distance between the two team’s home stadium) influenced home field advantage.

To calculate the home field advantage for each match i did the following: For each team, the average goal difference during the season are calculated (goals scored minus goals conceded divided by the number of matches). Then the expected goal difference for a match is the difference between the average goal differences (home minus away). The home field advantage is then the observed goal difference minus the expected goal difference.

In the 2012-13 Premier League season, for example, Chelsea scored 75 goals and conceded 39 goals in total. Everton scored 55 and conceded 40 goals. Both teams played 38 matches during the season. On average Chelsea had a goal difference of per match of 0.947 and Everton’s average were 0.395. With Chelsea meeting Everton at home the expected goal difference is 0.947 – 0.395 = 0.553. The actual outcome for this match was 2-1, a goal difference of 1. The home field advantage for this match is then 1 – 0.553 = 0.447.

Using data from the 2011-12 and 2012-13 seasons from the top divisions from Spain, France Germany, and the 2012-13 from England I used the stadium coordinates to calculate the traveling distance for the visiting team and the home field advantage. Plotting these two against each other, and drawing the least squares line gives this:

There is a great deal of noise in this plot, to put it mildly. The slope of the red line is 0.00006039. This is the estimated increase in number of goals the home team scores for each kilometer the away team has traveled. This is not significantly different from 0 (p-value = 0.646). The intercept, where the red line crosses the vertical axis is 0.4, meaning that the home team is estimated to score 0.4 more goals than expected, if the opposing team has traveled 0 kilometers. This is highly significant (p-value = 1.71e-11).

To be honest, I am a bit surprised to see such a clear lack of effect of traveling distance. I did not expect a particularly strong, or even very significant effect, but I had hoped to see at least a hint at something. Perhaps one reason for the lack of effect is that traveling distance is not necessarily the same as traveling time as longer distances may be covered by air, making them comparable to shorter travels by land.

It should be kept in mind that these results should only apply to the leagues included in the data. It could be that traveling distance could have a significant effect on longer distances, for example in international competitions such as the Champions League or between national teams.

# BBC’s More Or Less on why the men’s FIFA rankings fail

One of the podcasts I listen to regularly, ‘More Or Less’ from the BBC, had the other day an episode about the (men’s) FIFA rankings. In the episode they discuss a shortcoming in the ranking system that makes it possible for a team to loose points (and thus ranking position) despite winning a match. The reason for this is not fully explained, but looking closer at the descriptions provided at fifa.com I think I see where the problem lies. After each match, rating points are given to the winner (or split if there is a draw). The crucial thing here is that friendly matches (or other non-important matches) gives fewer points than important tournament matches. The published ratings then are basically an average over the points earned for the matches played in the last couple of years. That means that winning a friendly match sometimes will yield fewer than a team’s average points, thus decreasing the average.

Unfortunately the episode did not mention the women’s FIFA ranking system which is based on the much better Elo system, used in chess rankings (and which I have written about previously). In this sort of system a win will almost surely give more points, and not less (the worst case scenario for a win is that no points are earned).

# Dataset: Football stadiums with geographic coordinates

Here is a dataset I have put together with the location and capacity for the stadiums for about 130 European teams. The teams are from England, Scotland, France, Germany and Spain. The data is taken from Wikipedia and should be correct for the last couple of seasons. The French team Lille’s stadium is the current from the 2012 season, and Nice’s stadium is not the current one, but the one they had until the end of last season.

I have also added a column with the team names as they are used in the data from football-data.co.uk.

Some of the coordinates are more accurate than others, but I think they should be accurate enough to at least give an indication of the town the team comes from. That is probably true for the teams that have changed stadiums; they are probably within the same town as well.

What can this data set be used for? One thing I want to look into is whether traveling distance for the visiting team in a match influences the home field advantage. I have a couple of other ideas as well, but that will be for another time.