Predicting football results with Adaptive Boosting

Adaptive Boosting, usually referred to by the abbreviation AdaBoost, is perhaps the best general machine learning method around for classification. It is what’s called a meta-algorithm, since it relies on other algorithms to do the actual prediction. What AdaBoost does is combining a large number of such algorithms in a smart way: First a classification algorithm is trained, or fitted, or its parameters are estimated, to the data. The data points that the algorithm misclassifies are then given more weight as the algorithm is trained again. This procedure is repeated a large number of times (perhaps many thousand times). When making predictions based on a new set of data, each of the fitted algorithms predict the new response value, and a the most commonly predicted value is then considered the overall prediction. Of course there are more details surrounding the AdaBoost than this brief summary. I can recommend the book The Elements of Statistical Learning by Hasite, Tibshirani and Friedman for a good introduction to AdaBoost, and machine learning in general.

Although any classification algorithm can be used with AdaBoost, it is most commonly used with decision trees. Decision trees are intuitive models that make predictions based on a combination of simple rules. These rules are usually of the form “if predictor variable x is greater than a value y, then do this, if not, do that”. By “do this” and “do that” I mean continue to a different rule of the same form, or make a prediction. This cascade of different rules can be visualized with a chart that looks sort of like a tree, hence the tree metaphor in the name. Of course Wikipedia has an article, but The Elements of Statistical Learning has a nice chapter about trees too.

In this post I am going to use decision trees and AdaBoost to predict the results of football matches. As features, or predictors I am going to use the published odds from different betting companies, which is available from football-data.co.uk. I am going to use data from the 2012-13 and first half of the 2013-14 season of the English Premier League to train the model, and then I am going to predict the remaining matches from the 2013-14 season.

Implementing the algorithms by myself would of course take a lot of time, but luckily they are available trough the excellent Python scikit-learn package. This package contains lots of machine learning algorithms plus excellent documentation with a lot of examples. I am also going to use the pandas package for loading the data.

import numpy as np
import pandas as pd

dta_fapl2012_2013 = pd.read_csv('FAPL_2012_2013_2.csv', parse_dates=[1])
dta_fapl2013_2014 = pd.read_csv('FAPL_2013-2014.csv', parse_dates=[1])

dta = pd.concat([dta_fapl2012_2013, dta_fapl2013_2014], axis=0, ignore_index=True)

#Find the row numbers that should be used for training and testing.
train_idx = np.array(dta.Date < '2014-01-01')
test_idx = np.array(dta.Date >= '2014-01-01')

#Arrays where the match results are stored in
results_train = np.array(dta.FTR[train_idx])
results_test = np.array(dta.FTR[test_idx])

Next we need to decide which columns we want to use as predictors. I wrote earlier that I wanted to use the odds for the different outcomes. Asian handicap odds could be included as well, but to keep things simple I am not doing this now.

feature_columns = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
					'IWD', 'IWA','LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
					'SOH', 'SOD', 'SOA', 'SBH', 'SBD', 'SBA', 'SJH', 'SJD',
					'SJA', 'SYH', 'SYD','SYA', 'VCH', 'VCD', 'VCA', 'WHH',
					'WHD', 'WHA']

For some bookmakers the odds for certain matches is missing. In this data this is not much of a problem, but it could be worse in other data. Missing data is a problem because the algorithms will not work when some values are missing. Instead of removing the matches where this is the case we can instead guess the value that is missing. As a rule of thumb we can say that an approximate value for some variables of an observation is often better than dropping the observation completely. This is called imputation and scikit-learn comes with functionality for doing this for us.

The strategy I am using here is to fill inn the missing values by the mean of the odds for the same outcome. For example if the odds for home win from one bookmaker is missing, our guess of this odds is going to be the average of the odds for home win from the other bookmakers for that match. Doing this demands some more work since we have to split the data matrix in three.

from sklearn.preprocessing import Imputer

#Column numbers for odds for the three outcomes 
cidx_home = [i for i, col in enumerate(dta.columns) if col[-1] in 'H' and col in feature_columns]
cidx_draw = [i for i, col in enumerate(dta.columns) if col[-1] in 'D' and col in feature_columns]
cidx_away = [i for i, col in enumerate(dta.columns) if col[-1] in 'A' and col in feature_columns]

#The three feature matrices for training
feature_train_home = dta.ix[train_idx, cidx_home].as_matrix()
feature_train_draw = dta.ix[train_idx, cidx_draw].as_matrix()
feature_train_away = dta.ix[train_idx, cidx_away].as_matrix()

#The three feature matrices for testing
feature_test_home = dta.ix[test_idx, cidx_home].as_matrix()
feature_test_draw = dta.ix[test_idx, cidx_draw].as_matrix()
feature_test_away = dta.ix[test_idx, cidx_away].as_matrix()

train_arrays = [feature_train_home, feature_train_draw,
				feature_train_away]
									
test_arrays = [feature_test_home, feature_test_draw,
				feature_test_away]

imputed_training_matrices = []
imputed_test_matrices = []

for idx, farray in enumerate(train_arrays):
	imp = Imputer(strategy='mean', axis=1) #0: column, 1:rows
	farray = imp.fit_transform(farray)
	test_arrays[idx] = imp.fit_transform(test_arrays[idx])
	
	imputed_training_matrices.append(farray)
	imputed_test_matrices.append(test_arrays[idx])

#merge the imputed arrays
feature_train = np.concatenate(imputed_training_matrices, axis=1)
feature_test = np.concatenate(imputed_test_matrices, axis=1)

Now we are finally ready to use the data to train the algorithm. First an AdaBoostClassifier object is created, and here we need to give supply a set of arguments for it to work properly. The first argument is classification algoritm to use, which is the DecisionTreeClassifier algorithm. I have chosen to supply this algorithms with the max_dept=3 argument, which constrains the training algorithm to not apply more than three rules before making a prediction.

The n_estimators argument tells the algorithm how many decision trees it should fit, and the learning_rate argument tells the algorithm how much the misclassified matches are going to be up-weighted in the next round of decision three fitting. These two values are usually something that you can experiment with since there is no definite rule on how these should be set. The rule of thumb is that the lower the learning rate is, the more estimators you neeed.

The last argument, random_state, is something that should be given if you want to reproduce the model fitting. If this is not specified you will end up with slightly different trained algroithm each time you fit them. See this question on Stack Overflow for an explanation.

At last the algorithm is fitted using the fit() method, which is supplied with the odds and match results.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adb = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=3),
    n_estimators=1000,
    learning_rate=0.4, random_state=42)

adb = adb.fit(feature_train, results_train)

We can now see how well the trained algorithm fits the training data.

import sklearn.metrics as skm

training_pred = adb.predict(feature_train)
print skm.confusion_matrix(list(training_pred), list(results_train))

This is the resulting confusion matrix:

Away Draw Home
Away 164 1 0
Draw 1 152 0
Home 0 0 152

We see that only two matches in the training data is misclassified, one away win which were predicted to be a draw and one draw that was predicted to be an away win. Normally with such a good fit we should be wary of overfitting and poor predictive power on new data.

Let’s try to predict the outcome of the Premier League matches from January to May 2014:

test_pred = adb.predict(feature_test)
print skm.confusion_matrix(list(test_pred), list(results_test)) 
Away Draw Home
Away 31 19 12
Draw 13 10 22
Home 20 14 59

It successfully predicted the right match outcome in a bit over half of the matches.