Predicting football results with Adaptive Boosting

Adaptive Boosting, usually referred to by the abbreviation AdaBoost, is perhaps the best general machine learning method around for classification. It is what’s called a meta-algorithm, since it relies on other algorithms to do the actual prediction. What AdaBoost does is combining a large number of such algorithms in a smart way: First a classification algorithm is trained, or fitted, or its parameters are estimated, to the data. The data points that the algorithm misclassifies are then given more weight as the algorithm is trained again. This procedure is repeated a large number of times (perhaps many thousand times). When making predictions based on a new set of data, each of the fitted algorithms predict the new response value, and a the most commonly predicted value is then considered the overall prediction. Of course there are more details surrounding the AdaBoost than this brief summary. I can recommend the book The Elements of Statistical Learning by Hasite, Tibshirani and Friedman for a good introduction to AdaBoost, and machine learning in general.

Although any classification algorithm can be used with AdaBoost, it is most commonly used with decision trees. Decision trees are intuitive models that make predictions based on a combination of simple rules. These rules are usually of the form “if predictor variable x is greater than a value y, then do this, if not, do that”. By “do this” and “do that” I mean continue to a different rule of the same form, or make a prediction. This cascade of different rules can be visualized with a chart that looks sort of like a tree, hence the tree metaphor in the name. Of course Wikipedia has an article, but The Elements of Statistical Learning has a nice chapter about trees too.

In this post I am going to use decision trees and AdaBoost to predict the results of football matches. As features, or predictors I am going to use the published odds from different betting companies, which is available from football-data.co.uk. I am going to use data from the 2012-13 and first half of the 2013-14 season of the English Premier League to train the model, and then I am going to predict the remaining matches from the 2013-14 season.

Implementing the algorithms by myself would of course take a lot of time, but luckily they are available trough the excellent Python scikit-learn package. This package contains lots of machine learning algorithms plus excellent documentation with a lot of examples. I am also going to use the pandas package for loading the data.

import numpy as np
import pandas as pd

dta_fapl2012_2013 = pd.read_csv('FAPL_2012_2013_2.csv', parse_dates=[1])
dta_fapl2013_2014 = pd.read_csv('FAPL_2013-2014.csv', parse_dates=[1])

dta = pd.concat([dta_fapl2012_2013, dta_fapl2013_2014], axis=0, ignore_index=True)

#Find the row numbers that should be used for training and testing.
train_idx = np.array(dta.Date < '2014-01-01')
test_idx = np.array(dta.Date >= '2014-01-01')

#Arrays where the match results are stored in
results_train = np.array(dta.FTR[train_idx])
results_test = np.array(dta.FTR[test_idx])

Next we need to decide which columns we want to use as predictors. I wrote earlier that I wanted to use the odds for the different outcomes. Asian handicap odds could be included as well, but to keep things simple I am not doing this now.

feature_columns = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
					'IWD', 'IWA','LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
					'SOH', 'SOD', 'SOA', 'SBH', 'SBD', 'SBA', 'SJH', 'SJD',
					'SJA', 'SYH', 'SYD','SYA', 'VCH', 'VCD', 'VCA', 'WHH',
					'WHD', 'WHA']

For some bookmakers the odds for certain matches is missing. In this data this is not much of a problem, but it could be worse in other data. Missing data is a problem because the algorithms will not work when some values are missing. Instead of removing the matches where this is the case we can instead guess the value that is missing. As a rule of thumb we can say that an approximate value for some variables of an observation is often better than dropping the observation completely. This is called imputation and scikit-learn comes with functionality for doing this for us.

The strategy I am using here is to fill inn the missing values by the mean of the odds for the same outcome. For example if the odds for home win from one bookmaker is missing, our guess of this odds is going to be the average of the odds for home win from the other bookmakers for that match. Doing this demands some more work since we have to split the data matrix in three.

from sklearn.preprocessing import Imputer

#Column numbers for odds for the three outcomes 
cidx_home = [i for i, col in enumerate(dta.columns) if col[-1] in 'H' and col in feature_columns]
cidx_draw = [i for i, col in enumerate(dta.columns) if col[-1] in 'D' and col in feature_columns]
cidx_away = [i for i, col in enumerate(dta.columns) if col[-1] in 'A' and col in feature_columns]

#The three feature matrices for training
feature_train_home = dta.ix[train_idx, cidx_home].as_matrix()
feature_train_draw = dta.ix[train_idx, cidx_draw].as_matrix()
feature_train_away = dta.ix[train_idx, cidx_away].as_matrix()

#The three feature matrices for testing
feature_test_home = dta.ix[test_idx, cidx_home].as_matrix()
feature_test_draw = dta.ix[test_idx, cidx_draw].as_matrix()
feature_test_away = dta.ix[test_idx, cidx_away].as_matrix()

train_arrays = [feature_train_home, feature_train_draw,
				feature_train_away]
									
test_arrays = [feature_test_home, feature_test_draw,
				feature_test_away]

imputed_training_matrices = []
imputed_test_matrices = []

for idx, farray in enumerate(train_arrays):
	imp = Imputer(strategy='mean', axis=1) #0: column, 1:rows
	farray = imp.fit_transform(farray)
	test_arrays[idx] = imp.fit_transform(test_arrays[idx])
	
	imputed_training_matrices.append(farray)
	imputed_test_matrices.append(test_arrays[idx])

#merge the imputed arrays
feature_train = np.concatenate(imputed_training_matrices, axis=1)
feature_test = np.concatenate(imputed_test_matrices, axis=1)

Now we are finally ready to use the data to train the algorithm. First an AdaBoostClassifier object is created, and here we need to give supply a set of arguments for it to work properly. The first argument is classification algoritm to use, which is the DecisionTreeClassifier algorithm. I have chosen to supply this algorithms with the max_dept=3 argument, which constrains the training algorithm to not apply more than three rules before making a prediction.

The n_estimators argument tells the algorithm how many decision trees it should fit, and the learning_rate argument tells the algorithm how much the misclassified matches are going to be up-weighted in the next round of decision three fitting. These two values are usually something that you can experiment with since there is no definite rule on how these should be set. The rule of thumb is that the lower the learning rate is, the more estimators you neeed.

The last argument, random_state, is something that should be given if you want to reproduce the model fitting. If this is not specified you will end up with slightly different trained algroithm each time you fit them. See this question on Stack Overflow for an explanation.

At last the algorithm is fitted using the fit() method, which is supplied with the odds and match results.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

adb = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=3),
    n_estimators=1000,
    learning_rate=0.4, random_state=42)

adb = adb.fit(feature_train, results_train)

We can now see how well the trained algorithm fits the training data.

import sklearn.metrics as skm

training_pred = adb.predict(feature_train)
print skm.confusion_matrix(list(training_pred), list(results_train))

This is the resulting confusion matrix:

Away Draw Home
Away 164 1 0
Draw 1 152 0
Home 0 0 152

We see that only two matches in the training data is misclassified, one away win which were predicted to be a draw and one draw that was predicted to be an away win. Normally with such a good fit we should be wary of overfitting and poor predictive power on new data.

Let’s try to predict the outcome of the Premier League matches from January to May 2014:

test_pred = adb.predict(feature_test)
print skm.confusion_matrix(list(test_pred), list(results_test)) 
Away Draw Home
Away 31 19 12
Draw 13 10 22
Home 20 14 59

It successfully predicted the right match outcome in a bit over half of the matches.

Poor man’s parallel processing

Here’s a nice trick I learned on how you could implement simple parallel processing capabilities to speed up computations. This trick is only applicable in certain simple cases though, and does not scale very well, so it is best used in one-off scripts rather than in scripts that is used routinely or by others.

Suppose you have a list or an array that you are going to loop trough. Each of the elements in the list takes a long time to process and each iteration is NOT dependent on the result of any of the previous iterations. This is exactly the kind of situation where this trick is applicable.

The trick is to save the result for each iteration in a file whose name is unique to the iteration, and at the beginning of each iteration you simply check if that file already exists. If it does, the script skips to the next iteration. If it doesn’t, you create the file. This way you could run many instances of the script simultaneously, without doing the same iteration twice.

With this trick the results will be spread across different files, but if they are named and formated in a consistent way it is not hard to go trough the files and merge them into a single file.

Here is how it could be done in python:

import os.path

myList = ['bill', 'george', 'barack', 'ronald']

for president in myList:
	
	fileName = 'result_{}'.format(president)
	
	if os.path.isfile(fileName):
		print('File {} already exists, continues to the next iteration')
		continue
	
	f = open(filename, 'w')

	#Do your thing here

	#myResults is the object where your results are stored
	f.write(myResults)
	f.close()
	

And in R:


myList <- c('bill', 'george', 'barack', 'ronald')

for (president in myList){

	file.name <- paste('results', president, sep='_')

	if (file.exists(file.name)){
		cat('File', file.name, 'already exists, continues to the next iteration\n')
		next
	}

	file.create(file.name)

	#Do your thing here
	
	#Save the my.result object
	save(my.result)
}

L-system in Python

The other day I was skimming trough articles on different subjects on Wikipedia. After a while I stumbled across the article on L-systems. In short, L-systems is method for generating fractals using simple string manipulations. You start out with an initial string, and for each character (called variable in L-system terminology) in that string you apply a set of rules that change that character into a set of new characters. Then you repeat n times. The resulting string is then a recipe for a graph or geometric structure. See the Wikipedia article for more details and examples.

I decided to try to implement this in Python. I only implemented the string manipulation engine and not a graphing tool. It was fairly easy, and the first method I came up with takes only about 15 lines of code and relies on recursion. A method without recursion is a couple of lines more, but probably more efficient. The user have to supply their own substitution rules by using a Python dictionary. Each key in the dictionary corresponds to a variable, and constants are defined implicitly by the production rules.

Here are two example from the Wikipedia article:


#Algae example
algaeRules = { "A" : "AB",
		"B" : "A",}
		
algaeExample = lsystem("A", 7, algaeRules)
print(algaeExample)

#ABAABABAABAABABAABABAABAABABAABAAB
#Sierpinski triangle example
sierpinskiRules = {"A": "B-A-B",
					"B": "A+B+A"}

sierpinskiExample = lsystem("A", 3, sierpinskiRules)
print(sierpinskiExample)
#B-A-B+A+B+A+B-A-B-A+B+A-B-A-B-A+B+A-B-A-B+A+B+A+B-A-B

Here is the code for the lsystem() function, including the function that does not use recursion.

def _applyRules(letter, rules):
	"""Internal function for applying production rules on a single character."""
	try:
		return rules[letter]
	except:
		return letter

def lsystem(init, n, rulesDict):
	"""Return the n-th iteration of a L-system (generated recursively)."""
	
	if n <= 0:
		return init
	if n == 1:
		return "".join([_applyRules(i, rulesDict) for i in init])
	else:
		newCond = "".join([_applyRules(i, rulesDict) for i in init])
		return lsystem(newCond, n-1, rulesDict)
	
	

def lsystemNonRecursive(init, n, rulesDict):
	"""Return the n-th iteration of a L-system."""
	
	if n <= 0:
		return init
	else:
		currentCond = init
		while True:
			currentCond = "".join([_applyRules(i, rulesDict) for i in currentCond])
			n -= 1
			if n == 0:
				return currentCond

Unicode csv files in Python 2.x

In some recent web scraping projects I extracted some data from a HTML document and saved it in a .csv file, using the csv module in Python. I used the BeautifulSoup module to parse and navigate the HTML, and since BS always encodes text in unicode, there was some real hassle when I tried to write special (non-ASCII) characters to the csv file since the csv module does not support unicode.

The documentation to the csv module provides some solutions to the problem, but I found that the easiest solution was to just install jdunck’s unicodecsv module. It has the same interface as the regular csv module, which is great. This means that if you already have a script that uses the regular module you can just write import unicodecsv as csv (or whatever you imported csv as) and it should work.

I guess Python 3.x does not have this problem since all strings by default are unicode strings.

A brainfuck interpreter in Python.

So I implemented this small brainfuck interpreter in python. This implementation uses 8-bit memory cells and does not allow for for values outside the 0-255 range. If this happens a ValueError will be raised. By default, the memory tape is 5000 cells long, but this can be set by the user.

Python 2.6 or later is required.

Example:

bf = Brainfuck(100) #only 100 cells on the tape
twoPlusThree = "++>+++<[>+<-]"
bf.run_command(twoPlusThree)

And here is the implementation:

class Brainfuck:
	def __init__(self, tape_length=5000):
		self.tape = bytearray(tape_length)
		self.tape_pointer = 0
		
	def run_command(self, cmd):
		cmd_pointer = 0

		running = True
		while running:
			if self.tape_pointer > len(self.tape) or self.tape_pointer < 0 or cmd_pointer < 0 or cmd_pointer > len(cmd) -1:
				break

			if cmd[cmd_pointer] == "+":
				self.tape[self.tape_pointer] += 1
			elif cmd[cmd_pointer] == "-":
				self.tape[self.tape_pointer] -= 1
			elif cmd[cmd_pointer] == "<":
				self.tape_pointer -= 1
			elif cmd[cmd_pointer] == ">":
				self.tape_pointer += 1
			elif cmd[cmd_pointer] == ".":
				print(chr(self.tape[self.tape_pointer]))
			elif cmd[cmd_pointer] == ",":
				self.tape[self.tape_pointer] = int(raw_input())
			elif cmd[cmd_pointer] == "[":
				if int(self.tape[self.tape_pointer]) == 0:
					lbcounter = 0
					searching = True			
					while searching:
						cmdpointer +=1
						if cmd[cmd_pointer] == "[":
							lbcounter += 1
						elif lbcounter == 0 and cmd[cmd_pointer] == "]":
							searching = False 
						elif cmd[cmd_pointer] == "]":
							lbcounter -= 1
			elif cmd[cmd_pointer] == "]":
				if int(self.tape[self.tape_pointer]) != 0:
					rbcounter = 0
					searching = True			
					while searching:
						cmd_pointer -=1
						if cmd[cmd_pointer] == "]":
							rbcounter += 1
						elif rbcounter == 0 and cmd[cmd_pointer] == "[":
							searching = False
						elif cmd[cmd_pointer] == "[":
							rbcounter -= 1
			
			cmd_pointer += 1