Adaptive Boosting, usually referred to by the abbreviation AdaBoost, is perhaps the best general machine learning method around for classification. It is what’s called a meta-algorithm, since it relies on other algorithms to do the actual prediction. What AdaBoost does is combining a large number of such algorithms in a smart way: First a classification algorithm is trained, or fitted, or its parameters are estimated, to the data. The data points that the algorithm misclassifies are then given more weight as the algorithm is trained again. This procedure is repeated a large number of times (perhaps many thousand times). When making predictions based on a new set of data, each of the fitted algorithms predict the new response value, and a the most commonly predicted value is then considered the overall prediction. Of course there are more details surrounding the AdaBoost than this brief summary. I can recommend the book *The Elements of Statistical Learning* by Hasite, Tibshirani and Friedman for a good introduction to AdaBoost, and machine learning in general.

Although any classification algorithm can be used with AdaBoost, it is most commonly used with decision trees. Decision trees are intuitive models that make predictions based on a combination of simple rules. These rules are usually of the form “if predictor variable x is greater than a value y, then do this, if not, do that”. By “do this” and “do that” I mean continue to a different rule of the same form, or make a prediction. This cascade of different rules can be visualized with a chart that looks sort of like a tree, hence the tree metaphor in the name. Of course Wikipedia has an article, but *The Elements of Statistical Learning* has a nice chapter about trees too.

In this post I am going to use decision trees and AdaBoost to predict the results of football matches. As features, or predictors I am going to use the published odds from different betting companies, which is available from football-data.co.uk. I am going to use data from the 2012-13 and first half of the 2013-14 season of the English Premier League to train the model, and then I am going to predict the remaining matches from the 2013-14 season.

Implementing the algorithms by myself would of course take a lot of time, but luckily they are available trough the excellent Python scikit-learn package. This package contains lots of machine learning algorithms plus excellent documentation with a lot of examples. I am also going to use the pandas package for loading the data.

import numpy as np import pandas as pd dta_fapl2012_2013 = pd.read_csv('FAPL_2012_2013_2.csv', parse_dates=[1]) dta_fapl2013_2014 = pd.read_csv('FAPL_2013-2014.csv', parse_dates=[1]) dta = pd.concat([dta_fapl2012_2013, dta_fapl2013_2014], axis=0, ignore_index=True) #Find the row numbers that should be used for training and testing. train_idx = np.array(dta.Date < '2014-01-01') test_idx = np.array(dta.Date >= '2014-01-01') #Arrays where the match results are stored in results_train = np.array(dta.FTR[train_idx]) results_test = np.array(dta.FTR[test_idx])

Next we need to decide which columns we want to use as predictors. I wrote earlier that I wanted to use the odds for the different outcomes. Asian handicap odds could be included as well, but to keep things simple I am not doing this now.

feature_columns = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA','LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'SOH', 'SOD', 'SOA', 'SBH', 'SBD', 'SBA', 'SJH', 'SJD', 'SJA', 'SYH', 'SYD','SYA', 'VCH', 'VCD', 'VCA', 'WHH', 'WHD', 'WHA']

For some bookmakers the odds for certain matches is missing. In this data this is not much of a problem, but it could be worse in other data. Missing data is a problem because the algorithms will not work when some values are missing. Instead of removing the matches where this is the case we can instead guess the value that is missing. As a rule of thumb we can say that an approximate value for some variables of an observation is often better than dropping the observation completely. This is called imputation and scikit-learn comes with functionality for doing this for us.

The strategy I am using here is to fill inn the missing values by the mean of the odds for the same outcome. For example if the odds for home win from one bookmaker is missing, our guess of this odds is going to be the average of the odds for home win from the other bookmakers for that match. Doing this demands some more work since we have to split the data matrix in three.

from sklearn.preprocessing import Imputer #Column numbers for odds for the three outcomes cidx_home = [i for i, col in enumerate(dta.columns) if col[-1] in 'H' and col in feature_columns] cidx_draw = [i for i, col in enumerate(dta.columns) if col[-1] in 'D' and col in feature_columns] cidx_away = [i for i, col in enumerate(dta.columns) if col[-1] in 'A' and col in feature_columns] #The three feature matrices for training feature_train_home = dta.ix[train_idx, cidx_home].as_matrix() feature_train_draw = dta.ix[train_idx, cidx_draw].as_matrix() feature_train_away = dta.ix[train_idx, cidx_away].as_matrix() #The three feature matrices for testing feature_test_home = dta.ix[test_idx, cidx_home].as_matrix() feature_test_draw = dta.ix[test_idx, cidx_draw].as_matrix() feature_test_away = dta.ix[test_idx, cidx_away].as_matrix() train_arrays = [feature_train_home, feature_train_draw, feature_train_away] test_arrays = [feature_test_home, feature_test_draw, feature_test_away] imputed_training_matrices = [] imputed_test_matrices = [] for idx, farray in enumerate(train_arrays): imp = Imputer(strategy='mean', axis=1) #0: column, 1:rows farray = imp.fit_transform(farray) test_arrays[idx] = imp.fit_transform(test_arrays[idx]) imputed_training_matrices.append(farray) imputed_test_matrices.append(test_arrays[idx]) #merge the imputed arrays feature_train = np.concatenate(imputed_training_matrices, axis=1) feature_test = np.concatenate(imputed_test_matrices, axis=1)

Now we are finally ready to use the data to train the algorithm. First an AdaBoostClassifier object is created, and here we need to give supply a set of arguments for it to work properly. The first argument is classification algoritm to use, which is the DecisionTreeClassifier algorithm. I have chosen to supply this algorithms with the `max_dept=3`

argument, which constrains the training algorithm to not apply more than three rules before making a prediction.

The `n_estimators`

argument tells the algorithm how many decision trees it should fit, and the `learning_rate`

argument tells the algorithm how much the misclassified matches are going to be up-weighted in the next round of decision three fitting. These two values are usually something that you can experiment with since there is no definite rule on how these should be set. The rule of thumb is that the lower the learning rate is, the more estimators you neeed.

The last argument, `random_state`

, is something that should be given if you want to reproduce the model fitting. If this is not specified you will end up with slightly different trained algroithm each time you fit them. See this question on Stack Overflow for an explanation.

At last the algorithm is fitted using the `fit()`

method, which is supplied with the odds and match results.

from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier adb = AdaBoostClassifier( DecisionTreeClassifier(max_depth=3), n_estimators=1000, learning_rate=0.4, random_state=42) adb = adb.fit(feature_train, results_train)

We can now see how well the trained algorithm fits the training data.

import sklearn.metrics as skm training_pred = adb.predict(feature_train) print skm.confusion_matrix(list(training_pred), list(results_train))

This is the resulting confusion matrix:

Away | Draw | Home | |

Away | 164 | 1 | 0 |

Draw | 1 | 152 | 0 |

Home | 0 | 0 | 152 |

We see that only two matches in the training data is misclassified, one away win which were predicted to be a draw and one draw that was predicted to be an away win. Normally with such a good fit we should be wary of overfitting and poor predictive power on new data.

Let’s try to predict the outcome of the Premier League matches from January to May 2014:

test_pred = adb.predict(feature_test) print skm.confusion_matrix(list(test_pred), list(results_test))

Away | Draw | Home | |

Away | 31 | 19 | 12 |

Draw | 13 | 10 | 22 |

Home | 20 | 14 | 59 |

It successfully predicted the right match outcome in a bit over half of the matches.

“It successfully predicted the right match outcome in a bit over half of the matches. ”

why a bit over? looks like 50/50, doesn’t it?

Is it true this machine is predicting perfect

It is only able to perfectly predict the results it uses to estimate the parameters of the algorithm, which gives a completely unrealistic picture of how well the algorithm works for predicting future results. As you can see in the last table, about 50% of future results are correctly predicted.

How do you interpret the result matrix?

Take a look here:

https://en.wikipedia.org/wiki/Confusion_matrix

Thank you for this useful article.

I have a question

Can these codes be converted into R programming language?

Yes they could. I haven’t used adaptive boosting and tree classifiers in R myself, but they should be available from one of the many packages out there. Perhaps check out the caret package that implements wrappers around these kinds of algorithms from other packages. I would be surprised if these weren’t available there.

Ok! Thank you for answer. . .

With your permission, I want to ask you another question. Can artificial neural network codes be used for the match score prediction? is there an R package you can recommend in this regard?

I guess they could, but I have never used ANN so I can not recommend any packages for it.