# Statistics I don’t believe in

Oslo is one of the candidate cities for hosting the 2022 winter Olympics, and of course there is a lot of discussion going on whether it costs too much etc. Opinion polls are conducted regularly, and the current status is that the majority of people don’t want Oslo to host the games. A couple of weeks ago the office dealing with the application put up the data from all polls conducted this year. What is weird is that they choose to illustrate the article detailing the data with the results from the question Do you wish that Norway should continue to participate in the winter Olympics and Paralympics? where only 73% of the respondents answered yes and 19% answered no.

First of all I thought the question was silly. Of course no one want Norway not to participate, but then the results show that almost one in five don’t want Norway to participate. At first i thought that it might be just a fluke, but then I downloaded the data and saw that this question had been asked on three occasions, and the results were pretty similar:

Even after seeing the consistent trend, I still don’t believe the numbers are correct. I think this shows one of the potential pitfalls in opinion polls. Since it is really hard to get the nuances of peoples beliefs in questions with just a few possible answers, it may be tempting to ask more detailed questions. The danger is that the questions become more confusing, or they even become irrelevant. I think both these things are can explain what is going on here.

# Two Bayesian regression models for football results

Last fall I took a short introduction course in Bayesian modeling, and as part of the course we were going to analyze a data set of our own. I of course wanted to model football results. The inspiration came from a paper by Gianluca Baio and Marta A. Blangiardo Bayesian hierarchical model for the prediction of football results (link).

I used data from Premier League from 2012 and wanted to test the predictions on the last half of the 2012-23 season. With this data I fitted two models: One where the number of goals scored where modeled using th Poisson distribution, and one where I modeled the outcome directly (as home win, away win or draw) using an ordinal probit model. As predictors I used the teams as categorical predictors, meaning each team will be associated with two parameters.

The Poisson model was pretty much the same as the first and simplest model described in Baio and Blangiardo paper, but with slightly more informed priors. What makes this model interesting and different from the independent Poisson model I have written about before, apart from being estimated using Bayesian techniques, is that each match is not considered as two independent events when the parameters are estimated. Instead a correlation is implicitly modeled by specifying the priors in a smart way (see figure 1 in the paper, or here), thereby modeling the number of goals scored like a sort-of-bivariate Poisson.

Although I haven’t had time to look much into it yet, I should also mention that Baio and Blangiardo extended their model and used it this summer to model the World Cup. You can read more at Baio’s blog.

The ordinal probit model exploits the fact that the outcomes for a match can be thought to be on an ordinal scale, with a draw (D) considered to be ‘between’ a home win (H) and an away win (A). An ordinal probit model is in essence an ordinary linear regression model with a continuous response mu, that is coupled with a set of threshold parameters. For any value of mu the probabilities for any category is determined by the cumulative normal distribution and the threshold values. This is perhaps best explained with help from a figure:

Here we see an example where the predicted outcome is 0.9, and the threshold parameters has been estimated to 0 and 1.1. The area under the curve is then the probability of the different outcomes.

To model the match outcomes I use a model inspired by the structure in the predictors as the Poisson model above. Since the outcomes are given as Away, Draw and Home, the home field advantage is not needed as a separate term. This is instead implicit in the coefficients for each team. This gives the coefficients a different interpretation from the above model. The two coefficients here can be interpreted as the ability when playing at home and the ability when playing away.

To get this model to work I had to set the constrains that the threshold separating Away and Draw were below the Draw-Home threshold. This implies that a good team would be expected to have a negative Away coefficient and a positive Home coefficient. Also, the intercept parameter had to be fixed to an arbitrary value (I used 2).

To estimate the parameters and make predictions I used JAGS trough the rjags package.

For both models, I used the most credible match outcome as the prediction. How well were the last half of the 2012-13 season predictions? The results are shown in the confusion table below.

Confusion matrix for Poisson model

 actual/predicted A D H A 4 37 11 D 1 35 14 H 0 38 42

Confusion matrix for ordinal probit model

 actual/predicted A D H A 19 0 33 D 13 0 37 H 10 0 70

The Poisson got the result right in 44.5% of the matches while the ordinal probit got right in 48.9%. This was better than the Poisson model, but it completely failed to even consider draw as an outcome. Ordinal probit, however, does seem to be able to predict away wins, which the Poisson model was poor at.

Here is the JAGS model specification for the ordinal probit model.

model {

for( i in 1:Nmatches ) {

pr[i, 1] <- phi( thetaAD - mu[i]  )
pr[i, 2] <- max( 0 ,  phi( (thetaDH - mu[i]) ) - phi( (thetaAD - mu[i]) ) )
pr[i, 3] <- 1 - phi( (thetaDH - mu[i]) )

y[i] ~ dcat(pr[i, 1:3])

mu[i] <- b0 + homePerf[teamh[i]] + awayPerf[teama[i]]
}

for (j in 1:Nteams){
homePerf.p[j] ~ dnorm(muH, tauH)
awayPerf.p[j] ~ dnorm(muA, tauA)

#sum to zero constraint
homePerf[j] <- homePerf.p[j] - mean(homePerf.p[])
awayPerf[j] <- awayPerf.p[j] - mean(awayPerf.p[])
}

thetaAD ~ dnorm( 1.5 , 0.1 )
thetaDH ~ dnorm( 2.5 , 0.1 )

muH ~ dnorm(0, 0.01)
tauH ~ dgamma(0.1, 0.1)

muA ~ dnorm(0, 0.01)
tauA ~ dgamma(0.1, 0.1)

#predicting missing values
predictions <- y[392:573]
}


And here is the R code I used to run the above model in JAGS.

library('rjags')
library('coda')

#Remove the match outcomes that should be predicted
to.predict <- 392:573 #this is row numbers
observed.results <- dta[to.predict, 'FTR']
dta[to.predict, 'FTR'] <- NA

#list that is given to JAGS
data.list <- list(
teamh = as.numeric(dta[,'HomeTeam']),
teama = as.numeric(dta[,'AwayTeam']),
y = as.numeric(dta[, 'FTR']),
Nmatches = dim(dta)[1],
Nteams = length(unique(c(dta[,'HomeTeam'], dta[,'AwayTeam']))),
b0 = 2 #fixed
)

#MCMC settings
burnin <- 1000
nchains <- 1
steps <- 15000
thinsteps <- 5

#Fit the model
#script name is a string with the file name where the JAGS script is.
update(jagsmodel, n.iter=burnin)

samples <- coda.samples(jagsmodel, variable.names=parameters,
n.chains=nchains, thin=thinsteps,
n.iter=steps)

#Save the samples
save(samples, file='bayesProbit_20131030.RData')

#print summary
summary(samples)


# A short summer update

Hi everybody, I haven’t had written much lately as I have been busy with other things, so this is just a short update on what is going on with my life.

This May I handed in my M.Sc. thesis in bioinformatics at the Norwegian University of Life Sciences. In it I investigated the prevalence of cis-regulatory elements (CREs) in plant promoters across 25 species. CREs are short DNA sites (typically around 12 base pairs long) that plays a role in gene regulation. They typically work by having high binding affinity for proteins that somehow regulates RNA transcription. I used genomic data (sequences and annotations) from the PLAZA server. I then predicted CREs by using motifs from the JASPAR, AtcisDB and PLACE databases. I then investigated whether predicted CREs that was found to be conserved in many gene families were more co-expressed than CREs that were not conserved and conserved only in a few families. I found that in general, this was not the case.

Right after my thesis was done I started my new job as a statistician in the Department of Health Services Research at Akershus University Hospital right outside Oslo. I am very exited about this as I get to work on a lot of interesting and meaningful research projects with a bunch of really great people.

I also spent a week backpacking in Belgium this summer with my girlfriend. The atmosphere was great in Brussels when Belgium beat the US in the World Cup, the weather was nice when we hiked in the Ardennes near the French border and everything was excellent when we rode our bikes in Bruges. Antwerp had the best food and the most beautiful train station I have ever been to. The most bizarre thing we experienced was perhaps the Bruggian love for embroidered cats dressed as humans:

Until next time!

# Predicting football results with Adaptive Boosting

Adaptive Boosting, usually referred to by the abbreviation AdaBoost, is perhaps the best general machine learning method around for classification. It is what’s called a meta-algorithm, since it relies on other algorithms to do the actual prediction. What AdaBoost does is combining a large number of such algorithms in a smart way: First a classification algorithm is trained, or fitted, or its parameters are estimated, to the data. The data points that the algorithm misclassifies are then given more weight as the algorithm is trained again. This procedure is repeated a large number of times (perhaps many thousand times). When making predictions based on a new set of data, each of the fitted algorithms predict the new response value, and a the most commonly predicted value is then considered the overall prediction. Of course there are more details surrounding the AdaBoost than this brief summary. I can recommend the book The Elements of Statistical Learning by Hasite, Tibshirani and Friedman for a good introduction to AdaBoost, and machine learning in general.

Although any classification algorithm can be used with AdaBoost, it is most commonly used with decision trees. Decision trees are intuitive models that make predictions based on a combination of simple rules. These rules are usually of the form “if predictor variable x is greater than a value y, then do this, if not, do that”. By “do this” and “do that” I mean continue to a different rule of the same form, or make a prediction. This cascade of different rules can be visualized with a chart that looks sort of like a tree, hence the tree metaphor in the name. Of course Wikipedia has an article, but The Elements of Statistical Learning has a nice chapter about trees too.

In this post I am going to use decision trees and AdaBoost to predict the results of football matches. As features, or predictors I am going to use the published odds from different betting companies, which is available from football-data.co.uk. I am going to use data from the 2012-13 and first half of the 2013-14 season of the English Premier League to train the model, and then I am going to predict the remaining matches from the 2013-14 season.

Implementing the algorithms by myself would of course take a lot of time, but luckily they are available trough the excellent Python scikit-learn package. This package contains lots of machine learning algorithms plus excellent documentation with a lot of examples. I am also going to use the pandas package for loading the data.

import numpy as np
import pandas as pd

dta = pd.concat([dta_fapl2012_2013, dta_fapl2013_2014], axis=0, ignore_index=True)

#Find the row numbers that should be used for training and testing.
train_idx = np.array(dta.Date < '2014-01-01')
test_idx = np.array(dta.Date >= '2014-01-01')

#Arrays where the match results are stored in
results_train = np.array(dta.FTR[train_idx])
results_test = np.array(dta.FTR[test_idx])


Next we need to decide which columns we want to use as predictors. I wrote earlier that I wanted to use the odds for the different outcomes. Asian handicap odds could be included as well, but to keep things simple I am not doing this now.

feature_columns = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH',
'IWD', 'IWA','LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA',
'SOH', 'SOD', 'SOA', 'SBH', 'SBD', 'SBA', 'SJH', 'SJD',
'SJA', 'SYH', 'SYD','SYA', 'VCH', 'VCD', 'VCA', 'WHH',
'WHD', 'WHA']


For some bookmakers the odds for certain matches is missing. In this data this is not much of a problem, but it could be worse in other data. Missing data is a problem because the algorithms will not work when some values are missing. Instead of removing the matches where this is the case we can instead guess the value that is missing. As a rule of thumb we can say that an approximate value for some variables of an observation is often better than dropping the observation completely. This is called imputation and scikit-learn comes with functionality for doing this for us.

The strategy I am using here is to fill inn the missing values by the mean of the odds for the same outcome. For example if the odds for home win from one bookmaker is missing, our guess of this odds is going to be the average of the odds for home win from the other bookmakers for that match. Doing this demands some more work since we have to split the data matrix in three.

from sklearn.preprocessing import Imputer

#Column numbers for odds for the three outcomes
cidx_home = [i for i, col in enumerate(dta.columns) if col[-1] in 'H' and col in feature_columns]
cidx_draw = [i for i, col in enumerate(dta.columns) if col[-1] in 'D' and col in feature_columns]
cidx_away = [i for i, col in enumerate(dta.columns) if col[-1] in 'A' and col in feature_columns]

#The three feature matrices for training
feature_train_home = dta.ix[train_idx, cidx_home].as_matrix()
feature_train_draw = dta.ix[train_idx, cidx_draw].as_matrix()
feature_train_away = dta.ix[train_idx, cidx_away].as_matrix()

#The three feature matrices for testing
feature_test_home = dta.ix[test_idx, cidx_home].as_matrix()
feature_test_draw = dta.ix[test_idx, cidx_draw].as_matrix()
feature_test_away = dta.ix[test_idx, cidx_away].as_matrix()

train_arrays = [feature_train_home, feature_train_draw,
feature_train_away]

test_arrays = [feature_test_home, feature_test_draw,
feature_test_away]

imputed_training_matrices = []
imputed_test_matrices = []

for idx, farray in enumerate(train_arrays):
imp = Imputer(strategy='mean', axis=1) #0: column, 1:rows
farray = imp.fit_transform(farray)
test_arrays[idx] = imp.fit_transform(test_arrays[idx])

imputed_training_matrices.append(farray)
imputed_test_matrices.append(test_arrays[idx])

#merge the imputed arrays
feature_train = np.concatenate(imputed_training_matrices, axis=1)
feature_test = np.concatenate(imputed_test_matrices, axis=1)


Now we are finally ready to use the data to train the algorithm. First an AdaBoostClassifier object is created, and here we need to give supply a set of arguments for it to work properly. The first argument is classification algoritm to use, which is the DecisionTreeClassifier algorithm. I have chosen to supply this algorithms with the max_dept=3 argument, which constrains the training algorithm to not apply more than three rules before making a prediction.

The n_estimators argument tells the algorithm how many decision trees it should fit, and the learning_rate argument tells the algorithm how much the misclassified matches are going to be up-weighted in the next round of decision three fitting. These two values are usually something that you can experiment with since there is no definite rule on how these should be set. The rule of thumb is that the lower the learning rate is, the more estimators you neeed.

The last argument, random_state, is something that should be given if you want to reproduce the model fitting. If this is not specified you will end up with slightly different trained algroithm each time you fit them. See this question on Stack Overflow for an explanation.

At last the algorithm is fitted using the fit() method, which is supplied with the odds and match results.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

DecisionTreeClassifier(max_depth=3),
n_estimators=1000,
learning_rate=0.4, random_state=42)



We can now see how well the trained algorithm fits the training data.

import sklearn.metrics as skm

print skm.confusion_matrix(list(training_pred), list(results_train))


This is the resulting confusion matrix:

 Away Draw Home Away 164 1 0 Draw 1 152 0 Home 0 0 152

We see that only two matches in the training data is misclassified, one away win which were predicted to be a draw and one draw that was predicted to be an away win. Normally with such a good fit we should be wary of overfitting and poor predictive power on new data.

Let’s try to predict the outcome of the Premier League matches from January to May 2014:

test_pred = adb.predict(feature_test)
print skm.confusion_matrix(list(test_pred), list(results_test))

 Away Draw Home Away 31 19 12 Draw 13 10 22 Home 20 14 59

It successfully predicted the right match outcome in a bit over half of the matches.

# Identifying gender bias in candidate lists in proportional representation elections

The Norwegian parliamentary elections uses a system of proportional representation. Each county has a number of seats in parliament (based on number of inhabitants and area), and the number of seats given to each party almost proportional to the number of votes the party receives on that county. Since each party can win more than one seat the parties has to prepare a ranked list of people to be elected, where the top name is given the first seat, the second name given the second seat etc.

Proportional representation systems like the Norwegian one has been show to be associated with greater gender balance in parliaments than other systems (see table 1 in this paper). Also, the proportion of women in the Norwegian Storting has also increased the last 30 years:

Data source: Statistics Norway, table 08219.

At the 1981 election, 26% of the elected representatives where women. At the 2013 election, the proportion was almost 40%. One mechanism that can explain this persistent female underrepresentation is that men are overrepresented at the top of the electoral lists. Inspired by a bioinformatics method called Gene Set Enrichment (GSEA) I am going to put this hypothesis to the test.

The method is rather simple. Explained in general terms, this is how it works: First you need to calculate a score witch represents the degree of overrepresentation of a category near the top of the list. Each time you encounter an instance belonging to the category your testing you increase the score, otherwise you decrease it. To make the score be a measure of overrepsentation at the top of the list the increase and decrease must be weighted accordingly. The maximum score of this ‘running sum’ is the test statistic. Here I have chosen the function $$\frac{1}{\sqrt(i)}$$ where i is the number the candidate is on the list (number 1 is the top candidate).

To calculate the p-value the same thing is done again repeatedly with different random permutations of the list. The proportion of times the score from these randomizations are greater or equal to the observed score is then the p-value.

I am going to use this method on the election lists from Hordaland county from the 1981 and 2013 election. Hordaland had 15 seats in 1981, and 16 seats in 2013. 3 (20 %) women were elected in 1981 and 5 (31.3 %) in 2013. The election lists are available from the Norwegian Social Science Data Services and the National Library of Norway.

Here are the results for each party at the two elections:

 Party 2013 1981 Ap 1 (0.43) 3.58 (0.49) Frp 3.28 (0.195) 3.56 (0.49) H 1.018 (0.66) 3.17 (0.35) Krf 1.24 (0.43) 2.32(0.138) Sp 2.86 (0.49) 2.86 (0.48) Sv 1 (0.24) 0.29 (0.72) V 1.49 (0.59) 1.37 (0.29)

The number shown is the score, while the p-value is in parenthesis. A higher score means a higher over representation of men at the top of the list.

Even if we ignore problems with multiple testing, none of the parties have a significant over representation of men at the top if the traditional significance threshold of $$p \le 0.05$$ is used. This is perhaps unexpected, as at least the gender balance in the elected candidates after the 1981 election is significantly biased (p = 0.018, one sided exact binomial test).

This really tells us that this method is not really powerful enough to make inferences about this kinds of data. I think one possible improvement would be to somehow score all lists in combination to find an overall gender bias. One could also try a different null model. The one I have used here has randomly shuffled the list in question, maintaining the bias in gender ratio (if any). Instead a the observed score could be compared to random samplings where each gender were sampled with equal probabilities.

My final thought is that this whole significance testing approach is inappropriate. Even if the bias is statistical insignificant, it is still there to influence the gender ratio of the elected members of parliament. From looking at some of the lists and their scores, I will say that all scores greater than 1 at least indicate a positive bias towards having more men at the top.

# The R code for the home field advantage and traveling distance analysis.

I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use.

First load the data with the coordinates I posted last year.

dta.stadiums <- read.csv('stadiums.csv')


I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches.

First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that.

coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){
#Calculates the distance between two WGS84 coordinates.
#
#http://en.wikipedia.org/wiki/Haversine_formula
#http://www.movable-type.co.uk/scripts/gis-faq-5.1.html
dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180))
dlong <- (long2 * (pi/180)) - (long1 * (pi/180))
h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2)
c <- 2 * pmin(1, asin(sqrt(h)))
return(d)
}


Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches.

coord.home <- dta.stadiums[match(dta.matches$HomeTeam, dta.stadiums$FDCOUK),
c('Latitude', 'Longitude')]
coord.away <- dta.stadiums[match(dta.matches$AwayTeam, dta.stadiums$FDCOUK),
c('Latitude', 'Longitude')]

dta.matches$Distance <- coordinate.distance(coord.home$Latitude, coord.home$Longitude, coord.away$Latitude, coord.away$Longitude)  Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match. avgerage.gd <- function(dta){ #Calculates the average goal difference for each team. all.teams <- unique(c(levels(dta$HomeTeam), levels(dta$AwayTeam))) average.goal.diff <- numeric(length(all.teams)) names(average.goal.diff) <- all.teams for (t in all.teams){ idxh <- which(dta$HomeTeam == t)
goals.for.home <- dta[idxh, 'FTHG']
goals.against.home <- dta[idxh, 'FTAG']

idxa <- which(dta$AwayTeam == t) goals.for.away <- dta[idxa, 'FTAG'] goals.against.away <- dta[idxa, 'FTHG'] n.matches <- length(idxh) + length(idxa) total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away) average.goal.diff[t] <- total.goal.difference / n.matches } return(average.goal.diff) } matchwise.hfa <- function(dta, avg.goaldiff){ #Calculates the matchwise home field advantage based on the average goal #difference for each team. n.matches <- dim(dta)[1] hfa <- numeric(n.matches) for (idx in 1:n.matches){ hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']] awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']] expected.goal.diff <- hometeam.avg - awayteam.avg observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG'] hfa[idx] <- observed.goal.diff - expected.goal.diff } return(hfa) }  In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA. dta.matches$HFA <- numeric(dim(dta.matches)[1])
seasons <- levels(dta.matches$Season) for (i in 1:length(seasons)){ season.l <- dta.matches$Season == seasons[i]
h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,]))
dta.matches$HFA[season.l] <- h }  At last we can do the linear regression and make a nice little plot. m <- lm(HFA ~ Distance, data=dta.matches) summary(m) plot(dta.matches$Distance, dta.matches\$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance')
abline(m, col='red')


# Poor man’s parallel processing

Here’s a nice trick I learned on how you could implement simple parallel processing capabilities to speed up computations. This trick is only applicable in certain simple cases though, and does not scale very well, so it is best used in one-off scripts rather than in scripts that is used routinely or by others.

Suppose you have a list or an array that you are going to loop trough. Each of the elements in the list takes a long time to process and each iteration is NOT dependent on the result of any of the previous iterations. This is exactly the kind of situation where this trick is applicable.

The trick is to save the result for each iteration in a file whose name is unique to the iteration, and at the beginning of each iteration you simply check if that file already exists. If it does, the script skips to the next iteration. If it doesn’t, you create the file. This way you could run many instances of the script simultaneously, without doing the same iteration twice.

With this trick the results will be spread across different files, but if they are named and formated in a consistent way it is not hard to go trough the files and merge them into a single file.

Here is how it could be done in python:

import os.path

myList = ['bill', 'george', 'barack', 'ronald']

for president in myList:

fileName = 'result_{}'.format(president)

if os.path.isfile(fileName):
print('File {} already exists, continues to the next iteration')
continue

f = open(filename, 'w')

#myResults is the object where your results are stored
f.write(myResults)
f.close()



And in R:


myList <- c('bill', 'george', 'barack', 'ronald')

for (president in myList){

file.name <- paste('results', president, sep='_')

if (file.exists(file.name)){
cat('File', file.name, 'already exists, continues to the next iteration\n')
next
}

file.create(file.name)

#Save the my.result object
save(my.result)
}


# Gender differences in ski jumping at the Olympics

I had a discussion with some friends the other day about separate sports competitions for men and women. In some sports, like curling, it seems rather unnecessary to have separate competitions. At least assuming the reason for gendered competitions is that being a male or a female may give the competitor an obvious advantage. One sport where we didn’t think it was obvious was ski jumping, so I decided to look at some numbers.

This year’s Olympics was the first time women competed in ski jumping so decided to do a quick comparison of the results from the final round in the men’s and women’s final.

This is what I came up with:

What we see are the estimated distributions for the jump distances for men and women. The mode for the women seems to be a little lower than the mode for the men. We also see that there is much more variability among the women jumpers than among the men and that the women’s distribution have a longer right tail. Still, it looks like the best female jumpers are on par with the best male jumpers and vice verca.

The numbers I used here are not adjusted for wind conditions and other relevant factors, so I will not draw any firm conclusions. I hope to have time to look more into this later, using data from more competitions, adjusting for wind etc.

# The minimum violations ranking method

One informative benchmark when ranking and rating sports teams is how many times the ranking has been violated. A ranking violation occurs when a team beats a higher ranked team. Ideally no violations would occur, but in practice this rarely happens. In many cases it is unavoidable, for example in this three team competition: Team A beats team B, team B beats team C, and team C beats team A. In this case, for any of the 6 possible rankings of these three teams at least one violation would occur.

Inspired by this, one could try to construct a ranking with as few violations as possible. A minimum violations ranking (MVR), as it is called. The idea is simple and intuitive, and has been put to use in ranking American college sport teams. The MinV ranking by Jay Coleman is one example.

MV rankings have some other nice properties other than being just an intuitive measure. A MV ranking is the best ranking in terms of backwards predictions. It can also be a method for combining several other rankings, by using the other rankings as the data.

Despite this, I don’t think MV rankings are that useful in the context of football. The main reason for this is that football has a large number of draws and as far as I can tell, a draw has no influence on a MV ranking. A draw is therefore equivalent with no game at all and provides no information.

MV rankings also has another problem. In many cases there can be several rankings that satisfies the MV criterion. This of course depends on the data, but it seems nevertheless to be quite common, such as in the small example above.

Unfortunately, I have not found any software packages that can find a MV ranking. One algorithm is described in this paper (paywall), but I haven’t tried to implemented it myself. Most other MVR methods I have seen seem to be based on defining a set of mathematical constraints and then letting some optimization software search for solutions. See this paper for an example.

# Does traveling distance influence home field advantage?

A couple of weeks ago I posted a data set with the location of the stadiums for many of the football teams in Europe. One thing I wanted to use the dataset for was to see if the traveling distance between two teams (as measured by the distance between the two team’s home stadium) influenced home field advantage.

To calculate the home field advantage for each match i did the following: For each team, the average goal difference during the season are calculated (goals scored minus goals conceded divided by the number of matches). Then the expected goal difference for a match is the difference between the average goal differences (home minus away). The home field advantage is then the observed goal difference minus the expected goal difference.

In the 2012-13 Premier League season, for example, Chelsea scored 75 goals and conceded 39 goals in total. Everton scored 55 and conceded 40 goals. Both teams played 38 matches during the season. On average Chelsea had a goal difference of per match of 0.947 and Everton’s average were 0.395. With Chelsea meeting Everton at home the expected goal difference is 0.947 – 0.395 = 0.553. The actual outcome for this match was 2-1, a goal difference of 1. The home field advantage for this match is then 1 – 0.553 = 0.447.

Using data from the 2011-12 and 2012-13 seasons from the top divisions from Spain, France Germany, and the 2012-13 from England I used the stadium coordinates to calculate the traveling distance for the visiting team and the home field advantage. Plotting these two against each other, and drawing the least squares line gives this:

There is a great deal of noise in this plot, to put it mildly. The slope of the red line is 0.00006039. This is the estimated increase in number of goals the home team scores for each kilometer the away team has traveled. This is not significantly different from 0 (p-value = 0.646). The intercept, where the red line crosses the vertical axis is 0.4, meaning that the home team is estimated to score 0.4 more goals than expected, if the opposing team has traveled 0 kilometers. This is highly significant (p-value = 1.71e-11).

To be honest, I am a bit surprised to see such a clear lack of effect of traveling distance. I did not expect a particularly strong, or even very significant effect, but I had hoped to see at least a hint at something. Perhaps one reason for the lack of effect is that traveling distance is not necessarily the same as traveling time as longer distances may be covered by air, making them comparable to shorter travels by land.

It should be kept in mind that these results should only apply to the leagues included in the data. It could be that traveling distance could have a significant effect on longer distances, for example in international competitions such as the Champions League or between national teams.