Poor man’s parallel processing

Here’s a nice trick I learned on how you could implement simple parallel processing capabilities to speed up computations. This trick is only applicable in certain simple cases though, and does not scale very well, so it is best used in one-off scripts rather than in scripts that is used routinely or by others.

Suppose you have a list or an array that you are going to loop trough. Each of the elements in the list takes a long time to process and each iteration is NOT dependent on the result of any of the previous iterations. This is exactly the kind of situation where this trick is applicable.

The trick is to save the result for each iteration in a file whose name is unique to the iteration, and at the beginning of each iteration you simply check if that file already exists. If it does, the script skips to the next iteration. If it doesn’t, you create the file. This way you could run many instances of the script simultaneously, without doing the same iteration twice.

With this trick the results will be spread across different files, but if they are named and formated in a consistent way it is not hard to go trough the files and merge them into a single file.

Here is how it could be done in python:

import os.path

myList = ['bill', 'george', 'barack', 'ronald']

for president in myList:
	
	fileName = 'result_{}'.format(president)
	
	if os.path.isfile(fileName):
		print('File {} already exists, continues to the next iteration')
		continue
	
	f = open(filename, 'w')

	#Do your thing here

	#myResults is the object where your results are stored
	f.write(myResults)
	f.close()
	

And in R:


myList <- c('bill', 'george', 'barack', 'ronald')

for (president in myList){

	file.name <- paste('results', president, sep='_')

	if (file.exists(file.name)){
		cat('File', file.name, 'already exists, continues to the next iteration\n')
		next
	}

	file.create(file.name)

	#Do your thing here
	
	#Save the my.result object
	save(my.result)
}

Gender differences in ski jumping at the Olympics

I had a discussion with some friends the other day about separate sports competitions for men and women. In some sports, like curling, it seems rather unnecessary to have separate competitions. At least assuming the reason for gendered competitions is that being a male or a female may give the competitor an obvious advantage. One sport where we didn’t think it was obvious was ski jumping, so I decided to look at some numbers.

This year’s Olympics was the first time women competed in ski jumping so decided to do a quick comparison of the results from the final round in the men’s and women’s final.

This is what I came up with:
skijump

What we see are the estimated distributions for the jump distances for men and women. The mode for the women seems to be a little lower than the mode for the men. We also see that there is much more variability among the women jumpers than among the men and that the women’s distribution have a longer right tail. Still, it looks like the best female jumpers are on par with the best male jumpers and vice verca.

The numbers I used here are not adjusted for wind conditions and other relevant factors, so I will not draw any firm conclusions. I hope to have time to look more into this later, using data from more competitions, adjusting for wind etc.

The minimum violations ranking method

One informative benchmark when ranking and rating sports teams is how many times the ranking has been violated. A ranking violation occurs when a team beats a higher ranked team. Ideally no violations would occur, but in practice this rarely happens. In many cases it is unavoidable, for example in this three team competition: Team A beats team B, team B beats team C, and team C beats team A. In this case, for any of the 6 possible rankings of these three teams at least one violation would occur.

Inspired by this, one could try to construct a ranking with as few violations as possible. A minimum violations ranking (MVR), as it is called. The idea is simple and intuitive, and has been put to use in ranking American college sport teams. The MinV ranking by Jay Coleman is one example.

MV rankings have some other nice properties other than being just an intuitive measure. A MV ranking is the best ranking in terms of backwards predictions. It can also be a method for combining several other rankings, by using the other rankings as the data.

Despite this, I don’t think MV rankings are that useful in the context of football. The main reason for this is that football has a large number of draws and as far as I can tell, a draw has no influence on a MV ranking. A draw is therefore equivalent with no game at all and provides no information.

MV rankings also has another problem. In many cases there can be several rankings that satisfies the MV criterion. This of course depends on the data, but it seems nevertheless to be quite common, such as in the small example above.

Unfortunately, I have not found any software packages that can find a MV ranking. One algorithm is described in this paper (paywall), but I haven’t tried to implemented it myself. Most other MVR methods I have seen seem to be based on defining a set of mathematical constraints and then letting some optimization software search for solutions. See this paper for an example.

Does traveling distance influence home field advantage?

A couple of weeks ago I posted a data set with the location of the stadiums for many of the football teams in Europe. One thing I wanted to use the dataset for was to see if the traveling distance between two teams (as measured by the distance between the two team’s home stadium) influenced home field advantage.

To calculate the home field advantage for each match i did the following: For each team, the average goal difference during the season are calculated (goals scored minus goals conceded divided by the number of matches). Then the expected goal difference for a match is the difference between the average goal differences (home minus away). The home field advantage is then the observed goal difference minus the expected goal difference.

In the 2012-13 Premier League season, for example, Chelsea scored 75 goals and conceded 39 goals in total. Everton scored 55 and conceded 40 goals. Both teams played 38 matches during the season. On average Chelsea had a goal difference of per match of 0.947 and Everton’s average were 0.395. With Chelsea meeting Everton at home the expected goal difference is 0.947 – 0.395 = 0.553. The actual outcome for this match was 2-1, a goal difference of 1. The home field advantage for this match is then 1 – 0.553 = 0.447.

Using data from the 2011-12 and 2012-13 seasons from the top divisions from Spain, France Germany, and the 2012-13 from England I used the stadium coordinates to calculate the traveling distance for the visiting team and the home field advantage. Plotting these two against each other, and drawing the least squares line gives this:

hfadistance

There is a great deal of noise in this plot, to put it mildly. The slope of the red line is 0.00006039. This is the estimated increase in number of goals the home team scores for each kilometer the away team has traveled. This is not significantly different from 0 (p-value = 0.646). The intercept, where the red line crosses the vertical axis is 0.4, meaning that the home team is estimated to score 0.4 more goals than expected, if the opposing team has traveled 0 kilometers. This is highly significant (p-value = 1.71e-11).

To be honest, I am a bit surprised to see such a clear lack of effect of traveling distance. I did not expect a particularly strong, or even very significant effect, but I had hoped to see at least a hint at something. Perhaps one reason for the lack of effect is that traveling distance is not necessarily the same as traveling time as longer distances may be covered by air, making them comparable to shorter travels by land.

It should be kept in mind that these results should only apply to the leagues included in the data. It could be that traveling distance could have a significant effect on longer distances, for example in international competitions such as the Champions League or between national teams.

BBC’s More Or Less on why the men’s FIFA rankings fail

One of the podcasts I listen to regularly, ‘More Or Less’ from the BBC, had the other day an episode about the (men’s) FIFA rankings. In the episode they discuss a shortcoming in the ranking system that makes it possible for a team to loose points (and thus ranking position) despite winning a match. The reason for this is not fully explained, but looking closer at the descriptions provided at fifa.com I think I see where the problem lies. After each match, rating points are given to the winner (or split if there is a draw). The crucial thing here is that friendly matches (or other non-important matches) gives fewer points than important tournament matches. The published ratings then are basically an average over the points earned for the matches played in the last couple of years. That means that winning a friendly match sometimes will yield fewer than a team’s average points, thus decreasing the average.

Unfortunately the episode did not mention the women’s FIFA ranking system which is based on the much better Elo system, used in chess rankings (and which I have written about previously). In this sort of system a win will almost surely give more points, and not less (the worst case scenario for a win is that no points are earned).

Dataset: Football stadiums with geographic coordinates

Here is a dataset I have put together with the location and capacity for the stadiums for about 130 European teams. The teams are from England, Scotland, France, Germany and Spain. The data is taken from Wikipedia and should be correct for the last couple of seasons. The French team Lille’s stadium is the current from the 2012 season, and Nice’s stadium is not the current one, but the one they had until the end of last season.

I have also added a column with the team names as they are used in the data from football-data.co.uk.

Some of the coordinates are more accurate than others, but I think they should be accurate enough to at least give an indication of the town the team comes from. That is probably true for the teams that have changed stadiums; they are probably within the same town as well.

What can this data set be used for? One thing I want to look into is whether traveling distance for the visiting team in a match influences the home field advantage. I have a couple of other ideas as well, but that will be for another time.

Download data

Edit: Updated dataset after I found a trailing whitespace in the FDCOUK column.
Edit December 16 2013: Added two spanish teams.

Elo ratings in football: Home field advantage

In my first post about Elo ratings in football I posted the code for a R function where you could adjust the ratings to account for home field advantage. The method is simple: Some extra points are added to the home teams rating when the match predictions (which is based on the ratings) are calculated. My implementation did only support giving a the same fixed amount of extra points to all teams in all games. In other words it is assumed that all teams have the same home field advantage, and that the home field advantage does not change over time. This is of course unrealistic if the point of the ratings is to give as accurate predictions as possible. Still the method is used in the FIFA Women’s World Ranking and in the World Football Elo Ratings.

I know of two ways to implement a more dynamic (and more realistic) home field advantage. The ClubElo ratings (which is perhaps the best football rating site out there), developed by Lars Schiefler, let the home field advantage change over time, similar to how the ratings change. This is done by updating the home field advantage after each game based on the home team’s performance. An article on the ClubElo site describes the details very well.

A rather different method is used in the pi-rating system, developed by Anthony Constantinou. Each team have two ratings, one describing performance when playing at home, the other when playing away. The cool thing about the way this is done is that the two ratings for a team are not calculated separately from each other. It is not the case that only the home matches are used to calculate the home rating and away matches to calculate the away rating; After each match both ratings are updated. The home rating for the home team is updated almost like the regular Elo ratings, and the away rating is then also updated based on how much the home rating has changed, scaled by a factor. That way the two ratings are allowed to deviate from each other, giving rise to an adaptive, team specific home field advantage. The procedure is of course also applied to the away team’s ratings.

The pi-ratings, by the way, are interesting in other ways besides the method for determining the home field advantage. Instead of the ratings being somewhat arbitrary numbers, like most Elo systems, the pi-ratings directly models goal differences. The details are described in the paper Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries. A draft of the paper is also available at the pi-ratings website. While I am at it, I can also recommend Constantinou’s other papers on football prediction.

Degenerate DNA sequences as regular expressions

DNA molecules are described using a string made up of four different letters, each representing a nucleotide base on one strand in the double helix: A, C, G and T. Sometimes, however, there is a need to represent several possible nucleotides in a given position. One example where this is needed arises is when the sequencing does not work perfectly. Another example is when a binding site for proteins is described.

There are several ways to represent these ambiguous or degenerate positions. Take for example this description of the AGL1 binding site motif from AGRIS:

NTT(A/G/T)CC(A/T)(A/T)(A/T)(A/T)NNGG(A/T)AAN

In this sequence we see two different methods for describing the degenerate positions. In some positions we see the letter N, which means that any of the four nucleotides in that position matches the description. In the positions where more than one, but not all four, bases are allowed this is represented using parentheses and slashes. Despite the meaning being obvious, this convention is as far as I know considered non-standard. The use of the letter N, however, is an accepted standard. There are also other standard one-letter representations for all possible of two and three nucleotide positions. For example is the letter D used to represent a position where A, G and T is allowed. This means that the fourth position in the above motif could be represented using D instead of (A/G/T).

Suppose now that you would want to find out if this motif occurs in a DNA sequence. A simple text search with the motif as it is described above will obviously not do. One obvious solution would be to turn the motif into a regular
expression. There are many ways in which the above motif could be described using a regular expression, but I will take advantage of the fact that the motif already is very similar to a regular expression pattern. The only thing we need to do is convert it into the correct syntax.

In regular expressions the ambiguities can be described using square brackets. Position 4 in the motif becomes [AGT] instead of (A/G/T). A simple find-replace replacing the parentheses with square brackets and removing the slash will do the trick for this kind of notation. The positions described with the single letter N can similarly be replaced with [ACGT].

But wait! What if the sequence you want to find the motif in it self contains ambiguities? Let’s at least hope the ambiguities are represented using the single letter standard and not with the parentheses and slash method. First, to avoid matching wrong motifs we have establish that an A in the motif only matches A in the sequence, and not matches D, for example, even if an A is possible in that position. This is already implied, as an A in a regular expression of course does not matches anything else than an A.

What we need to do is to change the square bracket ambiguities in the regular expression to also match the appropriate ambiguities in the target sequence. Take the letter N that is know encoded as [ACGT]. Of course we need to add an N, but thats not all. We also need to add all other letters in the code. N therefore becomes [ACGTRYMKSWHBVDN]. Similarly D becomes [AGTDWK].

But would it not be easier to represent D, which according to the link above means ‘not C’, as [^C]? The problem with this is that [^C] can match letters that does not exclude C. [^C] would for instance match N, which of course stands for all nucleotides including C.

Here is a complete table of regular expression patterns for all degenerate DNA bases:

Nucleotide Regexp pattern
B [CGTBSKY]
D [AGTDRWK]
H [ACTHMYW]
K [GTK]
M [ACM]
N [ACGTBDHKMNRSVWY]
R [AGR]
S [CGS]
V [ACGVMSR]
W [ATW]
Y [CTY]

Thus the AGL1 binding site motif above becomes
[ACGTN]TT[AGT]CC[AT][AT][AT][AT][ACGTN][ACGTN]GG[AT]AA[ACGTN]

How to determine which football team is best? Statistical power and experimental design

Luck plays a significant part in a football match. Because of this we are not absolutely sure that the winning team in a match is the best one. Some researchers has taken a closer look at this by viewing a football match as a experiment used to determine which team is the best (Soccer matches as experiments: how often does the ‘best’ team win? by G. K. Skinner & G. H. Freeman, link). They found that in matches where the goal difference where less than about 3 or 4 goals, we could in general not be more than 90% sure that the best team won. This led the scientists to call a football match for “a badly designed experiment”.

While a single match could only hope to determine which of two teams is the best, we need a lot more matches to determine the best team among several. We need to hold a competition. There are several ways in which the different teams can play against each other in a competition. The perhaps most common formats are the the all versus all format we find in most national leagues. In this format every team plays against every other team in the league. Another common type of competition is the knockout tournament. This is the format used in the last stage in many international competitions like FIFA World Cup.

If we suppose the goal of a competition is to determine the best team we can see the competition as an experimental setup. Both of the two different competition formats have pros and cons. In an all versus all leagues (hereafter just referred to as a league) the teams often play each other twice during the season, once at each team’s home field. We thus get a repetition of each pair of teams, and we also get to control for home field advantage. This may or may not be the case in knockout tournaments (hereafter just referred to as a tournament). In knockout stage at the FIFA World Cup the the teams facing each other only plays a single match. Also, all teams except for the team representing the hosting nation has a home field advantage (if they reach the knockout stage, that is). The UEFA Champions League knockout stage operates with 2-leg matches, where each team play each other twice.

The question whether different types of competitions are better or worse at correctly identifying the best team relates to the statistical concept of power. In short, the power of an experimental procedure is the probability of confirming the alternative hypothesis when the alternative is true. In terms of identifying the best team the hypotheses can be stated as

H0: Team X is not the best team
H1: Team X is the best team

So what we want to figure out is what is the probability of team X winning the competition if it truly is the best team. The power of an experiments depends on a couple of factors: The number of observations, the size of the effect and the experimental procedure itself. In a football competition the number of observations and the procedure is greatly confounded. The number of matches is central to the competition format. A lot more matches are played in a league than in a tournament. In a league with N teams N(N-1) matches has to be played. Compared to a 1-leg tournaments where log2N matches has to be played, this is much greater. The effect size in this context is how good the best team is compared to the other teams.

Power analysis can be rather difficult to do analytically except for in the simplest models. One way to do a power analysis is therefore to do simulations. For the simulations I did here I decided to use Elo-ratings (which I have written about here) to generate some ratings and then simulate a competition. By doing this we can know which team is the best. By simulating the competition many times over we can get an estimate of the probability that the best team wins. The Elo-ratings can be used directly to calculate the chances of winning and loosing a match and is therefore a simple way to do this. Elo-ratings has some drawbacks, however. The most obvious that comes to my mind is that it is impossible to calculate the probability of a draw. This may be a problem the simulations of the league competitions. Hopefully, the results does not suffer too much because of this in the long run, since the probability of winning, as calculated by the ratings, does include half of the probability of drawing.

For the simulations I generated uniformly distributed ratings for 16 teams. By changing the upper and lower bounds for the uniform distribution we can change the competitiveness of the league. I used two sets of bounds: One where the ‘win percentage’ between the two bounds where 90% and one with 75%. We can think of this as the variability in effect size. For each simulation new ratings were generated. For each of the results 100000 competitions were simulated. For the tournament simulations I also looked at two different initial seedings. One was the completely random seed. The other was better informed, where the top half of the teams initially matched up against one of the bottom half of the teams. Otherwise it was random.

Here are the results:

Rank League (90) League (75) Unseeded (90) Unseeded (75) Seeded (90) Seeded (75)
1 0.3763 0.2523 0.2121 0.1368 0.2196 0.1438
2 0.2432 0.1972 0.1732 0.1236 0.1804 0.1296
3 0.1561 0.1488 0.1397 0.1095 0.1434 0.1124
4 0.0991 0.1127 0.1114 0.0965 0.1167 0.101
5 0.0556 0.0841 0.0891 0.0854 0.0924 0.0897
6 0.0344 0.0635 0.0702 0.0746 0.0732 0.079
7 0.0178 0.0455 0.0544 0.0658 0.0578 0.07
8 0.0096 0.0316 0.041 0.0565 0.0432 0.0595
9 0.0046 0.0223 0.0315 0.05 0.0211 0.0426
10 0.002 0.0152 0.0245 0.0444 0.0165 0.0376
11 0.0009 0.0104 0.0178 0.0374 0.0114 0.0313
12 0.0003 0.0067 0.0124 0.0318 0.0093 0.0287
13 0.0001 0.0044 0.0092 0.0279 0.0062 0.0238
14 0 0.0028 0.0063 0.0235 0.0042 0.0201
15 0 0.0014 0.0045 0.0203 0.0028 0.0168
16 0 0.0011 0.0027 0.0161 0.0018 0.0141

The league format unsurprisingly is much better at determining the best team than tournaments. What I found most surprising was how little effect the seeding in a tournament has. For both the higher and the lower competitive tournaments the chance of correctly identify the best team increases by less than one percentage point.

FIFA Women’s World Ranking and goal difference in Elo ratings.

The FIFA rankings for women’s national teams use a quite different methodology than the one used in ranking men’s national teams. The Women’s World Ranking (WWR) is based on the Elo rating system I wrote about in the previous post. The details for the men’s ranking can be found here, and the details the women’s ranking can be found here.

One thing that makes the WWR interesting is how goal differences are accounted for in the ratings. This is not something found in the ordinary chess-based Elo ratings. The method used in WWR is to let the ‘winning percentage’ change depending on two things: Goal difference and the number of goals the loosing team scored. This is in contrast to the original Elo ratings where the winning team wins 100% and the loosing team wins 0% (a draw is 50%). In WWR a team can never win 100%, the most it can win is 99%. This is the case when the goal difference is more than 6 and the loosing team haven’t scored any goals. The table below is from the pdf file linked to above and shows how many percent the loosing team win.

wwrtable

One strange thing about this table is the column for goal difference 0, implying a draw. I am guessing this is an error since it means that the winning percentage for the loosing team will be greater than the winning team. In the paper I mentioned in my last post where a number of different rating methods were compared (The predictive power of ranking systems in association football by J. Laset et. al), it was assumed that draw would yield both teams 50%, as in the original Elo-ratings. That paper also showed that the Women’s World Ranking was among the rating systems with best prediction.

Here is a plot showing the win percentage when the loosing team has scored 0 and 5 goals. We can see that there is not much gained for the loosing team to score one extra goal (assuming the goal difference stays the same, which of course is dubious), and most of the gain in winning percentage is when a team scores a goal such that the stance goes from a draw to a win.

WWRwinpercantage_nontransparent

The table above only goes up to 5 goals for the loosing team, but for the sake of implementation it is easy to generalize the rule about how much the loosing team gains by scoring an additional goal (with the goal difference is the same). For example will the loosing team gain 0.9 extra winning percentage points when the goal difference is 2. Similarly the gain is 0.6 percentage points when the goal difference is 5.

Below is a R function to compute the win percentage for football matches. It takes as input two vectors with the number of goals scored by the two opponents and returns a vector of win percentages (a number between 0 and 1) for the first team.

winPercentageWWR <- function(team1Goals, team2Goals){
  #calculates the win percentage for team 1.

  stopifnot(length(team1Goals) == length(team2Goals))
  
  perc <- c(0.01, 0.02, 0.03, 0.04, 0.08, 0.15, 0.50, 0.85, 0.92, 0.96, 0.97, 0.98, 0.99)
  add <- c(0, 0.01, 0.009, 0.008, 0.007, 0.006, 0.005)
  
  goalDifferences <- (team1Goals - team2Goals)
  goalDifferences[goalDifferences < -6] <- -6
  goalDifferences[goalDifferences > 6] <- 6

  team1WinPercentage <- numeric(length=length(goalDifferences))

  for (idx in 1:length(goalDifferences)){

    team1WinPercentage[idx] <- perc[goalDifferences[idx]+7] -  
      sign(goalDifferences[idx])*min(team1Goals[idx], team2Goals[idx])*add[abs(goalDifferences[idx])+1]
  }
  return(team1WinPercentage) 
}