How to determine which football team is best? Statistical power and experimental design

Luck plays a significant part in a football match. Because of this we are not absolutely sure that the winning team in a match is the best one. Some researchers has taken a closer look at this by viewing a football match as a experiment used to determine which team is the best (Soccer matches as experiments: how often does the ‘best’ team win? by G. K. Skinner & G. H. Freeman, link). They found that in matches where the goal difference where less than about 3 or 4 goals, we could in general not be more than 90% sure that the best team won. This led the scientists to call a football match for “a badly designed experiment”.

While a single match could only hope to determine which of two teams is the best, we need a lot more matches to determine the best team among several. We need to hold a competition. There are several ways in which the different teams can play against each other in a competition. The perhaps most common formats are the the all versus all format we find in most national leagues. In this format every team plays against every other team in the league. Another common type of competition is the knockout tournament. This is the format used in the last stage in many international competitions like FIFA World Cup.

If we suppose the goal of a competition is to determine the best team we can see the competition as an experimental setup. Both of the two different competition formats have pros and cons. In an all versus all leagues (hereafter just referred to as a league) the teams often play each other twice during the season, once at each team’s home field. We thus get a repetition of each pair of teams, and we also get to control for home field advantage. This may or may not be the case in knockout tournaments (hereafter just referred to as a tournament). In knockout stage at the FIFA World Cup the the teams facing each other only plays a single match. Also, all teams except for the team representing the hosting nation has a home field advantage (if they reach the knockout stage, that is). The UEFA Champions League knockout stage operates with 2-leg matches, where each team play each other twice.

The question whether different types of competitions are better or worse at correctly identifying the best team relates to the statistical concept of power. In short, the power of an experimental procedure is the probability of confirming the alternative hypothesis when the alternative is true. In terms of identifying the best team the hypotheses can be stated as

H0: Team X is not the best team
H1: Team X is the best team

So what we want to figure out is what is the probability of team X winning the competition if it truly is the best team. The power of an experiments depends on a couple of factors: The number of observations, the size of the effect and the experimental procedure itself. In a football competition the number of observations and the procedure is greatly confounded. The number of matches is central to the competition format. A lot more matches are played in a league than in a tournament. In a league with N teams N(N-1) matches has to be played. Compared to a 1-leg tournaments where log2N matches has to be played, this is much greater. The effect size in this context is how good the best team is compared to the other teams.

Power analysis can be rather difficult to do analytically except for in the simplest models. One way to do a power analysis is therefore to do simulations. For the simulations I did here I decided to use Elo-ratings (which I have written about here) to generate some ratings and then simulate a competition. By doing this we can know which team is the best. By simulating the competition many times over we can get an estimate of the probability that the best team wins. The Elo-ratings can be used directly to calculate the chances of winning and loosing a match and is therefore a simple way to do this. Elo-ratings has some drawbacks, however. The most obvious that comes to my mind is that it is impossible to calculate the probability of a draw. This may be a problem the simulations of the league competitions. Hopefully, the results does not suffer too much because of this in the long run, since the probability of winning, as calculated by the ratings, does include half of the probability of drawing.

For the simulations I generated uniformly distributed ratings for 16 teams. By changing the upper and lower bounds for the uniform distribution we can change the competitiveness of the league. I used two sets of bounds: One where the ‘win percentage’ between the two bounds where 90% and one with 75%. We can think of this as the variability in effect size. For each simulation new ratings were generated. For each of the results 100000 competitions were simulated. For the tournament simulations I also looked at two different initial seedings. One was the completely random seed. The other was better informed, where the top half of the teams initially matched up against one of the bottom half of the teams. Otherwise it was random.

Here are the results:

Rank League (90) League (75) Unseeded (90) Unseeded (75) Seeded (90) Seeded (75)
1 0.3763 0.2523 0.2121 0.1368 0.2196 0.1438
2 0.2432 0.1972 0.1732 0.1236 0.1804 0.1296
3 0.1561 0.1488 0.1397 0.1095 0.1434 0.1124
4 0.0991 0.1127 0.1114 0.0965 0.1167 0.101
5 0.0556 0.0841 0.0891 0.0854 0.0924 0.0897
6 0.0344 0.0635 0.0702 0.0746 0.0732 0.079
7 0.0178 0.0455 0.0544 0.0658 0.0578 0.07
8 0.0096 0.0316 0.041 0.0565 0.0432 0.0595
9 0.0046 0.0223 0.0315 0.05 0.0211 0.0426
10 0.002 0.0152 0.0245 0.0444 0.0165 0.0376
11 0.0009 0.0104 0.0178 0.0374 0.0114 0.0313
12 0.0003 0.0067 0.0124 0.0318 0.0093 0.0287
13 0.0001 0.0044 0.0092 0.0279 0.0062 0.0238
14 0 0.0028 0.0063 0.0235 0.0042 0.0201
15 0 0.0014 0.0045 0.0203 0.0028 0.0168
16 0 0.0011 0.0027 0.0161 0.0018 0.0141

The league format unsurprisingly is much better at determining the best team than tournaments. What I found most surprising was how little effect the seeding in a tournament has. For both the higher and the lower competitive tournaments the chance of correctly identify the best team increases by less than one percentage point.