# The Bayesian Bradley-Terry model with draws

In the previous post I tried out the Stan software to implement two Bayesian versions of the Bradley-Terry (BT) model. One drawback of the Bradley-Terry model is that it can’t handle draws, which seriously hampers its utility in modelling sports data. That was one reason I used handball results rather than football results as the example, since draws are rare in handball.

One (of several) extension of the BT model that can handle draws is the Davidson model. This was developed in the 1970 paper ‘On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments‘. In short, the model adds a new parameter, $$\nu$$, which influences the probability of a draw. When $$\nu = 0$$, the model becomes the ordinary BT model.

In my Stan implementation below I use a Dirichlet prior on the ratings, like last time. The consequence of this is that the sum of all the ratings is 1. In the BT model this gives us the the nice interpretation that the rating the probability of a team of winning against an hypothetical average team. This property is not exactly carried over to the Davidson model, but a related property is. The ratio of two ratings, $$\pi_1 / \pi_2$$, is the probability that team 1 wins against team 2, applies to both the BT model and the Davidson model

In my implementation of the BT model I used the Bernoulli distribution to model the outcomes, which is appropriate when we only have two outcomes. As you can see from the code below, we now have to use the categorical distribution, since we now have three outcomes. I also use an exponential prior on $$\nu$$. Admittedly, I have no particular reason for this except that it is the traditional choice for parameters that has to have only positive values.

Anyway, here is the Stan code:

data {
int<lower=0> N; // N games
int<lower=0> P; // P teams

// Each team is referred to by an integer that acts as an index for the ratings vector.
int team1[N]; // Indicator arrays for team 1
int team2[N]; // Indicator arrays for team 1
int results[N]; // Results. 1 if home win, 2 if away won, 3 if a draw.

real<lower=0> nu_prior_rate;

vector[P] alpha; // Parameters for Dirichlet prior.
}

parameters {
// Vector of ratings for each player
// The simplex constrains the ratings to sum to 1
simplex[P] ratings;

// Parameter adjusting the probability of draw.
real<lower=0> nu;
}

model {

// Array of length 3 vectors for the three outcome probabilies for each game.
vector[3] result_probabilities[N];
real nu_rating_prod;

ratings ~ dirichlet(alpha); // Dirichlet prior on the ratings.

nu ~ exponential(nu_prior_rate); // exponential prior on nu.

for (i in 1:N){

// nu multiplied by the harmonic mean of the ratings.
nu_rating_prod = sqrt(ratings[team1[i]] * ratings[team2[i]]) * nu;

result_probabilities[i][3] = nu_rating_prod / (ratings[team1[i]] + ratings[team2[i]] + nu_rating_prod);
result_probabilities[i][1] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]] + nu_rating_prod);
result_probabilities[i][2] = 1 - (result_probabilities[i][1] + result_probabilities[i][3]);

results[i] ~  categorical(result_probabilities[i]);
}

}


Another thing I wanted to do this time was to do proper MCMC sampling, so we could get the Bayesian posterior credibility intervals. The sampling takes longer time than the optimization procedure I used last time, but it only took a few seconds to get a decent amount of samples.

For the reanalysis of the handball data from last time I set the Dirichlet prior parameters to 5 for all teams, and the rate parameter for the exponential prior on $$\nu$$ is 1. We can visualize the estimate of the ratings and their uncertainties (95% intervall) using a forest plot:

The results agree with the ones from last time, but this time we also see that the credibility intervals are rather large. This is perhaps not that surprising, since the amount of data is rather limited. The posterior (mean) point estimate for $$\nu$$ is 0.15.

But let’s take a look at some English Premier League football data. With the ordinary BT model this would not work so well since there’s a lot of draws in football. Ignoring them would not be tenable. Below are the ratings, with 95% credibility interval, based on data from the 2015-15 season, using the same prior parameters as in the handball data set. The league points are shown in parenthesis for comparison.

The ratings generally agree with the points, except in a few instances, where a team or two have switched places. Another interesting thing to notice is that the width of the intervals seem to be related to the magnitude of the rating. I am not exactly sure why that is, but I suspect its due to the fact that the ratings are in a sense binomial probabilities, and these are known to have greater variance the closer they are to 0.5.

The point estimate for $$\nu$$ is 0.85 for this data set. Compared to the 0.15 for the handball data, it is clear that this reflects the higher overall probability of draws in football. In the handball data set only 6 games ended in a draw, while in the football data set about 20% of the games was a draw.

# The comparison graph

In my last post on Elo ratings I used a graph to illustrate why it is hard to compare the strengths of teams that play in different leagues.

This is based on data from two half-seasons of English Premier League and the Championship. Each team is represented by a node or vertex, which is drawn as a circle. Each edge between the nodes is drawn as line between them and indicates that the two teams have against played each other at least once. I didn’t add team names to the graph, but the orange nodes are the teams that played in the Premier League in both seasons, while the blue nodes are teams that played in the Championship or got promoted or relegated. The graph shows both why comparisons between teams in different leagues can be difficult, and why including data from the Championship can improve the prediction of Premier league matches. In this post I will go more into details about what I call the comparison graph and how it can be used.

We can easily recognize a few patterns in the graph above. The most obvious one is the cluster of several teams where everyone (or nearly everyone) has played against each other. In a fully played season every team has played each other and the graph is said to be fully connected. If the graph is fully connected we should be able to have a good idea about the relative strengths between every team. Here is an example of a fully connected graph representing a fully played season with 10 teams.

Another important pattern is the lack of edges between two teams. If two teams hasn’t played each other, but both has played a third team, they are indirectly comparable. Here we see that both Team B and Team C has played team A, but they have not played each other.

If you are going to predict the outcome of a match between Team B and Team C this graph shows you that you should be careful. The information we have of the relative strengths between them is only indirect. This can in some situations where you have very limited data be almost the same as having no data at all. Suppose both Team C and Team B won huge victories over Team A. This would perhaps indicate that Team A is crap, but we would have very little indication which of Team B and Team C is better. If on the other hand Team A beat Team C, and Team B beat Team A, we would have had a strict ordering, so it does not automatically mean that we can’t make anything out of the data.

Another important pattern in a graph is whether there are any disconnected subgraphs. Here we have two or more groups of teams that has played only against other teams within their own group, but not against the teams in the other groups. In the first few rounds of a season we can see patterns like this.

There are a lot of interesting things you can do with the comparison graph, but that will make for a future post.

# The underdispersed Conway-Maxwell Poisson distribution and goal differences

I have unfortunately not had the time to look more closely at the performance for the underdispersed count distributions that I in my last post found to be useful for predicting football results. Here I am taking a quick look into how the Conway-Maxwell distribution (COM) influences the predicted goal differences compared to the Poisson distribution.

Using data from the 2010-11 Premier League season I fitted the both the Poisson model and the COM model. The estimated dispersion parameter for the COM model indicated that there was less variability in the actual goals scored than implied by the Poisson distrubtion. I used the code I posted here to compute the probability distributions for the goal-differences for five matches in the season.

Let’s first look at the goal difference distribution for Arsenal playing at home against Manchester City. Both teams were in the top of the final table, and the actual result for this game in the 2010-11 season was 0-0. Comparing the distributions from the Poisson and COM models we see that they are pretty much identical.

For Aston Villa vs. Sunderland, which placed 9th and 10th on the table in 2010-11, we also see that there is not much difference between the two models. Although there is a slight increase in the probability for the actual result in that game, I don’t think it is of much importance.

Let’s compare the models using two teams from the bottom of the table, Wigan vs. Wolverhampton. Again, not much difference. Also note that there is basically no change in the probability for the actual result.

OK, so far the comparisons have been based on teams of similar strengths. But take Blackburn (15th on the table) vs. Liverpool (6th). Now we see that the COM model and Poisson model differ a bit. Here the COM model does a worse prediction of the actual result compared to the Poisson model. But only considering the league positions, the skewing of the distribution in favor of Liverpool in this case may not be totally unreasonable.

The last plot is compares the distributions between Chelsea (2nd on the table) vs. Birmingham (18th). Here we clearly see that there is a substantial difference in the prediction between the two models. The COM model favors Chelsea much more than the Poisson model does, which in this case give a much higher probability for the correct result.

From just looking at these few plots, I think we can conclude that the (underdispersed) COM model differs from the Poisson model where there is a greater difference in strength between the two sides.

# Statistics I don’t believe in

Oslo is one of the candidate cities for hosting the 2022 winter Olympics, and of course there is a lot of discussion going on whether it costs too much etc. Opinion polls are conducted regularly, and the current status is that the majority of people don’t want Oslo to host the games. A couple of weeks ago the office dealing with the application put up the data from all polls conducted this year. What is weird is that they choose to illustrate the article detailing the data with the results from the question Do you wish that Norway should continue to participate in the winter Olympics and Paralympics? where only 73% of the respondents answered yes and 19% answered no.

First of all I thought the question was silly. Of course no one want Norway not to participate, but then the results show that almost one in five don’t want Norway to participate. At first i thought that it might be just a fluke, but then I downloaded the data and saw that this question had been asked on three occasions, and the results were pretty similar:

Even after seeing the consistent trend, I still don’t believe the numbers are correct. I think this shows one of the potential pitfalls in opinion polls. Since it is really hard to get the nuances of peoples beliefs in questions with just a few possible answers, it may be tempting to ask more detailed questions. The danger is that the questions become more confusing, or they even become irrelevant. I think both these things are can explain what is going on here.

# A short summer update

Hi everybody, I haven’t had written much lately as I have been busy with other things, so this is just a short update on what is going on with my life.

This May I handed in my M.Sc. thesis in bioinformatics at the Norwegian University of Life Sciences. In it I investigated the prevalence of cis-regulatory elements (CREs) in plant promoters across 25 species. CREs are short DNA sites (typically around 12 base pairs long) that plays a role in gene regulation. They typically work by having high binding affinity for proteins that somehow regulates RNA transcription. I used genomic data (sequences and annotations) from the PLAZA server. I then predicted CREs by using motifs from the JASPAR, AtcisDB and PLACE databases. I then investigated whether predicted CREs that was found to be conserved in many gene families were more co-expressed than CREs that were not conserved and conserved only in a few families. I found that in general, this was not the case.

Right after my thesis was done I started my new job as a statistician in the Department of Health Services Research at Akershus University Hospital right outside Oslo. I am very exited about this as I get to work on a lot of interesting and meaningful research projects with a bunch of really great people.

I also spent a week backpacking in Belgium this summer with my girlfriend. The atmosphere was great in Brussels when Belgium beat the US in the World Cup, the weather was nice when we hiked in the Ardennes near the French border and everything was excellent when we rode our bikes in Bruges. Antwerp had the best food and the most beautiful train station I have ever been to. The most bizarre thing we experienced was perhaps the Bruggian love for embroidered cats dressed as humans:

Until next time!