Calculating the probability for both teams to score in R

I recently read an interesting paper that was published last year, by Igor Barbosada Costa, Leandro Balby Marinho and Carlos Eduardo Santos Pires: Forecasting football results and exploiting betting markets: The case of “both teams to score. In the paper they try out different approaches for predicting the probability of both teams to score at least one goal (“both teams to score”, or BTTS for short). A really cool thing about the paper is that they actually used my goalmodel R package. This is the first time I have seen my package used in a paper, so im really excited bout that! In addition to using goalmodel, they also tried a few other machine learning approaches, where insted of trying to model the number of goals by using the Poisson distribution, they train the machine learning models directly on the both-teams-to-score outcome. They found that both approaches had similar performance.

I have to say that I personally prefer to model the scorelines directly using Poisson-like models, rather than trying to fit a classifier to the outcome you want to predict, whether it is over/under, win/draw/lose, or both-teams-to-score. Although I can’t say I have any particularly rational arguments in favour of the goal models, I like the fact that you can use the goal model to compute the probabilities of any outcome you want from the same model, without having to fit separate models for every outcome you are interested in. But then again, goalmodels are of course limited by the particlar model you choose (Poisson, Dixon-Coles etc), which could give less precice predictions on some of these secondary outcomes.

Okay, so how can you compute the probability of both teams to score using the Poisson based models from goalmodel? Here’s what the paper says:

This is a straightforward approach where they have the matrix with the probabilities of all possible scorelines, and then just add together the probabilities that correspond to the outcome of at least one team to score no goals. And since this is the exact opposite outcome (the complement) of both teams to score, you take one minus this probability to get the BTTS probability. I assume thay have used the predict_goals() function to get the matrix in question from a fitted goal model. In theory, the Poisson model allows to an infinite amount of goals, but in practice it is sufficient to just compute the matrix up to 10 or 15 goals.

Heres a small self-contained example of how the matrix with the score line probabilities is computed by the predict_goals() function, and how you can compute the BTTS probability from that.

# Expected goals by the the two oppsing teams.
expg1 <- 1.1
expg2 <- 1.9

# The upper limit of how many goals to compute probabilities for.
maxgoal <- 15

# The "S" matrix, which can also be computed by predict_goals().
# Assuming the independent Poisson model.
probmat <- dpois(0:maxgoal, expg1) %*% t(dpois(0:maxgoal, expg2))

# Compute the BTTS probability using the formula from the paper.
prob_btts <- 1 - (sum(probmat[2:nrow(probmat),1]) + sum(probmat[1,2:ncol(probmat)]) + probmat[1,1])


In this example the probability of both teams to score is 56.7%.

In many models, like the one used above, it is assumed that, the goal scoring probabilites for the two oppsing teams are statistically independent (given that you provide the expected goals). For this type of model there is a much simpler formula for computing the BTTS probability. Again, you compute the probability of at each of the teams to not score, and then take 1 minus these probabilities. And the probability of at least one team to score is just the product of the probability of both teams to score. In mathematical notation this is

$$P(BTTS) = (1 – P(X = 0)) \times (1 – P(Y = 0))$$

where X is the random variable for the number of goals scored by the first team, and Y is the corresponding random variable for the second team. The R code for this formula for the independent Poisson model is

prob_btts_2 <- (1 - dpois(0, lambda = expg1)) * (1 - dpois(0, lambda = expg2))


which you can verify gives the same result as the matrix-approach. This formula only works with the statistical independence assumption. The Dixon-Coles and bivariate Poisson (not in the goalmodel package) models are notable models that does not have this assumption, but instead have some dependence (or correlation) between the goal scoring probabilities for the two sides.

I have also found a relatively simple formula for BTTS probability for the Dixon-Coles model. Recall that the Dixon-Coles model applies an adjustment to low scoring outcomes (less than two goals scored by either team), shifting probabilities to (or from) 0-0 and 1-1 to 1-0 and 0-1 outcomes. The amount of probability that is shifted depends on the parameter called rho. For the BTTS probability, it is the adjustment to the probability of the 1-1 outcome that is of interest. The trick is basically to subtract the probability of the 1-1 outcome for the underlying independent model without the Dixon-Coles adjustment, and then add back the Dixon-Coles adjusted 1-1 probability. The Dixon-Coles adjustment for the 1-1 outcome is simply 1 – rho, which does not depend on the expected goals of the two sides.

Here is some R code that shows how to apply the adjustment:


rho <- 0.13

# 1-1 probability for the independent Poisson model.
p11 <- dpois(1, lambda = expg1) * dpois(1, lambda = expg2)

dc_correction <- (p11 * (1-rho)) - p11

# Apply the corrections
prob_btts_dc  <- prob_btts_2 + dc_correction


If you run this, you will see that the BTTS probability decreases to 55.4% when rho = 0.13.

I have added two functions for computing BTTS probabilities in the new version 0.6 of the goalmodel package, so be sure to check that out. The predict_btts() function works just like the other predict_* functions in the package, where you give the function a fitted goalmodel, together with the fixtures you want to predict, and it gives you the BTTS probability. The other function is pbtts(), which works independently of a fitted goalmodel. Instead you just give it the expected goals, and other paramters like the Dixon-Coles rho parameter, directly.

The Bayesian Bradley-Terry model with draws

In the previous post I tried out the Stan software to implement two Bayesian versions of the Bradley-Terry (BT) model. One drawback of the Bradley-Terry model is that it can’t handle draws, which seriously hampers its utility in modelling sports data. That was one reason I used handball results rather than football results as the example, since draws are rare in handball.

One (of several) extension of the BT model that can handle draws is the Davidson model. This was developed in the 1970 paper ‘On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments‘. In short, the model adds a new parameter, $$\nu$$, which influences the probability of a draw. When $$\nu = 0$$, the model becomes the ordinary BT model.

In my Stan implementation below I use a Dirichlet prior on the ratings, like last time. The consequence of this is that the sum of all the ratings is 1. In the BT model this gives us the the nice interpretation that the rating the probability of a team of winning against an hypothetical average team. This property is not exactly carried over to the Davidson model, but a related property is. The ratio of two ratings, $$\pi_1 / \pi_2$$, is the probability that team 1 wins against team 2, applies to both the BT model and the Davidson model

In my implementation of the BT model I used the Bernoulli distribution to model the outcomes, which is appropriate when we only have two outcomes. As you can see from the code below, we now have to use the categorical distribution, since we now have three outcomes. I also use an exponential prior on $$\nu$$. Admittedly, I have no particular reason for this except that it is the traditional choice for parameters that has to have only positive values.

Anyway, here is the Stan code:

data {
int<lower=0> N; // N games
int<lower=0> P; // P teams

// Each team is referred to by an integer that acts as an index for the ratings vector.
int team1[N]; // Indicator arrays for team 1
int team2[N]; // Indicator arrays for team 1
int results[N]; // Results. 1 if home win, 2 if away won, 3 if a draw.

real<lower=0> nu_prior_rate;

vector[P] alpha; // Parameters for Dirichlet prior.
}

parameters {
// Vector of ratings for each player
// The simplex constrains the ratings to sum to 1
simplex[P] ratings;

// Parameter adjusting the probability of draw.
real<lower=0> nu;
}

model {

// Array of length 3 vectors for the three outcome probabilies for each game.
vector[3] result_probabilities[N];
real nu_rating_prod;

ratings ~ dirichlet(alpha); // Dirichlet prior on the ratings.

nu ~ exponential(nu_prior_rate); // exponential prior on nu.

for (i in 1:N){

// nu multiplied by the harmonic mean of the ratings.
nu_rating_prod = sqrt(ratings[team1[i]] * ratings[team2[i]]) * nu;

result_probabilities[i][3] = nu_rating_prod / (ratings[team1[i]] + ratings[team2[i]] + nu_rating_prod);
result_probabilities[i][1] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]] + nu_rating_prod);
result_probabilities[i][2] = 1 - (result_probabilities[i][1] + result_probabilities[i][3]);

results[i] ~  categorical(result_probabilities[i]);
}

}


Another thing I wanted to do this time was to do proper MCMC sampling, so we could get the Bayesian posterior credibility intervals. The sampling takes longer time than the optimization procedure I used last time, but it only took a few seconds to get a decent amount of samples.

For the reanalysis of the handball data from last time I set the Dirichlet prior parameters to 5 for all teams, and the rate parameter for the exponential prior on $$\nu$$ is 1. We can visualize the estimate of the ratings and their uncertainties (95% intervall) using a forest plot:

The results agree with the ones from last time, but this time we also see that the credibility intervals are rather large. This is perhaps not that surprising, since the amount of data is rather limited. The posterior (mean) point estimate for $$\nu$$ is 0.15.

But let’s take a look at some English Premier League football data. With the ordinary BT model this would not work so well since there’s a lot of draws in football. Ignoring them would not be tenable. Below are the ratings, with 95% credibility interval, based on data from the 2015-15 season, using the same prior parameters as in the handball data set. The league points are shown in parenthesis for comparison.

The ratings generally agree with the points, except in a few instances, where a team or two have switched places. Another interesting thing to notice is that the width of the intervals seem to be related to the magnitude of the rating. I am not exactly sure why that is, but I suspect its due to the fact that the ratings are in a sense binomial probabilities, and these are known to have greater variance the closer they are to 0.5.

The point estimate for $$\nu$$ is 0.85 for this data set. Compared to the 0.15 for the handball data, it is clear that this reflects the higher overall probability of draws in football. In the handball data set only 6 games ended in a draw, while in the football data set about 20% of the games was a draw.

The comparison graph

In my last post on Elo ratings I used a graph to illustrate why it is hard to compare the strengths of teams that play in different leagues.

This is based on data from two half-seasons of English Premier League and the Championship. Each team is represented by a node or vertex, which is drawn as a circle. Each edge between the nodes is drawn as line between them and indicates that the two teams have against played each other at least once. I didn’t add team names to the graph, but the orange nodes are the teams that played in the Premier League in both seasons, while the blue nodes are teams that played in the Championship or got promoted or relegated. The graph shows both why comparisons between teams in different leagues can be difficult, and why including data from the Championship can improve the prediction of Premier league matches. In this post I will go more into details about what I call the comparison graph and how it can be used.

We can easily recognize a few patterns in the graph above. The most obvious one is the cluster of several teams where everyone (or nearly everyone) has played against each other. In a fully played season every team has played each other and the graph is said to be fully connected. If the graph is fully connected we should be able to have a good idea about the relative strengths between every team. Here is an example of a fully connected graph representing a fully played season with 10 teams.

Another important pattern is the lack of edges between two teams. If two teams hasn’t played each other, but both has played a third team, they are indirectly comparable. Here we see that both Team B and Team C has played team A, but they have not played each other.

If you are going to predict the outcome of a match between Team B and Team C this graph shows you that you should be careful. The information we have of the relative strengths between them is only indirect. This can in some situations where you have very limited data be almost the same as having no data at all. Suppose both Team C and Team B won huge victories over Team A. This would perhaps indicate that Team A is crap, but we would have very little indication which of Team B and Team C is better. If on the other hand Team A beat Team C, and Team B beat Team A, we would have had a strict ordering, so it does not automatically mean that we can’t make anything out of the data.

Another important pattern in a graph is whether there are any disconnected subgraphs. Here we have two or more groups of teams that has played only against other teams within their own group, but not against the teams in the other groups. In the first few rounds of a season we can see patterns like this.

There are a lot of interesting things you can do with the comparison graph, but that will make for a future post.

The underdispersed Conway-Maxwell Poisson distribution and goal differences

I have unfortunately not had the time to look more closely at the performance for the underdispersed count distributions that I in my last post found to be useful for predicting football results. Here I am taking a quick look into how the Conway-Maxwell distribution (COM) influences the predicted goal differences compared to the Poisson distribution.

Using data from the 2010-11 Premier League season I fitted the both the Poisson model and the COM model. The estimated dispersion parameter for the COM model indicated that there was less variability in the actual goals scored than implied by the Poisson distrubtion. I used the code I posted here to compute the probability distributions for the goal-differences for five matches in the season.

Let’s first look at the goal difference distribution for Arsenal playing at home against Manchester City. Both teams were in the top of the final table, and the actual result for this game in the 2010-11 season was 0-0. Comparing the distributions from the Poisson and COM models we see that they are pretty much identical.

For Aston Villa vs. Sunderland, which placed 9th and 10th on the table in 2010-11, we also see that there is not much difference between the two models. Although there is a slight increase in the probability for the actual result in that game, I don’t think it is of much importance.

Let’s compare the models using two teams from the bottom of the table, Wigan vs. Wolverhampton. Again, not much difference. Also note that there is basically no change in the probability for the actual result.

OK, so far the comparisons have been based on teams of similar strengths. But take Blackburn (15th on the table) vs. Liverpool (6th). Now we see that the COM model and Poisson model differ a bit. Here the COM model does a worse prediction of the actual result compared to the Poisson model. But only considering the league positions, the skewing of the distribution in favor of Liverpool in this case may not be totally unreasonable.

The last plot is compares the distributions between Chelsea (2nd on the table) vs. Birmingham (18th). Here we clearly see that there is a substantial difference in the prediction between the two models. The COM model favors Chelsea much more than the Poisson model does, which in this case give a much higher probability for the correct result.

From just looking at these few plots, I think we can conclude that the (underdispersed) COM model differs from the Poisson model where there is a greater difference in strength between the two sides.

Statistics I don’t believe in

Oslo is one of the candidate cities for hosting the 2022 winter Olympics, and of course there is a lot of discussion going on whether it costs too much etc. Opinion polls are conducted regularly, and the current status is that the majority of people don’t want Oslo to host the games. A couple of weeks ago the office dealing with the application put up the data from all polls conducted this year. What is weird is that they choose to illustrate the article detailing the data with the results from the question Do you wish that Norway should continue to participate in the winter Olympics and Paralympics? where only 73% of the respondents answered yes and 19% answered no.

First of all I thought the question was silly. Of course no one want Norway not to participate, but then the results show that almost one in five don’t want Norway to participate. At first i thought that it might be just a fluke, but then I downloaded the data and saw that this question had been asked on three occasions, and the results were pretty similar:

Even after seeing the consistent trend, I still don’t believe the numbers are correct. I think this shows one of the potential pitfalls in opinion polls. Since it is really hard to get the nuances of peoples beliefs in questions with just a few possible answers, it may be tempting to ask more detailed questions. The danger is that the questions become more confusing, or they even become irrelevant. I think both these things are can explain what is going on here.

A short summer update

Hi everybody, I haven’t had written much lately as I have been busy with other things, so this is just a short update on what is going on with my life.

This May I handed in my M.Sc. thesis in bioinformatics at the Norwegian University of Life Sciences. In it I investigated the prevalence of cis-regulatory elements (CREs) in plant promoters across 25 species. CREs are short DNA sites (typically around 12 base pairs long) that plays a role in gene regulation. They typically work by having high binding affinity for proteins that somehow regulates RNA transcription. I used genomic data (sequences and annotations) from the PLAZA server. I then predicted CREs by using motifs from the JASPAR, AtcisDB and PLACE databases. I then investigated whether predicted CREs that was found to be conserved in many gene families were more co-expressed than CREs that were not conserved and conserved only in a few families. I found that in general, this was not the case.

Right after my thesis was done I started my new job as a statistician in the Department of Health Services Research at Akershus University Hospital right outside Oslo. I am very exited about this as I get to work on a lot of interesting and meaningful research projects with a bunch of really great people.

I also spent a week backpacking in Belgium this summer with my girlfriend. The atmosphere was great in Brussels when Belgium beat the US in the World Cup, the weather was nice when we hiked in the Ardennes near the French border and everything was excellent when we rode our bikes in Bruges. Antwerp had the best food and the most beautiful train station I have ever been to. The most bizarre thing we experienced was perhaps the Bruggian love for embroidered cats dressed as humans:

Until next time!