Fitting Bradley-Terry models using Stan

I have recently played around with Stan, which is an excellent software to fit Bayesian models. It is similar to JAGS, which I have used before to fit some regression models for predicting football results. Stan differs from JAGS in a number of ways. Although there is some resemblance between the two, the model specification languages are not compatible with each other. Stan, for instance, uses static typing. On the algorithmic side, JAGS uses the Gibbs sampling technique to sample from the posterior; Stan does not do Gibbs sampling, but has two other sampling algorithms. In Stan you can also get point estimates by using built-in optimization routines that search for the maximum of the posterior distribution.

In this post I will implement the popular Bradley-Terry machine learning model in Stan and test it on some sports data (handball, to be specific).

The Bradley-Terry model is used for making predictions based on paired comparisons. A paired comparison in this context means that two things are compared, and one of them is deemed preferable or better than the other. This can for example occur when studying consumer preferences or ranking sport teams.

The Bradley-Terry model is really simple. Suppose two teams are playing against each other, then the probability that team i beats team j is

\( p(i > j) = \frac{r_i}{r_i + r_j} \)

where \(r_i\) and \(r_j\) are the ratings for the two teams, and should be positive numbers. It is these ratings we want to estimate.

A problem with the model above is that the ratings are not uniquely determined. To overcome this problem the parameters need to be constrained. The most common constraint is to add a sum-to-one constraint

\( \sum_k r_k = 1 \)

I will explore a different constraint below.

Sine we are in a Bayesian setting we need to set a prior distribution for the rating parameters. Given the constraints that the parameters should be positive and sum-to-one the Dirichlet distribution is a natural choice of prior distribution.

\( r_1, r_2, …, r_p \sim Dir(\alpha_1, \alpha_2, …, \alpha_p) \)

where the hyperparameters \(\alpha\) are positive real numbers. I will explore different choices of these below.

Here is the Stan code for the Bradley-Terry model:

data {
	int<lower=0> N; // N games
	int<lower=0> P; // P teams
	
	// Each team is referred to by an integer that acts as an index for the ratings vector. 
	int team1[N]; // Indicator arrays for team 1
	int team2[N]; // Indicator arrays for team 1
	int results[N]; // Results. 1 if team 1 won, 0 if team 2 won.
	
	vector[P] alpha; // Parameters for Dirichlet prior.
}

parameters {
	// Vector of ratings for each team.
	// The simplex constrains the ratings to sum to 1 
	simplex[P] ratings;
}

model {
	real p1_win[N]; // Win probabilities for player 1
	
	ratings ~ dirichlet(alpha); // Dirichlet prior.
	
	for (i in 1:N){
		p1_win[i] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]]);
		results[i] ~ bernoulli(p1_win[i]);
	}
}

The way I implemented the model you need to supply the hyperparameters for the Dirichlet prior via R (or whatever you use to run Stan). The match outcomes should be coded as 1 if team 1 won, 0 if team 2 won. The two variables team1 and team2 are vectors of integers that are used to reference the corresponding parameters in the ratings parameter vector.

Before we fit the model to some data we need to consider what values we should give to the hyperparameters. Each of the parameters of the Dirichlet distribution corresponds to the rating of a specific team. Both the absolute magnitude and the relative magnitudes are important to consider. A simple case is when all hyperparameters have the same value. Setting all hyperparameters to be equal to each other, with a value greater or equal to 1, implies a prior belief that all the ratings are the same. If they are between 0 and 1, the prior belief is that the ratings are really really different. The magnitude also plays a role here. The greater the magnitudes are, the stronger the prior belief that the ratings are the same.

Let’s fit some data to the model. Below are the results from fitting the results from 104 games from the 2016-17 season of the Norwegian women’s handball league, with 11 participating teams. I had to exclude six games that ended in a tie, since that kind of result is not supported by the Bradley-Terry model. Extension exists that handle this, but that will be for another time.

Below are the results of fitting the model with different sets of priors, together with the league points for comparison. For this purpose I didn’t do any MCMC sampling, I only found the MAP estimates using the optimization procedures in Stan.

res_dir

For all the priors the resulting ratings give the same ranking. This ranking also corresponds well with the ranking given by the league points, except for Gjerpen and Stabæk which have switched place. We also clearly see the effect of the magnitude of the hyperparameters. When all the \(\alpha\)‘s are 1 the ratings varies from almost 0 to about 0.6. When they are all set to 100 the ratings are almost all the same. If these ratings were used to predict the result of future matches the magnitudes of the hyperparameters could be tuned using cross validation to find the value that give best predictions.

What if we used a different hyperparameter for each team? Below are the results when I set all \(\alpha\)‘s to 10, except for the one corresponding to the rating for Glassverket, which I set to 25. We clearly see the impact. Glassverket is now considered to be the best team. This is nice since it demonstrates how actual prior information, if available, can be used to estimate the ratings.

res_dir2

I also want to mention another way to fit the Bradley-Terry model, but without the sum-to-one constraint. The way to do this is by using a technique that the Stan manual calls soft centering. Instead of having a Dirichlet prior which enforces the constraint, we use a normal distribution prior. This prior will not give strict bounds on the parameter values, but will essentially provide a range of probable values they can take. For the model I chose a prior with mean 20 and standard deviation 6.

\( r_1, r_2, …, r_p \sim N(\mu = 20, \sigma = 6) \)

The mean hyperprior here is arbitrary, but the standard deviation required some more considerations. I reasoned that the best team would probably be in the top 99 percentile of the distribution, approximately three standard deviations above the mean. In this case this would imply a rating of 20 + 3*6 = 38. Similarly, the worst team would probably be rated three standard deviations below the mean, giving a rating of 2. This implies that the best team has a 95% chance of winning against the worst team.

Here is the Stan code:

data {
	int<lower=0> N; 
	int<lower=0> P;

	int team1[N]; 
	int team2[N]; 
	int results[N];

	real<lower=0> prior_mean; // sets the (arbitrary) location of the ratings.
	real<lower=0> prior_sd; // sets the (arbitrary) scale of the ratings.

}

parameters {
	real<lower=0> ratings[P];
}

model {
	real p1_win[N];

	// soft centering (see stan manual 8.7)
	ratings ~ normal(prior_mean, prior_sd); 

	for (i in 1:N){
		p1_win[i] = ratings[team1[i]] / (ratings[team1[i]] + ratings[team2[i]]);
		results[i] ~ bernoulli(p1_win[i]);
	}
}

And here are the ratings for the handball teams. The ratings are now on a different scale than before and largely matches the prior distribution. The ranking given by this model agrees with the model with the Dirichlet prior, with Gjerpen and Stabek switched relative to the league ranking.

res_norm

Two Bayesian regression models for football results

Last fall I took a short introduction course in Bayesian modeling, and as part of the course we were going to analyze a data set of our own. I of course wanted to model football results. The inspiration came from a paper by Gianluca Baio and Marta A. Blangiardo Bayesian hierarchical model for the prediction of football results (link).

I used data from Premier League from 2012 and wanted to test the predictions on the last half of the 2012-23 season. With this data I fitted two models: One where the number of goals scored where modeled using th Poisson distribution, and one where I modeled the outcome directly (as home win, away win or draw) using an ordinal probit model. As predictors I used the teams as categorical predictors, meaning each team will be associated with two parameters.

The Poisson model was pretty much the same as the first and simplest model described in Baio and Blangiardo paper, but with slightly more informed priors. What makes this model interesting and different from the independent Poisson model I have written about before, apart from being estimated using Bayesian techniques, is that each match is not considered as two independent events when the parameters are estimated. Instead a correlation is implicitly modeled by specifying the priors in a smart way (see figure 1 in the paper, or here), thereby modeling the number of goals scored like a sort-of-bivariate Poisson.

Although I haven’t had time to look much into it yet, I should also mention that Baio and Blangiardo extended their model and used it this summer to model the World Cup. You can read more at Baio’s blog.

The ordinal probit model exploits the fact that the outcomes for a match can be thought to be on an ordinal scale, with a draw (D) considered to be ‘between’ a home win (H) and an away win (A). An ordinal probit model is in essence an ordinary linear regression model with a continuous response mu, that is coupled with a set of threshold parameters. For any value of mu the probabilities for any category is determined by the cumulative normal distribution and the threshold values. This is perhaps best explained with help from a figure:

probit_treshold_example

Here we see an example where the predicted outcome is 0.9, and the threshold parameters has been estimated to 0 and 1.1. The area under the curve is then the probability of the different outcomes.

To model the match outcomes I use a model inspired by the structure in the predictors as the Poisson model above. Since the outcomes are given as Away, Draw and Home, the home field advantage is not needed as a separate term. This is instead implicit in the coefficients for each team. This gives the coefficients a different interpretation from the above model. The two coefficients here can be interpreted as the ability when playing at home and the ability when playing away.

To get this model to work I had to set the constrains that the threshold separating Away and Draw were below the Draw-Home threshold. This implies that a good team would be expected to have a negative Away coefficient and a positive Home coefficient. Also, the intercept parameter had to be fixed to an arbitrary value (I used 2).

To estimate the parameters and make predictions I used JAGS trough the rjags package.

For both models, I used the most credible match outcome as the prediction. How well were the last half of the 2012-13 season predictions? The results are shown in the confusion table below.

Confusion matrix for Poisson model

actual/predicted A D H
A 4 37 11
D 1 35 14
H 0 38 42

Confusion matrix for ordinal probit model

actual/predicted A D H
A 19 0 33
D 13 0 37
H 10 0 70

The Poisson got the result right in 44.5% of the matches while the ordinal probit got right in 48.9%. This was better than the Poisson model, but it completely failed to even consider draw as an outcome. Ordinal probit, however, does seem to be able to predict away wins, which the Poisson model was poor at.

Here is the JAGS model specification for the ordinal probit model.

model {

  for( i in 1:Nmatches ) {

    pr[i, 1] <- phi( thetaAD - mu[i]  )
    pr[i, 2] <- max( 0 ,  phi( (thetaDH - mu[i]) ) - phi( (thetaAD - mu[i]) ) )
    pr[i, 3] <- 1 - phi( (thetaDH - mu[i]) )

    y[i] ~ dcat(pr[i, 1:3])

    mu[i] <- b0 + homePerf[teamh[i]] + awayPerf[teama[i]]
  }

  for (j in 1:Nteams){
    homePerf.p[j] ~ dnorm(muH, tauH)
    awayPerf.p[j] ~ dnorm(muA, tauA)

    #sum to zero constraint
    homePerf[j] <- homePerf.p[j] - mean(homePerf.p[])
    awayPerf[j] <- awayPerf.p[j] - mean(awayPerf.p[])
  }

  thetaAD ~ dnorm( 1.5 , 0.1 )
  thetaDH ~ dnorm( 2.5 , 0.1 )

  muH ~ dnorm(0, 0.01)
  tauH ~ dgamma(0.1, 0.1)

  muA ~ dnorm(0, 0.01)
  tauA ~ dgamma(0.1, 0.1)

  #predicting missing values
  predictions <- y[392:573]
}

And here is the R code I used to run the above model in JAGS.

library('rjags')
library('coda')

#load the data
dta <- read.csv('PL_1213.csv')

#Remove the match outcomes that should be predicted
to.predict <- 392:573 #this is row numbers
observed.results <- dta[to.predict, 'FTR']
dta[to.predict, 'FTR'] <- NA

#list that is given to JAGS
data.list <- list(
  teamh = as.numeric(dta[,'HomeTeam']),
  teama = as.numeric(dta[,'AwayTeam']),
  y = as.numeric(dta[, 'FTR']),
  Nmatches = dim(dta)[1],
  Nteams = length(unique(c(dta[,'HomeTeam'], dta[,'AwayTeam']))),
  b0 = 2 #fixed
)

#MCMC settings
parameters <- c('homePerf', 'awayPerf', 'thetaDH', 'thetaAD', 'predictions')
adapt <- 1000
burnin <- 1000
nchains <- 1
steps <- 15000
thinsteps <- 5

#Fit the model
#script name is a string with the file name where the JAGS script is.
jagsmodel <- jags.model(script.name, data=data.list, n.chains=nchains, n.adapt=adapt)
update(jagsmodel, n.iter=burnin)

samples <- coda.samples(jagsmodel, variable.names=parameters, 
                        n.chains=nchains, thin=thinsteps,
                        n.iter=steps)

#Save the samples
save(samples, file='bayesProbit_20131030.RData')

#print summary
summary(samples)