Predicting football results with Poisson regression pt. 1

I have been meaning to write about my take on using Poisson regression to predict football results for a while, so here we go. Poisson regression is one of the earliest statistical methods used for predicting football results. The goal here is to use available data to to say something about how many goals a team is expected to score and from that calculate the probabilities for different match outcomes.

The Poisson distribution
The Poisson distribution is a probability distribution that can be used to model data that can be counted (i.e something that can happen 0, 1, 2, 3, … times). If we know the number of times something is expected to happen, we can find the probabilities that it happens any number of times. For example if we know something is expected to happen 4 times, we can calculate the probabilities that it happens 0, 1, 2, … times.

It turns out that the number of goals a team scores in a football match are approximately Poisson distributed. This means we have a method of assigning probabilities to the number of goals in a match and from this we can find probabilities for different match results. Note that I write that goals are approximately Poisson. The Poisson distribution does not always perfectly describe the number of goals in a match. It sometimes over or under estimates the number of goals, and some football leagues seems fit the Poisson distribution better than others. Anyway, the Poisson distribution seems to be an OK approximation.

The regression model
To be able to find the probabilities for different number of goals we need to find the expected number of goals L (It is customary to denote the expectation in a Poisson distribution by the Greek letter lambda, but WordPress seem to have problems with greek letters so i call i L instead). This is where the regression method comes in. With regression we can estimate lambda conditioned on certain variables. The most obvious variable to look at is which team is playing. Manchester United obviously makes more goals than Wigan. The second thing we want to take into account is who the opponent is. Some teams are expected to concede fewer goals, while others are expected to let in more goals. The third thing we want to take into account is home field advantage.

Written in the language of regression models this becomes

log(L) = mu + home + teami + opponentj

The mu is the overall mean number of goals. The home is the effect on number of goals a team has by playing at home. Teami is the effect of team number i, opponentj is the effect of team j.

(Note: Some descriptions of the Poisson regression model on football data uses the terms offensive and defensive strength to describe what I have called team and opponent. The reason I prefer the terms I use here is because it makes it a bit easier to understand later when we look at the data set.)

The logarithm on the left hand side is called the link function. I will not dwell much on what a link function is, but the short story is that they ensure that the parameter we try to estimate don’t fall outside its domain. In this case it ensures us that we never get negative expected number of goals.

In my example I will use data from What data you would want to use is up to yourself. Typically you could choose to use data from the last year or the least season, but that is totally up to you to decide.

Each of the terms on the right hand side of the equation (except for mu) corresponds to a columns in a table, so we need to fix our data a bit before we proceed with fitting the model. Each match is essentially two observations, one for how many goals the home team scores, the second how many the away team scores. Basically, each match need two rows in our data set, not just one.

Doing the fix is an easy thing to do in excel or Libre Office Calc. We take the data rows (i.e. the matches) we want to use and duplicate them. Then we need to switch the away team and away goals columns so they become the same as the home team column. We also need a column to indicate the home team. Here is an example on how it will look like:

In the next part I will fit the actual model, calculate probabilities and describe how we can make predictions using R.

Is goal difference the best way to rank and rate football teams?

In my previous post i compared the least squares rating of football teams to the ordinary three points for a win rating. In this post I will look closer at how these two systems rank teams differently. I briefly touched upon the subject in the last post, were we saw that the two systems generally ranked the teams in the same order, with a few exceptions. We saw that Sunderland and Newcastle were the two teams in the 2011-2012 Premier League season who differed most in their ranking in the two systems. The reason for this was of course because the least squares approach is based on goal difference, while the points system is based only on match outcome. This means that teams who win a match by many goals will benefit more on the least squares ranking than on the points system. For example, a 3-0 win will count more than a 2-1 win when we use goal difference, but they will give the same number of points based on match outcome. This also holds if wee look at the loosing team; a 2-1 loss is better than a 3-0 loss.

It seems more intuitive to rank teams on a system based on goal difference (using least squares or some other method) than the tree points for a win system, especially when we remind ourself that it lacks any theoretical justification. Awarding three points for a win instead of two was not used before the 1980’s and were not used in the World Cup until 1994. The reason for introducing the three points system was to give the teams more incentive to win. Also, as far as I know, even the two points for a win lacks a theoretical basis as a way to measure teams strength. But even if the points system lack an underlying mathematical theory, it still could be a better system than a system based on goal difference for deciding the true strength of a team. A paper titled Fitness, chance, and myths: an objective view on soccer results by the two German physicists A. Hauer and O. Rubner compares the two systems using data from the German Bundesliga. They looked at each team in each season from the late 1980’s and calculated how much the teams goal difference and points correlated between the first and second half of a season. A higher correlation means that there is less chance involved in how the measure reflects a teams real strength. What they found was that goal difference was more correlated between the half-seasons than the 3- and 2 points for a win system.

However, this does not mean that goal difference is the best way to measure team strength. I would like to see if there are some other measures that correlate better between season halves. What first comes to mind is to look at ball possession or shots at target.

As a last note, even if goal difference has a better theoretical foundation as a measure of “who is the best”, I do not think that leagues and tournaments should quit the points system. It may very well be that the points system makes a football competition more interesting since it adds more chance to it.

Least squares rating of football teams

The Wikipedia article Statistical association football predictions mentions a method for least squares rating of football teams. The article does not give any source for this, but I found what I think may be the origin of this method. It appears to be from an undergrad thesis titled Statistical Models Applied to the Rating of Sports Teams by Kenneth Massey. It is not on football in particular, but on sports in general where two teams compete for points. A link to the thesis can be found here.

The basic method as described in Massey’s paper and the Wikipedia article is to use a n*k design matrix A where each of the k columns represents one team, and each of the n rows represents a match. In each match (or row) the home team is indicated by 1, and the away team by -1. Then we have a vector y indicating goal differences in each match, with respect to the home team (i.e. positive values for home wins, negative for away wins). Then the least squares solution to the system Ax = y is found, with the x vector now containing the rating values for each team.

When it comes to interpretation, the difference in least squares estimate for the rating of two teams can be seen as the expected goal difference between the teams in a game. The individual rating can be seen as how many goals a teams scores compared to the overall average.

Massey’s paper also discusses some extensions to this simple model that is not mentioned in the Wikipedia article. The most obvious is incorporation of home field advantage, but there is also a section on splitting the teams’ performances into offensive and defensive components. I am not going to go into these extensions here, you can read more about them i Massey’s paper, along with some other rating systems that are also discussed. What I will do, is to take a closer look at the simple least squares rating and compare it to the ordinary three points for a win rating used to determine the league winner.

I used the function I made earlier to compute the points for the 2011-2012 Premier League season, then I computed the least squares rating. Here you can see the result:

  PTS LSR LSRrank RankDiff
Man City 89 1.600 1 0
Man United 89 1.400 2 0
Arsenal 70 0.625 3 0
Tottenham 69 0.625 4 0
Newcastle 65 0.125 8 3
Chelsea 64 0.475 5 -1
Everton 56 0.250 6 -1
Liverpool 52 0.175 7 -1
Fulham 52 -0.075 10 1
West Brom 47 -0.175 12 2
Swansea 47 -0.175 11 0
Norwich 47 -0.350 13 1
Sunderland 45 -0.025 9 -4
Stoke 45 -0.425 15 1
Wigan 43 -0.500 16 1
Aston Villa 38 -0.400 14 -2
QPR 37 -0.575 17 0
Bolton 36 -0.775 19 1
Blackburn 31 -0.750 18 -1
Wolves 25 -1.050 20 0

It looks like the Least squares approach gives similar results as the standard points system. It differentiates between the two top teams, Manchester City and Manchester United, even if they have the same number of points. This is perhaps not so surprising since City won the league because of greater goal difference than United, and this is what the least squares rating is based on. Another, perhaps more surprising thing is how relatively low least squares rating Newcastle has, compared to the other teams with approximately same number of points. If ranked according to the least squares rating, Newcastle should have been below Liverpool, instead they are three places above. This hints at Newcastle being better at winning, but with few goals, and Liverpool winning fewer times, but when they win, they win with more goals. We can also see that Sunderland comes poor out in the least squares rating, dropping four places.

If we now plot the number of points to the least squares rating we see that the two methods generally gives similar results. This is perhaps not so surprising, and despite some disparities like the ones I pointed out, there are no obvious outliers. I also calculated the correlation coefficient, 0.978, and I was actually a bit surprised of how big it was.

R functions for soccer league tables and result matrix

Here are three R functions i wrote to calculate ranking tables in soccer leagues based on the result of played matches. The functions are made for ordinary leagues where each team play every other team twice, one time at the home field, the other at the opposing teams home field, but the match.result() and league.table() function can be used on more general data.

The first function, match.results() just computes the outcome of a match (Home, Draw or Away, i.e “H”, “D” or “A”) based on number of goals scored, and is used by the other two functions.

> res <- match.results(c(1,2,1,2,3,1,0,5), c(0,1,2,0,3,0,4,0))
> res
[1] "H" "H" "A" "H" "D" "H" "A" "H"

The league.table() function returns a data.frame with some statistics for each team, such as number of wins, draws, loss (for both home and away games), goals, goal difference etc. As input it takes vectors with the name of the home team, away team, goals score by the home team and goals scored by the away team. Three points are given for a win, one point for a draw, and zero points for a loss, as is used in most leagues. If you want to compute an alternative table with a different point scheme you can just change the three variables first in the function body. The teams are ranked by the number of points awarded, but if two or more teams have the same numbero of points, they are ranked by goal difference. If the goal difference is also equal, number of goals scored is used.

#load data from
matchdata <- read.csv("premierLeague2011-11.csv")
league.table(HomeTeam, AwayTeam, FTHG, FTAG)

Man United   38 18  1  0  5 10  4 78 37  41  80
Chelsea      38 14  3  2  7  5  7 69 33  36  71
Man City     38 13  4  2  8  4  7 60 33  27  71
Arsenal      38 11  4  4  8  7  4 72 43  29  68
Tottenham    38  9  9  1  7  5  7 55 46   9  62
Liverpool    38 12  4  3  5  3 11 59 44  15  58
Everton      38  9  7  3  4  8  7 51 45   6  54
Fulham       38  8  7  4  3  9  7 49 43   6  49
Aston Villa  38  8  7  4  4  5 10 48 59 -11  48
Sunderland   38  7  5  7  5  6  8 45 56 -11  47
West Brom    38  8  6  5  4  5 10 56 71 -15  47
Newcastle    38  6  8  5  5  5  9 56 57  -1  46
Stoke        38 10  4  5  3  3 13 46 48  -2  46
Bolton       38 10  5  4  2  5 12 52 56  -4  46
Blackburn    38  7  7  5  4  3 12 46 59 -13  43
Wigan        38  5  8  6  4  7  8 40 61 -21  42
Wolves       38  8  4  7  3  3 13 46 66 -20  40
Birmingham   38  6  8  5  2  7 10 37 58 -21  39
Blackpool    38  5  5  9  5  4 10 55 78 -23  39
West Ham     38  5  5  9  2  7 10 43 70 -27  33

The last function is result.matrix(), which returns a matrix with the match results. with home teams on the rows, and away teams on the columns. The cell contents can be formated in three different ways using the format argument. By default this is set to “score” which gives the output like “2 – 1”. “HDA” gives either “A”, “D” or “H”. “difference” gives the goal difference. The diagonal consists of “NA”s.

#only the five first rows and columns to save space
result.matrix(m$HomeTeam, m$AwayTeam, m$FTHG, m$FTAG, format="score")[1:5,1:5]

            Arsenal Aston Villa Birmingham Blackburn Blackpool
Arsenal     NA      "1 - 2"     "2 - 1"    "0 - 0"   "6 - 0"  
Aston Villa "2 - 4" NA          "0 - 0"    "4 - 1"   "3 - 2"  
Birmingham  "0 - 3" "1 - 1"     NA         "2 - 1"   "2 - 0"  
Blackburn   "1 - 2" "2 - 0"     "1 - 1"    NA        "2 - 2"  
Blackpool   "1 - 3" "1 - 1"     "1 - 2"    "1 - 2"   NA       

And here is the code for the three functions.

match.results <- function(homeGoals, awayGoals){
  #Determines the match outcome (H, D or A) based on goals scored by home and away teams.
  home <- homeGoals > awayGoals
  away <- awayGoals > homeGoals
  draws <- homeGoals == awayGoals
  results <- character(length(homeGoals))
  results[draws] <- "D"
  results[home] <- "H"
  results[away] <- "A"


league.table <- function(homeTeam, awayTeam, homeGoals, awayGoals){
  #points awarded for a match outcome  
  winPts <- 3
  drawPts <- 1
  loosePts <- 0
  if (length(unique(sapply(list(homeTeam, awayTeam, homeGoals, awayGoals), length))) != 1 ){
    warning("input vectors not of same length.")
  numMatches <- length(homeTeam)
  teams <- levels(factor(c(as.character(homeTeam), as.character(awayTeam))))
  numTeams <- length(teams)
  #vector with outcome of a match (H, D or A)
  results <- match.results(homeGoals, awayGoals)
  #for output
  homeWins <- numeric(numTeams)
  homeDraws <- numeric(numTeams)
  homeLoss <- numeric(numTeams)
  awayWins <- numeric(numTeams)
  awayDraws <- numeric(numTeams)
  awayLoss <- numeric(numTeams)
  goalsFor <- numeric(numTeams)
  goalsAgainst <- numeric(numTeams)
  goalsDifference <- numeric(numTeams)
  playedMatches <- numeric(numTeams)
  pts <- numeric(numTeams)

  for (t in 1:numTeams) {
    #mathc results for a given team
    homeResults <- results[homeTeam == teams[t]]
    awayResults <- results[awayTeam == teams[t]]

    playedMatches[t] <- length(homeResults) + length(awayResults)
    goalsForH <- sum(homeGoals[homeTeam == teams[t]])
    goalsForA <- sum(awayGoals[awayTeam == teams[t]])
    goalsFor[t] <- goalsForA + goalsForH
    goalsAgainstH <- sum(awayGoals[homeTeam == teams[t]])
    goalsAgainstA <- sum(homeGoals[awayTeam == teams[t]])
    goalsAgainst[t] <- goalsAgainstA + goalsAgainstH
    goalsDifference[t] <- goalsFor[t] - goalsAgainst[t]
    homeWins[t] <- sum(homeResults == "H")
    homeDraws[t] <- sum(homeResults == "D")
    homeLoss[t] <- sum(homeResults == "A")
    awayWins[t] <- sum(awayResults == "A")
    awayDraws[t] <- sum(awayResults == "D")
    awayLoss[t] <- sum(awayResults == "H")
    totWins <- homeWins[t] + awayWins[t]
    totDraws <- homeDraws[t] + awayDraws[t]
    totLoss <- homeLoss[t] + awayLoss[t]
    pts[t] <- (winPts * totWins) + (drawPts * totDraws) + (loosePts * totLoss)

  table <- data.frame(cbind(playedMatches, homeWins, homeDraws, 
                            homeLoss, awayWins, awayDraws, awayLoss, 
                            goalsFor, goalsAgainst, goalsDifference, pts),

  names(table) <- c("PLD", "HW", "HD", "HL", "AW", "AD", "AL", "GF", "GA", "GD", "PTS")
  ord <- order(-table$PTS, -table$GD, -table$GF)
  table <- table[ord, ]


result.matrix <- function(homeTeam, awayTeam, homeGoals, awayGoals, format="score"){
  if (length(unique(sapply(list(homeTeam, awayTeam, homeGoals, awayGoals), length))) != 1 ){
    warning("input vectors not of same length.")
  teams <- levels(factor(c(as.character(homeTeam), as.character(awayTeam))))
  numTeams <- length(teams)
  numMatches <- length(homeTeam)
  if (format == "HDA"){
    results <- match.results(homeGoals, awayGoals)
  resultMatrix <- matrix(nrow=numTeams, ncol=numTeams, dimnames=list(teams, teams))
  for (m in 1:numMatches){
    if (format == "score"){
      cell <- paste(homeGoals[m], "-", awayGoals[m])
    else if (format == "HDA"){
      cell <- results[m]
    else if (format == "difference"){
      cell <- homeGoals[m] - awayGoals[m]
    resultMatrix[homeTeam[m], awayTeam[m]] <- cell