Predicting football results with Poisson regression pt. 1

Posted on February 26, 2013 by opisthokonta

I have been meaning to write about my take on using Poisson regression to predict football results for a while, so here we go. Poisson regression is one of the earliest statistical methods used for predicting football results. The goal here is to use available data to to say something about how many goals a team is expected to score and from that calculate the probabilities for different match outcomes.

The Poisson distribution
The Poisson distribution is a probability distribution that can be used to model data that can be counted (i.e something that can happen 0, 1, 2, 3, … times). If we know the number of times something is expected to happen, we can find the probabilities that it happens any number of times. For example if we know something is expected to happen 4 times, we can calculate the probabilities that it happens 0, 1, 2, … times.

It turns out that the number of goals a team scores in a football match are approximately Poisson distributed. This means we have a method of assigning probabilities to the number of goals in a match and from this we can find probabilities for different match results. Note that I write that goals are approximately Poisson. The Poisson distribution does not always perfectly describe the number of goals in a match. It sometimes over or under estimates the number of goals, and some football leagues seems fit the Poisson distribution better than others. Anyway, the Poisson distribution seems to be an OK approximation.

The regression model
To be able to find the probabilities for different number of goals we need to find the expected number of goals L (It is customary to denote the expectation in a Poisson distribution by the Greek letter lambda, but WordPress seem to have problems with greek letters so i call i L instead). This is where the regression method comes in. With regression we can estimate lambda conditioned on certain variables. The most obvious variable to look at is which team is playing. Manchester United obviously makes more goals than Wigan. The second thing we want to take into account is who the opponent is. Some teams are expected to concede fewer goals, while others are expected to let in more goals. The third thing we want to take into account is home field advantage.

Written in the language of regression models this becomes

log(L) = mu + home + team_i + opponent_j

The mu is the overall mean number of goals. The home is the effect on number of goals a team has by playing at home. Team_i is the effect of team number i, opponent_j is the effect of team j.

(Note: Some descriptions of the Poisson regression model on football data uses the terms offensive and defensive strength to describe what I have called team and opponent. The reason I prefer the terms I use here is because it makes it a bit easier to understand later when we look at the data set.)

The logarithm on the left hand side is called the link function. I will not dwell much on what a link function is, but the short story is that they ensure that the parameter we try to estimate don’t fall outside its domain. In this case it ensures us that we never get negative expected number of goals.

Data
In my example I will use data from football-data.co.uk. What data you would want to use is up to yourself. Typically you could choose to use data from the last year or the least season, but that is totally up to you to decide.

Each of the terms on the right hand side of the equation (except for mu) corresponds to a columns in a table, so we need to fix our data a bit before we proceed with fitting the model. Each match is essentially two observations, one for how many goals the home team scores, the second how many the away team scores. Basically, each match need two rows in our data set, not just one.

Doing the fix is an easy thing to do in excel or Libre Office Calc. We take the data rows (i.e. the matches) we want to use and duplicate them. Then we need to switch the away team and away goals columns so they become the same as the home team column. We also need a column to indicate the home team. Here is an example on how it will look like:

In the next part I will fit the actual model, calculate probabilities and describe how we can make predictions using R.

33 thoughts on “Predicting football results with Poisson regression pt. 1”

Marco on October 14, 2014 at 12:20 pm said:

Hi man,
My name’s Marco and I’m spanish. I’ve been seeking for poisson formulas for a long time, and I’ve some spreadsheets made by me, but the search still continues. A few days ago I found your page, and it throws a bit of light over my studies, but, since I don’t understand R, I work on Excel. May be is so rudimentary but it works for me!

Now, I hope you can help me with some points of the log(L) formula here:

MU: Is the overall mean of goals. Okay, Goals for? Goals against? Goals of the team I am studying? Goals of the league / championship?

HOME: The home is the effect on number of goals a team has by playing at home. Does it mean: goals for at home / goals for away?

TEAM & OPONNENT: This I think is clear. As you said, is attacking strength and deffense strength.

So, as I understand, for example:

City – Tottenham (Next Saturday 18 October 2014)

70 matches, 196 goals in Premier League. Goals per match: 2.80
*** MU = 2.80

HOME LEAGUE GOALS: 1.54
AWAY LEAGUE GOALS: 1.26
*** Home advantage (effect): 1.54/1.26 = 1.22

*** TEAM (attacking strength of M.City): 0.864
*** OPPONENT (Deffense strength of Tottenham): 0.649

Then:

Log(L) = 2.80 + 1.22 + 0.864 + 0.649 = 0.7429

So, 0,7429 is the logarythm. Must I then do the exponent? e^0.7429?
And, is 0,7429 (or the e^07429) the number of goals expected for Manchester City?

Thanks a lot, I hope you understand it at all even though my english…
If you want to answer via e-mail: *******
Or if you want here, I’ll be so grateful wherever you answer!

Reply ↓
- opisthokonta on October 22, 2014 at 6:50 pm said:
  
  While it is convenient to think about the MU as the overall mean of goals scored, it is not entirely right. Remember that these parameters are estimated in terms of the logarithm of the number of goals. It is perhaps more accurately to describe it at the geometric mean. Therefore you cant just estimate MU and home parameter like you did. Also, ignoring logarithms, I am not sure about the way you estimate the home field advantage (HFA). In the context of linear models (models where add things together) it is more intuitive to think of HFA as the difference instead of the ratio.
  
  And yes, you have to exponentiate the end result to get the correct estimate.
  
  Reply ↓
arzu on November 9, 2014 at 12:32 pm said:

Hi. Thanks for you site. Very nice article can you give any example wuth real market odds?

Reply ↓
Marco on November 26, 2014 at 11:25 pm said:

Thanks for the answer, what type of Log do you use? log10? LN?

Reply ↓
- opisthokonta on November 29, 2014 at 10:10 am said:
  
  R uses the natural log by default, so that is the one used here.
  
  Reply ↓
kingsley on January 13, 2015 at 9:41 am said:

how did you calculate the overall goals scored in R

Reply ↓
- opisthokonta on January 13, 2015 at 7:37 pm said:
  
  I explain how you can do predictions in part 2:
  http://opisthokonta.net/?p=296
  
  Reply ↓
kingsley on January 13, 2015 at 9:45 am said:

for the expected goals score form the two teams how will i go about it
also i want to know how to predict the probabity given the fixed odds of the teams and also market odds

Reply ↓
- opisthokonta on January 13, 2015 at 7:41 pm said:
  
  I am not an expert on betting and odds markets etc. but you could try to just add the odds into the model, instead of using the teams. I did use the odds as predictors in my post on using decision trees with adaptive boosting:
  
  http://opisthokonta.net/?p=809
  
  Reply ↓
Vladimir on February 15, 2016 at 3:48 pm said:

I love that ! But it’s possible to do don’t write each match in excel ( for exemple ) just write the number of Goal for a team ( home and away ) here in your data , you write all matches , and if we want to do a lot a league it’s so long !

Can you explain in more détails how you have download R package for have boxplot ( part 2 ) , thank you for you answer ! =)

Reply ↓
- opisthokonta on February 20, 2016 at 1:54 pm said:
  
  This method will not work with just having the total number of goals scored for each team. What makes this regression model interesting is that it takes into account both the team and opposition in each game. Fortunately, you don’t have to punch in the data yourself, you can take a look at this page for some links to downloadable data sets.
  
  There are no boxplots in part 2, but all plots are made using the built-in plotting commands in R.
  
  Reply ↓
Zhannat on March 13, 2016 at 6:02 am said:

How big the data should be? Do you use only 2 past matches or collect data over several years?

Reply ↓
- opisthokonta on March 20, 2016 at 6:57 pm said:
  
  You should use as much data as possible. You could use the Dixon-Coles weighting method which I have described here. See also this post where I show that you can improve prediction by also including data outside of the league you are going to predict.
  
  Reply ↓
neha on March 10, 2017 at 7:11 pm said:

How do you find the Home effect column?

Reply ↓
- opisthokonta on March 13, 2017 at 5:19 pm said:
  
  You have to create it yourself.
  
  Reply ↓
  - opisthokonta on March 13, 2017 at 5:26 pm said:
    
    If you are using R, you can take a look at the first block of code in this post:
    http://opisthokonta.net/?p=927
    
    Reply ↓
    - Lexi on July 7, 2017 at 4:59 am said:
      
      Hi,
      
      I still dont understand, I mean from the data, theres 2 variables; goals scored by home and away teams. From here do you sum goals scored by home team divide it by goals scored agains to get home advantage?
      
      Reply ↓
      - opisthokonta on July 7, 2017 at 1:12 pm said:
        
        No. The trick here is that you only look at one variable, goals scored. So you have to stack the two variables into a single variable. Then to get the home team advantage you create a new variable that indicates (with 0’s and 1’s) whether the number of goals scored was the home team or not. The implication is that each game can be thought of as two independent observations, one as the number of goals by the home team and the other as the number of goals by the away team.
mel on September 26, 2017 at 8:23 am said:

Please how do I calculate home advantage

Reply ↓
- opisthokonta on September 26, 2017 at 9:39 am said:
  
  Take a look at part 2 for how you estimate the parameters, including the home advantage.
  
  Reply ↓
Lichao on December 23, 2017 at 7:47 am said:

Hi, the database seems only considered a team to goal at home and away, no goal against at home or away. Do you think it is necessery to take goal against into accout?

Reply ↓
- opisthokonta on December 27, 2017 at 6:20 pm said:
  
  As you can see in part 2, the goals scored are modeled as a function of who playes and who the opponent is. So the opponent factor for each team takes the goals against into account.
  
  Reply ↓
Brahma on June 14, 2018 at 12:46 pm said:

Hello My name is Brahma and i am Nepali .How can we calculate the prediction of world cup2018 Russia by this method. can you please explain. i get confused when i remember that in world cup there is no any home or away game
There is no any home or away games .

Reply ↓
- opisthokonta on June 14, 2018 at 1:47 pm said:
  
  You need to use data from international games to fit the model and make predictions. You are right that in international games sometimes there are no home team. In the world cup only Russia would be the home team. If there are no home team I suggest not having any home advantage for that game. In other words, you set the “Home” variable to 0 for both teams.
  
  Reply ↓
Chachi on June 28, 2018 at 5:02 pm said:

Hello,

I came across this post as you are calculating the geometric mean for soccer results.

I wonder if you can shed some light om how you handled zero values for the geometric mean?

I initially replaced zero values with 0.001 however I feel that this isn’t a good method to handle them.

I would greatly appreciate your response.

Regards
Chachi

Reply ↓
- opisthokonta on June 29, 2018 at 10:04 am said:
  
  I don’t calculate the geometric mean in this post. The model calculates the logarithm of expected (average) goals.
  
  Reply ↓
  - Chachi on July 2, 2018 at 12:24 pm said:
    
    Ah yes sorry,
    
    I was referring to youre reply to Marco regarding MU-
    
    “It is perhaps more accurately to describe it at the geometric mean.”
    
    Thanks Again
    
    Reply ↓
John on January 11, 2019 at 7:31 pm said:

Hi, good afternoon. Is this model that described by Lee? Lee AJ (1997). Modeling scores in the Premier League: is Manchester United really the best?,Chance,10, 15–19.

Second question: Does the model take into account the opposing team?
for example:
Real Madrid 5 x 0 Levante;
Barcelona 4 x 0 Atlético de Madrid
Atlético de Madrid 3 x 0 Levante
Barcelona ? x ? Real Madrid

In my opinion, Barcelona would be the better attack, because Atlético is stronger than Levante.

Reply ↓
John on January 11, 2019 at 7:40 pm said:

John again.
I forgot a doubt.

If this model is that described by Lee, how can I put the constraints that the sum of the parameters is equal to 1 or 0.

\sum^n_{i =1} \alpha_i = 0
\sum^n_{i =1} \beta_i = 0
\sum^n_{i =1} \eta_i = 0

where, alpha = attack, beta = deffense and eta = home advantage.

It’s to ensure identifiability of model.

It’s like you did on Dixon model.

hugs from Brazil.

Reply ↓
- opisthokonta on January 13, 2019 at 11:09 am said:
  
  Yes it is the model from Lee AJ (1997). It takes the opposition into account. The GLM function in R automatically adds the constraints needed to make the parameters identifiable. By default it sets one of the parameters to 0, but you can also specify a sum-to-zero constraint. If you use the GLM function, and make predictions with the built-in predict function, you don’t need to worry about what constraints you use.
  
  Reply ↓
  - John on January 21, 2019 at 7:24 pm said:
    
    thanks and great work ;
    
    Reply ↓
Ivaylo Vasilev on April 9, 2020 at 3:25 pm said:

Hi, great work!
Does the Poisson distribution lambda value HAVE to be a recorded frequency?
If not can you explain why this lambda value calculated above works

Reply ↓
- opisthokonta on April 10, 2020 at 3:48 pm said:
  
  I am not sure what you mean. The lambda is the expected value for the Poisson distribution. In the type of regression modelling used here the Lambda is a function of the attack and defense ratings for the different teams (and home field advantage). In other words, the lambda parameter is not estimated as a separate parameter.
  
  Reply ↓

33 thoughts on “Predicting football results with Poisson regression pt. 1”

Leave a Reply to opisthokonta Cancel reply