I have been meaning to write about my take on using Poisson regression to predict football results for a while, so here we go. Poisson regression is one of the earliest statistical methods used for predicting football results. The goal here is to use available data to to say something about how many goals a team is expected to score and from that calculate the probabilities for different match outcomes.
The Poisson distribution
The Poisson distribution is a probability distribution that can be used to model data that can be counted (i.e something that can happen 0, 1, 2, 3, … times). If we know the number of times something is expected to happen, we can find the probabilities that it happens any number of times. For example if we know something is expected to happen 4 times, we can calculate the probabilities that it happens 0, 1, 2, … times.
It turns out that the number of goals a team scores in a football match are approximately Poisson distributed. This means we have a method of assigning probabilities to the number of goals in a match and from this we can find probabilities for different match results. Note that I write that goals are approximately Poisson. The Poisson distribution does not always perfectly describe the number of goals in a match. It sometimes over or under estimates the number of goals, and some football leagues seems fit the Poisson distribution better than others. Anyway, the Poisson distribution seems to be an OK approximation.
The regression model
To be able to find the probabilities for different number of goals we need to find the expected number of goals L (It is customary to denote the expectation in a Poisson distribution by the Greek letter lambda, but WordPress seem to have problems with greek letters so i call i L instead). This is where the regression method comes in. With regression we can estimate lambda conditioned on certain variables. The most obvious variable to look at is which team is playing. Manchester United obviously makes more goals than Wigan. The second thing we want to take into account is who the opponent is. Some teams are expected to concede fewer goals, while others are expected to let in more goals. The third thing we want to take into account is home field advantage.
Written in the language of regression models this becomes
log(L) = mu + home + teami + opponentj
The mu is the overall mean number of goals. The home is the effect on number of goals a team has by playing at home. Teami is the effect of team number i, opponentj is the effect of team j.
(Note: Some descriptions of the Poisson regression model on football data uses the terms offensive and defensive strength to describe what I have called team and opponent. The reason I prefer the terms I use here is because it makes it a bit easier to understand later when we look at the data set.)
The logarithm on the left hand side is called the link function. I will not dwell much on what a link function is, but the short story is that they ensure that the parameter we try to estimate don’t fall outside its domain. In this case it ensures us that we never get negative expected number of goals.
Data
In my example I will use data from football-data.co.uk. What data you would want to use is up to yourself. Typically you could choose to use data from the last year or the least season, but that is totally up to you to decide.
Each of the terms on the right hand side of the equation (except for mu) corresponds to a columns in a table, so we need to fix our data a bit before we proceed with fitting the model. Each match is essentially two observations, one for how many goals the home team scores, the second how many the away team scores. Basically, each match need two rows in our data set, not just one.
Doing the fix is an easy thing to do in excel or Libre Office Calc. We take the data rows (i.e. the matches) we want to use and duplicate them. Then we need to switch the away team and away goals columns so they become the same as the home team column. We also need a column to indicate the home team. Here is an example on how it will look like:
In the next part I will fit the actual model, calculate probabilities and describe how we can make predictions using R.
Hi man,
My name’s Marco and I’m spanish. I’ve been seeking for poisson formulas for a long time, and I’ve some spreadsheets made by me, but the search still continues. A few days ago I found your page, and it throws a bit of light over my studies, but, since I don’t understand R, I work on Excel. May be is so rudimentary but it works for me!
Now, I hope you can help me with some points of the log(L) formula here:
MU: Is the overall mean of goals. Okay, Goals for? Goals against? Goals of the team I am studying? Goals of the league / championship?
HOME: The home is the effect on number of goals a team has by playing at home. Does it mean: goals for at home / goals for away?
TEAM & OPONNENT: This I think is clear. As you said, is attacking strength and deffense strength.
So, as I understand, for example:
City – Tottenham (Next Saturday 18 October 2014)
70 matches, 196 goals in Premier League. Goals per match: 2.80
*** MU = 2.80
HOME LEAGUE GOALS: 1.54
AWAY LEAGUE GOALS: 1.26
*** Home advantage (effect): 1.54/1.26 = 1.22
*** TEAM (attacking strength of M.City): 0.864
*** OPPONENT (Deffense strength of Tottenham): 0.649
Then:
Log(L) = 2.80 + 1.22 + 0.864 + 0.649 = 0.7429
So, 0,7429 is the logarythm. Must I then do the exponent? e^0.7429?
And, is 0,7429 (or the e^07429) the number of goals expected for Manchester City?
Thanks a lot, I hope you understand it at all even though my english…
If you want to answer via e-mail: *******
Or if you want here, I’ll be so grateful wherever you answer!
While it is convenient to think about the MU as the overall mean of goals scored, it is not entirely right. Remember that these parameters are estimated in terms of the logarithm of the number of goals. It is perhaps more accurately to describe it at the geometric mean. Therefore you cant just estimate MU and home parameter like you did. Also, ignoring logarithms, I am not sure about the way you estimate the home field advantage (HFA). In the context of linear models (models where add things together) it is more intuitive to think of HFA as the difference instead of the ratio.
And yes, you have to exponentiate the end result to get the correct estimate.
Hi. Thanks for you site. Very nice article can you give any example wuth real market odds?
Thanks for the answer, what type of Log do you use? log10? LN?
R uses the natural log by default, so that is the one used here.
how did you calculate the overall goals scored in R
I explain how you can do predictions in part 2:
http://opisthokonta.net/?p=296
for the expected goals score form the two teams how will i go about it
also i want to know how to predict the probabity given the fixed odds of the teams and also market odds
I am not an expert on betting and odds markets etc. but you could try to just add the odds into the model, instead of using the teams. I did use the odds as predictors in my post on using decision trees with adaptive boosting:
http://opisthokonta.net/?p=809
I love that ! But it’s possible to do don’t write each match in excel ( for exemple ) just write the number of Goal for a team ( home and away ) here in your data , you write all matches , and if we want to do a lot a league it’s so long !
Can you explain in more détails how you have download R package for have boxplot ( part 2 ) , thank you for you answer ! =)
This method will not work with just having the total number of goals scored for each team. What makes this regression model interesting is that it takes into account both the team and opposition in each game. Fortunately, you don’t have to punch in the data yourself, you can take a look at this page for some links to downloadable data sets.
There are no boxplots in part 2, but all plots are made using the built-in plotting commands in R.
How big the data should be? Do you use only 2 past matches or collect data over several years?
You should use as much data as possible. You could use the Dixon-Coles weighting method which I have described here. See also this post where I show that you can improve prediction by also including data outside of the league you are going to predict.
How do you find the Home effect column?
You have to create it yourself.
If you are using R, you can take a look at the first block of code in this post:
http://opisthokonta.net/?p=927
Hi,
I still dont understand, I mean from the data, theres 2 variables; goals scored by home and away teams. From here do you sum goals scored by home team divide it by goals scored agains to get home advantage?
No. The trick here is that you only look at one variable, goals scored. So you have to stack the two variables into a single variable. Then to get the home team advantage you create a new variable that indicates (with 0’s and 1’s) whether the number of goals scored was the home team or not. The implication is that each game can be thought of as two independent observations, one as the number of goals by the home team and the other as the number of goals by the away team.
Please how do I calculate home advantage
Take a look at part 2 for how you estimate the parameters, including the home advantage.
Hi, the database seems only considered a team to goal at home and away, no goal against at home or away. Do you think it is necessery to take goal against into accout?
As you can see in part 2, the goals scored are modeled as a function of who playes and who the opponent is. So the opponent factor for each team takes the goals against into account.
Hello My name is Brahma and i am Nepali .How can we calculate the prediction of world cup2018 Russia by this method. can you please explain. i get confused when i remember that in world cup there is no any home or away game
There is no any home or away games .
You need to use data from international games to fit the model and make predictions. You are right that in international games sometimes there are no home team. In the world cup only Russia would be the home team. If there are no home team I suggest not having any home advantage for that game. In other words, you set the “Home” variable to 0 for both teams.
Hello,
I came across this post as you are calculating the geometric mean for soccer results.
I wonder if you can shed some light om how you handled zero values for the geometric mean?
I initially replaced zero values with 0.001 however I feel that this isn’t a good method to handle them.
I would greatly appreciate your response.
Regards
Chachi
I don’t calculate the geometric mean in this post. The model calculates the logarithm of expected (average) goals.
Ah yes sorry,
I was referring to youre reply to Marco regarding MU-
“It is perhaps more accurately to describe it at the geometric mean.”
Thanks Again
Hi, good afternoon. Is this model that described by Lee? Lee AJ (1997). Modeling scores in the Premier League: is Manchester United really the best?,Chance,10, 15–19.
Second question: Does the model take into account the opposing team?
for example:
Real Madrid 5 x 0 Levante;
Barcelona 4 x 0 Atlético de Madrid
Atlético de Madrid 3 x 0 Levante
Barcelona ? x ? Real Madrid
In my opinion, Barcelona would be the better attack, because Atlético is stronger than Levante.
John again.
I forgot a doubt.
If this model is that described by Lee, how can I put the constraints that the sum of the parameters is equal to 1 or 0.
\sum^n_{i =1} \alpha_i = 0
\sum^n_{i =1} \beta_i = 0
\sum^n_{i =1} \eta_i = 0
where, alpha = attack, beta = deffense and eta = home advantage.
It’s to ensure identifiability of model.
It’s like you did on Dixon model.
hugs from Brazil.
Yes it is the model from Lee AJ (1997). It takes the opposition into account. The GLM function in R automatically adds the constraints needed to make the parameters identifiable. By default it sets one of the parameters to 0, but you can also specify a sum-to-zero constraint. If you use the GLM function, and make predictions with the built-in predict function, you don’t need to worry about what constraints you use.
thanks and great work ;
Hi, great work!
Does the Poisson distribution lambda value HAVE to be a recorded frequency?
If not can you explain why this lambda value calculated above works
I am not sure what you mean. The lambda is the expected value for the Poisson distribution. In the type of regression modelling used here the Lambda is a function of the attack and defense ratings for the different teams (and home field advantage). In other words, the lambda parameter is not estimated as a separate parameter.