Calculate the ranked probability score in R

I was asked in the comments for the R code for the ranked probability score, so instead of posting it deep down in the comments I thought I’d post it as a proper blog instead. The ranked probability score (RPS) is a measure of how similar two probability distributions are and is used as a way to evaluate the quality of a probabilistic prediction. It is an example of a proper scoring rule.

The RPS was brought to my attention in the paper Solving the problem of inadequate scoring rules for assessing probabilistic football forecasting models by Constantinou and Fenton. In that paper they argue that the RPS is the best measure of the quality of football predictions when the predictions are of the type where you have probabilities for the outcome (home win, draw or away win). The thing about the RPS is that it also reflects that an away win is in a sense closer to a draw than a home win. That means that a higher probability predicted for a draw is considered better than a higher probability for home win if the actual result is an away win.

You can also find some more details at the pena.lt/y blog.

The following R function takes two arguments. The first argument (predictions) is a matrix with the predictions. It should be laid out so that each row is one prediction, laid out in the proper order, where each element is a probability and each row sum to 1. The second argument (observed) is a numeric vector that indicates which outcome that was actually observed.

For assessing football predictions the predictions matrix would have three columns, with the probabilities for the match ordered as home, draw and away (or in the opposite order).

rankProbScore <- function(predictions, observed){
  ncat <- ncol(predictions)
  npred <- nrow(predictions)
  
  rps <- numeric(npred)
  
  for (rr in 1:npred){
    obsvec <- rep(0, ncat)
    obsvec[observed[rr]] <- 1
    cumulative <- 0
    for (i in 1:ncat){
      cumulative <- cumulative + (sum(predictions[rr,1:i]) - sum(obsvec[1:i]))^2
    }
    rps[rr] <- (1/(ncat-1))*cumulative
  }
  return(rps)
}

The Norwegian election survey: Voting patterns across generations.

Predicting election outcomes has in the recent years been a popular activity among data analytics. I guess you all know how Nate Silver became known for his predictions in the United States elections. Next year is Norwegian election for parliament and I have been thinking about maybe making an attempt at predicting the results when that time comes. There are already some people in Norway doing this, like the pollofpolls.no website and the Norwegian Computing Center.

In the meantime I decided to take a look at some historical data. After each election in Norway a large survey (1500-2000+ respondents) is carried out in an attempt to figure out why people voted what they did. This has been going on since the 1950’s and includes both local and national elections. The data from the surveys are available online from the Norwegian Center For Research Data. Only raw data from the oldest surveys are available for immediate download, but the online analytics tool at the website can be used to create simple tabulations of the all variables in the raw data and the results can downloaded as spreadsheets.

The obvious thing to look at in data like these are if there are any correlations between voting patterns and demographic variables. Gender, income and geography are obvious ones, but they are pretty boring boring, so I didn’t want to look at those. Instead I decided to look at what the relationship between birth year and party preference were.

I used the online tool and tabulated year of birth (or age, if that was the only available) against which party each respondent voted and downloaded the raw numbers. I did this for each survey all the available national elections, and the local elections in this millennium. This gave data on 17 elections from 1957 to 2013. I then cleaned the data a bit, threw out the category of parties termed “others” (usually less than 2% of the votes), calculated the birth year from age where necessary, and a bunch of other small details. With 1500+ respondents, about 70 birth years in each election and about 7 parties gives about 3 to 4 respondents in each cell, on average. Some parties have much lower support, so these tend to have even lower counts. It was therefore necessary to aggregate the birth years into groups. After some experimenting, I ended up by grouping them in 7 year bins.

What makes birth year more interesting to look at than age is that it gives a window back in time. By looking at age only you get a range of ages from 18 to about 90, but when you look at this data from the birth cohort view you can see 150+ years back in time. The oldest respondent in the data set was born in 1865.

Okay, on to some plots. We can start out with the the support for the Labour Party which has been the most popular party in the time after WWII.

cohort_dna

Each line in this plot is one election. The colors goes from black (the 1957 election) to red (the 2013 local election). We see that the general trend is that the Labour Party have most support among voters born before 1950, and that there is a decline among younger generations. We also see a trend where they are not as popular as they used to be in the 1960’s and 70’s, which is also seen in the generations born in the pre-1950’s cohorts. The dark red line at the bottom is the 2001 election, where the they did their worst election since the 1920’s.

So let’s take a look at the support for the Conservative Party, the second most popular party.

cohort_h

Unlike the Labour Party, there does not seem to be any generational trend at all. The Conservatives has usually received between 15-25% of the votes, except at a period in the 1980’s, where they received 30%.

The next party up is the Progress Party, which is currently in a coalition cabinet with the Conservative Party. The first election they participated in was the 1973 election, so the birth year series don’t go as far back as the other parties.

cohort_frp

I think this plot is very interesting. It looks like the Progress Party is popular among people born in the 1930’s but also among the young voters. Notice how the rightmost part of each lines tend to point upwards. The 1930’s birth trend does not however seem to be present in the earliest elections (those with the darkest lines), but the popularity among the youngest part of the election cohort is there.

The support for the Christian Democratic Party also show some interesting trends. In the plot below we clearly see that they get a sizable portion of their votes from people born before 1940’s. Also noticeable are the two elections in the 1990’s where they did particularly well, where a lot of younger voters also voted for them. Does it also look like a small bump in popularity for voters born in the 1980’s? It could be just a coincidence, so it will be interesting to see if this appears in the next election as well.

cohort_krf

The last plot I want to show is for the Socialist Left Party. What this plot clearly shows is that the Socialist Party is more popular among the younger generations than the older. This does not mean we can extrapolate this into future elections and predict an increased popularity. On the contrary, we also see that their decreasing popularity since their peak in 2001 also applies to the younger generations. One could speculate that some of the younger voters have left the Labour Party in favor of the Socialist Party, and that will be the topic in a future blog post.

cohort_sv

My predictions for the rest of the Premier League season

A couple of weeks ago Constantinos Chappas asked on twitter for predictions for the results of the remaining season of English Premier League:

I had been thinking about posting some predictions about the Premier League around new years, since this season is really exciting and it would be a great opportunity to see how well my models would cope with everything that is currently going on. I have never posted any predictions before, so this will surely be an interesting experience. And I thought Chappas’ initiative was really interesting, so that surely gave me a nice reason to come trough.

Today Chappas posted the combined results from all 15 participants so I thought I could share some of the details behind my contribution.

I originally wanted to use the Conway-Maxwell model I have written about recently, but I had some problems with the estimation procedure, so I instead used a classic Poisson model. I used data on Premier League and Championship results going back to the 2011-12 season. By including data from the Champoionship I hope to get better predictions, like I have demonstrated before. Since I used data from a long time back I used the Dixon-Coles weighting scheme, which make more recent games have a greater impact on the predictions. The weighting parameter \(\xi\) was set to 0.0019, which gives a bit more weight on more recent games than the 0.0018 I found to be most optimal earlier.

I fitted the model and calculated the probabilities for the remaining games of the season. From these probabilities I simulated the rest of the season ten thousand times. From these simulations we can get the probabilities and expectations for the end of season results.

So how do I predict the league table will look like at the end of the season?

Team Points
Manchester City 75.7
Arsenal 75.2
Tottenham 65.6
Leicester City 64.8
Manchester United 64.3
Liverpool 58.2
West Ham 56.1
Chelsea 54.7
Everton 53.7
Crystal Palace 53.7
Stoke City 52.9
Watford 51.9
Southampton 50.6
West Bromwich Albion 45.8
Norwich City 43.7
Bournemouth 42.9
Swansea City 40.9
Newcastle 34.5
Sunderland 31.5
Aston Villa 23.1

Although I predict 0.2 points more for Manchester City than Arsenal, the probabilities for both of them to win is 47.0%. I also give Tottenham a 2.3% chance, Leicester 2.1% and Manchester United a 1.5%. At last, Liverpool have a 0.1% chance. The other teams have a chance less than 0.04%.

I will come back with an update with my entire table with probabilities for all positions for all teams.

The underdispersed Conway-Maxwell Poisson distribution and goal differences

I have unfortunately not had the time to look more closely at the performance for the underdispersed count distributions that I in my last post found to be useful for predicting football results. Here I am taking a quick look into how the Conway-Maxwell distribution (COM) influences the predicted goal differences compared to the Poisson distribution.

Using data from the 2010-11 Premier League season I fitted the both the Poisson model and the COM model. The estimated dispersion parameter for the COM model indicated that there was less variability in the actual goals scored than implied by the Poisson distrubtion. I used the code I posted here to compute the probability distributions for the goal-differences for five matches in the season.

Let’s first look at the goal difference distribution for Arsenal playing at home against Manchester City. Both teams were in the top of the final table, and the actual result for this game in the 2010-11 season was 0-0. Comparing the distributions from the Poisson and COM models we see that they are pretty much identical.

com_arsman2011

For Aston Villa vs. Sunderland, which placed 9th and 10th on the table in 2010-11, we also see that there is not much difference between the two models. Although there is a slight increase in the probability for the actual result in that game, I don’t think it is of much importance.

com_astsun2011

Let’s compare the models using two teams from the bottom of the table, Wigan vs. Wolverhampton. Again, not much difference. Also note that there is basically no change in the probability for the actual result.

com_wigwol2011

OK, so far the comparisons have been based on teams of similar strengths. But take Blackburn (15th on the table) vs. Liverpool (6th). Now we see that the COM model and Poisson model differ a bit. Here the COM model does a worse prediction of the actual result compared to the Poisson model. But only considering the league positions, the skewing of the distribution in favor of Liverpool in this case may not be totally unreasonable.

com_blaliv2011

The last plot is compares the distributions between Chelsea (2nd on the table) vs. Birmingham (18th). Here we clearly see that there is a substantial difference in the prediction between the two models. The COM model favors Chelsea much more than the Poisson model does, which in this case give a much higher probability for the correct result.

com_chebir2011

From just looking at these few plots, I think we can conclude that the (underdispersed) COM model differs from the Poisson model where there is a greater difference in strength between the two sides.

Underdispersed Poisson alternatives seem to be better at predicting football results

In the previous post I discussed some Poisson-like probability distributions that offer more flexibility than the Poisson distribution. They typically have an extra parameter that controls the variance, or dispersion. The reason I looked into these distributions was of course to see if they could be useful for modeling and predicting football results. I hoped in particular that the distributions that can be underdispersed would be most useful. If the underdispersed distributions describe the data well then the model should predict the outcome of a match better than the ordinary Poisson model.

The model I use is basically the same as the independent Poisson regression model, except that the part with the Poisson distribution is replaced by one of the alternative distributions. Let the \(Y_{ij}\) be the number of goals scored in game i by team j


\( Y_{ij} \sim f(\mu_{ij}, \sigma) \)
\( log(\mu_{ij}) = \gamma + \alpha_j + \beta_k \)

where \(\alpha_j\) is the attack parameter for team j, and \(\beta_k\) is the defense parameter for opposing team k, and \(\gamma\) is the home field advantage parameter that is applied only if team j plays at home. \(f(\mu_{ij}, \sigma)\) is one of the probability distributions discussed in the last post, parameterized by the location parameter mu and dispersion parameter sigma.

To these models I fitted data from English Premier League from the 2010-11 season to the 2014-15 season. I also used Bundesliga data from the same seasons. The models were fitted separately for each season and compared to each other with AIC. I consider this only a preliminary analysis and I have therefore not done a full scale testing of the accuracy of predictions where I refit the model before each match day and use Dixon-Coles weighting.

The five probability distributions I used in the above model was the Poisson (PO), negative binomial (NBI), double Poisson (DPO), Conway-Maxwell Poisson (COM) and the Delaporte (DEL) which I did not mention in the last post. All of these, except the Conway-Maxwell Poisson, were easy to fit using the gamlss R package. I also tried two other gamlss-supported models, the Poisson inverse Gaussian and Waring distributions, but the fitting algorithm did not work properly. To fit the Conway-Maxwell Poisson model I used the CompGLM package. For good measure I also fitted the data to the Dixon-Coles bivariate Poisson model (DC). This model is a bit different from the rest of the models, but since I have written about it before and never really tested it I thought this was a nice opportunity to do just that.

The AIC calculated from each model fitted to the data is listed in the following table. A lower AIC indicates that the model is better. I have indicated the best model for each data set in red.

Pois_alt_aic

The first thing to notice is that the two models that only account for overdispersion, the Negative Binomial and Delaporte, are never better than the ordinary Poisson model. The other and more interesting thing to note, is that the Conway-Maxwell and Double Poisson models are almost always better than the ordinary Poisson model. The Dixon-Coles model is also the best model for three of the data sets.

It is of course necessary to take a look at the estimates of the parameters that extends the three models from the Poisson model, the \(\sigma\) parameter for the Conway-Maxwell and double Poisson and the \(\rho\) for the Dixon-Coles model. Remember that for the Conway-Maxwell a \(\sigma\) greater than 1 indicates underdispersion, while for the Double Poisson model a \(\sigma\) less than 1 is indicates underdispersion. For the Dixon-Coles model a \(\rho\) less than 0 indicates an excess of 0-0 and 1-1 scores and fewer 0-1 and 1-0 scores, while it is the opposite for \(\rho\) greater than 0.

pois_alt_params

It is interesting to see that the estimated dispersion parameters indicate underdispersion for all the data sets. It is also interesting to see that the data sets where the parameter estimates are most indicative of equidispersion is where the Poisson model is best according to AIC (Premier League 2013-14 and Bundesliga 2010-11 and 2014-15).

The parameter estimates for the Dixon-Coles model do not give a very consistent picture. The sign seem to change a lot from season to season for the Premier League data, and for the data sets where the Dixon-Coles model was found to be best, the signs were in the opposite direction of what where the motivation described in the original 1997 paper. Although it does not look so bad for the Bundesliga data, this makes me suspect that the Dixon-Coles model is prone to overfitting. Compared to the Conway-Maxwell and double Poisson models that can capture more general patterns in all of the data, the Dixon-Coles model extends the Poisson model to just parts of the data, the low scoring outcomes.

It would be interesting to do fuller tests of the prediction accuracy of these three models compared to the ordinary Poisson model.

Some alternatives to the Poisson distribution

One important characteristic of the Poisson distribution is that both its expectation and the variance equals parameter \(\lambda\). A consequence of this is that when we use the Poisson distribution, for example in a Poisson regression, we have to assume that the variance equals the expected value.

The equality assumption may of course not hold in practice and there are two ways in which this assumption can be wrong. Either the variance is less than the expectation or it is greater than the expectation. This is called under- and overdispersion, respectively. When the equality assumption holds, it is called equidispersion.

There are two main consequences if the assumption does not hold: The first is that standard errors of the parameter estimates, which are based on the Poisson, are wrong. This could lead to wrong conclusions when doing inference. The other consequence happens when you use the Poisson to make predictions, for example how many goals a football team will score. The probabilities assigned to each number of goals to be scored will be inaccurate.

(Under- and overdispersion should not be confused with heteroscedasticity in ordinary linear regression. Poisson regression models are naturally heteroscedastic because of the variance-expectation equality. Dispersion refers to what relationship there is between the variance and the expected value, in other words what form the heteroscedasticity takes.)

When it comes to modeling and predicting football results using the Poisson, a good thing would be if the data were actually underdispersed. That would mean that the probabilities for the predicted number of goals scored would be higher around the expectation, and it would be possible to make more precise predictions. The increase in precision would be greatest for the best teams. Even if the data were really overdispersed, we would still get probabilities that more accurately reflect the observed number of goals, although the predictions would be less precise.

This is the reason why I have looked into alternatives to the Poisson model that are suitable to model count data and that are capable of being over- and underdispersed. Except for the negative binomial model there seems to have been little focus on more flexible Poisson-like models in the literature, although there are a handful of papers from the last 15 years with some applied examples.

I should already mention the gamlss package, which is an extremely useful package that can fit a large number of regression type models in R. I like to think of it as the glm function on steroids. It can be used to create regression models for a large number of distributions (50+) and using different forms of dependent variables (for example random effects and splines) and doing regression on distribution parameters other than the usual expectation parameters.

The models that I have considered usually have two parameters. The two parameters are often not easy to interpret, but the distributions can be re-parameterized (which is done in the gamlss package) so that the parameters describe the location (denoted \(\mu\), often the same as the expectation) and shape (denoted \(\sigma\), often a dispersion parameter that modifies the association between the expectation and variance). Another typical property is that they equal the Poisson for certain values of the shape parameter.

As I have already mentioned, the kind of model that is most often put forward as an alternative to the Poisson is the Negative binomial distribution (NBI). The advantages of the negative binomial are that is well studied and good software packages exists for using it. The shape parameter \(\sigma > 0\) determines the overdispersion (relative to the Poisson) so that the closer it is to 0, the more it resembles the Poisson. This is a disadvantage as it can not be used to model underdispersion (or equidispersion, although in practice it can come arbitrarily close to it). Another similar, but less studied, model is the Poisson-inverse Gaussian (PIG). It too has a parameter \(\sigma > 0\) that determines the overdispersion.

NBI_PIG

A large class of distributions, called Weighted Poisson distributions, is capable of being both over- and underdispersed. (The terms Weighted in the name comes from a technique used to derive the distribution formulas, not that the data is weighted) A paper describing this class can be found here. The general form of the probability distribution is

\(P(x;\theta,\alpha)=\frac{e^{\mu x+\theta t(x)}}{x!C(\theta,\alpha)}\)

where \(t(x)\) is one of a large number of possible functions, and \(C(\theta,\alpha)\) is a normalizing constant which makes sure all probabilities in the distribution sum to 1. Note that I have denoted the two parameters using \(\theta\) and \(\alpha\) and not \(\mu\) and \(\sigma\) to indicate that these are not necessarily location and shape parameters. I think this and interesting class of distributions that I want to look more into, but since they are not generally implemented in any R package that I know of I will not consider them further now.

Another model that is capable of being over- and underdispersed is the Conway–Maxwell–Poisson distribution (COM), which incidentally is a special case of the class of Weighted Poisson distributions mentioned above (see this paper). The Poisson distribution is a special case of the COM when \(\sigma = 1\), and is underdispersed when \(\sigma > 1\) and overdispersed when \(\sigma\) is between 0 and 1. One drawback with the COM model is that the expected value depends on both parameters \(\mu\) and \(\sigma\), although it is dominated by \(\mu\). This makes the interpretation a bit difficult, but it may not be a problem when making predictions.

Unfortunately, the COM model is not supported by the gamlss package, but there are some other R packages that implements it. I have tried a few of them and the only one that I got to work is CompGLM, which for some reason does not use the location (\(\mu\)) and shape (\(\sigma\)) parameterization.

COM

The Double Poisson (DP) is another interesting distribution which also equals the Poisson distribution when \(\sigma = 1\), but is overdispersed when \(\sigma > 1\) and underdispersed when \(\sigma\) is between 0 and 1. The expectation does not depend on the shape parameter \(\sigma\), and it is approximately equal to the location parameter \(\mu\). Another interesting thing about the Double Poisson is that it is belongs to a larger group of distributions called double exponential families which also lets you derive a binomial-like distribution with an extra dispersion parameter which can be useful in a logistic regression setting (see this paper, or this preprint).

DP

In a follow up post I will try to use these distributions in regression models similar to the independent Poisson model.

A hectic schedule has some effect on the outcome of a football match

It may be that a football team who has had a hectic period with a lot of games will, because of lack of training and restitution, perform poorer. The Wikipedia page for the FA Cup mentions Manchester United’s absence from the cup as a reason for why they won the Premier League by 18 points in the 1999-2000 season. If this is indeed the case, then this is something we could try to exploit in a prediction model.

I used basically the same data and model as I have used before. I used data from the English Championship and the Premier League, and predicted the Premier League games from January 2007 until January 2015 using the independent Poisson model with the Dixon & Coles weighting method (more details on the setup here and here). In addition I constructed a new variable, the number of matches each team has played the last x number of days, were we can use and try different values of x. As a pretentious shorthand I will call this the Match Schedule Intensity Index (MSII). Matches from the FA Cup, Europa Cup and Champions League were also included in the calculations.

As usual the ranked probability score (RPS) is used to assess the prediction accuracy.

I tried four different number of days backwards in time (21, 25, 28 and 31 days) and also varied the time weighing parameter \(\xi\) a bit to see how these things varied together.

Plotting the RPS, number of days back in time and the different values of \(\xi\) against each other gives the following:

xiRPS

We see that looking back 28 days, or four weeks, back in time gives the lowest RPS and this the most accurate predictions of the four alternatives. 25 days is almost as good as 28 days, while 21 and 31 days performs poorer than not having the MSII in the model at all. I am not sure how important the drop in RPS is, as the changes are around the 4th and 5th decimal place. It is probably not that much, but on the other hand, this is an average over 3000 matches, and the number of days backward in time seems to be a more important parameter than the small changes in \(\xi\) that I tried.

It is also interesting to see what effect the MSII has on the number of goals scored. I plotted the estimated multiplicative effect for each additional match for all the fitted models from 2007 to 2015 using the best model with 28 days and \(\xi=0.0020\).

effectTime

I expected the effect of additional matches to be negative, meaning the more games the team has recently played, the fewer goals will they be expected to score. This seems to be at least halfway true, except for a few dips over on the positive side around 2010 and 2013-2014, and a rather large positive effect from the start in 2007 until 2008. This was a bit surprising, and I don’t know why. It would be interesting to redo the analysis with data going further back in time to see how far back the positive effect goes.

Is the effect large? Not really. The most extreme values of the multiplicative effects for the MSII is around 0.97 and 1.04. These values means that for each match a team has played more in the last four weeks they are expected to score around 3-4% more or fewer goals. This effect is around 10% for a team that has played four matches in four weeks, which is a typical mid-season schedule. This is not that big of a deal for individual matches, but it seem to improve the predictions in the long run. But I also think it is necessary to keep in mind that the effect in seems to be mostly absent in some periods.

Better prediction, not just for promoted teams

Ian posted an interesting question that had a lot to do with the post I posted last week:

I have implemented the model to make predictions with two different approaches. The first approach is the standard where I use all matches played in a league to predict a match between Team A and Team B. The second approach is to use just matches played by Team A and Team B to predict the outcome of when they both play each other.

Now would you say that the second approach should be more accurate? As surely the only results which matter for predicting the match between Team A and B is of those two teams?

My answer was that regression models use all the data to estimate the parameters, and that the parameter estimates for Team A and Team B probably will be more precise by including matches where neither team is playing. The intuition for this is that both teams play against a whole bunch of other teams during the season, and the more accurate parameter estimates we can get for these other teams, the more information are we going to get from the matches involving either Team A or Team B. One possible way of getting more accurate parameter estimates for all the other teams is to include data from more matches, if available. And at last, more precise parameter estimates should hopefully provide better predictions.

This is not exactly what I demonstrated in the last post. There I just demonstrated that more data, especially related to promoted teams, will give better predictions on average across the whole Premier League. I did not investigate exactly where these improved predictions occur. It could be that all that gain was just related to the improved parameter estimates of the promoted teams.

That is why, prompted by Ian’s comment, I took a closer look at the predictions. Using the model fitted with data from the Premier League and the Championship, with separate home field advantage for the two divisions, I decided to look at how well the predictions were for some Premier League Teams. Recall that this was the model that made the best predictions in the previous post. I decided to look at only the matches between Manchester United, Arsenal, Aston Villa, Chelsea, Liverpool, Everton and Tottenham since these teams have played in Premier League for a long time.

When only looking at these teams, and using Premier League data only, the RPS was 0.24462. When the Championship were included in the data, RPS were a bit smaller, 0.24436. So this means that including more data, not directly related to this group of teams, improved predictions within that group.

I also tried the model without separate home field advantage parameter for the two divisions, and the predictions got worse for this group of teams. This was not the case when looking at the predictions for all Premier League matches, were it got better on average. This demonstrates an important point that I did not mention in my reasoning above: More data is not necessarily a good thing if your model can’t properly handle it.

Better prediction of Premier League matches using data from other competitions

In most of my football posts on this blog I have used data from the English Premier league to fit statistical models and make predictions. Only occasionally have I looked at other leagues, but always in isolation. That is, I have never combined data from different leagues and competitions into the same model. Using a league by itself works mostly fine, but I have experienced some issues. Model fitting and prediction making often simply does not work at the beginning of the season. The reason for this has mostly to do with newly promoted teams.

If only data from Premier League is used to fit a model, then no data on the new teams is available at the beginning of the season. This makes predicting the outcome of the first matches of the new teams impossible. In subsequent matches the information available is also very limited compared to the other teams, for which we can rely on data from the previous seasons. This uncertainty in the new teams also propagates into the estimates and predictions for the other teams.

This problem can be remedied by using data from outside the Premier League to help estimate the parameters for the promoted teams. The most obvious place to look for data related to the promoted teams is in the Championship, where the teams played before they were promoted. The FA Cup, where teams from the Championship and Premier League are automatically qualified, should also be a good place to use data from.

To test how much the extra data helps make predictions in the Premier League, I did something similar as I did in my post on the Dixon-Coles time weighting scheme. I used the independent Poisson model to make predictions for all the Premier League matches from 1st of January 2007 to 15th of January 2015. The predictions were made using a model fitted only with data from previous matches (going back to august 2005), thus emulating a realistic real-time prediction scenario. I weighted the data using the Dixon-Coles approach, with \(\xi=0.0018\). This makes the scenario a bit unrealistic, since I estimated this parameter using the same Premier League matches I am going to predict here. I also experimented with using different home field advantage for each of the competitions.

To measure prediction quality I used the Ranked Probability Score (RPS), which goes from 0 to 1, with 0 being perfect prediction. RPS is calculated for each match, and the RPS I report here is the average RPS of all predictions made. Since this is over 3600 matches, I am going to report the RPS with quite a lot of decimal places.

Although the RPS goes from 0 to 1, using a RPS = 1 to mean worst possible prediction ability is unrealistic. To get a more realistic RPS to compare against I calculated the RPS using the probabilities of home, draw and away using the raw proportions of the outcome in my data. In statistical jargon this is often called the null model. The probabilities were 0.47, 0.25 and 0.28, respectively, and gave a RPS = 0.2249.

Using only Premier League data, skipping predictions for the first matches in a season involving newly promoted teams, gave a RPS of 0.19558.

Including data from the Championship in the model fitting, and assuming the home field advantage in both divisions were the same, gave a RPS of 0.19298. Adding a separate parameter for the home field advantage in the Championship gave an even better RPS of 0.19292.

Including data from the FA Cup (in addition to data from the Championship) were challenging. When data from the earliest round were included, the model fitting sometimes failed. I am not 100% sure of this, but I believe the reason for this is that some teams, or groups of teams, are mostly isolated from the rest of the teams. By that I mean that some group of teams have only played each other, but not any other team in the data. While this is not actually the case (it can not be) I nevertheless think the time weights makes this approximately true. Matches played a few years before the mathces that predictions are made for will have weights that are almost 0. It seems reasonable that this coupled with the incomplete design of the knockout format is where the trouble comes from.

Anyway, I got it to work by excluding matches played by a team not in the Championship or Premier League in the respective season. An additional parameter for home field advantage in the Cup were included in the model as well. Interestingly, this gave a somewhat poorer prediction ability that using additional data from the Championship only, with a RPS of 0.192972, but still better that using Premier League data only. With the same home overall field advantage for all the competitions, the prediction were unsurprisingly poorer with RPS = 0.1931.

I originally wanted to include data from Champions League and Europa League, as well as data from other European leagues, but the problems and results with the FA Cup made me dismiss the idea.

I am not sure why including the FA Cup didn’t give better predictions, but I have some theories. One is that a separate FA Cup home field advantage is unrealistic. Perhaps it would be better to assume that the home field advantage is the same as in the division the two opponents play in, if they play in the same division. If they played in different divisions, perhaps an overall average home field advantage could be used instead.

Another theory has to do with the time weighting scheme. The time weighting parameter I used was found by using data from the Premier League only. Since this gives uncertain estimates for the newly promoted teams, it will perhaps give more recent matches more weight to try to compensate. With more informative data from the previous season, this should probably be more influential. Perhaps the time weighting could be further refined with different weighting parameters for each division.

Rain does not influence football results

I have often seen the weather mentioned as something that could influence football results, but I have yet to see anyone looking more into it. There are various ways in which the game could be influenced by the weather, and here I am going to look into the effects of precipitation (i.e. rain and snow). I have two hypotheses about what rain could do to the end result of a game.

The first is that rain makes the grass wet, which makes the the ball bounce less and makes running harder. This, I can imagine, should give make scoring goals harder, and thus we should see fewer goals scored in matches where it rains. Also, if it rains during the match the players also get wet, which of course is a burden that should influence the game.

My second hypothesis sort of follows from the first, and that is that rain should make draws more likely.

The obvious hindrance to test the two hypotheses is lack of data. It turns out that getting good historical weather data for a given location is not that simple. The Norwegian Meteorological Institute provides free data from Norwegian weather stations, but (for now at least) I didn’t want to test the hypotheses on Norwegian football results. Instead, I wanted to test it on data from England. What I ended up doing was scraping data from English weather stations from WeatherOnline. That site provides precipitation data from British weather stations in 6-hour intervals, in a window around 1400 o’clock.

Luckily, WeatherOnline provided the coordinates to the weather stations, and I used this together with the coordinates I have compiled in my football stadiums data set to figure out which weather station were nearest. Data from the weather station closest to the place where a match was played should hopefully serve as an adequate proxy for the conditions on the field.

As part of the work on this analysis I also updated the stadium data with some additional stadiums that I needed for this project.

Unfortunately, weather data from all match dates were not available, but all in all i ended up with precipitation data for 4826 matches from the Championship and 2702 matches from Premier League, going back to 2002.

How well can we expect the numbers from the weather stations to reflect the conditions on the stadiums where the mathces are played? After I had coupled the precipitation and match data I made a histogram of the distances from the stadium to the weather station. It reveals that some of the weather stations can be quite far away, some more than 300 kilometers.

distance_histogram

This of course is a problem. The closer the station is to where the match is played, the more accurate is the data going to be. The usual way to deal with data points that are less accurate than others, is to weight them accordingly. That way they have less influence on the parameter estimation.

But how should we decide on how to weight the different matches? What we need is a way to relate the distance to accuracy. For this we need the precipitation levels at a specific location, and the precipitation at weather stations nearby. To do this, we can use the weather station themselves, and see how well the weather stations correlate with other weather stations.

I calculated the correlations between all pairs of weather stations, and plotted them against the distance between them:

weatherstation_correlation

Some of the weather stations are much farther away from each other than the farthest of the ones I have coupled to the matches. We see that there is a clear trend of diminishing correlation the farther away the stations are. Since the correlations are mostly positive (between 0 and 1), they can be used as weights.

The red line in the plot is an attempt to fit a function to the correlations that can be used to compute the weights for a given distance. I fitted (using least squares) the function

\( \lambda_0 e^{-\lambda d} \)

where d is the distance in kilometers, \(\lambda_0\) is the value when d is 0, and \(\lambda\) is the rate in which the function decreases. The estimated values of \(\lambda_0\) and \(\lambda\) that best describes the trend were found to be 0.75 and 0.0047, respectively. Judging from the line in the plot above, it reflects the trend quite well, although there are quite some variability around it.

To test the hypothesis of fewer overall goals scored I fitted a Poisson regression model of the total number of goals scored as response. As predictors I added an indicator for matches played in the Championship, and the amount of rain in millimeters.

Each millimeter rain is associated with 0.16% more goals, which is insignificantly different from 0% (p = 0.856).

To test whether rain makes draws more likely, I used the same predictors as in the Poisson model in a logistic regression model. The odds ratio associated with each millimeter rain were 0.952, insignificantly different from 1 (p=0.165).

To summarize: I found no evidence for any of my two hypotheses. Both were insignificantly different from the null hypotheses of no effect of rain on the number of goals and the probability of draws. The point estimates of the effects were both actually in the opposite direction of what I had thought. Rain was associated with more goals and fewer draws, but not more than we would expect to see if it all was due to chance.