Better prediction, not just for promoted teams

Ian posted an interesting question that had a lot to do with the post I posted last week:

I have implemented the model to make predictions with two different approaches. The first approach is the standard where I use all matches played in a league to predict a match between Team A and Team B. The second approach is to use just matches played by Team A and Team B to predict the outcome of when they both play each other.

Now would you say that the second approach should be more accurate? As surely the only results which matter for predicting the match between Team A and B is of those two teams?

My answer was that regression models use all the data to estimate the parameters, and that the parameter estimates for Team A and Team B probably will be more precise by including matches where neither team is playing. The intuition for this is that both teams play against a whole bunch of other teams during the season, and the more accurate parameter estimates we can get for these other teams, the more information are we going to get from the matches involving either Team A or Team B. One possible way of getting more accurate parameter estimates for all the other teams is to include data from more matches, if available. And at last, more precise parameter estimates should hopefully provide better predictions.

This is not exactly what I demonstrated in the last post. There I just demonstrated that more data, especially related to promoted teams, will give better predictions on average across the whole Premier League. I did not investigate exactly where these improved predictions occur. It could be that all that gain was just related to the improved parameter estimates of the promoted teams.

That is why, prompted by Ian’s comment, I took a closer look at the predictions. Using the model fitted with data from the Premier League and the Championship, with separate home field advantage for the two divisions, I decided to look at how well the predictions were for some Premier League Teams. Recall that this was the model that made the best predictions in the previous post. I decided to look at only the matches between Manchester United, Arsenal, Aston Villa, Chelsea, Liverpool, Everton and Tottenham since these teams have played in Premier League for a long time.

When only looking at these teams, and using Premier League data only, the RPS was 0.24462. When the Championship were included in the data, RPS were a bit smaller, 0.24436. So this means that including more data, not directly related to this group of teams, improved predictions within that group.

I also tried the model without separate home field advantage parameter for the two divisions, and the predictions got worse for this group of teams. This was not the case when looking at the predictions for all Premier League matches, were it got better on average. This demonstrates an important point that I did not mention in my reasoning above: More data is not necessarily a good thing if your model can’t properly handle it.

Better prediction of Premier League matches using data from other competitions

In most of my football posts on this blog I have used data from the English Premier league to fit statistical models and make predictions. Only occasionally have I looked at other leagues, but always in isolation. That is, I have never combined data from different leagues and competitions into the same model. Using a league by itself works mostly fine, but I have experienced some issues. Model fitting and prediction making often simply does not work at the beginning of the season. The reason for this has mostly to do with newly promoted teams.

If only data from Premier League is used to fit a model, then no data on the new teams is available at the beginning of the season. This makes predicting the outcome of the first matches of the new teams impossible. In subsequent matches the information available is also very limited compared to the other teams, for which we can rely on data from the previous seasons. This uncertainty in the new teams also propagates into the estimates and predictions for the other teams.

This problem can be remedied by using data from outside the Premier League to help estimate the parameters for the promoted teams. The most obvious place to look for data related to the promoted teams is in the Championship, where the teams played before they were promoted. The FA Cup, where teams from the Championship and Premier League are automatically qualified, should also be a good place to use data from.

To test how much the extra data helps make predictions in the Premier League, I did something similar as I did in my post on the Dixon-Coles time weighting scheme. I used the independent Poisson model to make predictions for all the Premier League matches from 1st of January 2007 to 15th of January 2015. The predictions were made using a model fitted only with data from previous matches (going back to august 2005), thus emulating a realistic real-time prediction scenario. I weighted the data using the Dixon-Coles approach, with \(\xi=0.0018\). This makes the scenario a bit unrealistic, since I estimated this parameter using the same Premier League matches I am going to predict here. I also experimented with using different home field advantage for each of the competitions.

To measure prediction quality I used the Ranked Probability Score (RPS), which goes from 0 to 1, with 0 being perfect prediction. RPS is calculated for each match, and the RPS I report here is the average RPS of all predictions made. Since this is over 3600 matches, I am going to report the RPS with quite a lot of decimal places.

Although the RPS goes from 0 to 1, using a RPS = 1 to mean worst possible prediction ability is unrealistic. To get a more realistic RPS to compare against I calculated the RPS using the probabilities of home, draw and away using the raw proportions of the outcome in my data. In statistical jargon this is often called the null model. The probabilities were 0.47, 0.25 and 0.28, respectively, and gave a RPS = 0.2249.

Using only Premier League data, skipping predictions for the first matches in a season involving newly promoted teams, gave a RPS of 0.19558.

Including data from the Championship in the model fitting, and assuming the home field advantage in both divisions were the same, gave a RPS of 0.19298. Adding a separate parameter for the home field advantage in the Championship gave an even better RPS of 0.19292.

Including data from the FA Cup (in addition to data from the Championship) were challenging. When data from the earliest round were included, the model fitting sometimes failed. I am not 100% sure of this, but I believe the reason for this is that some teams, or groups of teams, are mostly isolated from the rest of the teams. By that I mean that some group of teams have only played each other, but not any other team in the data. While this is not actually the case (it can not be) I nevertheless think the time weights makes this approximately true. Matches played a few years before the mathces that predictions are made for will have weights that are almost 0. It seems reasonable that this coupled with the incomplete design of the knockout format is where the trouble comes from.

Anyway, I got it to work by excluding matches played by a team not in the Championship or Premier League in the respective season. An additional parameter for home field advantage in the Cup were included in the model as well. Interestingly, this gave a somewhat poorer prediction ability that using additional data from the Championship only, with a RPS of 0.192972, but still better that using Premier League data only. With the same home overall field advantage for all the competitions, the prediction were unsurprisingly poorer with RPS = 0.1931.

I originally wanted to include data from Champions League and Europa League, as well as data from other European leagues, but the problems and results with the FA Cup made me dismiss the idea.

I am not sure why including the FA Cup didn’t give better predictions, but I have some theories. One is that a separate FA Cup home field advantage is unrealistic. Perhaps it would be better to assume that the home field advantage is the same as in the division the two opponents play in, if they play in the same division. If they played in different divisions, perhaps an overall average home field advantage could be used instead.

Another theory has to do with the time weighting scheme. The time weighting parameter I used was found by using data from the Premier League only. Since this gives uncertain estimates for the newly promoted teams, it will perhaps give more recent matches more weight to try to compensate. With more informative data from the previous season, this should probably be more influential. Perhaps the time weighting could be further refined with different weighting parameters for each division.

Rain does not influence football results

I have often seen the weather mentioned as something that could influence football results, but I have yet to see anyone looking more into it. There are various ways in which the game could be influenced by the weather, and here I am going to look into the effects of precipitation (i.e. rain and snow). I have two hypotheses about what rain could do to the end result of a game.

The first is that rain makes the grass wet, which makes the the ball bounce less and makes running harder. This, I can imagine, should give make scoring goals harder, and thus we should see fewer goals scored in matches where it rains. Also, if it rains during the match the players also get wet, which of course is a burden that should influence the game.

My second hypothesis sort of follows from the first, and that is that rain should make draws more likely.

The obvious hindrance to test the two hypotheses is lack of data. It turns out that getting good historical weather data for a given location is not that simple. The Norwegian Meteorological Institute provides free data from Norwegian weather stations, but (for now at least) I didn’t want to test the hypotheses on Norwegian football results. Instead, I wanted to test it on data from England. What I ended up doing was scraping data from English weather stations from WeatherOnline. That site provides precipitation data from British weather stations in 6-hour intervals, in a window around 1400 o’clock.

Luckily, WeatherOnline provided the coordinates to the weather stations, and I used this together with the coordinates I have compiled in my football stadiums data set to figure out which weather station were nearest. Data from the weather station closest to the place where a match was played should hopefully serve as an adequate proxy for the conditions on the field.

As part of the work on this analysis I also updated the stadium data with some additional stadiums that I needed for this project.

Unfortunately, weather data from all match dates were not available, but all in all i ended up with precipitation data for 4826 matches from the Championship and 2702 matches from Premier League, going back to 2002.

How well can we expect the numbers from the weather stations to reflect the conditions on the stadiums where the mathces are played? After I had coupled the precipitation and match data I made a histogram of the distances from the stadium to the weather station. It reveals that some of the weather stations can be quite far away, some more than 300 kilometers.

distance_histogram

This of course is a problem. The closer the station is to where the match is played, the more accurate is the data going to be. The usual way to deal with data points that are less accurate than others, is to weight them accordingly. That way they have less influence on the parameter estimation.

But how should we decide on how to weight the different matches? What we need is a way to relate the distance to accuracy. For this we need the precipitation levels at a specific location, and the precipitation at weather stations nearby. To do this, we can use the weather station themselves, and see how well the weather stations correlate with other weather stations.

I calculated the correlations between all pairs of weather stations, and plotted them against the distance between them:

weatherstation_correlation

Some of the weather stations are much farther away from each other than the farthest of the ones I have coupled to the matches. We see that there is a clear trend of diminishing correlation the farther away the stations are. Since the correlations are mostly positive (between 0 and 1), they can be used as weights.

The red line in the plot is an attempt to fit a function to the correlations that can be used to compute the weights for a given distance. I fitted (using least squares) the function

\( \lambda_0 e^{-\lambda d} \)

where d is the distance in kilometers, \(\lambda_0\) is the value when d is 0, and \(\lambda\) is the rate in which the function decreases. The estimated values of \(\lambda_0\) and \(\lambda\) that best describes the trend were found to be 0.75 and 0.0047, respectively. Judging from the line in the plot above, it reflects the trend quite well, although there are quite some variability around it.

To test the hypothesis of fewer overall goals scored I fitted a Poisson regression model of the total number of goals scored as response. As predictors I added an indicator for matches played in the Championship, and the amount of rain in millimeters.

Each millimeter rain is associated with 0.16% more goals, which is insignificantly different from 0% (p = 0.856).

To test whether rain makes draws more likely, I used the same predictors as in the Poisson model in a logistic regression model. The odds ratio associated with each millimeter rain were 0.952, insignificantly different from 1 (p=0.165).

To summarize: I found no evidence for any of my two hypotheses. Both were insignificantly different from the null hypotheses of no effect of rain on the number of goals and the probability of draws. The point estimates of the effects were both actually in the opposite direction of what I had thought. Rain was associated with more goals and fewer draws, but not more than we would expect to see if it all was due to chance.