Ian posted an interesting question that had a lot to do with the post I posted last week:
I have implemented the model to make predictions with two different approaches. The first approach is the standard where I use all matches played in a league to predict a match between Team A and Team B. The second approach is to use just matches played by Team A and Team B to predict the outcome of when they both play each other.
Now would you say that the second approach should be more accurate? As surely the only results which matter for predicting the match between Team A and B is of those two teams?
My answer was that regression models use all the data to estimate the parameters, and that the parameter estimates for Team A and Team B probably will be more precise by including matches where neither team is playing. The intuition for this is that both teams play against a whole bunch of other teams during the season, and the more accurate parameter estimates we can get for these other teams, the more information are we going to get from the matches involving either Team A or Team B. One possible way of getting more accurate parameter estimates for all the other teams is to include data from more matches, if available. And at last, more precise parameter estimates should hopefully provide better predictions.
This is not exactly what I demonstrated in the last post. There I just demonstrated that more data, especially related to promoted teams, will give better predictions on average across the whole Premier League. I did not investigate exactly where these improved predictions occur. It could be that all that gain was just related to the improved parameter estimates of the promoted teams.
That is why, prompted by Ian’s comment, I took a closer look at the predictions. Using the model fitted with data from the Premier League and the Championship, with separate home field advantage for the two divisions, I decided to look at how well the predictions were for some Premier League Teams. Recall that this was the model that made the best predictions in the previous post. I decided to look at only the matches between Manchester United, Arsenal, Aston Villa, Chelsea, Liverpool, Everton and Tottenham since these teams have played in Premier League for a long time.
When only looking at these teams, and using Premier League data only, the RPS was 0.24462. When the Championship were included in the data, RPS were a bit smaller, 0.24436. So this means that including more data, not directly related to this group of teams, improved predictions within that group.
I also tried the model without separate home field advantage parameter for the two divisions, and the predictions got worse for this group of teams. This was not the case when looking at the predictions for all Premier League matches, were it got better on average. This demonstrates an important point that I did not mention in my reasoning above: More data is not necessarily a good thing if your model can’t properly handle it.