The comparison graph

In my last post on Elo ratings I used a graph to illustrate why it is hard to compare the strengths of teams that play in different leagues.

compgraph2

This is based on data from two half-seasons of English Premier League and the Championship. Each team is represented by a node or vertex, which is drawn as a circle. Each edge between the nodes is drawn as line between them and indicates that the two teams have against played each other at least once. I didn’t add team names to the graph, but the orange nodes are the teams that played in the Premier League in both seasons, while the blue nodes are teams that played in the Championship or got promoted or relegated. The graph shows both why comparisons between teams in different leagues can be difficult, and why including data from the Championship can improve the prediction of Premier league matches. In this post I will go more into details about what I call the comparison graph and how it can be used.

We can easily recognize a few patterns in the graph above. The most obvious one is the cluster of several teams where everyone (or nearly everyone) has played against each other. In a fully played season every team has played each other and the graph is said to be fully connected. If the graph is fully connected we should be able to have a good idea about the relative strengths between every team. Here is an example of a fully connected graph representing a fully played season with 10 teams.

graph_example

Another important pattern is the lack of edges between two teams. If two teams hasn’t played each other, but both has played a third team, they are indirectly comparable. Here we see that both Team B and Team C has played team A, but they have not played each other.

graph_example1

If you are going to predict the outcome of a match between Team B and Team C this graph shows you that you should be careful. The information we have of the relative strengths between them is only indirect. This can in some situations where you have very limited data be almost the same as having no data at all. Suppose both Team C and Team B won huge victories over Team A. This would perhaps indicate that Team A is crap, but we would have very little indication which of Team B and Team C is better. If on the other hand Team A beat Team C, and Team B beat Team A, we would have had a strict ordering, so it does not automatically mean that we can’t make anything out of the data.

Another important pattern in a graph is whether there are any disconnected subgraphs. Here we have two or more groups of teams that has played only against other teams within their own group, but not against the teams in the other groups. In the first few rounds of a season we can see patterns like this.

There are a lot of interesting things you can do with the comparison graph, but that will make for a future post.

The underdispersed Conway-Maxwell Poisson distribution and goal differences

I have unfortunately not had the time to look more closely at the performance for the underdispersed count distributions that I in my last post found to be useful for predicting football results. Here I am taking a quick look into how the Conway-Maxwell distribution (COM) influences the predicted goal differences compared to the Poisson distribution.

Using data from the 2010-11 Premier League season I fitted the both the Poisson model and the COM model. The estimated dispersion parameter for the COM model indicated that there was less variability in the actual goals scored than implied by the Poisson distrubtion. I used the code I posted here to compute the probability distributions for the goal-differences for five matches in the season.

Let’s first look at the goal difference distribution for Arsenal playing at home against Manchester City. Both teams were in the top of the final table, and the actual result for this game in the 2010-11 season was 0-0. Comparing the distributions from the Poisson and COM models we see that they are pretty much identical.

com_arsman2011

For Aston Villa vs. Sunderland, which placed 9th and 10th on the table in 2010-11, we also see that there is not much difference between the two models. Although there is a slight increase in the probability for the actual result in that game, I don’t think it is of much importance.

com_astsun2011

Let’s compare the models using two teams from the bottom of the table, Wigan vs. Wolverhampton. Again, not much difference. Also note that there is basically no change in the probability for the actual result.

com_wigwol2011

OK, so far the comparisons have been based on teams of similar strengths. But take Blackburn (15th on the table) vs. Liverpool (6th). Now we see that the COM model and Poisson model differ a bit. Here the COM model does a worse prediction of the actual result compared to the Poisson model. But only considering the league positions, the skewing of the distribution in favor of Liverpool in this case may not be totally unreasonable.

com_blaliv2011

The last plot is compares the distributions between Chelsea (2nd on the table) vs. Birmingham (18th). Here we clearly see that there is a substantial difference in the prediction between the two models. The COM model favors Chelsea much more than the Poisson model does, which in this case give a much higher probability for the correct result.

com_chebir2011

From just looking at these few plots, I think we can conclude that the (underdispersed) COM model differs from the Poisson model where there is a greater difference in strength between the two sides.

Statistics I don’t believe in

Oslo is one of the candidate cities for hosting the 2022 winter Olympics, and of course there is a lot of discussion going on whether it costs too much etc. Opinion polls are conducted regularly, and the current status is that the majority of people don’t want Oslo to host the games. A couple of weeks ago the office dealing with the application put up the data from all polls conducted this year. What is weird is that they choose to illustrate the article detailing the data with the results from the question Do you wish that Norway should continue to participate in the winter Olympics and Paralympics? where only 73% of the respondents answered yes and 19% answered no.

First of all I thought the question was silly. Of course no one want Norway not to participate, but then the results show that almost one in five don’t want Norway to participate. At first i thought that it might be just a fluke, but then I downloaded the data and saw that this question had been asked on three occasions, and the results were pretty similar:

oslo2022

Even after seeing the consistent trend, I still don’t believe the numbers are correct. I think this shows one of the potential pitfalls in opinion polls. Since it is really hard to get the nuances of peoples beliefs in questions with just a few possible answers, it may be tempting to ask more detailed questions. The danger is that the questions become more confusing, or they even become irrelevant. I think both these things are can explain what is going on here.

A short summer update

Hi everybody, I haven’t had written much lately as I have been busy with other things, so this is just a short update on what is going on with my life.

This May I handed in my M.Sc. thesis in bioinformatics at the Norwegian University of Life Sciences. In it I investigated the prevalence of cis-regulatory elements (CREs) in plant promoters across 25 species. CREs are short DNA sites (typically around 12 base pairs long) that plays a role in gene regulation. They typically work by having high binding affinity for proteins that somehow regulates RNA transcription. I used genomic data (sequences and annotations) from the PLAZA server. I then predicted CREs by using motifs from the JASPAR, AtcisDB and PLACE databases. I then investigated whether predicted CREs that was found to be conserved in many gene families were more co-expressed than CREs that were not conserved and conserved only in a few families. I found that in general, this was not the case.

Right after my thesis was done I started my new job as a statistician in the Department of Health Services Research at Akershus University Hospital right outside Oslo. I am very exited about this as I get to work on a lot of interesting and meaningful research projects with a bunch of really great people.

I also spent a week backpacking in Belgium this summer with my girlfriend. The atmosphere was great in Brussels when Belgium beat the US in the World Cup, the weather was nice when we hiked in the Ardennes near the French border and everything was excellent when we rode our bikes in Bruges. Antwerp had the best food and the most beautiful train station I have ever been to. The most bizarre thing we experienced was perhaps the Bruggian love for embroidered cats dressed as humans:

catsinbrugge

Until next time!