Back in august i downloaded all album reviews from pitchfork.com, a hip music website mainly dealing with genres such as rock, electronica, experimental music, jazz etc. In addition to a written review, each reviewed album is given a score by the reviewer from 0.0 to 10.0, to one decimal accuracy. In other words, a reviewed album is graded on a 101 point scale. But does it make sense to have such an accurate grading scale? Is it really any substantial difference between two records with a 0.1 difference in score? Listening to music is a qualitative experience, and no matter how professional the reviewer is, a record review is always a subjective analysis influenced by the reviewers taste, mood and preconceptions. To quantify musical quality on a single scale is therefore a hard, if not impossible, feat. Still, new music releases is routinely reviewed and graded in the media, but i don’t know of anyone having a grading system to the accuracy that Pitchfork does. Usually there is a 0 to 5 or 0 to 10 scale, perhaps to the accuracy of a half. There are sites like Metacritc and Rotten Tomatoes (for film reviews) that has a similar accuracy to their reviews, but they are both based on reviews collected from many sources. In the case of Pitchfork, there is usually just one reviewer (with a few reviews credited to two or more people). As far as i know pitchfork has no guidelines on how to interpret the score or what criteria they use to set the score and it may just be up to the reviewer to figure out what to put in the score.
Anyway, I extracted the information from the reviews i downloaded and put it into a .csv file. This gave me data on 13330 reviews which i then loaded into R for some plotting with ggplot2. Lets look at some graphs to see how the scores are distributed and try to find something interesting. First we have a regular histogram:
When I first saw it I was not expect the distribution to be so right skewed. I expected the top to be around maybe 5 or 6. I calculated the mean and median which are 6.96 and 7.2, respectively. Lets look at a bar plot, where each bar corresponds to a specific score.
Now this is interesting. We can clearly see four spikes around the top, some scores are clearly more popular than others. ggplot2 clutters the ticks on the x-axis so it is difficult to see exactly which scores it is (this seems to be a regular problem with ggplot2, event the examples in the official documentation suffers from this) Anyway, I found out that the most popular scores are 7.5 (620 records), 7.0 (614 records), 7.8 (611 records) and 8.0 (594 records). Together, 18.3% of the reviewed records has been given one of these four scores. From this it seems to be some sort of bias towards round or ‘half round’ numbers. I guess we humans have some sort of subconscious preference for these kinds of numbers. If we now look closer at the right end of the plot, we see the same phenomena:
The 10.0 ‘perfect’ score is way more used than the scores just below it. So it appears to be harder to make a ‘near perfect’ album than a perfect one, which is kind of strange. If I were to draw some conclusion after looking at these charts, it would be that a 101 point scale is too accurate to be useful for distinguish between albums that differ little in their numeric scores. I also wonder if this phenomenon can be found in other situations where people are asked to grade something on a scale with similar accuracy.