The probabilities implied by bookmaker odds: Introducing the ‘implied’ package

My package for converting bookmaker odds into probabilities is now on available from CRAN. The package contains several different conversion algorithms, which are all accessible via the implied_probabilities() function. I have written an introduction on how you can use the package here, together with a description of all the methods and with references to papers. But I also want to give some background to some of the methods here on the blog as well.

In statistics, an odd is usually taken to mean the inverse of a probability, that is 1/p, but in the betting world different odds formats exists. As usual, Wikipedia has a nice overview of the different formats. In the implied package, only inverse probability odds are allowed as inputs, which in betting are called decimal odds.

Now you might think that converting decimal odds to probabilities should be easy, you can just use the definition above and take the inverse of the odds to recover the probability. But it is not that simple, since in practice using this simple formula will give you improper probabilities. They will not sum to 1, as they should, but be slightly larger. This gives the bookmakers an edge and the probabilities (which aren’t real probabilities) can not be considered fair, and so different methods for correcting this exists.

Some methods uses different types of regression modelling combined with historical data to estimate the biases in the different outcomes. This is for example the case in the paper On determining probability forecasts from betting odds by Erik Štrumbelj. Anyway, the implied package does not include these kinds of methods. The reason I wanted to mention this paper is that this was where I first read about Shin’s method for the first time.

All the methods in the package are what I call one-shot methods. The conversion of a set of odds for a game only relies on the odds them self, and not on any other data. This is deliberate choice, since I didn’t want to make a modelling package, since that would be much more complicated.

Many of the methods in the package comes are described in the Wisdom of the Crowd document by Joseph Buchdahl, and a review paper by Clarke et al (Adjusting Bookmaker’s Odds to Allow for Overround).

Many of the methods in the package can be described as ad hoc methods. They basically use a simple mathematical formula that relates the true underlying probabilities to the improper probabilities given by the bookmakers odds. Then this formula is used to find the true probabilities so that they are proper (sum to 1) while also recovering the improper bookmaker probabilities.

A few other methods in the package are more theory based, like Shin’s method, and I find these methods really interesting. Shin’s method imagine that there are two types of bettors. The first type is the typical bettor, and the sum of bets by this type follows the “wisdom of the crowd” pattern which should reflect the true ncertainty of the outcome given the publicly available information. Then there is a second type of bettor, which has inside information and always bets on the winning outcome. However, the bookmaker don’t know what type of bettor the individual bettors are, and only observes the mixture of the two types. Here is the interesting part: By assuming the bookmakers know that there are two types of bettors, and that the bookmakers seek to maximize their profits, Shin was able to derive some complicated formulas that relate the true underlying “wisdom of the crowds” probabilities and the bookmakers odds. These formulas can be used in the same way as the ad hoc methods to find the underlying probabilities.

A natural question question is what method gives the most realistic probabilities? There is no definite answer to this, and different methods will be best in different markets and settings. You need to figure this out for yourself.

I am currently working on some new methods inspired by Shin’s framework which I hope to write about later. Shin’s work was mostly done in the context of horse racing, where there is realistic that some bettors have inside information. I hope to develop a method that is more relevant for football.

23 thoughts on “The probabilities implied by bookmaker odds: Introducing the ‘implied’ package

  1. really nice, it’s the one I have been looking for lately. Building logistic regression model in r, and was wondering why sometimes after adding up prob I get value >1. So will use it for sure

  2. Thank you ! Really good stuff
    Let me ask you a question please. Do you have an idea (or maybe an URL source) of how we can predict next match goals in football using decision trees or random forest ?

    • The point with these methods is to NOT use regression modelling, but to only use the information inherent in the odds for a single match. The paper I linked to also seem to support that these methods might be better than regression models.

  3. Hi, really nice job! I’m trying to perform some analysis in R using the implied package.. it seems it returns an error for some combination of odds (may be it does not converge to the solution!).

    Particularly, it returns:

    Error in if any(problematic)) { :
    missing value where TRUE/FALSE needed

    It happens probably because some model condition about the odds is not fulfilled and the code returns missing data.

    Do you know which conditions about the odds have to be satisfied to run the implied_probabilities function?

    • Sorry, I’tt try to be more precise.. the methodology used is “shin” and beyond the error message it returns a warning message too.

      The message is:
      In implied_probabilities(odds, method=”shin”):
      Could not find z: Did not converge in 12 instances. Some results may be unreliable. See the “problematic” vector in the output.

      The problem is that it does not return the output dataframe, so I cannot check for nothing!

      Anyway, thanks for the package! It is really really useful!

      • A possible workaround to get the non-problematic probabilities and identify the problematic ones is to write a loop for all sets of odds, and then feed the one-by-one to the implied_probabilities() function. I am also working on fixing the bug so that at least it raises a warning and return the OK results.

        • Hello, i get error f() values at end points not of opposite sign with odds 1.19, 7.0,14.0 and method=’jsd’

    • If you have never used R before it might be a bit tricky to access the functionality of this package. I suggest you find a tutorial on R, and learn to read in and manipulate data, install packages and so on. There are countless tutorials out there, just google it.

      • Ok. Thanks. I need to type manual all teams and odds? What i want to say is…so much work..how are the results? Some profit?

        • You don’t have to provide the team names, only the odds. But check out the link to the Wisdom of the Crowd webpage. Odds from numerous providers are posted regularly (except for now that sports are cancelled) as easy-to-use csv and excel files, and I think there’s even a excel sheet with some similar calculators.

  4. Thank you for the package.

    “Shin’s method imagine that there are two types of bettors”.

    Perhaps this was true several years ago, when there was extremely little information… Today, I think, this is the wrong direction. Now bookmakers already know more than individual players. Or at least, bookmakers have a lot of statistical and other information that allows them to confuse us by setting the wrong odds.

    Even logically – if bookmakers would ALWAYS set fair odds, then they quickly went bankrupt. For me, odds are misinformation.

    Why you focus so much on odds?

    • Shin’s model was devloped in the early 90s for horse race betting, so yes, it might not be the best method for football odds.

  5. You probably won’t see it since it’s an old post, but I have a doubt: I’m still a Newbie in R, but I’m trying to apply the things I learn to a hobby project of mine: Resimulating the history of a particular league, season by season. Using as an example the Premier League, I would go back to 1889 and use the real data of matches and goals scored to calculate expected goals and then randomly generate a Poisson-based number for each team to get match results and then build a table.
    I used to do this on Excel, but your Goalmodel package is doing wonders for me right now. My doubt is regarding international football, like World Cups. Is there a way to do something similar? Goalmodel is working just fine but it requires matches and scores to calculate the attack/defense strength and then the scoreline probabilities. Is there a way to apply this to international football? They don’t play each other that much and all I have back in 1930 are Elo ratings. There’s a package called “Elo” but it only offers the outcome probability, not the result.

    Thanks in advance and sorry for the book I just wrote.

    • Yes you would need the actual scores for the goalmodel package to be useful. I don’t know any easy to use sources for that kind of data, so you would probably need to scrape the data yourself.

      It is a problem that international games are “sparse”, most countries don’t play each other, so most comparisons between rely on indirect data. I explored this a bit in this old post. As long as the graph is connected, it the goalmodel should work. I think it gives a warning or error if it contains two or more separate components.

      • Thanks for the reply! I already have all the scores needed, manage to scrape it from the international-football website. I’m using matches between World Cups as a parameter (1930 uses data from 1927 to 1930). I don’t know if it will do, but I’ll test. There’s one problem though: yes, it’s giving me this warning of not comparable clusters. Probably some group of islands kept playing themselves but not anyone else. Is there a way to identify these clusters and remove them?

        • You can use the igraph pcakage. I dont’t quite remember the details, but you create a graph (ie a network) of all teams using the graph_from_dataframe function, using just the two vectors of teams, and then you use some function to find the “connected components”, I don’t remember the name.

Leave a Reply

Your email address will not be published. Required fields are marked *