Introducing the goalmodel R package

I have written a lot about different models for forecasting football results, and provided a lot of R code along the way. Especially popular are my posts about the Dixon-Coles model, where people still post comments, four years since I first wrote them. Because of the interest in them, and the interest in some of the other models I have written about, I decided to tidy up my code and functions a bit, and make an R package out of it. The result is the goalmodel R package. The package let you fit the ordinary Poisson model, which was one of the first models I wrote about, the Dixon-Coles model, The Negative-Binomial model, and you can also use the adjustment I wrote about in my previous update.

The package contains a function to fit the different models, and you can even combine different aspects of the different models into the same model. You can for instance use the Dixon-Coles adjustment together with a negative binomial model. There is also a range of different methods for making prediction of different kinds, such as expected goals and over/under.

The package can be downloaded from github. It is still just the initial version, so there are probably some bugs and stuff to be sorted out, but go and try it out and let me know what you think!

20 thoughts on “Introducing the goalmodel R package

  1. Thanks for making this work public. I’m new to both sports betting and R, but I did science in uni back in the day, so I will knock the rust off and try my hand at modeling. 🙂

    Would the package be useful for predicting other sports like hockey and basket?

  2. Playing around with the goalmodel package for English championship. Can we make a weighted Rue Salvesen model with weights from the weights_dc function, like this?

    #
    # Weighted Rue-Salvesen model (rsw)
    #
    my_weights <- weights_dc(championship$Date, xi=0.0019)
    length(my_weights)
    plot(championship$Date, my_weights)

    gm_res_rsw <- goalmodel(goals1 = championship$hgoal, goals2 = championship$vgoal,
    team1 = championship$home, team2=championship$visitor,
    rs=TRUE, weights = my_weights)

    summary(gm_res_rsw)

    • Yes, you can use weights for all types of models. However, read my blog post about the Rue-Salvesen adjustment, and how it might not be a good idea to actually fit the model with the adjustment. Also take a look at the github page about how you can set the RS adjustment parameter after you have fit the default (or Dixon-Coles) model.

    • Thanks! I have now read up on parameter setting and tested it.

      Forgive me for being an R newb, but how would you filter for matches between two dates?

      I take it that we load up multiple seasons like this (let’s do 2011-2015):

      # Load data from English Premier League, 2011/2012 to
      # 2015/2016 season.

      england %>%
      filter(Season %in% c(2011):c(2015),
      tier==c(1)) %>%
      mutate(Date = as.Date(Date)) -> england_2011_2015

      Let’s say we want to predict a match within this dataset, using all matches played up to that point. How do we specify a sub-dataset between two dates? Specifically, between the 1st game of 2011/2012 season and some date before the end of the 2015/2016 season?

      Thanks in advance and sorry for the nag. 😉

  3. Amazing package! Why do you give it for free? And how much money did you make already?
    One question: Why don’t you use the derivatives in optimization?

    • The optim function use the finite differences method to compute derivatives, unless a function that computes these is provided. Deriving the derivatives for all the models in the package is boring and difficult, so I haven’t done it.

  4. This looks amazing and exactly what I’ve been looking for, thanks!

    Sorry for being a pain though but do you know of any step by step guides for using R? I’m a complete novice and have figured the odd thing out through trial and error / Google but am struggling to get this to work. Apologies again and keep up the good work

  5. Great post thanks. HAve you covered how you apply time weight in simple poisson distribution?

    Do you apply it to log-likelihood function as do Dixon Coles?
    If so, you don;y use built in glm function in R, but you build log-likelihood and minimize it?

    thanks,

    • I have blogged about the time-weighting, using ordinary Poisson regression, here: http://opisthokonta.net/?p=1013

      If you use the built-in glm function, you just use the weight argument. The goalmodel package don’t use the glm function, but has its own implementation of the poisson log-likelihood, which can be weighted in the same manner as the DC likelihood.

  6. Hi, I’ve never used R before so wouldn’t even know where to start. I’ve been moedlling football games from many leagues using expected goals that I scrape and then use Poisson to get the probabilities for the upcoming games. Would this tool help me make it quicker/better ? Is it possible to implement expected goals data into etc. ? Sorry for all this questions but I’m new to the programing language.

  7. Hi!

    I’m having some problems with a gaussian model. It gives very big numbers for attack/defence parameters.. This happens usually for the first team in the list. So if i rename “Arsenal” to “Barsenal” for example, the problem would move to Aston Villa here..

    Model Gaussian

    Log Likelihood -1699.37
    AIC 3588.74
    R-squared NA
    Parameters (estimated) 95
    Parameters (fixed) 0

    Team Attack Defense
    Arsenal -13.21 0.10
    Aston Villa 0.40 -0.11
    Barnsley 0.03 -0.10
    Birmingham -0.00 -0.08
    ……………………………………………………..

    Maybe a bug in goalmodel? I tried to locate it but i’m more into C# than R.

    But anyway thanks for a great package! Its really helpful!

  8. Hey was just wondering – when looking at the parameter estimates at the end of the pl 2011 season using the 2011 season’s data (time weighted with xi=0.0018 and BFGS optimization as in your blog in the Dixon Coles model) I get seemingly decent parameter estimates, but when you try and return the log-likelihood for the estimates (which has just been minimized) it returns Inf. I assume this is because an adj value is negative, but is there any way of including that as a constraint in the optimization? So rho cannot go between the two values outlined in the original paper?

    Also is there a way you could return the log-likelihood still, just excusing those values that return a negative dc_adj? I played with the idea of just returning an all-1 matrix instead so that it doesn’t affect the prob matrix/optimization otherwise? Worried that these inf values may be affecting the estimates? Thanks again!

    • I already put in an edit that if dc_adj is negative it makes that value of dc_adj for that game equal 1 (so when logged it makes no difference to the expected goals). Can then calculate the total log-likelihood for those games that have an ‘acceptable’ dc_adj.

      I was just wondering if you thought that might affect the estimates/why it would be occurring that it gives values that fall outside the required values in the tau function?
      Thanks!

      • Sorry for the dearth of comments but I’ve just checked further and separately calculated the dc_adj and not found any negative values – is there some issue with the coding of
        if (any(dc_adj <= 0)){
        return(Inf)
        }
        It doesn't appear wrong but when I edit that I can get a log-likelihood value out?

Leave a Reply

Your email address will not be published. Required fields are marked *