Introducing the goalmodel R package

I have written a lot about different models for forecasting football results, and provided a lot of R code along the way. Especially popular are my posts about the Dixon-Coles model, where people still post comments, four years since I first wrote them. Because of the interest in them, and the interest in some of the other models I have written about, I decided to tidy up my code and functions a bit, and make an R package out of it. The result is the goalmodel R package. The package let you fit the ordinary Poisson model, which was one of the first models I wrote about, the Dixon-Coles model, The Negative-Binomial model, and you can also use the adjustment I wrote about in my previous update.

The package contains a function to fit the different models, and you can even combine different aspects of the different models into the same model. You can for instance use the Dixon-Coles adjustment together with a negative binomial model. There is also a range of different methods for making prediction of different kinds, such as expected goals and over/under.

The package can be downloaded from github. It is still just the initial version, so there are probably some bugs and stuff to be sorted out, but go and try it out and let me know what you think!

43 thoughts on “Introducing the goalmodel R package

  1. Thanks for making this work public. I’m new to both sports betting and R, but I did science in uni back in the day, so I will knock the rust off and try my hand at modeling. 🙂

    Would the package be useful for predicting other sports like hockey and basket?

  2. Playing around with the goalmodel package for English championship. Can we make a weighted Rue Salvesen model with weights from the weights_dc function, like this?

    #
    # Weighted Rue-Salvesen model (rsw)
    #
    my_weights <- weights_dc(championship$Date, xi=0.0019)
    length(my_weights)
    plot(championship$Date, my_weights)

    gm_res_rsw <- goalmodel(goals1 = championship$hgoal, goals2 = championship$vgoal,
    team1 = championship$home, team2=championship$visitor,
    rs=TRUE, weights = my_weights)

    summary(gm_res_rsw)

    • Yes, you can use weights for all types of models. However, read my blog post about the Rue-Salvesen adjustment, and how it might not be a good idea to actually fit the model with the adjustment. Also take a look at the github page about how you can set the RS adjustment parameter after you have fit the default (or Dixon-Coles) model.

    • Thanks! I have now read up on parameter setting and tested it.

      Forgive me for being an R newb, but how would you filter for matches between two dates?

      I take it that we load up multiple seasons like this (let’s do 2011-2015):

      # Load data from English Premier League, 2011/2012 to
      # 2015/2016 season.

      england %>%
      filter(Season %in% c(2011):c(2015),
      tier==c(1)) %>%
      mutate(Date = as.Date(Date)) -> england_2011_2015

      Let’s say we want to predict a match within this dataset, using all matches played up to that point. How do we specify a sub-dataset between two dates? Specifically, between the 1st game of 2011/2012 season and some date before the end of the 2015/2016 season?

      Thanks in advance and sorry for the nag. 😉

  3. Amazing package! Why do you give it for free? And how much money did you make already?
    One question: Why don’t you use the derivatives in optimization?

    • The optim function use the finite differences method to compute derivatives, unless a function that computes these is provided. Deriving the derivatives for all the models in the package is boring and difficult, so I haven’t done it.

  4. This looks amazing and exactly what I’ve been looking for, thanks!

    Sorry for being a pain though but do you know of any step by step guides for using R? I’m a complete novice and have figured the odd thing out through trial and error / Google but am struggling to get this to work. Apologies again and keep up the good work

  5. Great post thanks. HAve you covered how you apply time weight in simple poisson distribution?

    Do you apply it to log-likelihood function as do Dixon Coles?
    If so, you don;y use built in glm function in R, but you build log-likelihood and minimize it?

    thanks,

    • I have blogged about the time-weighting, using ordinary Poisson regression, here: http://opisthokonta.net/?p=1013

      If you use the built-in glm function, you just use the weight argument. The goalmodel package don’t use the glm function, but has its own implementation of the poisson log-likelihood, which can be weighted in the same manner as the DC likelihood.

  6. Hi, I’ve never used R before so wouldn’t even know where to start. I’ve been moedlling football games from many leagues using expected goals that I scrape and then use Poisson to get the probabilities for the upcoming games. Would this tool help me make it quicker/better ? Is it possible to implement expected goals data into etc. ? Sorry for all this questions but I’m new to the programing language.

  7. Hi!

    I’m having some problems with a gaussian model. It gives very big numbers for attack/defence parameters.. This happens usually for the first team in the list. So if i rename “Arsenal” to “Barsenal” for example, the problem would move to Aston Villa here..

    Model Gaussian

    Log Likelihood -1699.37
    AIC 3588.74
    R-squared NA
    Parameters (estimated) 95
    Parameters (fixed) 0

    Team Attack Defense
    Arsenal -13.21 0.10
    Aston Villa 0.40 -0.11
    Barnsley 0.03 -0.10
    Birmingham -0.00 -0.08
    ……………………………………………………..

    Maybe a bug in goalmodel? I tried to locate it but i’m more into C# than R.

    But anyway thanks for a great package! Its really helpful!

  8. Hey was just wondering – when looking at the parameter estimates at the end of the pl 2011 season using the 2011 season’s data (time weighted with xi=0.0018 and BFGS optimization as in your blog in the Dixon Coles model) I get seemingly decent parameter estimates, but when you try and return the log-likelihood for the estimates (which has just been minimized) it returns Inf. I assume this is because an adj value is negative, but is there any way of including that as a constraint in the optimization? So rho cannot go between the two values outlined in the original paper?

    Also is there a way you could return the log-likelihood still, just excusing those values that return a negative dc_adj? I played with the idea of just returning an all-1 matrix instead so that it doesn’t affect the prob matrix/optimization otherwise? Worried that these inf values may be affecting the estimates? Thanks again!

    • I already put in an edit that if dc_adj is negative it makes that value of dc_adj for that game equal 1 (so when logged it makes no difference to the expected goals). Can then calculate the total log-likelihood for those games that have an ‘acceptable’ dc_adj.

      I was just wondering if you thought that might affect the estimates/why it would be occurring that it gives values that fall outside the required values in the tau function?
      Thanks!

      • Sorry for the dearth of comments but I’ve just checked further and separately calculated the dc_adj and not found any negative values – is there some issue with the coding of
        if (any(dc_adj <= 0)){
        return(Inf)
        }
        It doesn't appear wrong but when I edit that I can get a log-likelihood value out?

    • I couldn’r reproduce your error. What is the value returned from optim (optim_res$value)? I get something like 871 with the same data and xi as you. If you get a value from this, but not when you evaluate the (weighted) loglikelihood function itself, it seems like the problem is not with the dc_adj being negative (as you pointed out), but somewhere else. Did you mess up the estimated parameters somehow? or the weights?

      It would be hard to implement the constraints for rho in other ways, since they depends on lambda and mu, which them self are functions of all the attack and defense parameters. The constraints are actually there to ensure that tau (dc_adj) is not negative. Returning inf is a slightly hacky way of adding constraints to the optimization, but it usually works. Making dc_adj be 1 instead will probably affect the estimation, but not sure in what way.

  9. Many thanks for this package.

    I am looking at the predict_goals function. Does this predict score to nil, for example 0-0, 0-1, 1-0 etc? Perhaps I am not interpreting the data-frame correctly?

    [,1] [,2] [,3] [,4] [,5]
    [1,] 4.019% 3.265% 2.346% 0.928% 0.275%
    [2,] 6.691% 9.445% 5.196% 2.056% 0.610%

    That appears to be predicting 1-1, 1-2, 1-3 … and 2-1, 2-2, 2-3 …

    Thank you.

    • Yes it predicts the 0 score. The 1’s you see are the index of the matrix, which in R counts from 1. So element [1,1] corresponds to score 0-0, element [2,1] to 1-0, and so on.

  10. Hello,

    When it comes to predict_goals, is there a way to get a plot of this, or the data as a table? I’m good with the basics of R but I’m struggling to think how to do this. Or a way to modify it to output the most likely score?

    Best,

    • Here is how I use to do it using ggplot:

      # Matrix of probabilities. 
      # probmat should be the output from predict_goals()
      # but here i make one just to have a self contained example.
      probmat <- dpois(x = 0:5, lambda = 1) %*% t(dpois(x = 0:6, lambda = 3))
      
      # Make long data frame.
      df <- expand.grid(X = 0:(nrow(probmat)-1), 
                       Y = 0:(ncol(probmat)-1))
      
      # Add probabilities from probmat.
      # Using the c() function on a matrix flattens it to a vector.
      df$PROB <- c(probmat)
      
      # Make labels with 3 decimals for plotting.
      df$PROB_LABELS <- sprintf('%.3f', df$PROB)
      
      # Make plot.
      ggplot(df, aes(x=X, y=Y, fill=PROB, label=PROB_LABELS)) +
        geom_tile() +
        geom_text() +
        scale_fill_continuous(low='white', high = 'orange') +
        scale_x_continuous(breaks=0:(nrow(probmat)-1) ) +
        scale_y_continuous(breaks=0:(ncol(probmat)-1) ) +
        theme_bw() +
        # Remove gridlines.
        theme(panel.grid = element_blank(),
              axis.ticks = element_blank())
      
  11. Hi,

    I am also working on football models and have interesting dataset that inclides live stats plus bookmaker odds per each match minute. It covers all games for last 4 years.
    I am trying to extract as much profitable strats from it as possible and looking for colaboration with someone with good football analysis background. There are so much can be done with this data.
    Can we have a discussion somewhere? (failed to find your contacts)

    My telegram: @jurikmazurik
    Also be free to email me.

    Thanks.

    • That sounds like an interesting data set. Unfortunately i dont have much time right now to do collaborations. I will leave this comment here in case anyone else is interested.

  12. Great package.
    Please can I ask what the parameter ‘Intercept’ is used for when using the Dixon-Coles option in the goalmodel package ?

  13. Hi, Is it possible to fit a model without home advantage parameter? I was thinking testing this for world cup scores, but can’t figure out out to remove it calculating home field advantage.

  14. Hi, thanks for your package. It’s great.
    I have created two Poisson models with goalmodel. One is based on goals, the other on xG. Now I want to combine them together and weight them (30 and 70 % respectively). What is the best way to do this? Is it a good idea, to change the values in the list gm_res_dc$parameters$attack/defense?

    • I assume you have read this article by Ben Torvaney, where he finds the 30/70 weights:

      https://www.statsandsnakeoil.com/2021/06/09/does-xg-really-tell-all/

      He combines the individual parameters, and discusses other options. I would think that the best approach would be to use each model to make their own predictions, and then combine the predictions. This is much more flexible since you can have different types of models that models things other than attack and defence strengts, such as correlations in the number of goals, overdispersion, and so on. The combining should work for all types of prediction types, such as 1×2, over/under, expected goals, etc, and perhaps different weights would be better for certain types of predictions. It should also be possible to combine more than two models as well.

      There is a whole litterature on model averaging, or ensemble models and so on. I think a neat approach that would be suitable here is what is called “model stacking”. Basically, you do a leave-one-out cross validation for all your models, and you optimize a weighted combination of the LOO predictions to fit the data (least squares). You can use backtesting instead of LOO, and depending on the predictions you want to make you can use other losses than square loss. Page 3 in this pdf gives a really short, but good, descriprion of model stacking:

      http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf

  15. Hi,

    Do you think using confusion matrix as a metric is appropriate? For example, making prediction from the model, we get 50%-10%-40% as the probability of home team winning, draw match and away team winning respectively. Then taking the largest probability as the final result which is home team will win (50%) the this example. And we make a comparison of our predictions (home win, tie or away win) to the actual results.

    (It will end up with a 3×3 matrix).

    Also is it reasonable to measure our model in this way: take the mean absolute value of the difference between predicted score and actual score (MAE)?

    • The confusion matrix is not a metric, but a table of the number of observed outcomes for each predicted outcome. From the confusion matrix you can calculate a large number of metrics. Wikipedia lists many of them:
      https://en.wikipedia.org/wiki/Confusion_matrix

      That said, what you should do to evaluate your predictions is to take the probability of the observed outcome (50% if the home team wins in your example, or 10% of draw, etc). Then take the logarithm of the probability, and then sum together all the log-probabilities. This metric you can compare between models (provided you make predictions on the same data). See this paper for more information:
      https://arxiv.org/abs/1908.08980

      In the goalmodel package there is a function called “score_predictions” that can help you with this.

  16. You mentioned on Github

    The goalmodel package models the goal scoring intensity so that it is a function of the two opposing teams attack and defense ratings. Let ?1 and ?2 be the goal scoring intensities for the two sides. The default model in the goalmodel package then models these as

    log(?1) = ? + hfa + attack1 ? defense2

    log(?2) = ? + attack2 ? defense1

    where ? is the intercept (approximately the average number of goals scored) and hfa is the home field advantage given to team 1. By default the number of goals scored, Y1 and Y2 are distributed as

    Y1 ? Poisson(?1)

    Y2 ? Poisson(?2)

    Can you please explain further or provide an article, video tutorial, searching keyword on how to solve it? I mean given Home team, Away team, Home score and Away score, how can we get ?, hfa, attack1, attack2, defense1 and defense2?

    • The entire purpose of the goalmodel package is to solve (or estimate) the hfa, attack and defense parameters. The same page you copied the text from is a tutorial on how to do it.

Leave a Reply to Mark Cancel reply

Your email address will not be published. Required fields are marked *