# Introducing the goalmodel R package

I have written a lot about different models for forecasting football results, and provided a lot of R code along the way. Especially popular are my posts about the Dixon-Coles model, where people still post comments, four years since I first wrote them. Because of the interest in them, and the interest in some of the other models I have written about, I decided to tidy up my code and functions a bit, and make an R package out of it. The result is the goalmodel R package. The package let you fit the ordinary Poisson model, which was one of the first models I wrote about, the Dixon-Coles model, The Negative-Binomial model, and you can also use the adjustment I wrote about in my previous update.

The package contains a function to fit the different models, and you can even combine different aspects of the different models into the same model. You can for instance use the Dixon-Coles adjustment together with a negative binomial model. There is also a range of different methods for making prediction of different kinds, such as expected goals and over/under.

The package can be downloaded from github. It is still just the initial version, so there are probably some bugs and stuff to be sorted out, but go and try it out and let me know what you think!

## 29 thoughts on “Introducing the goalmodel R package”

1. Thanks for making this work public. I’m new to both sports betting and R, but I did science in uni back in the day, so I will knock the rust off and try my hand at modeling. ðŸ™‚

Would the package be useful for predicting other sports like hockey and basket?

2. Playing around with the goalmodel package for English championship. Can we make a weighted Rue Salvesen model with weights from the weights_dc function, like this?

#
# Weighted Rue-Salvesen model (rsw)
#
my_weights <- weights_dc(championship\$Date, xi=0.0019)
length(my_weights)
plot(championship\$Date, my_weights)

gm_res_rsw <- goalmodel(goals1 = championship\$hgoal, goals2 = championship\$vgoal,
team1 = championship\$home, team2=championship\$visitor,
rs=TRUE, weights = my_weights)

summary(gm_res_rsw)

• Yes, you can use weights for all types of models. However, read my blog post about the Rue-Salvesen adjustment, and how it might not be a good idea to actually fit the model with the adjustment. Also take a look at the github page about how you can set the RS adjustment parameter after you have fit the default (or Dixon-Coles) model.

• Thanks! I have now read up on parameter setting and tested it.

Forgive me for being an R newb, but how would you filter for matches between two dates?

I take it that we load up multiple seasons like this (let’s do 2011-2015):

# Load data from English Premier League, 2011/2012 to
# 2015/2016 season.

england %>%
filter(Season %in% c(2011):c(2015),
tier==c(1)) %>%
mutate(Date = as.Date(Date)) -> england_2011_2015

Let’s say we want to predict a match within this dataset, using all matches played up to that point. How do we specify a sub-dataset between two dates? Specifically, between the 1st game of 2011/2012 season and some date before the end of the 2015/2016 season?

Thanks in advance and sorry for the nag. ðŸ˜‰

3. Amazing package! Why do you give it for free? And how much money did you make already?
One question: Why don’t you use the derivatives in optimization?

• The optim function use the finite differences method to compute derivatives, unless a function that computes these is provided. Deriving the derivatives for all the models in the package is boring and difficult, so I haven’t done it.

4. This looks amazing and exactly what I’ve been looking for, thanks!

Sorry for being a pain though but do you know of any step by step guides for using R? I’m a complete novice and have figured the odd thing out through trial and error / Google but am struggling to get this to work. Apologies again and keep up the good work

5. Great post thanks. HAve you covered how you apply time weight in simple poisson distribution?

Do you apply it to log-likelihood function as do Dixon Coles?
If so, you don;y use built in glm function in R, but you build log-likelihood and minimize it?

thanks,

• I have blogged about the time-weighting, using ordinary Poisson regression, here: http://opisthokonta.net/?p=1013

If you use the built-in glm function, you just use the weight argument. The goalmodel package don’t use the glm function, but has its own implementation of the poisson log-likelihood, which can be weighted in the same manner as the DC likelihood.

6. Hi, I’ve never used R before so wouldn’t even know where to start. I’ve been moedlling football games from many leagues using expected goals that I scrape and then use Poisson to get the probabilities for the upcoming games. Would this tool help me make it quicker/better ? Is it possible to implement expected goals data into etc. ? Sorry for all this questions but I’m new to the programing language.

7. great content keep up the great work!! Why dont you make a youtube video about your github model and how to use it?

8. Hi!

I’m having some problems with a gaussian model. It gives very big numbers for attack/defence parameters.. This happens usually for the first team in the list. So if i rename “Arsenal” to “Barsenal” for example, the problem would move to Aston Villa here..

Model Gaussian

Log Likelihood -1699.37
AIC 3588.74
R-squared NA
Parameters (estimated) 95
Parameters (fixed) 0

Team Attack Defense
Arsenal -13.21 0.10
Aston Villa 0.40 -0.11
Barnsley 0.03 -0.10
Birmingham -0.00 -0.08
……………………………………………………..

Maybe a bug in goalmodel? I tried to locate it but i’m more into C# than R.

But anyway thanks for a great package! Its really helpful!

• You have a lot of parameters it seems. Could there be that some teams have very few games in your data? Also, have you used the new version 0.2? It should be more stable.

• I updated to 2.0, cannot get gaussian model to work at all anymore. Even with simple data. It always gives very high intercept.

9. Hey was just wondering – when looking at the parameter estimates at the end of the pl 2011 season using the 2011 season’s data (time weighted with xi=0.0018 and BFGS optimization as in your blog in the Dixon Coles model) I get seemingly decent parameter estimates, but when you try and return the log-likelihood for the estimates (which has just been minimized) it returns Inf. I assume this is because an adj value is negative, but is there any way of including that as a constraint in the optimization? So rho cannot go between the two values outlined in the original paper?

Also is there a way you could return the log-likelihood still, just excusing those values that return a negative dc_adj? I played with the idea of just returning an all-1 matrix instead so that it doesn’t affect the prob matrix/optimization otherwise? Worried that these inf values may be affecting the estimates? Thanks again!

• I already put in an edit that if dc_adj is negative it makes that value of dc_adj for that game equal 1 (so when logged it makes no difference to the expected goals). Can then calculate the total log-likelihood for those games that have an ‘acceptable’ dc_adj.

I was just wondering if you thought that might affect the estimates/why it would be occurring that it gives values that fall outside the required values in the tau function?
Thanks!

• Sorry for the dearth of comments but I’ve just checked further and separately calculated the dc_adj and not found any negative values – is there some issue with the coding of
return(Inf)
}
It doesn't appear wrong but when I edit that I can get a log-likelihood value out?

• I couldn’r reproduce your error. What is the value returned from optim (optim_res\$value)? I get something like 871 with the same data and xi as you. If you get a value from this, but not when you evaluate the (weighted) loglikelihood function itself, it seems like the problem is not with the dc_adj being negative (as you pointed out), but somewhere else. Did you mess up the estimated parameters somehow? or the weights?

It would be hard to implement the constraints for rho in other ways, since they depends on lambda and mu, which them self are functions of all the attack and defense parameters. The constraints are actually there to ensure that tau (dc_adj) is not negative. Returning inf is a slightly hacky way of adding constraints to the optimization, but it usually works. Making dc_adj be 1 instead will probably affect the estimation, but not sure in what way.

10. Many thanks for this package.

I am looking at the predict_goals function. Does this predict score to nil, for example 0-0, 0-1, 1-0 etc? Perhaps I am not interpreting the data-frame correctly?

[,1] [,2] [,3] [,4] [,5]
[1,] 4.019% 3.265% 2.346% 0.928% 0.275%
[2,] 6.691% 9.445% 5.196% 2.056% 0.610%

That appears to be predicting 1-1, 1-2, 1-3 … and 2-1, 2-2, 2-3 …

Thank you.

• Yes it predicts the 0 score. The 1’s you see are the index of the matrix, which in R counts from 1. So element [1,1] corresponds to score 0-0, element [2,1] to 1-0, and so on.

11. Hello,

When it comes to predict_goals, is there a way to get a plot of this, or the data as a table? I’m good with the basics of R but I’m struggling to think how to do this. Or a way to modify it to output the most likely score?

Best,

• Here is how I use to do it using ggplot:

```# Matrix of probabilities.
# probmat should be the output from predict_goals()
# but here i make one just to have a self contained example.
probmat <- dpois(x = 0:5, lambda = 1) %*% t(dpois(x = 0:6, lambda = 3))

# Make long data frame.
df <- expand.grid(X = 0:(nrow(probmat)-1),
Y = 0:(ncol(probmat)-1))

# Using the c() function on a matrix flattens it to a vector.
df\$PROB <- c(probmat)

# Make labels with 3 decimals for plotting.
df\$PROB_LABELS <- sprintf('%.3f', df\$PROB)

# Make plot.
ggplot(df, aes(x=X, y=Y, fill=PROB, label=PROB_LABELS)) +
geom_tile() +
geom_text() +
scale_fill_continuous(low='white', high = 'orange') +
scale_x_continuous(breaks=0:(nrow(probmat)-1) ) +
scale_y_continuous(breaks=0:(ncol(probmat)-1) ) +
theme_bw() +
# Remove gridlines.
theme(panel.grid = element_blank(),
axis.ticks = element_blank())
```
12. Hi,

I am also working on football models and have interesting dataset that inclides live stats plus bookmaker odds per each match minute. It covers all games for last 4 years.
I am trying to extract as much profitable strats from it as possible and looking for colaboration with someone with good football analysis background. There are so much can be done with this data.
Can we have a discussion somewhere? (failed to find your contacts)

My telegram: @jurikmazurik
Also be free to email me.

Thanks.

• That sounds like an interesting data set. Unfortunately i dont have much time right now to do collaborations. I will leave this comment here in case anyone else is interested.

13. Great package.
Please can I ask what the parameter ‘Intercept’ is used for when using the Dixon-Coles option in the goalmodel package ?