In the previous blog posts about predicting football results using Poisson regression I have mostly ignored the fact that the data points (ie matches) used to fit the models are gathered (played) at different time points.

In the 1997 Dixon and Coles paper where they described the bivariate adjustment for low scores they also discussed using weighted maximum likelihood to better make the parameter estimates reflect the current abilities of the teams, rather than as an average over the whole period you have data from.

Dixon and Coles propose to weight the games using a function so that games are down weighted exponentially by how long time it is since they were played. The function to determine the weight for a match played is

where t is the time since the match was played, and \(\xi\) is a positive parameter that determines how much down-weighting should occur.

I have implemented this function in R, but I have done a slight modification from the one from the paper. Dixon and Coles uses “half weeks” as their time unit, but they do not describe in more detail what exactly they mean. They probably used Wednesday or Thursday as the day a new half-week starts, but I won’t bother implementing something like that. Instead I am just going to use days as the unit of time.

This function takes a vector of the match dates (data type *Date*) and computes the weights according to the current date and a value of \(\xi\). The *currentDate* argument lets you set the date to count from, with all dates after this will be given weight 0.

DCweights <- function(dates, currentDate=Sys.Date(), xi=0){ datediffs <- dates - as.Date(currentDate) datediffs <- as.numeric(datediffs *-1) w <- exp(-1*xi*datediffs) w[datediffs <= 0] <- 0 #Future dates should have zero weights return(w) }

We can use this function to plot how the much weight the games in the past is given for different values of \(\xi\). Here we see that \(\xi = 0\) gives the same weight to all the matches. I have also set the *currentDate* as a day in April 2013 to illustrate how future dates are weighted 0.

To figure out the optimal value for \(\xi\) Dixon and Coles emulated a situation where they predicted the match results using only the match data prior to the match in question, and then optimizing for prediction ability. They don’t explain how exactly they did the optimizing, but I am not going to use the *optim* function to do this. Instead I am going to just try a lot of different values of \(\xi\), and go for the one that is best. The reason for this is that doing the prediction emulation takes some time, and using an optimizing algorithm will take an unpredictable amount of time.

I am not going to use the Dixon-Coles model here. Instead I am going for the independent Poisson model. Again, the reason is that I don’t want to use too much time on this.

To measure prediction ability, Dixon & Coles used the predictive log-likelihood (PLL). This is just the logarithm of the probabilities calculated by the model for the outcome that actually occurred, added together for all matches. This means that a greater PLL indicates that the actual outcomes was more probable according to the model, which is what we want.

I want to use an additional measure of prediction ability to complement the PLL: The ranked probability score (RPS). This is a measure of prediction error, and takes on values between 0 and 1, with 0 meaning perfect prediction. RPS measure takes into account the entire probability distribution for the three outcomes, not just the probability of the observed outcome. That means a high probability of a draw is considered less of an error that a high probability of away win, if the actual outcome was home win. This measure was popularized in football analytics by Anthony Constantinou in his paper *Solving the Problem of Inadequate Scoring Rules for Assessing Probabilistic Football Forecast Models*. You can find a link to a draft version of that paper on Constantinou’s website.

I used data from the 2005-06 season and onwards, and did predictions from January 2007 and up until the end of 2014. I also skipped the ten first match days at the beginning of each season to avoid problems with lack of data for the promoted teams. I did this for the top leagues in England, Germany, Netherlands and France. Here are optimal values of \(\xi\) according to the two prediction measurements:

PLL | RPS | |
---|---|---|

England | 0.0018 | 0.0018 |

Germany | 0.0023 | 0.0023 |

Netherlands | 0.0019 | 0.0020 |

France | 0.0019 | 0.0020 |

The RPS and PLL mostly agree, and where they disagree, it is only by one in the last decimal place.

Dixon & Coles found an optimal value of 0.0065 in their data (which were from the 1990s and included data from the top four English leagues and the FA cup), but they used half weeks instead of days as their time unit. Incidentally, if we divide their value by the number of days in a half week (3.5 days) we get 0.00186, about the same I got. The German league has the greatest optimum value, meaning historical data is of less importance when making predictions.

An interesting thing to do is to plot the predictive ability against different values of \(\xi\). Here are the PLL (adjusted to be between 0 and 1, with the optimum at 1) for England and Germany compared, with their optima indicated by the dashed vertical lines:

I am not sure I want to interpret this plot too much, but it does seems like predictions for the German league are more robust to values of \(\xi\) greater than the optimum, as indicated by the slower decline in the graph, than the English league.

So here I have presented the values of \(\xi\) for the independent Poisson regression model, but will these values be the optimal for the Dixon & Coles model? Probably not, but I suspect there will be less variability between the two models than between the same model fitted for different leagues.

Could you show how you could include this into the existing Dixon and Coles model you have produced please? I’m not sure how you add it in and I’m interested to see the results.

It is really simple actually, just multiply the weights with the log likelihood for each match, before you sum them together. You can replace the original log likelihood function with this:

It should be obvious, but anyway: Give your weights (calculated by the

DCweightsfunction, for example) to theweightsargument. If you don’t give it any weights it will give equal weight to all matches,i.e.the same results as before.Hi, there. I have read many of your article. But I don’t understand this one. So, do you mean I should type “DCweights” instead of “NULL” ?

I did it. I put the function DCweights right after the function tau, and changed the DCloglik to the function DCloglikWeighted and also the NULL to DCweights.

It gave me this:

> HomeWinProbability

[1] NA

> DrawProbability

[1] NA

> AwayWinProbability

[1] NA

please help.

it gave me this:

Error in loglik * weights: non-numeric argument of a binary operator

You should compute the weights first, for example like this:

myweights <- DCweights(dates, xi=0.0018) This creates a vector with the weights that you can pass to the DClogLikWeighted function: DClogLikWeighted(...., weights=myweights)

However, how to input dates from csv file? Should we make a matrix like attack.p in DCoptimFn?

I did it that way but it doesn’t work.

DCmodelData <- function(df){

hm <- model.matrix(~ HomeTeam – 1, data=df, contrasts.arg=list(HomeTeam='contr.treatment'))

am <- model.matrix(~ AwayTeam -1, data=df)

team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))

ax = model.matrix(~ Date -1,data=df)

return(list(

homeTeamDM=hm,

awayTeamDM=am,

homeGoals=df$FTHG,

awayGoals=df$FTAG,

dates=ax,

teams=team.names

))

}

DCoptimFn <- function(params, DCm){

home.p <- params[1]

rho.p <- params[2]

nteams <- length(DCm$teams)

ndates <- length(DCm$dates)

attack.p <- matrix(params[3:(nteams+2)], ncol=1)

defence.p <- matrix(params[(nteams+3):(nteams+2+nteams)], ncol=1)

da <- matrix(params[(nteams+nteams+3):length(DCm$dates)],ncol=1)

lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p)

mu <- exp(DCm$awayTeamDM %*% attack.p + DCm$homeTeamDM %*% defence.p)

dates <- (DCm$dates %*% da)

xi <-0.0018

datediffs <- as.Date(dates,"%d%b%Y") – as.Date("2013-12-30")

datediffs <- as.numeric(datediffs *-1)

w <- exp(-1*xi*datediffs)

w[datediffs <= 0] <- 0 #Future dates should have zero weights

return(

DClogLikWeighted(y1=DCm$homeGoals, y2=DCm$awayGoals, lambda, mu, rho.p, w) * -1

)

}

You don’t really have to do anything special with the dates, just pass the column to the as.Date function, with the appropriate format argument. This is what is already done in the command that creates the datediff variable. You don’t have to use the model.matrix argument or do any matrix multiplication. So your da and (DCm$dates %*% da) stuff is not really needed. In the way you have set it up you can just do like this in the DCmodelData function:

ax <- df$Date

I think I am able to do it. Thank you so much.

I have problems to compute THIS:

myweights <- DCweights(dates, xi=0.0018)

It gives me this error:

Error in DCweights(dates, xi = 0.0018) : 'dates' object not found

It says the ‘dates’ object not found, which means you have not defined a variable called ‘dates’.

Would a Pareto decay model not be more representative of the recent value of the team strengths? Exponential decay places a lot of emphasis on form, which doesn’t take into account the varying strengths of opposition?

Have you thought about modelling adjustments for strength of schedule ?

I have been thinking a bit about other possible weighting schemes, and I will definitely look into the Pareto function. Thanks for the tip!

One thing I tried was incorporating seasons into the weights. The idea was that the current season should count more than the previous ones, since I thought data from 5 months back would less informative if that was in the previous season than in the current one. What I did was just multiplying the weights in the previous seasons by a number less than 1 (and adjusting the xi parameter a bit), but at least for the Premier League it didn’t improve the predictions.

I think these regression models incorporate strength of schedule pretty good already, since the parameter estimates relies on all the data. I am not even sure the model could

notincorporate strength of schedule, since if there were two groups of teams where no team in one group had played any team in the other group, I think the model would be unidentifiable. I have been thinking about looking into the “connectedness” between groups of teams (i.e. from different leagues) some time in the future.Did you find that adding the time weightings yielded more accurate predictions vs the independent Poisson regression model with the Dixon and Coles low scoring adjustment?

I haven’t done any extensive testing of the different models, so I am not quite sure. I hope to get around to it sometime in the near future.

I get 0 as a weight for every single past match using your code. Is that expected?

Ignore the comment! I have sorted it out! Working as expected.

Just out of curiosity (I know you didn’t intend to maximise xi within the function because it takes time), but how would you go about it?

I just tried every value of xi between 0 and 0.0030, incremented by 0.0001 (R command:

seq(0,0.0030, 0.0001)). Then I calculated the RPS for all 31 values of xi and found which value of xi gave the smallest one.I also wondered how you implemented that. If I understood correctly, you have to refit the model each matchday and compute the probability of each outcome. Then you have to do the whole process for different values of xi and compare which xi yields the highest PPL. But there is no chance for me to see an easy implementation of that <.<

Yes that’s correct. You have to do this to get a realistic prediction. You should fit the model including the most recent data, and not having any future data. The implementation of that requires two loop constructs. The inner loop iterates over all match dates and does the model fitting and predictions, the outer loop goes over all values of xi.

Hi,

Great posts! Would you be able to show code for calculating the ranked probability score please? I am currently using the Brier scores but it means calculating for win, loss and draw.

Thanks! I just posted the code as a proper blog post. Read it here:

http://opisthokonta.net/?p=1333

Hi,

first of all: great articles, very clear and straight to the point!

I tried to implement this last step of the model but I think I’m missing something. I calculate the weights outside the optimization function, and pass them as parameters inside par.inits. I think that by doing so I implicitly force the weights to be optimized too…

DCoptimFnWeighted <- function(params, DCm){

home.p <- params[1]

rho.p <- params[2]

nteams <- length(DCm$teams)

attack.p <- matrix(params[3:(nteams+2)], ncol=1)

defence.p <- matrix(params[(nteams+3):(2*nteams+2)], ncol=1)

weights <- matrix(params[(2*nteams+3):length(params)], ncol=1)

lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p)

mu <- exp(DCm$awayTeamDM %*% attack.p + DCm$homeTeamDM %*% defence.p)

return(

DClogLikWeighted(y1=DCm$homeGoals, y2=DCm$awayGoals, lambda, mu, rho.p, weights) * -1

)

}

the function DClogLikWeighted is the one you suggested.

Can you please tell me where I am wrong? How should I pass the weights to the LogLikWeighted function?

thank you

Thank you!

From the code you posted it seems like you try to optimize all the weights. In the model estimation the weights are treated as given, ie they are not to be estimated. If you try to minimize the negative weighted log-likelihood functions the weights will all be estimated to -Inf or something. That is why you should pass the weights you want to use (those calculated by the DCweights function, for example) directly to the likelihood function, and not treat them as parameters to be estimated. To find the optimal weights from the DCweights function I did what Dixon and Coles did in their paper: Emulate a real prediction situation where you try to predict the next match using the data available up to that point in time, and evaluate the prediction accuracy for different values of the dumping factor xi.

What would I pass into dates in the DCWeights function? I am new to R.

You should pass a vector of the Date class. You would typically have a character vector with dates, and you can convert this to a proper date vector with the as.Date() function.

Sorry, could you show how to do this in R? I am not able to do so.

Here is a quick example:

# Character vector with dates

my_dates <- c('2005-08-13', '2013-09-02') as.Date(my_dates)

Hi, I tried that but it is giving me the following error: Error in `-.Date`(dates, as.Date(currentDate)) :

can only subtract from “Date” objects

> Below is my current code.

my_dates <- c('2016-08-20', '2016-06-11')

as.Date(my_dates)

myweights <- DCweights(my_dates, xi = 0.0018)

return(

DClogLikWeighted(y1 = DCm$homeGoals, y2 = DCm$awayGoals, lambda, mu, rho.p, myweights) * -1

)

You need to pass the output from the as.Date function to a variable, or inside the function call argument. It could be like this:

my_dates <- c('2016-08-20', '2016-06-11') myweights <- DCweights(as.Date(my_dates), xi = 0.0018)

Hello,

First of all, I apologise for the late comment. Secondly, your posts have helped me a lot and I have made a similar model in Excel (I know Excel isn’t the best tool for this kind of stuff … however, I’m doing my A-Levels and it is what was easiest at the time since I have no coding experience/knowledge).

Could you clarify how exactly you calculate ? please? My spreadsheet is set up with a few seasons of results with corresponding dates but I am not too sure how I can find the value for ?.

Sorry if I sound stupid here, I only have basic knowledge compared to you guys as I start university later this year.

Thanks

Could

Im sorry, but it looks like some of your comment is missing. What is it that you don’t know how to calculate? The xi? The ranked probability score?

Hello,

I realised I added in a random word at the bottom somehow. Sorry for the confusion there!

The part I am struggling with is determining a value for ? in the following formula:

?(t)=exp(??t)

For my t I’m using says since last match. But as I’m working in Excel I can’t figure out how to get a value for ?.

Hope this clears it up, and thank you for your response.

The way I do it is for each match day, I fit the model for all available previous data, and predict the results on the games on that day. I do this for a couple of seasons, say. I do thins for several values of xi, and select the one that gives best predictions. This can take a few hours when I do it in R.

Hello,

Sorry for my late reply.

Thank you very much – that’s cleared everything up for me!

You’re welcome!

Hi, sorry but i don’t understand how you calculate the best value for xi. I try with doing a cicle where for every value of xi from 0.0001 to 0.02 (increasing by 0.0001), i get the estimates from the model, then i calculate the value of the PLL, but the PLL is always decreasing. Any suggestion? there is the code:

#function to create the thetea Home, draw, away

genThetadata$awayGoals]=1

if(t==”Draw”)

theta[data$homeGoals==data$awayGoals]=1

if(t==”Away”)

theta[data$homeGoals<data$awayGoals]=1

return(theta)

}

#function for estimate the probabilties

#p_match result return a vector with the pWin, pDraw, pLoss

genP<-function(data,mle)

{

k=length(data$homeGoals)

p=matrix(rep(1,(k*3)),nrow=k)

for(i in 1:k){

gen.p=p_MatchResult(mle,data,data$homeTeams[i],data$awayTeams[i])

p[i,1]=gen.p[1]

p[i,2]=gen.p[2]

p[i,3]=gen.p[3]

}

return(p)

}

S<-function(thetaHome,thetaDraw,thetaAway,p)

{

p.H=p[,1]

index=which(p.H==0)

p.H[index]=1

p.D=p[,2]

index=which(p.D==0)

p.D[index]=1

p.A=p[,3]

index=which(p.A==0)

p.A[index]=1

return(sum(thetaHome*log(p.H)+thetaDraw*log(p.D)+thetaAway*log(p.A)))

}

phi<-function(eta,data,ref.date=Sys.Date())

{

t=(as.numeric(ref.date-data$Date))

weight=exp(-1*eta*t)

weight[t<0]=0

return(weight)

}

….

if(weight)

{

s.val=NULL

xi=0

e=0.00001

tH=genTheta(data,"Home")

tD=genTheta(data,"Draw")

tA=genTheta(data,"Away")

while(es.val))

{

xi=e

s.val=s.val2

}

e=e+0.0001

}

optim=nlminb(par,logLik,data=data,phi=phi(xi,data))

hess=hessian(func=logLik,x=par,data=data,phi=phi(xi,data))

}

….

It will be very helpfull for me, i’m a student and i’m writing about this topic on my final review. Hope for reply, thank you.

Sorry there were an error in the code i posted before…

this is the main part: i maximezed the loglik function (it is sum(phi* logLikDixonColes) ) and the i calculate the S value (the PPL)

….

if(weight)

{

s.val=NULL

eta=0

e=0.00001

tH=genTheta(data,”Home”)

tD=genTheta(data,”Draw”)

tA=genTheta(data,”Away”)

while(es.val))

{

eta=e

s.val=s.val2

}

e=e+0.0001

}

optim=nlminb(par,logLik,data=data,phi=phi(eta,data))

hess=hessian(func=logLik,x=par,data=data,phi=phi(eta,data))

}

….

Sorry again when i post the comment it cut the main part of the code….

If the PPL is always decreasing it is probably because the optimal xi is greater than the greatest value you are trying.

Hi, first of all thanks for your work!

Can you show how you calculate the xi optim value?

I tried wit doing 2 loop: One that just change the value of xi and in the inner loop I optimezd the model for each match day and calculate the probability of outcome, then with all the probability I calculate the value of the PPL.

But the series of PPL that I get is not regular, it is first increasing then decresing and again increasing and decreasing and so for all the range of value of xi I tried (about 0.0001 to 0.02).

Am I missing something?

They way you describe your procedure seems right, so my guess is that you have a introduced a bug somewhere, where you perhaps overwrite some results, or you index something wrong or something.

I checked my code but i’m still blocked. Is there any possibility you can show your code, please? it would be very helpfull

I don’t think that would be very helpful. There are so many possible way of doing this, the data could be coded in different ways, and so on, so code for doing this would not be very informative, I think.

Hey, was just wondering if the inclusion of the time weighting would influence the parameter calculations of if it is just in use for the predictions?

The time weights are only used in the parameter estimation. They are not used when calculating the predictions.

When I change the value of xi it doesn’t change the value of my parameter estimations? I’ve used your code

tau <- Vectorize(function(x, y, lambda, mu, rho){

if (x == 0 & y == 0){return(1 – (lambda*mu*rho))

} else if (x == 0 & y == 1){return(1 + (lambda*rho))

} else if (x == 1 & y == 0){return(1 + (mu*rho))

} else if (x == 1 & y == 1){return(1 – rho)

} else {return(1)}

})

DCweights <- function(prem15, currentDate, xi){

datediffs<-difftime(strptime(prem15$Date, format ="%d/%m/%y"), strptime(currentDate, format="%d/%m/%y"), units="days")

datediffs<-as.numeric(datediffs*-1)

w <- exp(-1*xi*datediffs)

w[datediffs <= 0] <- 0 #Future dates should have zero weights

return(w)

}

DClogLikWeighted <- function(x, y, lambda, mu, rho=0.06, weights=weights){

loglik<-sum(log(tau(x, y, lambda, mu, rho)) + log(dpois(x, lambda)) + log(dpois(y, mu)))

if(is.null(weights)){

return(sum(loglik))

}else{

return(sum(loglik*weights))

}

}

DCoptimFn <- function(params, DCm){

home.p <- params[1]

rho.p <- params[2]

nteams <- length(DCm$teams)

attack.p <- matrix(params[3:(nteams+2)], ncol=1)

defence.p <- matrix(params[(nteams+3):length(params)], ncol=1)

lambda <- exp(DCm$homeTeamDM %*% attack.p + DCm$awayTeamDM %*% defence.p + home.p)

mu <- exp(DCm$awayTeamDM %*% attack.p + DCm$homeTeamDM %*% defence.p)

return(

DClogLikWeighted(x=DCm$homeGoals, y=DCm$awayGoals, lambda, mu, rho.p, weights=weights) * -1

)

}

DCattackConstr <- function(params, DCm, …){

nteams <- length(DCm$teams)

attack.p <- matrix(params[3:(nteams+2)], ncol=1)

return((sum(attack.p) / nteams) – 1)

}

date<-"25/05/15"

weights<-DCweights(prem15, date, xi=0.5)

prem15DC<-DCmodelData(prem15)

#initial parameter estimates – make sure the length is the same prem thing

attack.p <- rep(0.01, length(unique(prem15DC$teams)))

defence.p <- rep(-0.08, length(unique(prem15DC$teams)))

home.p <- 0.06

rho.init <- 0.06

par.inits <- c(home.param, rho.init, attack.p, defence.p)

names(par.inits) <- c('HOME', 'RHO', paste('Attack', prem15DC$teams, sep='.'), paste('Defence', prem15DC$teams, sep='.'))

test<-auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=prem15DC)

Where prem15 is the 2014/15 season data, if I change the value of xi the parameters are the same?

What I mean is that when you multiply by the weights in the DCLogLikWeighted function, yes the resulting value will be lower in the Optim Function, but surely the same values for the paremeters will minimise this too?

The parameters should change when you change xi. In your code you have a large value of xi (0.5), so small changes is that might not change the parameter estimates much when you have data from only one season.

Apologies, the xi in there is incorrect to what I used, I have tried a loop from 0.01 to 0.06 using the code

date<-"25/05/15"

for(i in seq(0.001,0.04, 0.001))

{

weights15<-DCweights(prem15, date, xi=i)

test<-auglag(par=par.inits, fn=DCoptimFn, heq=DCattackConstr, DCm=prem15DC)

ActualResultsProbs(prem15) #This produces the value for the pll

}

and there are no changes to the parameters, they all produce for example 1.40 for Arsenal using the full 2014/15 season, simply outputting smaller fvals of the output of the auglag.

Any idea why it doesn't change?

It seems like you don’t actually pass weights15 to DCoptimfun via auglag.

I don’t understand why the output of the optim function would change as the weights change though? For example, when I use xi=0.001 I get an output once maximised of 216100, but when I use xi=0.005 I get 241200. However these use the same values for the parameters.

Im not sure i follow you. Do you think it is strange that a weighted sum changes as the weights change?