# The Dixon-Coles model, part 4: A trick to speed up estimation

In the previous installments in this series on implementing the Dixon-Coles model I complained a bit about the time it took to estimate the parameters. In the original implementation in part 1 it took about 40 seconds. Now 40 seconds is not much to complain about, there are a whole lot of other models and algorithms that takes much much longer time to fit (for my master’s I had some computations that took several months). Still, I wanted to make a few improvements.

The approach I described in part 3 is quite acceptable, I think, especially since it takes less than a second to fit the model. But still, I wanted to make some improvements to my original implementation.

There are several reasons for the estimation procedure being slow. I used a general purpose optimizer instead of a tailor-made algorithm, and I didn’t provide the optimizer with a function of the derivative of the model likelihood function, nor the function defining the constraint. This means that the optimizer have to estimate the derivatives by doing a lot of evaluations of the two functions with slight changes in the parameters. The most important speed bump, however, is probably due to how I implemented the constraint that all the average of the attack parameters should equal 1.

The alabama package I used relied on a technique called Lagrange multipliers, which is a very general method for constrained optimization. Instead of relying on general constrained optimization procedures, there is a trick commonly used in linear models with sum-to-zero constrained categorical parameters that we also can use here.

There has been some discussion and confusion in the comments about how categorical variables are coded and how R presents the results of the glm function. A thorough discussion of this is best left for another time, but let me explain how the sum-to-zero constraint is implemented in linear models. We will fit the model with this constraint and then make some adjustments later on to get the correct average-is-one constraint.

The sum-to-zero constraint basically says that the sum of all the parameters for a categorical variable must equal to zero:

$$\sum_{i=1} \theta_i = 0$$

If we for example have three levels, we can write out the equation like this:

$$\theta_1 + \theta_2 + \theta_3 = 0$$

If we subtract $$\theta_3$$ and multiply both sides of the equation by minus 1 we get

$$– \theta_1 – \theta_2 = \theta_3$$

Notice how we can write one of the parameters as a simple linear function of the other parameters. We can use this result to construct the design matrix for the categorical variable, incorporating the sum-to-zero constraint (exactly which parameter or level we chose to be a function of the others doesn’t matter, the end results does not differ). Suppose we have the following observations of a three-level categorical variable:

$$\begin{bmatrix} A & A & B & B & C & C \end{bmatrix}^T$$

We can then construct the following design matrix:

$$\begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ -1 & -1 & \\ -1 & -1 & \end{bmatrix}$$

Notice that we only need two columns (i.e. two variables) to encode the three levels. Since the last parameter is a function of the two other it is redundant. Also notice how the observations in the two last rows, corresponding to the $$C$$ observations, will influence the estimation of all the other parameters for this variable. When the two parameters are estimated, the last parameter can be obtained using the result from above relating the last parameter to all the other.

In the Dixon-Coles paper they used the constraint that the average of the attack parameters should be 1. This is not quite the same as the sum-to-zero constraint, but for prediction, it does not matter exactly which constraint we use. Anyway, I will explain later how we can fix this.

To use this trick in the Dixon-Coles implementation we need to make the following changes to our code from part 1. Obviously the first thing we need to change is how the design matrices in the DCmodelData function is computed. We need four matrices now, since the number of parameters estimated directly is different for the attack and defense parameters. Notice how I chose the last of team that appear last in the team.names vector. The teams get sorted alphabetically, so for the 2011-12 Premier League data this is is Wolves.

DCmodelData <- function(df){

team.names <- unique(c(levels(df$HomeTeam), levels(df$AwayTeam)))

# attack, with sum-to-zero constraint
## home
hm.a <- model.matrix(~ HomeTeam - 1, data=df)
hm.a[df$HomeTeam == team.names[length(team.names)], ] <- -1 hm.a <- hm.a[,1:(length(team.names)-1)] # away am.a <- model.matrix(~ AwayTeam -1, data=df) am.a[df$AwayTeam == team.names[length(team.names)], ] <- -1
am.a <- am.a[,1:(length(team.names)-1)]

# defence, same as before
hm.d <- model.matrix(~ HomeTeam - 1, data=df)
am.d <- model.matrix(~ AwayTeam -1, data=df)

return(list(homeTeamDMa=hm.a, homeTeamDMd=hm.d,
awayTeamDMa=am.a, awayTeamDMd=am.d,
homeGoals=df$FTHG, awayGoals=df$FTAG,
teams=team.names))
}


Some changes to the DCoptimFn function is also needed, so it properly handles the changes we made to the design matrices.

# I don't bother showing the rest of the function
nteams <- length(DCm$teams) attack.p <- matrix(params[3:(nteams+1)], ncol=1) #one column less defence.p <- matrix(params[(nteams+2):length(params)], ncol=1) # need to multiply with the correct matrices lambda <- exp(DCm$homeTeamDMa %*% attack.p + DCm$awayTeamDMd %*% defence.p + home.p) mu <- exp(DCm$awayTeamDMa %*% attack.p + DCm$homeTeamDMd %*% defence.p)  We also need to make a the appropriate adjustments to the vectors with the initial parameter values, so that they have the correct lengths. dcm <- DCmodelData(data) nteams <- length(dcm$teams)

#initial parameter estimates
attack.params <- rep(.1, times=nteams-1) # one less parameter
defence.params <- rep(-0.8, times=nteams)
home.param <- 0.06
rho.init <- 0.03
par.inits <- c(home.param, rho.init, attack.params, defence.params)

#informative names
#skip the last team
names(par.inits) <- c('HOME', 'RHO',
paste('Attack', dcm$teams[1:(nteams-1)], sep='.'), paste('Defence', dcm$teams, sep='.'))


With these changes we can simply use the built-in optim function in R. There is no need for the DCattackConstr function anymore, or a third party package, since we built the constraint right into the design matrices.

res <- optim(par=par.inits, fn=DCoptimFn, DCm=dcm, method='BFGS')


This takes about 6-7 seconds on my laptop, a decent improvement to the 40 seconds it took before. If you take a look at the resulting parameter estimates in res$par you will see that the attack parameter for Wolves is missing. As I explained earlier, this parameter is easy to find. It is also easy to correct all the parameter estimates so that they become the same as if we had a mean-is-one constraint on the attack parameters. This is done by increasing the attack parameters by one, and decreasing the defense parameters by one. The reason it is that simple is that the sum-to-zero constraint is equivalent with a mean-is-zero constraint. parameters <- res$par

#compute Wolves attack parameter
missing.attack <- sum(parameters[3:(nteams+1)]) * -1

#put it in the parameters vector
parameters <- c(parameters[1:(nteams+1)], missing.attack, parameters[(nteams+2):length(parameters)])
names(parameters)[nteams+2] <- paste('Attack.', dcm$teams[nteams], sep='') #increase attack by one parameters[3:(nteams+2)] <- parameters[3:(nteams+2)] + 1 #decrease defence by one parameters[(nteams+3):length(parameters)] <- parameters[(nteams+3):length(parameters)] - 1  # The R code for the home field advantage and traveling distance analysis. I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use. First load the data with the coordinates I posted last year. dta.stadiums <- read.csv('stadiums.csv')  I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches. First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that. coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){ #Calculates the distance between two WGS84 coordinates. # #http://en.wikipedia.org/wiki/Haversine_formula #http://www.movable-type.co.uk/scripts/gis-faq-5.1.html dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180)) dlong <- (long2 * (pi/180)) - (long1 * (pi/180)) h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2) c <- 2 * pmin(1, asin(sqrt(h))) d <- radius * c return(d) }  Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches. coord.home <- dta.stadiums[match(dta.matches$HomeTeam, dta.stadiums$FDCOUK), c('Latitude', 'Longitude')] coord.away <- dta.stadiums[match(dta.matches$AwayTeam, dta.stadiums$FDCOUK), c('Latitude', 'Longitude')] dta.matches$Distance <- coordinate.distance(coord.home$Latitude, coord.home$Longitude,
coord.away$Latitude, coord.away$Longitude)


Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match.

avgerage.gd <- function(dta){
#Calculates the average goal difference for each team.

all.teams <- unique(c(levels(dta$HomeTeam), levels(dta$AwayTeam)))
average.goal.diff <- numeric(length(all.teams))
names(average.goal.diff) <- all.teams
for (t in all.teams){
idxh <- which(dta$HomeTeam == t) goals.for.home <- dta[idxh, 'FTHG'] goals.against.home <- dta[idxh, 'FTAG'] idxa <- which(dta$AwayTeam == t)
goals.for.away <- dta[idxa, 'FTAG']
goals.against.away <- dta[idxa, 'FTHG']

n.matches <- length(idxh) + length(idxa)
total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away)

average.goal.diff[t] <- total.goal.difference / n.matches
}
return(average.goal.diff)
}

matchwise.hfa <- function(dta, avg.goaldiff){
#Calculates the matchwise home field advantage based on the average goal
#difference for each team.

n.matches <- dim(dta)[1]
hfa <- numeric(n.matches)
for (idx in 1:n.matches){
hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']]
awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']]
expected.goal.diff <- hometeam.avg - awayteam.avg
observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG']
hfa[idx] <- observed.goal.diff - expected.goal.diff
}
return(hfa)
}


In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA.

dta.matches$HFA <- numeric(dim(dta.matches)[1]) seasons <- levels(dta.matches$Season)

for (i in 1:length(seasons)){
season.l <- dta.matches$Season == seasons[i] h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,])) dta.matches$HFA[season.l] <- h
}


At last we can do the linear regression and make a nice little plot.

m <- lm(HFA ~ Distance, data=dta.matches)
summary(m)

plot(dta.matches$Distance, dta.matches$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance')
abline(m, col='red')


# ‘Synonymous’ factor levels in R

When I work with data from different sources, they are often inconsistent in ways they specify categorical variables. One example is country names. There are many ways the name of a country can be specified, and even if there are international standards, different organizations like to do it their way. North Korea, for example, may sometimes be written as just as ‘North Korea’, but other sources may call it ‘Korea DPR’.

This of course leads to complications when we want to combine data from different sources. What could be a trivial lookup in two different dataframes in R becomes a real hassle. One solution I have come up with is to make a .csv file with different names from different sources, and then load it into R and use it to ‘translate’ the factor levels from one source to the way the levels are represented in the other. Based on a method for renaming levels with regular expressions from Winston Chang’s Cookbook for R, I made a function for renaming several levels in a dataframe at once. The part about using a .csv file is not the important thing here, it is just a more convenient way of storing the information needed.

The function takes four arguments. dat is a dataframe that contains the factors that is to be renamed. vars is the variables to rename. from and to specifies what to rename from and what to rename to. The function returns a dataframe.

renameLevels <- function(dat, vars, from, to){
for (v in vars){
ptrns <- paste("^", from, "$", sep="") for (lvl in 1:length(ptrns)){ levels(dat[, v]) <- sub(ptrns[lvl], to[lvl], levels(dat[, v])) } } return(dat) }  A small example: #data to be translated var <- factor(c("b", "a", "c", "a", "d", "a", "e", "b")) var2 <- factor(c("b", "b", "b", "b", "b", "a", "e", "b")) data <- data.frame(var, var2) #> data # var var2 #1 b b #2 a b #3 c b #4 a b #5 d b #6 a a #7 e e #8 b b #translate from roman to greek letters roman <- c("a", "b", "c", "d", "e") greek <- c("alpha", "beta", "gamma", "delta", "epsilon") data2 <- renameLevels(data, c("var", "var2"), roman, greek) #> data2 # var var2 #1 beta beta #2 alpha beta #3 gamma beta #4 alpha beta #5 delta beta #6 alpha alpha #7 epsilon epsilon #8 beta beta  # Least squares rating of football teams The Wikipedia article Statistical association football predictions mentions a method for least squares rating of football teams. The article does not give any source for this, but I found what I think may be the origin of this method. It appears to be from an undergrad thesis titled Statistical Models Applied to the Rating of Sports Teams by Kenneth Massey. It is not on football in particular, but on sports in general where two teams compete for points. A link to the thesis can be found here. The basic method as described in Massey’s paper and the Wikipedia article is to use a n*k design matrix A where each of the k columns represents one team, and each of the n rows represents a match. In each match (or row) the home team is indicated by 1, and the away team by -1. Then we have a vector y indicating goal differences in each match, with respect to the home team (i.e. positive values for home wins, negative for away wins). Then the least squares solution to the system Ax = y is found, with the x vector now containing the rating values for each team. When it comes to interpretation, the difference in least squares estimate for the rating of two teams can be seen as the expected goal difference between the teams in a game. The individual rating can be seen as how many goals a teams scores compared to the overall average. Massey’s paper also discusses some extensions to this simple model that is not mentioned in the Wikipedia article. The most obvious is incorporation of home field advantage, but there is also a section on splitting the teams’ performances into offensive and defensive components. I am not going to go into these extensions here, you can read more about them i Massey’s paper, along with some other rating systems that are also discussed. What I will do, is to take a closer look at the simple least squares rating and compare it to the ordinary three points for a win rating used to determine the league winner. I used the function I made earlier to compute the points for the 2011-2012 Premier League season, then I computed the least squares rating. Here you can see the result: PTS LSR LSRrank RankDiff Man City 89 1.600 1 0 Man United 89 1.400 2 0 Arsenal 70 0.625 3 0 Tottenham 69 0.625 4 0 Newcastle 65 0.125 8 3 Chelsea 64 0.475 5 -1 Everton 56 0.250 6 -1 Liverpool 52 0.175 7 -1 Fulham 52 -0.075 10 1 West Brom 47 -0.175 12 2 Swansea 47 -0.175 11 0 Norwich 47 -0.350 13 1 Sunderland 45 -0.025 9 -4 Stoke 45 -0.425 15 1 Wigan 43 -0.500 16 1 Aston Villa 38 -0.400 14 -2 QPR 37 -0.575 17 0 Bolton 36 -0.775 19 1 Blackburn 31 -0.750 18 -1 Wolves 25 -1.050 20 0 It looks like the Least squares approach gives similar results as the standard points system. It differentiates between the two top teams, Manchester City and Manchester United, even if they have the same number of points. This is perhaps not so surprising since City won the league because of greater goal difference than United, and this is what the least squares rating is based on. Another, perhaps more surprising thing is how relatively low least squares rating Newcastle has, compared to the other teams with approximately same number of points. If ranked according to the least squares rating, Newcastle should have been below Liverpool, instead they are three places above. This hints at Newcastle being better at winning, but with few goals, and Liverpool winning fewer times, but when they win, they win with more goals. We can also see that Sunderland comes poor out in the least squares rating, dropping four places. If we now plot the number of points to the least squares rating we see that the two methods generally gives similar results. This is perhaps not so surprising, and despite some disparities like the ones I pointed out, there are no obvious outliers. I also calculated the correlation coefficient, 0.978, and I was actually a bit surprised of how big it was. # Looking at monthly distribution of births in Norway A news story earlier this week reported an increased number of births during the summer months in Norway. According to the story the peak in births used to be in the spring months, nine months after summer vacation, but is now during the summer. The midwifes thinks this change is because of the rules for granting a place in preschool day care. Children born before september 1st are legally entitlet to a place in day care. Anyway i decided to try to visualize this. I found some data at the Statistics Norway website, loaded it into R, cleaned it, restructured it etc. and made this animation with ggplot2 showing the monthly distribution of births from year 2000 to 2011. I decided to include data for the years before 2005 since that is when the current left wing coalition took office and they had a program for universal access to day care. It is hard to spot a definite trend, but the graph for 2011 shows a clear top in the summer months. It will be interesting to see if this becomes clearer the next couple of years. Also, if this becomes a continuing trend, it would be interesting to look at surveys in family planning and see if there has been more of it the last couple of years. The birthIndex on the y-axis is not the precise number of births for a given month, but is corrected for the number of days in the month. This makes the different months comparable. # R functions for soccer league tables and result matrix Here are three R functions i wrote to calculate ranking tables in soccer leagues based on the result of played matches. The functions are made for ordinary leagues where each team play every other team twice, one time at the home field, the other at the opposing teams home field, but the match.result() and league.table() function can be used on more general data. The first function, match.results() just computes the outcome of a match (Home, Draw or Away, i.e “H”, “D” or “A”) based on number of goals scored, and is used by the other two functions. > res <- match.results(c(1,2,1,2,3,1,0,5), c(0,1,2,0,3,0,4,0)) > res [1] "H" "H" "A" "H" "D" "H" "A" "H"  The league.table() function returns a data.frame with some statistics for each team, such as number of wins, draws, loss (for both home and away games), goals, goal difference etc. As input it takes vectors with the name of the home team, away team, goals score by the home team and goals scored by the away team. Three points are given for a win, one point for a draw, and zero points for a loss, as is used in most leagues. If you want to compute an alternative table with a different point scheme you can just change the three variables first in the function body. The teams are ranked by the number of points awarded, but if two or more teams have the same numbero of points, they are ranked by goal difference. If the goal difference is also equal, number of goals scored is used. #load data from football-data.co.uk matchdata <- read.csv("premierLeague2011-11.csv") attach(matchdata) league.table(HomeTeam, AwayTeam, FTHG, FTAG) PLD HW HD HL AW AD AL GF GA GD PTS Man United 38 18 1 0 5 10 4 78 37 41 80 Chelsea 38 14 3 2 7 5 7 69 33 36 71 Man City 38 13 4 2 8 4 7 60 33 27 71 Arsenal 38 11 4 4 8 7 4 72 43 29 68 Tottenham 38 9 9 1 7 5 7 55 46 9 62 Liverpool 38 12 4 3 5 3 11 59 44 15 58 Everton 38 9 7 3 4 8 7 51 45 6 54 Fulham 38 8 7 4 3 9 7 49 43 6 49 Aston Villa 38 8 7 4 4 5 10 48 59 -11 48 Sunderland 38 7 5 7 5 6 8 45 56 -11 47 West Brom 38 8 6 5 4 5 10 56 71 -15 47 Newcastle 38 6 8 5 5 5 9 56 57 -1 46 Stoke 38 10 4 5 3 3 13 46 48 -2 46 Bolton 38 10 5 4 2 5 12 52 56 -4 46 Blackburn 38 7 7 5 4 3 12 46 59 -13 43 Wigan 38 5 8 6 4 7 8 40 61 -21 42 Wolves 38 8 4 7 3 3 13 46 66 -20 40 Birmingham 38 6 8 5 2 7 10 37 58 -21 39 Blackpool 38 5 5 9 5 4 10 55 78 -23 39 West Ham 38 5 5 9 2 7 10 43 70 -27 33  The last function is result.matrix(), which returns a matrix with the match results. with home teams on the rows, and away teams on the columns. The cell contents can be formated in three different ways using the format argument. By default this is set to “score” which gives the output like “2 – 1”. “HDA” gives either “A”, “D” or “H”. “difference” gives the goal difference. The diagonal consists of “NA”s. #only the five first rows and columns to save space result.matrix(m$HomeTeam, m$AwayTeam, m$FTHG, m$FTAG, format="score")[1:5,1:5] Arsenal Aston Villa Birmingham Blackburn Blackpool Arsenal NA "1 - 2" "2 - 1" "0 - 0" "6 - 0" Aston Villa "2 - 4" NA "0 - 0" "4 - 1" "3 - 2" Birmingham "0 - 3" "1 - 1" NA "2 - 1" "2 - 0" Blackburn "1 - 2" "2 - 0" "1 - 1" NA "2 - 2" Blackpool "1 - 3" "1 - 1" "1 - 2" "1 - 2" NA  And here is the code for the three functions. match.results <- function(homeGoals, awayGoals){ #Determines the match outcome (H, D or A) based on goals scored by home and away teams. home <- homeGoals > awayGoals away <- awayGoals > homeGoals draws <- homeGoals == awayGoals results <- character(length(homeGoals)) results[draws] <- "D" results[home] <- "H" results[away] <- "A" return(results) } league.table <- function(homeTeam, awayTeam, homeGoals, awayGoals){ #points awarded for a match outcome winPts <- 3 drawPts <- 1 loosePts <- 0 if (length(unique(sapply(list(homeTeam, awayTeam, homeGoals, awayGoals), length))) != 1 ){ warning("input vectors not of same length.") } numMatches <- length(homeTeam) teams <- levels(factor(c(as.character(homeTeam), as.character(awayTeam)))) numTeams <- length(teams) #vector with outcome of a match (H, D or A) results <- match.results(homeGoals, awayGoals) #for output homeWins <- numeric(numTeams) homeDraws <- numeric(numTeams) homeLoss <- numeric(numTeams) awayWins <- numeric(numTeams) awayDraws <- numeric(numTeams) awayLoss <- numeric(numTeams) goalsFor <- numeric(numTeams) goalsAgainst <- numeric(numTeams) goalsDifference <- numeric(numTeams) playedMatches <- numeric(numTeams) pts <- numeric(numTeams) for (t in 1:numTeams) { #mathc results for a given team homeResults <- results[homeTeam == teams[t]] awayResults <- results[awayTeam == teams[t]] playedMatches[t] <- length(homeResults) + length(awayResults) goalsForH <- sum(homeGoals[homeTeam == teams[t]]) goalsForA <- sum(awayGoals[awayTeam == teams[t]]) goalsFor[t] <- goalsForA + goalsForH goalsAgainstH <- sum(awayGoals[homeTeam == teams[t]]) goalsAgainstA <- sum(homeGoals[awayTeam == teams[t]]) goalsAgainst[t] <- goalsAgainstA + goalsAgainstH goalsDifference[t] <- goalsFor[t] - goalsAgainst[t] homeWins[t] <- sum(homeResults == "H") homeDraws[t] <- sum(homeResults == "D") homeLoss[t] <- sum(homeResults == "A") awayWins[t] <- sum(awayResults == "A") awayDraws[t] <- sum(awayResults == "D") awayLoss[t] <- sum(awayResults == "H") totWins <- homeWins[t] + awayWins[t] totDraws <- homeDraws[t] + awayDraws[t] totLoss <- homeLoss[t] + awayLoss[t] pts[t] <- (winPts * totWins) + (drawPts * totDraws) + (loosePts * totLoss) } table <- data.frame(cbind(playedMatches, homeWins, homeDraws, homeLoss, awayWins, awayDraws, awayLoss, goalsFor, goalsAgainst, goalsDifference, pts), row.names=teams) names(table) <- c("PLD", "HW", "HD", "HL", "AW", "AD", "AL", "GF", "GA", "GD", "PTS") ord <- order(-table$PTS, -table$GD, -table$GF)
table <- table[ord, ]

return(table)

}

result.matrix <- function(homeTeam, awayTeam, homeGoals, awayGoals, format="score"){

if (length(unique(sapply(list(homeTeam, awayTeam, homeGoals, awayGoals), length))) != 1 ){
warning("input vectors not of same length.")
}

teams <- levels(factor(c(as.character(homeTeam), as.character(awayTeam))))
numTeams <- length(teams)
numMatches <- length(homeTeam)

if (format == "HDA"){
results <- match.results(homeGoals, awayGoals)
}

resultMatrix <- matrix(nrow=numTeams, ncol=numTeams, dimnames=list(teams, teams))

for (m in 1:numMatches){

if (format == "score"){
cell <- paste(homeGoals[m], "-", awayGoals[m])
}
else if (format == "HDA"){
cell <- results[m]
}
else if (format == "difference"){
cell <- homeGoals[m] - awayGoals[m]
}

resultMatrix[homeTeam[m], awayTeam[m]] <- cell
}

return(resultMatrix)

}