I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use.
First load the data with the coordinates I posted last year.
dta.stadiums <- read.csv('stadiums.csv')
I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches.
First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that.
coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){ #Calculates the distance between two WGS84 coordinates. # #http://en.wikipedia.org/wiki/Haversine_formula #http://www.movable-type.co.uk/scripts/gis-faq-5.1.html dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180)) dlong <- (long2 * (pi/180)) - (long1 * (pi/180)) h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2) c <- 2 * pmin(1, asin(sqrt(h))) d <- radius * c return(d) }
Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches.
coord.home <- dta.stadiums[match(dta.matches$HomeTeam, dta.stadiums$FDCOUK), c('Latitude', 'Longitude')] coord.away <- dta.stadiums[match(dta.matches$AwayTeam, dta.stadiums$FDCOUK), c('Latitude', 'Longitude')] dta.matches$Distance <- coordinate.distance(coord.home$Latitude, coord.home$Longitude, coord.away$Latitude, coord.away$Longitude)
Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match.
avgerage.gd <- function(dta){ #Calculates the average goal difference for each team. all.teams <- unique(c(levels(dta$HomeTeam), levels(dta$AwayTeam))) average.goal.diff <- numeric(length(all.teams)) names(average.goal.diff) <- all.teams for (t in all.teams){ idxh <- which(dta$HomeTeam == t) goals.for.home <- dta[idxh, 'FTHG'] goals.against.home <- dta[idxh, 'FTAG'] idxa <- which(dta$AwayTeam == t) goals.for.away <- dta[idxa, 'FTAG'] goals.against.away <- dta[idxa, 'FTHG'] n.matches <- length(idxh) + length(idxa) total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away) average.goal.diff[t] <- total.goal.difference / n.matches } return(average.goal.diff) } matchwise.hfa <- function(dta, avg.goaldiff){ #Calculates the matchwise home field advantage based on the average goal #difference for each team. n.matches <- dim(dta)[1] hfa <- numeric(n.matches) for (idx in 1:n.matches){ hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']] awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']] expected.goal.diff <- hometeam.avg - awayteam.avg observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG'] hfa[idx] <- observed.goal.diff - expected.goal.diff } return(hfa) }
In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA.
dta.matches$HFA <- numeric(dim(dta.matches)[1]) seasons <- levels(dta.matches$Season) for (i in 1:length(seasons)){ season.l <- dta.matches$Season == seasons[i] h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,])) dta.matches$HFA[season.l] <- h }
At last we can do the linear regression and make a nice little plot.
m <- lm(HFA ~ Distance, data=dta.matches) summary(m) plot(dta.matches$Distance, dta.matches$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance') abline(m, col='red')