# Identifying gender bias in candidate lists in proportional representation elections

The Norwegian parliamentary elections uses a system of proportional representation. Each county has a number of seats in parliament (based on number of inhabitants and area), and the number of seats given to each party almost proportional to the number of votes the party receives on that county. Since each party can win more than one seat the parties has to prepare a ranked list of people to be elected, where the top name is given the first seat, the second name given the second seat etc.

Proportional representation systems like the Norwegian one has been show to be associated with greater gender balance in parliaments than other systems (see table 1 in this paper). Also, the proportion of women in the Norwegian Storting has also increased the last 30 years:

Data source: Statistics Norway, table 08219.

At the 1981 election, 26% of the elected representatives where women. At the 2013 election, the proportion was almost 40%. One mechanism that can explain this persistent female underrepresentation is that men are overrepresented at the top of the electoral lists. Inspired by a bioinformatics method called Gene Set Enrichment (GSEA) I am going to put this hypothesis to the test.

The method is rather simple. Explained in general terms, this is how it works: First you need to calculate a score witch represents the degree of overrepresentation of a category near the top of the list. Each time you encounter an instance belonging to the category your testing you increase the score, otherwise you decrease it. To make the score be a measure of overrepsentation at the top of the list the increase and decrease must be weighted accordingly. The maximum score of this ‘running sum’ is the test statistic. Here I have chosen the function $$\frac{1}{\sqrt(i)}$$ where i is the number the candidate is on the list (number 1 is the top candidate).

To calculate the p-value the same thing is done again repeatedly with different random permutations of the list. The proportion of times the score from these randomizations are greater or equal to the observed score is then the p-value.

I am going to use this method on the election lists from Hordaland county from the 1981 and 2013 election. Hordaland had 15 seats in 1981, and 16 seats in 2013. 3 (20 %) women were elected in 1981 and 5 (31.3 %) in 2013. The election lists are available from the Norwegian Social Science Data Services and the National Library of Norway.

Here are the results for each party at the two elections:

 Party 2013 1981 Ap 1 (0.43) 3.58 (0.49) Frp 3.28 (0.195) 3.56 (0.49) H 1.018 (0.66) 3.17 (0.35) Krf 1.24 (0.43) 2.32(0.138) Sp 2.86 (0.49) 2.86 (0.48) Sv 1 (0.24) 0.29 (0.72) V 1.49 (0.59) 1.37 (0.29)

The number shown is the score, while the p-value is in parenthesis. A higher score means a higher over representation of men at the top of the list.

Even if we ignore problems with multiple testing, none of the parties have a significant over representation of men at the top if the traditional significance threshold of $$p \le 0.05$$ is used. This is perhaps unexpected, as at least the gender balance in the elected candidates after the 1981 election is significantly biased (p = 0.018, one sided exact binomial test).

This really tells us that this method is not really powerful enough to make inferences about this kinds of data. I think one possible improvement would be to somehow score all lists in combination to find an overall gender bias. One could also try a different null model. The one I have used here has randomly shuffled the list in question, maintaining the bias in gender ratio (if any). Instead a the observed score could be compared to random samplings where each gender were sampled with equal probabilities.

My final thought is that this whole significance testing approach is inappropriate. Even if the bias is statistical insignificant, it is still there to influence the gender ratio of the elected members of parliament. From looking at some of the lists and their scores, I will say that all scores greater than 1 at least indicate a positive bias towards having more men at the top.

# The R code for the home field advantage and traveling distance analysis.

I was asked in the comments on my Does traveling distance influence home field advantage? to provide the R code I used, because Klemens of the rationalsoccer blog wanted to do the analysis on some of his own data. I have refactored it a bit to make it easier to use.

First load the data with the coordinates I posted last year.

dta.stadiums <- read.csv('stadiums.csv')


I also assume you have data formated like the data from football-data.co.uk in a data frame called dta.matches.

First wee need a way to calculate the distance (in kilometers) between the two coordinates. This is a function that does that.

coordinate.distance <- function(lat1, long1, lat2, long2, radius=6371){
#Calculates the distance between two WGS84 coordinates.
#
#http://en.wikipedia.org/wiki/Haversine_formula
#http://www.movable-type.co.uk/scripts/gis-faq-5.1.html
dlat <- (lat2 * (pi/180)) - (lat1 * (pi/180))
dlong <- (long2 * (pi/180)) - (long1 * (pi/180))
h <- (sin((dlat)/2))^2 + cos((lat1 * (pi/180)))*cos((lat2 * (pi/180))) * ((sin((dlong)/2))^2)
c <- 2 * pmin(1, asin(sqrt(h)))
return(d)
}


Next, we need to find the coordinates where each match is played, and the coordinates for where the visting team comes from. Then the traveling distance for each match is calculated and put into the Distance column of dta.matches.

coord.home <- dta.stadiums[match(dta.matches$HomeTeam, dta.stadiums$FDCOUK),
c('Latitude', 'Longitude')]
coord.away <- dta.stadiums[match(dta.matches$AwayTeam, dta.stadiums$FDCOUK),
c('Latitude', 'Longitude')]

dta.matches$Distance <- coordinate.distance(coord.home$Latitude, coord.home$Longitude, coord.away$Latitude, coord.away$Longitude)  Here are two functions that is needed to calculate the home field advantage per match. The avgerage.gd function takes a data frame as an argument and computes the average goal difference for each team. The result should be passed to the matchwise.hfa function to calculate the the home field advantage per match. avgerage.gd <- function(dta){ #Calculates the average goal difference for each team. all.teams <- unique(c(levels(dta$HomeTeam), levels(dta$AwayTeam))) average.goal.diff <- numeric(length(all.teams)) names(average.goal.diff) <- all.teams for (t in all.teams){ idxh <- which(dta$HomeTeam == t)
goals.for.home <- dta[idxh, 'FTHG']
goals.against.home <- dta[idxh, 'FTAG']

idxa <- which(dta$AwayTeam == t) goals.for.away <- dta[idxa, 'FTAG'] goals.against.away <- dta[idxa, 'FTHG'] n.matches <- length(idxh) + length(idxa) total.goal.difference <- sum(goals.for.home) + sum(goals.for.away) - sum(goals.against.home) - sum(goals.against.away) average.goal.diff[t] <- total.goal.difference / n.matches } return(average.goal.diff) } matchwise.hfa <- function(dta, avg.goaldiff){ #Calculates the matchwise home field advantage based on the average goal #difference for each team. n.matches <- dim(dta)[1] hfa <- numeric(n.matches) for (idx in 1:n.matches){ hometeam.avg <- avg.goaldiff[dta[idx,'HomeTeam']] awayteam.avg <- avg.goaldiff[dta[idx,'AwayTeam']] expected.goal.diff <- hometeam.avg - awayteam.avg observed.goal.diff <- dta[idx,'FTHG'] - dta[idx,'FTAG'] hfa[idx] <- observed.goal.diff - expected.goal.diff } return(hfa) }  In my analysis I used data from several seasons, and the average goal difference for each team was calculated per season. Assuming you have added a Season column to dta.matches that is a factor indicating which season the match is from, this piece of code calculates the home field advantage per match based on the seasonwise average goal differences for each team (puh!). The home field advantage is out into the new column HFA. dta.matches$HFA <- numeric(dim(dta.matches)[1])
seasons <- levels(dta.matches$Season) for (i in 1:length(seasons)){ season.l <- dta.matches$Season == seasons[i]
h <- matchwise.hfa(dta.matches[season.l,], avgerage.gd(dta.matches[season.l,]))
dta.matches$HFA[season.l] <- h }  At last we can do the linear regression and make a nice little plot. m <- lm(HFA ~ Distance, data=dta.matches) summary(m) plot(dta.matches$Distance, dta.matches\$HFA, xlab='Distance (km)', ylab='Difference from expected goals', main='Home field advantage vs traveling distance')
abline(m, col='red')