‘Synonymous’ factor levels in R

When I work with data from different sources, they are often inconsistent in ways they specify categorical variables. One example is country names. There are many ways the name of a country can be specified, and even if there are international standards, different organizations like to do it their way. North Korea, for example, may sometimes be written as just as ‘North Korea’, but other sources may call it ‘Korea DPR’.

This of course leads to complications when we want to combine data from different sources. What could be a trivial lookup in two different dataframes in R becomes a real hassle. One solution I have come up with is to make a .csv file with different names from different sources, and then load it into R and use it to ‘translate’ the factor levels from one source to the way the levels are represented in the other. Based on a method for renaming levels with regular expressions from Winston Chang’s Cookbook for R, I made a function for renaming several levels in a dataframe at once. The part about using a .csv file is not the important thing here, it is just a more convenient way of storing the information needed.

The function takes four arguments. dat is a dataframe that contains the factors that is to be renamed. vars is the variables to rename. from and to specifies what to rename from and what to rename to. The function returns a dataframe.

renameLevels <- function(dat, vars, from, to){
  for (v in vars){
    ptrns <- paste("^", from, "$", sep="") 
    for (lvl in 1:length(ptrns)){
      levels(dat[, v]) <- sub(ptrns[lvl], to[lvl], levels(dat[, v]))

A small example:

#data to be translated
var <- factor(c("b", "a", "c", "a", "d", "a", "e", "b"))
var2 <- factor(c("b", "b", "b", "b", "b", "a", "e", "b"))
data <- data.frame(var, var2)
#> data
#  var var2
#1   b    b
#2   a    b
#3   c    b
#4   a    b
#5   d    b
#6   a    a
#7   e    e
#8   b    b

#translate from roman to greek letters
roman <- c("a", "b", "c", "d", "e")
greek <- c("alpha", "beta", "gamma", "delta", "epsilon")

data2 <- renameLevels(data, c("var", "var2"), roman, greek)
#> data2
#      var    var2
#1    beta    beta
#2   alpha    beta
#3   gamma    beta
#4   alpha    beta
#5   delta    beta
#6   alpha   alpha
#7 epsilon epsilon
#8    beta    beta