When I work with data from different sources, they are often inconsistent in ways they specify categorical variables. One example is country names. There are many ways the name of a country can be specified, and even if there are international standards, different organizations like to do it their way. North Korea, for example, may sometimes be written as just as ‘North Korea’, but other sources may call it ‘Korea DPR’.
This of course leads to complications when we want to combine data from different sources. What could be a trivial lookup in two different dataframes in R becomes a real hassle. One solution I have come up with is to make a .csv file with different names from different sources, and then load it into R and use it to ‘translate’ the factor levels from one source to the way the levels are represented in the other. Based on a method for renaming levels with regular expressions from Winston Chang’s Cookbook for R, I made a function for renaming several levels in a dataframe at once. The part about using a .csv file is not the important thing here, it is just a more convenient way of storing the information needed.
The function takes four arguments. dat
is a dataframe that contains the factors that is to be renamed. vars
is the variables to rename. from
and to
specifies what to rename from and what to rename to. The function returns a dataframe.
renameLevels <- function(dat, vars, from, to){ for (v in vars){ ptrns <- paste("^", from, "$", sep="") for (lvl in 1:length(ptrns)){ levels(dat[, v]) <- sub(ptrns[lvl], to[lvl], levels(dat[, v])) } } return(dat) }
A small example:
#data to be translated var <- factor(c("b", "a", "c", "a", "d", "a", "e", "b")) var2 <- factor(c("b", "b", "b", "b", "b", "a", "e", "b")) data <- data.frame(var, var2) #> data # var var2 #1 b b #2 a b #3 c b #4 a b #5 d b #6 a a #7 e e #8 b b #translate from roman to greek letters roman <- c("a", "b", "c", "d", "e") greek <- c("alpha", "beta", "gamma", "delta", "epsilon") data2 <- renameLevels(data, c("var", "var2"), roman, greek) #> data2 # var var2 #1 beta beta #2 alpha beta #3 gamma beta #4 alpha beta #5 delta beta #6 alpha alpha #7 epsilon epsilon #8 beta beta