首页 > 解决方案 > Automating the process of recoding numeric variables to meaningful factor variables

问题描述

I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown. However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.

data <- data.frame("ID" = c(1:5),
                   "Var1" = c(2,2,8,1,8),
                   "Var2" = c(5,8,4,88,10))

For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like

> Var1_values
[1] 8 2 1

with a corresponding vector containing the categories:

> Var1_categories
[1] "unknown" "yes" "no"

But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like

| ID | Var1    | Var2              |
|----|---------|-------------------|
| 1  | yes     | condition E       |
| 2  | yes     | condition H       |
| 3  | unknown | condition D       |
| 4  | no      | unknown condition |
| 5  | unknown | condition H       |

where each column is a meaningful factor variable.

As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.

// edit:

I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:

mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
                      "Value1" = c(2,2,2,7),
                      "Word1" = c("yes","yes","yes","condition A"),
                      "Value2" = c(1,1,1,6),
                      "Word2" = c("no","no","no","Condition B"),
                      "Value3" = c(8,8,8,5),
                      "Word3" = c("unk","unk","unk", "Condition C"),
                      "Value4" = c(NA,NA,NA,4),
                      "Word4" = c(NA,NA,NA,"Condition B")
                      )

I would like to "long"-transform it so I can use it with @r2evan 's solution.

标签: rrecodeforcats

解决方案


这是一个想法,尽管它需要重新塑造(两次)数据。

mapping <- data.frame(
  Var = c(rep("Var1", 3), rep("Var2", 5)),
  Val = c(1, 2, 8, 4, 5, 8, 10, 88),
  Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
#    Var Val Words
# 1 Var1   1    no
# 2 Var1   2   yes
# 3 Var1   8   unk
# 4 Var2   4     D
# 5 Var2   5     E
# 6 Var2   8     H
# 7 Var2  10     H
# 8 Var2  88   unk

library(dplyr)
library(tidyr) # pivot_*
data %>%
  pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
  left_join(mapping, by = c("Var", "Val")) %>%
  pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
#      ID Var1  Var2 
#   <int> <chr> <chr>
# 1     1 yes   E    
# 2     2 yes   H    
# 3     3 unk   D    
# 4     4 no    unk  
# 5     5 unk   H    

使用此方法,您可以控制每个变量的数字到单词的映射。


另一种选择是使用地图列表,类似于上面,但它不需要双重整形。

maplist <- list(
  Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
  Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
#     1     2     8 
#  "no" "yes" "unk" 
# $Var2
#     4     5     8    10    88 
#   "D"   "E"   "H"   "H" "unk" 
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)], 
                  data[nms], maplist[nms])
data
#   ID Var1 Var2
# 1  1  yes    E
# 2  2  yes    H
# 3  3  unk    D
# 4  4   no  unk
# 5  5  unk    H

在两者之间,我认为我更喜欢第一个,如果你的数据不会因为你重塑它而受到惩罚(很多事情可能会让这不那么吸引人)。它之所以好的一个原因是mapping维护 CSV 可以像维护 CSV 一样简单(这可以在您最喜欢的电子表格工具中完成,例如 Excel 或 Calc)。


推荐阅读