r - Automating the process of recoding numeric variables to meaningful factor variables
问题描述
I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown. However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.
data <- data.frame("ID" = c(1:5),
"Var1" = c(2,2,8,1,8),
"Var2" = c(5,8,4,88,10))
For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like
> Var1_values
[1] 8 2 1
with a corresponding vector containing the categories:
> Var1_categories
[1] "unknown" "yes" "no"
But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like
| ID | Var1 | Var2 |
|----|---------|-------------------|
| 1 | yes | condition E |
| 2 | yes | condition H |
| 3 | unknown | condition D |
| 4 | no | unknown condition |
| 5 | unknown | condition H |
where each column is a meaningful factor variable.
As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.
// edit:
I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:
mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
"Value1" = c(2,2,2,7),
"Word1" = c("yes","yes","yes","condition A"),
"Value2" = c(1,1,1,6),
"Word2" = c("no","no","no","Condition B"),
"Value3" = c(8,8,8,5),
"Word3" = c("unk","unk","unk", "Condition C"),
"Value4" = c(NA,NA,NA,4),
"Word4" = c(NA,NA,NA,"Condition B")
)
I would like to "long"-transform it so I can use it with @r2evan 's solution.
解决方案
这是一个想法,尽管它需要重新塑造(两次)数据。
mapping <- data.frame(
Var = c(rep("Var1", 3), rep("Var2", 5)),
Val = c(1, 2, 8, 4, 5, 8, 10, 88),
Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
# Var Val Words
# 1 Var1 1 no
# 2 Var1 2 yes
# 3 Var1 8 unk
# 4 Var2 4 D
# 5 Var2 5 E
# 6 Var2 8 H
# 7 Var2 10 H
# 8 Var2 88 unk
library(dplyr)
library(tidyr) # pivot_*
data %>%
pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
left_join(mapping, by = c("Var", "Val")) %>%
pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
# ID Var1 Var2
# <int> <chr> <chr>
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
使用此方法,您可以控制每个变量的数字到单词的映射。
另一种选择是使用地图列表,类似于上面,但它不需要双重整形。
maplist <- list(
Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
# 1 2 8
# "no" "yes" "unk"
# $Var2
# 4 5 8 10 88
# "D" "E" "H" "H" "unk"
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)],
data[nms], maplist[nms])
data
# ID Var1 Var2
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
在两者之间,我认为我更喜欢第一个,如果你的数据不会因为你重塑它而受到惩罚(很多事情可能会让这不那么吸引人)。它之所以好的一个原因是mapping
维护 CSV 可以像维护 CSV 一样简单(这可以在您最喜欢的电子表格工具中完成,例如 Excel 或 Calc)。
推荐阅读
- python - 我如何知道子进程是否已启动?
- java - Xamarin、Android Activity、IntentFilter、打开表单视图
- apexcharts - 如何在页面加载时设置 ApexCharts 高度,然后在滚动到视图时渲染图表
- qt - QML从消息对话框中隐藏问号
- elasticsearch - 在 Elasticsearch 试用版中尝试 Kerberos 身份验证
- arm - IAR EW:添加新设备
- php - 使用带有分页的 $wp_query 进行搜索时的 URL 错误
- ajax - Ajax 调用未在控制器中输入方法
- .net - 如何更改单元格中的数据,将其与同一行内另一个单元格的值进行比较?
- sql - 根据 if 条件更改表