r - 将数据文件和标签文件组合在一起,在 R 中拥有一个带有标签的数据框
问题描述
我有两个数据框,一个是调查数据(data.csv),另一个是标签数据(label.csv)。这是示例数据(我的原始数据有大约 150 个变量)
#sample data
df <- tibble::tribble(
~id, ~House_member, ~dob, ~age_quota, ~work, ~sex, ~pss,
1L, 4L, 1983L, 2L, 2L, 1, 1,
2L, 1L, 1940L, 7L, 2L, 1, 2,
3L, 2L, 1951L, 5L, 6L, 1, 1,
4L, 4L, 1965L, 2L, 2L, 1, 4,
5L, 3L, 1965L, 2L, 3L, 1, 1,
6L, 1L, 1951L, 3L, 1L, 1, 3,
7L, 1L, 1955L, 1L, 1L, 1, 3,
8L, 4L, 1982L, 2L, 2L, 2, 5,
9L, 2L, 1990L, 2L, 4L, 2, 3,
10L, 2L, 1953L, 3L, 2L, 2, 4
)
#sample label data
label <- tibble::tribble(
~variable, ~value, ~label,
"House_member", NA, "How many people live with you?",
"House_member", 1L, "1 person",
"House_member", 2L, "2 persons",
"House_member", 3L, "3 persons",
"House_member", 4L, "4 persons",
"House_member", 5L, "5 persons",
"House_member", 6L, "6 persons",
"House_member", 7L, "7 persons",
"House_member", 8L, "8 persons",
"House_member", 9L, "9 persons",
"House_member", 10L, "10 or more",
"dob", NA, "date of brith",
"age_quota", NA, "age_quota",
"age_quota", 1L, "10-14",
"age_quota", 2L, "15-19",
"age_quota", 3L, "20-29",
"age_quota", 4L, "30-39",
"age_quota", 5L, "40-49",
"age_quota", 6L, "50-70",
"age_quota", 7L, "70 +",
"work", NA, "what is your occupation?",
"work", 1L, "full time",
"work", 2L, "part time",
"work", 3L, "retired",
"work", 4L, "student",
"work", 5L, "housewife",
"work", 6L, "unemployed",
"work", 7L, "other",
"work", 8L, "kid under 15",
"sex", NA, "gender?",
"sex", 1L, "Man",
"sex", 2L, "Woman",
"pss", NA, "How often do you use PS?",
"pss", 1L, "Daily",
"pss", 2L, "several times per week",
"pss", 3L, "once per week",
"pss", 4L, "several time per month",
"pss", 5L, "Rarly"
)
我想知道有什么方法可以将这些文件组合在一起,形成一个带有标签的数据框,例如SPSS
's 样式格式(dbl+lbl 格式)。我知道labelled
可以将值标签添加到未标记向量的包,例如以下示例:
v <- labelled::labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, maybe = 2, no = 3))
我希望有一种比一个一个地为每个变量添加标签更好/更快的方法。
解决方案
另一种imap_dfc
解决方案:
library(tidyverse)
df %>% imap_dfc(~{
label[label$variable==.y,c('label','value')] %>%
deframe() %>% # to named vector
haven::labelled(.x,.)
})
# A tibble: 10 x 7
id House_member dob age_quota work sex pss
<int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl> <dbl+lbl> <dbl+lbl>
1 1 4 [4 persons] 1983 2 [15-19] 2 [part time] 1 [Man] 1 [Daily]
2 2 1 [1 person] 1940 7 [70 +] 2 [part time] 1 [Man] 2 [several times per week]
3 3 2 [2 persons] 1951 5 [40-49] 6 [unemployed] 1 [Man] 1 [Daily]
4 4 4 [4 persons] 1965 2 [15-19] 2 [part time] 1 [Man] 4 [several time per month]
5 5 3 [3 persons] 1965 2 [15-19] 3 [retired] 1 [Man] 1 [Daily]
6 6 1 [1 person] 1951 3 [20-29] 1 [full time] 1 [Man] 3 [once per week]
7 7 1 [1 person] 1955 1 [10-14] 1 [full time] 1 [Man] 3 [once per week]
8 8 4 [4 persons] 1982 2 [15-19] 2 [part time] 2 [Woman] 5 [Rarly]
9 9 2 [2 persons] 1990 2 [15-19] 4 [student] 2 [Woman] 3 [once per week]
10 10 2 [2 persons] 1953 3 [20-29] 2 [part time] 2 [Woman] 4 [several time per month]
已使用tibble::deframe
且haven::labelled
包含在tidyverse
替换后的速度比较filter
/select
通过直接访问label
:
Waldi <- function() {
df %>% imap_dfc(~{
label[label$variable==.y,c('label','value')] %>%
deframe() %>% # to named vector
haven::labelled(.x,.)})}
Waldi_old <- function() {
df %>% imap_dfc(~{
label %>% filter(variable==.y) %>%
select(label, value) %>%
deframe() %>% # to named vector
haven::labelled(.x,.)
})}
#EDIT : Included TIC33() for-loop solution
microbenchmark::microbenchmark(TIC3(),Waldi(),Anil(),TIC1(),Waldi_old(),Sinh())
Unit: microseconds
expr min lq mean median uq max neval cld
TIC3() 688.0 871.80 982.280 920.95 1005.55 1801.6 100 a
Waldi() 1345.5 1543.60 1804.758 1635.45 1893.75 4306.8 100 b
Anil() 4006.8 4476.65 5188.519 4862.95 5439.10 10163.6 100 c
TIC1() 3898.2 4278.80 5009.927 4774.95 5277.05 12916.2 100 c
Waldi_old() 18712.3 20091.75 21756.140 20609.35 22169.75 33359.8 100 d
Sinh() 22730.9 24093.45 25931.412 24946.00 26614.00 38735.3 100 e
推荐阅读
- regex - 使用 powershell 和 regex 查找和替换文本文件中的多个字符串
- go - 方法中第三个返回语句的原因
- linux - Bitbake 不更新目标 rootfs 的 /etc/passwd 和 /etc/group 中的用户和组条目
- networking - 在端口转发我的本地反应应用程序时遇到问题
- c# - 如何在 Unity 中碰撞两个对象而不移动任何一个对象
- javascript - 使用角度的 webpack 联合加载了错误的模块
- awk - 使用正则表达式提取多个字符串
- typescript - 等待可观察的分辨率
- javascript - 某些操作后 SetInterval 无法正常工作
- reactjs - React - 下拉菜单不适用于三元和 .map