首页 > 解决方案 > 如何为变量列表选择精确匹配以附加数据集

问题描述

对于不同的波浪,我有不同的数据集。每个 wave 都有自己的数据集和变量名称前缀。我正在尝试使用我需要的变量子集导入和附加所有数据文件。因此,我目前正在做:

 var_list <- c("pidp", "jbsat", "jbhrs", "jbnssec8_dv", "panssec8_dv", "manssec8_dv", "paedqf", "maedqf", "qfhigh", "age_dv",
          "sex_dv", "psu", "strata", "employ", "jbhas", "jboff", "jbsem", "jbstat", "jbterm1", "jbterm2", "pjbptft", "fimnet_dv",
          "fimngrs_dv", "fimnlabnet_dv", "seearnnet_dv", "fimnmisc_dv", "fimnprben_dv", "fimninvent_dv", "fimnpen_dv", "fimnsben_dv", 
          "hhtype_dv", "livesp_dv", "nch14resp", "nmpsp_dv", "tenure_dv", "urban_dv", "jbsat", "health", "sf1", "scghqa",
          "scghqb", "scghqc", "scghqd", "scgqhe", "scgqhf", "scghqg", "scghqi", "scghqj", "scghqh", "scghql", "sclsat1", 
          "sclsat2", "sclsat3", "sclsat4", "indscus_lw", "indscub_xw")

然后导入第一波的数据,选择这些变量并删除波前缀:

 longfile <- read_dta(file=paste0(dir, "ukhls_w1/a_indresp.dta")) %>% 
 select(matches(var_list)) %>% 
 rename_at(vars(starts_with("a_")), ~str_replace(.,"a_", "")) %>% #remove the wave prefix
 mutate(wave = 1) 

此时,我将简单地使用以下循环:

for (wn in 2:10) {
wl <- paste0(letters[wn],"_") 
wave_data <- read_dta(paste0(dir, "ukhls_w", wn, "/", wl, "indresp.dta")) %>% 
select(matches(var_list)) %>% 
rename_at(vars(starts_with(wl)), ~str_replace(.,wl, "")) %>% # remove prefix wave 
mutate(wave = wn)
longfile <- rbind(longfile, wave_data)
}   

但是,问题在于某些变量名称与后续波次的文件中的多个列匹配。例如,在第二波中,它存在一个名为“nxtjbhrs”的变量,因此当它匹配“jbhrs”时,它将被包含在内。这将在 rbind 中产生错误,因为列数会有所不同。

在这种情况下如何选择精确匹配?还是强制附加数据集?

谢谢你的支持!

标签: rselectdplyr

解决方案


select(setdiff(names(.), var_list))

推荐阅读