首页 > 解决方案 > 合并标准 Eurobarometer 的多个数据集

问题描述

我是 R 的新手,我正试图找出一种方法来合并 8 个标准欧洲晴雨表数据集(横截面),它们如下:ZA5932_v3-0-0(2014 年 11 月)、ZA6643_v4-0-0( 2015 年 11 月)、ZA6788_v2-0-0(2016 年 11 月)、ZA6863_v1-0-0(2017 年 5 月)、ZA6928_v1-0-0(2017 年 11 月)、ZA6963_v1-0-0(2018 年 3 月)、ZA7489_v1-0-0( 2018 年 11 月),最后是 ZA7576_v1-0-0(2019 年 6 月至 7 月)。

我想将每个数据集的所有变量都包含在最终数据框中,这肯定会导致最终合并数据集的大小很大。然后,我将重命名所有数据集中共有的变量(年龄、教育、信任等)

将所有数据集(.sav)读入 R 后,我在 R 中尝试了以下代码来合并数据框:

full_dplyr  <- full_join(Nov_2014, Nov_2015, Nov_2016, 
by = c("uniqid", "studyno1", "studyno2","doi", "version", "edition", "isocntry"), all.x = TRUE, all.y = TRUE) 

此代码操作合并,但仅考虑前两个数据集。然后我尝试列出所有数据框:

ls_df <- list(data.frame(Nov_2014),
              data.frame(Nov_2015),
              data.frame(Nov_2016),
              data.frame(May_2017),
              data.frame(Nov_2017),
              data.frame(Mar_2018),
              data.frame(Nov_2018),
              data.frame(Jun_Jul_2019))

ls_tmp <- list(data.frame(Nov_2014),
               data.frame(Nov_2015),
               data.frame(Nov_2016))


merge(ls_df, by = c("uniqid", "studyno1", "studyno2", "doi", "version", "edition",
                    "isocntry"), all = TRUE)

然后我尝试使用 purrr 包和 reduce 命令:

multi_full <- Reduce(function(Nov_2014, Nov_2015, Nov_2016,
                              May_2017, Nov_2017, Mar_2018, 
                              Nov_2018, Jun_Jul_2019) merge(Nov_2014, Nov_2015, all = TRUE), ls_df)

#or

multi_full <- Reduce(full_join, 
by = c("uniqid", "studyno1", "studyno2", "doi", "version", "edition", 
        "survey", "caseid", "split", "tnscntry", "country", "isocntry"), all = TRUE, ls_tmp)

#or

list(Nov_2014, Nov_2015, Nov_2016, May_2017, Nov_2017,
     Mar_2018, Nov_2018, Jun_Jul_2019) %>%
reduce(full_join, by = c("uniqid", "studyno1", "studyno2", "doi", "version", "edition", 
                         "survey", "caseid", "split", "tnscntry", "country", "isocntry"), all = TRUE)


所有这些都不像我希望的那样工作。我什至最后一步尝试从 Github 下载 Eurobarometer 和retroharmonize 的软件包:

devtools::install_github("antaldaniel/retroharmonize", force = TRUE)
devtools::install_github("antaldaniel/eurobarometer", force = TRUE)

library(Eurobarometer)

sav_to_rds("ZA5932_v3-0-0.sav", export = "Nov_2014")
sav_to_rds("ZA6643_v4-0-0.sav", export = "Nov_2015")


import_file_names <- c('Nov_2014','Nov_2015')

my_survey_list <- read_surveys (import_file_names, .f = 'read_rds')

my_metadata <- gesis_metadata_create(my_survey_list)
names(my_metadata)

但是这些代码只生成一个元数据数据框,其中包含两个数据集的选定变量。目前,唯一对我有用的代码(但仅适用于少数数据帧,而不是全部)如下:


total_3 <- merge(Nov_2014, Nov_2015, by = c("uniqid", "studyno1", "studyno2", "doi", "version", "edition",     #This way all obsv. from both datasets are kept and the obsv. are sorted according to the variables included in the code.
                                            "survey", "caseid", "split", "tnscntry", "country",
                                            "isocntry"), all.x = TRUE, all.y = TRUE)
total_4 <- merge(total_3, Nov_2016, by = c("uniqid", "studyno1", "studyno2", "doi", "version", "edition",     #This way all obsv. from both datasets are kept and the obsv. are sorted according to the variables included in the code.
                                           "survey", "caseid", "split", "tnscntry", "country",
                                           "isocntry"), all.x = TRUE, all.y = TRUE)


total_5 <- merge(total_4, May_2017, by = c("uniqid", "studyno1", "studyno2", "doi", "version", "edition",     #This way all obsv. from both datasets are kept and the obsv. are sorted according to the variables included in the code.
                                           "survey", "caseid", "split", "tnscntry", "country",
                                           "isocntry"), all.x = TRUE, all.y = TRUE)

但是,对于 total_5 我得到错误:

Error: cannot allocate vector of size 518 Kb

我通过扩大 memory.limit 解决了这个问题,但它只工作了一次。有谁知道如何将所有这些大数据帧合并在一起而不会丢失变量或在合并之前重新编码它们?您是否曾经一起处理过 EB 数据集?

标签: rdataframemerge

解决方案


推荐阅读