r - 如何在允许变量之间存在时间滞后的同时合并 2 个数据集?
问题描述
我无法正确合并两个数据集。理想情况下,对于每个国家/地区,year.y 变量应该比 year.x 变量晚 3 年,以解释滞后效应。但是,当我尝试合并时,year.y 似乎是随机分配的。此外,该代码还会导致行数不必要地增加,因为该特定国家/地区的 year.x 与 every year.y 配对,而它只需要与单个年份 (year.x +3) 配对。
> dput(head(corruption))
#Corruption dataset
structure(list(Jurisdiction_c = c("Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"),
year = c("X2001_c", "X2002_c", "X2003_c", "X2004_c", "X2005_c",
"X2006_c"), cpi = c(NA, NA, NA, NA, "2.5", NA)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
> dput(head(resource_wealth))
#resource wealth dataset
structure(list(Country.Name_r = c("Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"),
year = c("X1998_r", "X1999_r", "X2000_r", "X2001_r", "X2002_r",
"X2003_r"), resource_percentage = c(NA, NA, NA, NA, 1.11398250245278,
0.719357369114903)), row.names = c(NA, 6L), class = "data.frame")
这是我用来合并的行:
dataset_final <- merge(x = resource_wealth, y = corruption, by.x = "Country.Name_r", by.y = "Jurisdiction_c", all.x = TRUE)
> dput(head(dataset_final))
structure(list(Country.Name_r = c("Afghanistan", "Afghanistan",
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"),
year.x = c("X2005_r", "X2005_r", "X2005_r", "X2005_r", "X2005_r",
"X2005_r"), resource_percentage = c(0.38440433910658, 0.38440433910658,
0.38440433910658, 0.38440433910658, 0.38440433910658, 0.38440433910658
), year.y = c("X2012_c", "X2015_c", "X2007_c", "X2003_c",
"X2001_c", "X2002_c"), cpi = c("8", "11", "1.8", NA, NA,
NA)), row.names = c(NA, 6L), class = "data.frame")
这是期望的结果:
1 Afghanistan X1998_r <NA> X2001_c <NA>
2 Afghanistan X1999_r <NA> X2002_c <NA>
3 Afghanistan X2000_r <NA> X2003_c <NA>
4 Afghanistan X2001_r <NA> X2004_c <NA>
5 Afghanistan X2002_r 1.113983e+00 X2005_c 2.5
6 Afghanistan X2003_r 7.193574e-01 X2006_c <NA>
解决方案
由于您只剩下一个键加入,它将创建具有所有年份值的所有匹配观察。year_y
实现预期结果的一种方法是从列中提取年份值并仅保留is的那些年份year_x + 3
。
library(dplyr)
left_join(resource_wealth, corruption,
by = c('Country.Name_r' = 'Jurisdiction_c')) %>%
mutate(year_x = as.numeric(gsub('\\D', '', year.x)),
year_y = as.numeric(gsub('\\D', '', year.y))) %>%
filter(year_y == (year_x + 3))
# Country.Name_r year.x resource_percentage year.y cpi year_x year_y
#1 Afghanistan X1998_r NA X2001_c <NA> 1998 2001
#2 Afghanistan X1999_r NA X2002_c <NA> 1999 2002
#3 Afghanistan X2000_r NA X2003_c <NA> 2000 2003
#4 Afghanistan X2001_r NA X2004_c <NA> 2001 2004
#5 Afghanistan X2002_r 1.1139825 X2005_c 2.5 2002 2005
#6 Afghanistan X2003_r 0.7193574 X2006_c <NA> 2003 2006
然后,您可以使用 仅选择那些需要的列select
。
推荐阅读
- optimization - 使用 NEON 内在函数的代码性能缓慢
- python - Twilio python烧瓶尝试获取状态回调事件,但它返回“无”
- typo3 - Typo3 widget.paginate 为下一页创建了一个损坏的 url
- google-cloud-platform - GCP Dataproc:创建集群并激活堆栈驱动程序
- amazon-web-services - 使用 Pyspark 读取分区数据时输入文件名为空
- python - Python 使用驱动程序 selenium 将焦点从 Web 浏览器切换回终端
- r - R - 使用累积部分计算(不是累计)填充数据帧的函数或脚本
- node.js - 将套接字服务器与 API 服务器分开是一种好习惯吗?
- godot - 检测 godot 内 3d 对象上的点击/触摸屏输入
- assembly - x86 在执行指令和读/写数据时如何处理字节寻址和字寻址?