首页 > 解决方案 > R:按多个ID组合数据帧,如果匹配则按第二个ID对齐,否则插入NA

问题描述

我有两个看起来像这样的字符串数据框

df1
firm_id    user_id_source1
"1234"     "abc"
"1234"     "bcd"
"1234"     "cde"
"1234"     "def"
"1234"     "fgh"
"2345"     "zyx"
"2345"     "yxw"
"2345"     "xwv"
"2345"     "wvu"
"2345"     "vut"

df2
firm_id    user_id_source2
"1234"     "abc"
"1234"     "bcd"
"1234"     "ghi"
"1234"     "hij"
"2345"     "zyx"
"2345"     "yxw"
"2345"     "utr"

有许多公司,每个公司都有许多用户 ID。我希望结合用户 ID 的匹配位置,然后是其余数据,而不会丢失任何数据,因为用户 ID 的来源很重要。像这样:

result
firm_id    user_id_source1    user_id_source2
"1234"     "abc"               "abc"
"1234"     "bcd"               "bcd"
"1234"     "cde"               NA
"1234"     "def"               NA
"1234"     "fgh"               NA
"1234"     NA                  "ghi"
"1234"     NA                  "hij"
"2345"     "zyx"               "zyx"
"2345"     "yxw"               "yxw"
"2345"     "xwv"               NA
"2345"     "wvu"               NA
"2345"     "vut"               NA
"2345"     NA                  "utr"

我已经尝试了每种类型的加入和合并,但无法弄清楚。很感谢任何形式的帮助。谢谢你。

编辑:下面的 r-schifini 解决了这个问题。谢谢你。我添加了

library(dplyr)
df <- df %>% arrange(firm_id, is.na(user_id_source1), is.na(user_id_source2))

按所需顺序排列。

EDIT2:我已经在使用 dplyr 所以这对我来说不是问题,但我想有人可能想把它保存在 base R 中。这是如何排序这是 base R。


df_ordered <- df[order(df$firm_id, is.na(df$user_id_source1), is.na(df$user_id_source2)),]

标签: rjoinmerge

解决方案


您需要通过两列连接数据框:firm_iduser_id(1 和 2)。由于 user_id 列具有不同的名称,并且您想保留它们,因此请执行以下操作:

df1 = data.frame(firm_id = c('1234','1234','5678','5678'),user_id_1 = c('abc','bcd','abc','xyz'))
df2 = data.frame(firm_id = c('1234','3456','5678'),user_id_2 = c('abc','bcd','xyz'))

# Create a copy of the user_id columns with a common name: 'user'
df1$user = df1$user_id_1
df2$user = df2$user_id_2

# do a full merge (they will be merged by 'firm_id' and 'user')
df = merge(df1, df2, all = T)

# remove the 'user' column
df$user = NULL

结果:

> df1
  firm_id user_id_1 user
1    1234       abc  abc
2    1234       bcd  bcd
3    5678       abc  abc
4    5678       xyz  xyz

> df2
  firm_id user_id_2 user
1    1234       abc  abc
2    3456       bcd  bcd
3    5678       xyz  xyz

> df
  firm_id user_id_1 user_id_2
1    1234       abc       abc
2    1234       bcd      <NA>
3    5678       abc      <NA>
4    5678       xyz       xyz
5    3456      <NA>       bcd

推荐阅读