首页 > 解决方案 > 使用 R 删除基于每个组的重复数据

问题描述

我有一个数据集,其中包含员工 ID、姓名和他们的银行账户信息。其中一些员工具有相同的员工 ID 或具有相同员工姓名的不同员工 ID 的重复名称。这些员工中很少有同名的银行账户信息相同,而有些员工的同名银行账户信息不同。目的是找到那些姓名相同但银行帐号不同的员工。以下是数据示例:

| Emp_id |   Name  | Bank Account |
|--------|:-------:|-------------:|
| 123    |   Joan  |         6758 |
| 134    |  Karyn  |         1244 |
| 143    | Larry   | 4900         |
| 143    | Larry   | 5201         |
| 235    | Larry   | 5201         |
| 433    | Larry   | 5201         |
| 231    | Larry   | 5201         |
| 120    | Amy     | 7890         |
| 135    | Amy     | 7890         |
| 150    |  Chris  | 1280         |
| 150    | Chris   | 6565         |
| 900    | Cassy   | 1280         |
| 900    | Cassy   | 9873         |

我必须根据他们的名字找到重复的员工,我可以成功地做到这一点。完成后,我必须识别具有相同姓名但不同银行帐号的员工。现在的问题是它没有根据姓名对员工进行分组并搜索不同的银行账户。相反,它正在查找不同个人的帐号,如果发现相同,则删除其中一个重复值。例如,Chris 和 Cassy 具有相同的银行帐号“1280”,因此它会将其识别为相同并自动删除 Chris 的记录之一(输出中的银行帐号编号为 1280)。我得到的输出如下所示:

| Emp_id |  Name | Bank Account |
|--------|:-----:|-------------:|
| 120    |  Amy  |         7890 |
| 900    | Cassy |         1280 |
| 900    | Cassy | 9873         |
| 150    | Chris | 6565         |
| 143    | Larry | 4900         |
| 143    | Larry | 5201         |

这是我遵循的代码:

sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]

然而,实际输出应该考虑每个名称(同名)中的银行账户号。输出应如下所示:

| Emp_id |   Name  | Bank Account |
|--------|:-------:|-------------:|
| 900    |  Cassy  |1280          |
| 900    |  Cassy  |9873          |
| 150    |  Chris  | 1280         |
| 150    | Chris   | 6565         |
| 143    | Larry   | 4900         |
| 143    | Larry   | 5201         |

有人可以指导我找到正确的代码吗?

标签: rgroup-byduplicates

解决方案


我们可以n_distinct使用filter

library(dplyr)
sample %>% 
    group_by(Name) %>%
    filter(n() > 1) %>%
    group_by(Id, add = TRUE) %>% 
    filter(n_distinct(Bank_Account) > 1) %>%
    arrange(desc(Id))
# A tibble: 6 x 3
# Groups:   Name, Id [3]
#  Id    Name  Bank_Account
#  <fct> <fct> <fct>       
#1 900   Cassy 1280        
#2 900   Cassy 9873        
#3 150   Chris 1280        
#4 150   Chris 6565        
#5 143   Larry 4900        
#6 143   Larry 5201      

推荐阅读