首页 > 解决方案 > dplyr:如果多行在列中具有相同的值,则按条件选择行

问题描述

在我的数据中,有几行具有相同id但不同的id2日期。我想使用最新且最大的行id2

例如,在下面的数据集中,第 13 到 15 行是最新的,并且在id2具有相同日期的那些中最大。

可能有多个人可以满足这个条件,这就是为什么我想使用从 13 到 15 的行,而不是其中一个。

id <- c("id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590",
"id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590", "id7590",
"id7590")
id2 <- c("n0960999", "n0960999", "n0960999", "n0961001", "n0961001", "n0961001", "n0961002", "n0961002",
"n0961002", "n0961003", "n0961003", "n0961003", "n0961004", "n0961004", "n0961004", "n0961183",
"n0961183", "n0961183", "n0961184", "n0961184", "n0961184")
date <- c("1980-06-24", "1980-06-24", "1980-06-24", "1980-06-25", "1980-06-25", "1980-06-25", "1980-06-25",
"1980-06-25", "1980-06-25", "1980-06-25", "1980-06-25", "1980-06-25", "1980-06-25", "1980-06-25",
"1980-06-25", "1980-09-24", "1980-09-24", "1980-09-24", "1980-09-24", "1980-09-24", "1980-09-24")
people <- c("14029", "3078", "7333", "14029", "7333", "3078", "7333", "14029", "3078", "7333", "14029", "3078",
"3078", "14029", "7333", "7333", "14029", "3078", "14029", "3078", "7333")
tibble(id=id, id2=id2, people=people, date=date)
       id      id2 people       date
1  id7590 n0960999  14029 1980-06-24
2  id7590 n0960999   3078 1980-06-24
3  id7590 n0960999   7333 1980-06-24
4  id7590 n0961001  14029 1980-06-25
5  id7590 n0961001   7333 1980-06-25
6  id7590 n0961001   3078 1980-06-25
7  id7590 n0961002   7333 1980-06-25
8  id7590 n0961002  14029 1980-06-25
9  id7590 n0961002   3078 1980-06-25
10 id7590 n0961003   7333 1980-06-25
11 id7590 n0961003  14029 1980-06-25
12 id7590 n0961003   3078 1980-06-25
13 id7590 n0961004   3078 1980-06-25
14 id7590 n0961004  14029 1980-06-25
15 id7590 n0961004   7333 1980-06-25
16 id7590 n0961183   7333 1980-09-24
17 id7590 n0961183  14029 1980-09-24
18 id7590 n0961183   3078 1980-09-24
19 id7590 n0961184  14029 1980-09-24
20 id7590 n0961184   3078 1980-09-24
21 id7590 n0961184   7333 1980-09-24

我发现了一个类似的 SQL 问题,但我想知道如何处理dplyr.

标签: rdplyr

解决方案


您可以使用group_byandtop_n来解决这个问题:

mydf <- tibble(id = id, id2 = id2, people = people, date = date) 

mydf %>% 
  group_by(id, date) %>% 
  top_n(1, id2)

# A tibble: 9 x 4
# Groups:   id, date [3]
#   id     id2      people date      
#   <chr>  <chr>    <chr>  <chr>     
# 1 id7590 n0960999 14029  1980-06-24
# 2 id7590 n0960999 3078   1980-06-24
# 3 id7590 n0960999 7333   1980-06-24
# 4 id7590 n0961004 3078   1980-06-25
# 5 id7590 n0961004 14029  1980-06-25
# 6 id7590 n0961004 7333   1980-06-25
# 7 id7590 n0961184 14029  1980-09-24
# 8 id7590 n0961184 3078   1980-09-24
# 9 id7590 n0961184 7333   1980-09-24

推荐阅读