首页 > 解决方案 > 使用日期时间和数字列删除重复项

问题描述

我试图让每个人都进行首次交易(除非该人没有进行任何交易,否则第一笔交易不能是 NA)。我只能得到最早的日期,但这不是我想要的。如何获得所需的输出?

当前数据框

  Name                Date Transaction
1 John 2020/09/27 10:25:00         100
2 John 2020/09/27 11:30:00          NA
3 John 2020/09/27 12:30:00         250
4 Adam 2020/07/21 14:00:00          NA
5 Adam 2020/07/21 14:25:00         400
6 Adam 2020/07/21 14:45:00         200
5  Tom 2020/07/21 11:00:00          NA
6  Tom 2020/07/21 14:30:00          NA
6  Tom 2020/07/21 14:30:00          NA

期望的输出

  Name                Date Transaction
1 John 2020/09/27 10:25:00         100
2 Adam 2020/07/21 14:25:00         400
3  Tom 2020/07/21 11:00:00          NA
structure(list(Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 3L, 
3L, 3L), .Label = c("Adam", "John", "Tom"), class = "factor"), 
    Date = structure(c(6L, 7L, 8L, 2L, 3L, 5L, 1L, 4L, 4L), .Label = c("2020/07/21 11:00:00", 
    "2020/07/21 14:00:00", "2020/07/21 14:25:00", "2020/07/21 14:30:00", 
    "2020/07/21 14:45:00", "2020/09/27 10:25:00", "2020/09/27 11:30:00", 
    "2020/09/27 12:30:00"), class = "factor"), FirstTransaction = c(100, 
    NA, 250, NA, 400, 200, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-9L))

标签: rdatetime

解决方案


您可以尝试以下方法dplyr

arrange如果所有值都返回第 1 行或返回第 1 个非 NA 行,则数据由Nameand组成。DateTransactionNAName

library(dplyr)

df %>%
  mutate(Date = lubridate::ymd_hms(Date)) %>%
  arrange(Name, Date) %>%
  group_by(Name) %>%
  slice(if(all(is.na(Transaction))) 1L else which(!is.na(Transaction))[1])
  

#  Name  Date                Transaction
#  <chr> <dttm>                    <int>
#1 Adam  2020-07-21 14:25:00         400
#2 John  2020-09-27 10:25:00         100
#3 Tom   2020-07-21 11:00:00          NA

@Paul 建议的一种较短的非复杂方式是:

df %>%
  mutate(Date = lubridate::ymd_hms(Date)) %>%
  arrange(Name, is.na(Transaction), Date) %>%
  group_by(Name) %>%
  slice(1L)

推荐阅读