首页 > 解决方案 > Removing particular rows in a dataframe with pre-defined conditions

问题描述

I have a data frame with columns

    shipment_id     created_at    picked_at   packed_at   shipped_at
    CSDJKH231BN     2019-02-03    2019-02-03    
    CSDJKH231BN     2019-02-03    2019-02-03  2019-02-04  2019-02-05
    CSDJKH2KFJ3     2019-02-01    2019-02-04  2019-02-07  

The data base is being uploaded to rServer via google drive which is being constantly being updated.

    u1 <- "https://docs.google.com/spreadsheets/d/e/"link""
    tc1 <- getURL(u1, ssl.verifypeer=FALSE)
    x <- read.csv(textConnection(tc1))

If in the first update the shipment_id CSDJKH231BN was upto picked_at and in second update from google drive we get CSDJKH231BN upto shipped_at. How do i keep only the shipment_id that are upto shipped_at, but i also want to keep the shipment_id like CSDJKH2KFJ3 which are still to be processed and are not shipped yet.

Basically just to delete the duplicate entries but this code is not working for me.

    df <- df[!duplicated(df), ]

Any help would be appreciated.

标签: rdataframedplyr

解决方案


I think you just need to specify that you're looking for duplicates in shipment_id. However, that will just keep the first version which would have nothing in the shipped_at column. So you might need to sort the column by the shipped_at and packed_at columns (in reverse, so that null values are at the bottom). Does this work?

df <- df[order(df[,'shipped_at'],df[,'packed_at'], decreasing=TRUE),]
df <- df[!duplicated(df$shipment_id), ]

推荐阅读