首页 > 解决方案 > 如何对来自多个数据框的数据进行分组

问题描述

我有 4 个数据框。

数据框:取货地址

ID    x1     x2      x3       x4         x5    x6.....x1090
1001  Place1 Place2  NA       NA         NA    NA
1002  Place1 Place2  Place 3  Place4     NA    NA
1003  Place5 Place3  Place 2  Place 2    NA    NA
1004  Place6 Place7  NA       NA         NA    NA 

数据框:丢弃地址

ID    x1     x2      x3       x4         x5    x6.....x1090
1001  Place2 Place1  NA       NA         NA    NA
1002  Place2 Place1  Place4   Place3     NA    NA
1003  Place3 Place5  Place6   Place7     NA    NA
1004  Place7 Place6  NA       NA         NA    NA 

数据框:日期

ID    x1         x2         x3        x4         x5    x6.....x1090
1001  4/1/2020   4/1/2020   NA        NA         NA    NA
1002  4/1/2020   4/1/2020   4/3/2020  4/3/2020   NA    NA
1003  4/3/2020   4/3/2020   4/4/2020  4/5/2020   NA    NA
1004  4/5/2020   4/5/2020   NA        NA         NA    NA 

数据框:时间

    ID    x1           x2           x3            x4           x5    x6.....x1090
    1001  8:00:00 AM   4:00:00 PM   NA            NA           NA    NA
    1002  9:00:05 AM  12:35:05 PM   11:00:00 AM   4:00:00 AM   NA    NA
    1003  3:00:00 PM   6:00:00 PM   7:00:00 AM    3:00:00 PM   NA    NA
    1004  7:00:00 AM   3:00:00 PM   NA            NA           NA    NA 

我想按日期对这些旅行进行分组并找到旅行链。输出将是这样的:

ID    Date      Tripchain                      Time                 
1001  4/1/2020  place 1- place 2- place 1      Morning- Afternoon
1002  4/1/2020  place 1- place 2- place 1      Morning -Afternoon
1002  4/3/2020  place 3- place 4- place 3      Morning -Afternoon
1003  4/3/2020  place 5- place 3- place 5      Afternoon-Afternoon
1003  4/4/2020  place 2- place 7               Morning
1003  4/5/2020  place 2- place 7               Afternoon
1004  4/5/2020  place 6- place 7- place 6      Morning-Afternoon

标签: rgrouping

解决方案


首先,我认为您place1-place2-...在字符串中折叠的格式会严重扩展。如果它仅用于报告(从不使用该格式的数据),那很好,但否则我建议您将其分开。

这里要做的第一件事是将单个帧转换为组合的长格式。我会更进一步,将日期/时间转换为 R-nativePOSIXt对象:

dat <- list(
  pivot_longer(pickup, -ID, names_to = "x", values_to = "pickup"),
  pivot_longer(dropoff, -ID, names_to = "x", values_to = "dropoff"),
  pivot_longer(dates, -ID, names_to = "x", values_to = "date"),
  pivot_longer(times, -ID, names_to = "x", values_to = "time")
) %>%
  Reduce(function(a, b) full_join(a, b, by = c("ID", "x")), .) %>%
  filter(complete.cases(.)) %>%
  mutate(
    timestamp = as.POSIXct(paste(date, time), format = "%m/%d/%Y %I:%M:%S_%p"),
    date = as.Date(timestamp)
  ) %>%
  select(-x, -time)
dat
# # A tibble: 12 x 5
#       ID pickup dropoff date       timestamp              
#    <int> <chr>  <chr>   <date>     <dttm>                 
#  1  1001 Place1 Place2  2020-04-01 2020-04-01 08:00:00.000
#  2  1001 Place2 Place1  2020-04-01 2020-04-01 16:00:00.000
#  3  1002 Place1 Place2  2020-04-01 2020-04-01 09:00:05.000
#  4  1002 Place2 Place1  2020-04-01 2020-04-01 12:35:05.000
#  5  1002 Place3 Place4  2020-04-03 2020-04-03 11:00:00.000
#  6  1002 Place4 Place3  2020-04-03 2020-04-03 04:00:00.000
#  7  1003 Place5 Place3  2020-04-03 2020-04-03 15:00:00.000
#  8  1003 Place3 Place5  2020-04-04 2020-04-03 18:00:00.000
#  9  1003 Place2 Place6  2020-04-04 2020-04-04 07:00:00.000
# 10  1003 Place2 Place7  2020-04-05 2020-04-05 15:00:00.000
# 11  1004 Place6 Place7  2020-04-05 2020-04-05 07:00:00.000
# 12  1004 Place7 Place6  2020-04-05 2020-04-05 15:00:00.000

从这里到你想去的地方,我想我们可以对它们进行排序(按时间戳),对它们进行分组(按 id),然后组合。哦,并转换为人类可读的“时间”。

time2human <- function(x) {
  as.character(
    cut(as.numeric(format(x, format = "%H")), c(-1, 4, 12, 17, 20, 25),
        labels = c("Night", "Morning", "Afternoon", "Evening", "Night"))
  )
}

dat %>%
  arrange(timestamp) %>%
  group_by(ID, date) %>%
  summarize(
    Tripchain = paste(c(pickup, last(dropoff)), collapse = " - "),
    Timeframe = paste(time2human(min(timestamp)), time2human(max(timestamp)),
                      sep = " - "),
    Time1 = first(timestamp), Time2 = last(timestamp)
  ) %>%
  ungroup()
# # A tibble: 7 x 6
#      ID date       Tripchain                Timeframe             Time1                   Time2                  
#   <int> <date>     <chr>                    <chr>                 <dttm>                  <dttm>                 
# 1  1001 2020-04-01 Place1 - Place2 - Place1 Morning - Afternoon   2020-04-01 08:00:00.000 2020-04-01 16:00:00.000
# 2  1002 2020-04-01 Place1 - Place2 - Place1 Morning - Morning     2020-04-01 09:00:05.000 2020-04-01 12:35:05.000
# 3  1002 2020-04-03 Place4 - Place3 - Place4 Night - Morning       2020-04-03 04:00:00.000 2020-04-03 11:00:00.000
# 4  1003 2020-04-03 Place5 - Place3          Afternoon - Afternoon 2020-04-03 15:00:00.000 2020-04-03 15:00:00.000
# 5  1003 2020-04-04 Place3 - Place2 - Place6 Evening - Morning     2020-04-03 18:00:00.000 2020-04-04 07:00:00.000
# 6  1003 2020-04-05 Place2 - Place7          Afternoon - Afternoon 2020-04-05 15:00:00.000 2020-04-05 15:00:00.000
# 7  1004 2020-04-05 Place6 - Place7 - Place6 Morning - Afternoon   2020-04-05 07:00:00.000 2020-04-05 15:00:00.000

我保留了Time1Time2以防您仍然需要完整的时间戳。您将需要更新该time2human功能,以便切割更符合您的喜好。(我有 -1 和 +25 的“小时”,以确保我完全限制所有可能的时间;我可以更精确和小心,但这已经足够安全了。)

最后,如果您更喜欢以您开始使用的格式设置日期和时间,请使用它format(..., format=...)来获得您想要的;我更喜欢(并建议)尽可能长时间地保持它们的格式,因为它们本质上是数字的,因此很容易比较/排序DatePOSIXt一旦它们转换为字符串,这样的操作就变得不那么清晰了(例如尝试在4/1/2019之前排序3/1/2020)。


推荐阅读