r - 如何对来自多个数据框的数据进行分组
问题描述
我有 4 个数据框。
数据框:取货地址
ID x1 x2 x3 x4 x5 x6.....x1090
1001 Place1 Place2 NA NA NA NA
1002 Place1 Place2 Place 3 Place4 NA NA
1003 Place5 Place3 Place 2 Place 2 NA NA
1004 Place6 Place7 NA NA NA NA
数据框:丢弃地址
ID x1 x2 x3 x4 x5 x6.....x1090
1001 Place2 Place1 NA NA NA NA
1002 Place2 Place1 Place4 Place3 NA NA
1003 Place3 Place5 Place6 Place7 NA NA
1004 Place7 Place6 NA NA NA NA
数据框:日期
ID x1 x2 x3 x4 x5 x6.....x1090
1001 4/1/2020 4/1/2020 NA NA NA NA
1002 4/1/2020 4/1/2020 4/3/2020 4/3/2020 NA NA
1003 4/3/2020 4/3/2020 4/4/2020 4/5/2020 NA NA
1004 4/5/2020 4/5/2020 NA NA NA NA
数据框:时间
ID x1 x2 x3 x4 x5 x6.....x1090
1001 8:00:00 AM 4:00:00 PM NA NA NA NA
1002 9:00:05 AM 12:35:05 PM 11:00:00 AM 4:00:00 AM NA NA
1003 3:00:00 PM 6:00:00 PM 7:00:00 AM 3:00:00 PM NA NA
1004 7:00:00 AM 3:00:00 PM NA NA NA NA
我想按日期对这些旅行进行分组并找到旅行链。输出将是这样的:
ID Date Tripchain Time
1001 4/1/2020 place 1- place 2- place 1 Morning- Afternoon
1002 4/1/2020 place 1- place 2- place 1 Morning -Afternoon
1002 4/3/2020 place 3- place 4- place 3 Morning -Afternoon
1003 4/3/2020 place 5- place 3- place 5 Afternoon-Afternoon
1003 4/4/2020 place 2- place 7 Morning
1003 4/5/2020 place 2- place 7 Afternoon
1004 4/5/2020 place 6- place 7- place 6 Morning-Afternoon
解决方案
首先,我认为您place1-place2-...
在字符串中折叠的格式会严重扩展。如果它仅用于报告(从不使用该格式的数据),那很好,但否则我建议您将其分开。
这里要做的第一件事是将单个帧转换为组合的长格式。我会更进一步,将日期/时间转换为 R-nativePOSIXt
对象:
dat <- list(
pivot_longer(pickup, -ID, names_to = "x", values_to = "pickup"),
pivot_longer(dropoff, -ID, names_to = "x", values_to = "dropoff"),
pivot_longer(dates, -ID, names_to = "x", values_to = "date"),
pivot_longer(times, -ID, names_to = "x", values_to = "time")
) %>%
Reduce(function(a, b) full_join(a, b, by = c("ID", "x")), .) %>%
filter(complete.cases(.)) %>%
mutate(
timestamp = as.POSIXct(paste(date, time), format = "%m/%d/%Y %I:%M:%S_%p"),
date = as.Date(timestamp)
) %>%
select(-x, -time)
dat
# # A tibble: 12 x 5
# ID pickup dropoff date timestamp
# <int> <chr> <chr> <date> <dttm>
# 1 1001 Place1 Place2 2020-04-01 2020-04-01 08:00:00.000
# 2 1001 Place2 Place1 2020-04-01 2020-04-01 16:00:00.000
# 3 1002 Place1 Place2 2020-04-01 2020-04-01 09:00:05.000
# 4 1002 Place2 Place1 2020-04-01 2020-04-01 12:35:05.000
# 5 1002 Place3 Place4 2020-04-03 2020-04-03 11:00:00.000
# 6 1002 Place4 Place3 2020-04-03 2020-04-03 04:00:00.000
# 7 1003 Place5 Place3 2020-04-03 2020-04-03 15:00:00.000
# 8 1003 Place3 Place5 2020-04-04 2020-04-03 18:00:00.000
# 9 1003 Place2 Place6 2020-04-04 2020-04-04 07:00:00.000
# 10 1003 Place2 Place7 2020-04-05 2020-04-05 15:00:00.000
# 11 1004 Place6 Place7 2020-04-05 2020-04-05 07:00:00.000
# 12 1004 Place7 Place6 2020-04-05 2020-04-05 15:00:00.000
从这里到你想去的地方,我想我们可以对它们进行排序(按时间戳),对它们进行分组(按 id),然后组合。哦,并转换为人类可读的“时间”。
time2human <- function(x) {
as.character(
cut(as.numeric(format(x, format = "%H")), c(-1, 4, 12, 17, 20, 25),
labels = c("Night", "Morning", "Afternoon", "Evening", "Night"))
)
}
dat %>%
arrange(timestamp) %>%
group_by(ID, date) %>%
summarize(
Tripchain = paste(c(pickup, last(dropoff)), collapse = " - "),
Timeframe = paste(time2human(min(timestamp)), time2human(max(timestamp)),
sep = " - "),
Time1 = first(timestamp), Time2 = last(timestamp)
) %>%
ungroup()
# # A tibble: 7 x 6
# ID date Tripchain Timeframe Time1 Time2
# <int> <date> <chr> <chr> <dttm> <dttm>
# 1 1001 2020-04-01 Place1 - Place2 - Place1 Morning - Afternoon 2020-04-01 08:00:00.000 2020-04-01 16:00:00.000
# 2 1002 2020-04-01 Place1 - Place2 - Place1 Morning - Morning 2020-04-01 09:00:05.000 2020-04-01 12:35:05.000
# 3 1002 2020-04-03 Place4 - Place3 - Place4 Night - Morning 2020-04-03 04:00:00.000 2020-04-03 11:00:00.000
# 4 1003 2020-04-03 Place5 - Place3 Afternoon - Afternoon 2020-04-03 15:00:00.000 2020-04-03 15:00:00.000
# 5 1003 2020-04-04 Place3 - Place2 - Place6 Evening - Morning 2020-04-03 18:00:00.000 2020-04-04 07:00:00.000
# 6 1003 2020-04-05 Place2 - Place7 Afternoon - Afternoon 2020-04-05 15:00:00.000 2020-04-05 15:00:00.000
# 7 1004 2020-04-05 Place6 - Place7 - Place6 Morning - Afternoon 2020-04-05 07:00:00.000 2020-04-05 15:00:00.000
我保留了Time1
,Time2
以防您仍然需要完整的时间戳。您将需要更新该time2human
功能,以便切割更符合您的喜好。(我有 -1 和 +25 的“小时”,以确保我完全限制所有可能的时间;我可以更精确和小心,但这已经足够安全了。)
最后,如果您更喜欢以您开始使用的格式设置日期和时间,请使用它format(..., format=...)
来获得您想要的;我更喜欢(并建议)尽可能长时间地保持它们的格式,因为它们本质上是数字的,因此很容易比较/排序Date
。POSIXt
一旦它们转换为字符串,这样的操作就变得不那么清晰了(例如尝试在4/1/2019
之前排序3/1/2020
)。
推荐阅读
- php - 在 CodeIgniter 中更改 URI 的扩展名
- c++ - 前向声明类时成员函数的行为
- awesome-wm - 在 rc.lua 中调用 beautiful.at_screen_connect() 错误
- javascript - 带有回调函数的Javascript异步方法问题
- html - 为什么固定位置在移动屏幕上不起作用?
- video-streaming - 垂直方向的 H.264 流显示损坏的视频
- android - Gradle 密度 abi APK
- java - 正多边形代码中心
- ruby-on-rails - Rails 脚手架无法正常工作
- r - fread 读取数据结构的引号错误