首页 > 解决方案 > 如何使用 R 中的可能 spread() 和 gather() 函数将给定的预定旅行数据集重建为所需的链接旅行数据集?

问题描述

我有一个预定的旅行数据集如下:

bktrips <- data.frame(
  userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"), 
  mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"), 
  Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"), 
  Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ), 
  depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"), 
  Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"), 
  Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"), 
  Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"), 
  Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)

现在我需要重组这个预订的旅行数据集来准备一个链接的旅行数据集。

例如,将公共汽车和火车旅行(通过相同的用户 ID,P001)链接到一个链接的旅行中,并重新定义此旅程的起点和目的地(分别为 O1 和 D2)。

我们需要使用规则来链接行程(同一用户ID的行程,上一次行程的目的地靠近下一次行程的起点(1公里以内),上一次行程结束与下一次出发的时间间隔小于60分钟)。

在预订行程数据集中,变量为:

任何人都可以在这里帮助我吗?我是 R 的新用户。非常感谢。

标签: rtidyversetidyrsfspread

解决方案


这是一种使用dplyrandgeosphere来计算距离的方法。我lubridate用来修复你的日期栏。

首先,我们修复列的类。接下来,我们依赖于旅行必须按时间顺序发生的事实。因此,我们使用 from 和 from 计算与前一个目的地的lag距离dplyr以及distHaversinegeosphere上一次出发以来的时间。

library(dplyr)
library(geosphere)
library(lubridate)
bktrips %>%
  mutate(depart_dt = ymd_hms(depart_dt)) %>%
  mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
  group_by(userID) %>% 
  arrange(depart_dt,.by_group = TRUE) %>%
  mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
         TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
  dplyr::select(-depart_dt,-contains(c("lat","lon")))
  userID mode  Origin Destination DistPrevDest TimePrevDep
  <fct>  <fct> <fct>  <fct>              <dbl> <drtn>     
1 P001   bus   O1     D1                   NA    NA mins  
2 P001   train O2     D2                  801.   30 mins  
3 P001   taxi  O3     D3                10434.  150 mins  
4 P001   bus   O4     D4                  547.   40 mins  
5 P001   train O5     D5                  130. 1220 mins  
6 P002   taxi  O6     D6                   NA    NA mins  
7 P002   bus   O7     D7                 3105. 1370 mins  
8 P002   train O8     D8                 3188. 1440 mins  
9 P002   taxi  O9     D9                  879.   50 mins  

现在我们可以添加一个TripID使用一些逻辑和cumsum

然后我们按 分组TripID,并使用summarize重新定义所有列。

bktrips %>%
  mutate(depart_dt = ymd_hms(depart_dt)) %>%
  bktrips %>%
  mutate(depart_dt = ymd_hms(depart_dt)) %>%
  mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
  group_by(userID) %>% 
  arrange(depart_dt,.by_group = TRUE) %>%
  mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
         TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
  mutate(TripID = cumsum(!((is.na(DistPrevDest) | DistPrevDest < 1000) & (is.na(TimePrevDep) |TimePrevDep < 60)))) %>%
  group_by(userID,TripID) %>%
  summarize(mode = paste(mode,collapse = ","),
            Origin = first(Origin),
            Destination = last(Destination),
            depart_dt = paste(depart_dt,collapse = ","),
            Olat = first(Olat),
            Olon = first(Olon),
            Dlat = last(Dlat),
            Dlon = last(Dlon))
  userID TripID mode       Origin Destination depart_dt                                Olat  Olon  Dlat  Dlon
  <fct>   <int> <chr>      <fct>  <fct>       <chr>                                   <dbl> <dbl> <dbl> <dbl>
1 P001        0 bus,train  O1     D2          2019-11-05 08:00:00,2019-11-05 08:30:00 -33.9  151. -33.9  151.
2 P001        1 taxi,bus   O3     D4          2019-11-05 11:00:00,2019-11-05 11:40:00 -33.8  151. -33.9  151.
3 P001        2 train      O5     D5          2019-11-06 08:00:00                     -33.9  151. -33.9  151.
4 P002        0 taxi       O6     D6          2019-11-06 09:10:00                     -33.9  151. -33.9  151.
5 P002        1 bus        O7     D7          2019-11-07 08:00:00                     -33.9  151. -33.9  151.
6 P002        2 train,taxi O8     D9          2019-11-08 08:00:00,2019-11-08 08:50:00 -33.9  151. -33.9  151.

我建议您在数据中也包含到达时间,而是计算出发时间和上次到达时间之间的差异。

编辑:错过了一个cumsum()。现在修好了。而且,也不需要rleid了。


推荐阅读