首页 > 解决方案 > 如何在 R 中使用 Dataframes 根据日期时间对事件进行计数并按条件分组

问题描述

我正面临着通过对两个不同的数据集进行一些数据整理来构建表格的挑战。

DataSet A:有服装店的购买信息,变量有:客户姓名、购买日期、代理、t时间段内购买的产品。

df1 <- tibble::tribble(
  ~NAME, ~PRODUCT, ~AGENT,        ~DATE_PURCHASE,
  "Karen",   "M_14",  "X_1", "8-25-20021 18:21:28",
  "Jean",   "M_78",  "X_3", "8-26-20021 18:11:06",
  "Jean",   "M_71",  "X_4", "8-26-20021 18:21:01",
  "Jean",   "M_64",  "X_4", "8-27-20021 20:21:59",
  "Keith",   "M_57",  "X_4", "8-27-20021 20:21:02",
  "Alba",   "M_50",  "X_1", "8-28-20021 20:21:03",
  "Alba",   "M_43",  "X_3", "8-29-20021 20:21:04",
  "Alex",   "M_36",  "X_2", "8-25-20021 20:21:05"
)

数据集B:有t时间段内拨打公司CX SERVICE专线的客户信息,存储客户姓名、通话日期、通话类型等变量。

df2 <- tibble::tribble(
  ~NAME,        ~TYPE,         ~DATE_OF_CALL,       
  "Karen",   "COMPLAIN", "8-26-20021 18:21:28", 
  "Jean", "CX_SERVICE", "8-27-20021 18:11:06",
  "Jean",   "COMPLAIN", "8-28-20021 18:21:01", 
  "Jean", "CX_SERVICE", "8-29-20021 20:21:59",
  "Keith", "CX_SERVICE", "8-29-20021 20:21:02", 
  "Alba",   "COMPLAIN", "8-30-20021 20:21:03", 
  "Alex", "CX_SERVICE", "8-25-20021 21:21:05", 
)

我必须使用以下内容构建一个数据集:我必须创建一个名为“x尝试”的新变量,它会让我知道这是否是在线客户的第一次、第二次、第三次等调用,以及每个客户将在收到的最后一个电话之前购​​买的最后一个产品带回桌面,包括电话的类型。我知道这可能听起来令人困惑,所以这里是所需表的示例:

NAME   | x attempt | product | TYPE      | DATE_CALL           | DATE_PURCHASE       |
Jean|  |     3     |  M_64   |CX_SERVICE | 8-29-20021 20:21:59 | 8-27-20021 20:21:59 |

这个结果是正确的,因为记录在案......这将是 Jean 的第三次电话,最后一次电话的类型是 CX_SERVICE 于 8-29-20021 20:21:59 和她购买的最后一个产品是 8-27-20021 20:21:59 上的 M_64。

标签: rdataframedatetimegroup-bydata-wrangling

解决方案


我相信您可能在年份上弄错了,所以我删除了年份(2021 年)中多余的零。我看到您正在使用 tibbles,因此我将提供一种tidyverse解决此问题的方法。

提供的代码的想法是首先单独处理小标题,然后通过公分母将它们连接起来NAME。这应该这样做:

library(dplyr)

df1 <- tibble::tribble(
  ~NAME, ~PRODUCT, ~AGENT,        ~DATE_PURCHASE,
  "Karen",   "M_14",  "X_1", "8-25-2021 18:21:28",
  "Jean",   "M_78",  "X_3", "8-26-2021 18:11:06",
  "Jean",   "M_71",  "X_4", "8-26-2021 18:21:01",
  "Jean",   "M_64",  "X_4", "8-27-2021 20:21:59",
  "Keith",   "M_57",  "X_4", "8-27-2021 20:21:02",
  "Alba",   "M_50",  "X_1", "8-28-2021 20:21:03",
  "Alba",   "M_43",  "X_3", "8-29-2021 20:21:04",
  "Alex",   "M_36",  "X_2", "8-25-2021 20:21:05"
)

df2 <- tibble::tribble(
  ~NAME,        ~TYPE,         ~DATE_OF_CALL,       
  "Karen",   "COMPLAIN", "8-26-2021 18:21:28", 
  "Jean", "CX_SERVICE", "8-27-2021 18:11:06",
  "Jean",   "COMPLAIN", "8-28-2021 18:21:01", 
  "Jean", "CX_SERVICE", "8-29-2021 20:21:59",
  "Keith", "CX_SERVICE", "8-29-2021 20:21:02", 
  "Alba",   "COMPLAIN", "8-30-2021 20:21:03", 
  "Alex", "CX_SERVICE", "8-25-2021 21:21:05", 
)

(df1_mod <- df1 %>%
  mutate(DATE_PURCHASE = as.POSIXct(DATE_PURCHASE, format = "%m-%d-%Y %H:%M:%S")) %>%
  group_by(NAME) %>%
  summarise(product = PRODUCT[DATE_PURCHASE == max(DATE_PURCHASE)], # retrieve product with the most recent date_purchase
            DATE_PURCHASE = max(DATE_PURCHASE), # retrieve most recent date_purchase
            .groups = "drop"))
#> # A tibble: 5 x 3
#>   NAME  product DATE_PURCHASE      
#>   <chr> <chr>   <dttm>             
#> 1 Alba  M_43    2021-08-29 20:21:04
#> 2 Alex  M_36    2021-08-25 20:21:05
#> 3 Jean  M_64    2021-08-27 20:21:59
#> 4 Karen M_14    2021-08-25 18:21:28
#> 5 Keith M_57    2021-08-27 20:21:02

(df2_mod <- df2 %>%
  mutate(DATE_OF_CALL = as.POSIXct(DATE_OF_CALL, format = "%m-%d-%Y %H:%M:%S")) %>%
  group_by(NAME) %>%
  summarise(`x attempt` = n(), # retrieve amount of calls, which is n() (the amount of rows in the group)
            TYPE = TYPE[DATE_OF_CALL == max(DATE_OF_CALL)], # retrieve type of call from most recent call
            DATE_OF_CALL = max(DATE_OF_CALL), # retrieve most recent date_of_call
            .groups = "drop"))
#> # A tibble: 5 x 4
#>   NAME  `x attempt` TYPE       DATE_OF_CALL       
#>   <chr>       <int> <chr>      <dttm>             
#> 1 Alba            1 COMPLAIN   2021-08-30 20:21:03
#> 2 Alex            1 CX_SERVICE 2021-08-25 21:21:05
#> 3 Jean            3 CX_SERVICE 2021-08-29 20:21:59
#> 4 Karen           1 COMPLAIN   2021-08-26 18:21:28
#> 5 Keith           1 CX_SERVICE 2021-08-29 20:21:02
  
left_join(df1_mod, df2_mod, by = "NAME")
#> # A tibble: 5 x 6
#>   NAME  product DATE_PURCHASE       `x attempt` TYPE       DATE_OF_CALL       
#>   <chr> <chr>   <dttm>                    <int> <chr>      <dttm>             
#> 1 Alba  M_43    2021-08-29 20:21:04           1 COMPLAIN   2021-08-30 20:21:03
#> 2 Alex  M_36    2021-08-25 20:21:05           1 CX_SERVICE 2021-08-25 21:21:05
#> 3 Jean  M_64    2021-08-27 20:21:59           3 CX_SERVICE 2021-08-29 20:21:59
#> 4 Karen M_14    2021-08-25 18:21:28           1 COMPLAIN   2021-08-26 18:21:28
#> 5 Keith M_57    2021-08-27 20:21:02           1 CX_SERVICE 2021-08-29 20:21:02
Created on 2021-04-10 by the reprex package (v0.3.0)

推荐阅读