首页 > 解决方案 > 如何在交付/接收较早的订单之前知道下一个订单的客户?在 R 中

问题描述

我有一个有两个日期的大型数据库。例如,获取超市数据(http://www.tableau.com/sites/default/files/training/global_superstore.zip)“订单”表。

一个日期是订单日期,另一个是发货/交货日期(假设它是交货日期)。我想知道那些下一个订单的客户的所有订单的详细信息,而无需等待他们之前任何一个订单的发货/交付。

例如,ID 为“ZC-21910”的客户于 2014 年 6 月 12 日下订单,ID 为 CA-2014-133928,该订单于 2014 年 6 月 18 日发货。然而,同一客户下一个 ID 为“IT-2014-3511710”的订单2014 年 6 月 13 日,即 2014 年 6 月 18 日之前(早期订单之一的发货日期)。

最好在单独的数据框/表中过滤掉所有此类订单(订单 ID)。

我怎样才能在 R 中做到这一点?

示例数据集

> dput(df)
structure(list(customer_id = c("A", "A", "A", "B", "B", "C", 
"C"), order_id = structure(1:7, .Label = c("1", "2", "3", "4", 
"5", "6", "7"), class = "factor"), order_date = structure(c(17897, 
17901, 17912, 17901, 17902, 17903, 17905), class = "Date"), ship_date = structure(c(17926, 
17906, 17914, 17904, 17904, 17904, 17906), class = "Date")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

标签: rdataframedplyriterationrolling-computation

解决方案


以下是我在 R 中构建此工作流的方式,注意:在 Tableau 中复制功能将非常困难。

# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("readxl")

# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
                                       installed.packages()[, "Package"])]

# If the vector has more than 0 values, install the new pacakges
# (and their) associated dependencies:
if(length(new_packages) > 0){install.packages(new_packages, dependencies = TRUE)}

# Initialise the packages in the session:
lapply(necessary_packages, require, character.only = TRUE)

# Store a scalar of the link to the data: durl => character scalar
durl <- "http://www.tableau.com/sites/default/files/training/global_superstore.zip"

# Store the path to the temporary directory: tmpdir_path => character scalar
tmpdir_path <- tempdir()

# Store a character scalar denoting the link to the zipped directory
# that is to be created: zip_path => character scalar
zip_path <- paste0(tmpdir_path, "/tableau.zip")

# Store a character scalar denoting the link to the unzipped directory
# that is to be created: unzip_path => character scalar
unzip_path <- paste0(tmpdir_path, "/global_superstore")

# Download the zip file: global_superstore.zip => stdout (zip_path)
download.file(durl, zip_path)

# Unzip the file into the unzip directory: tableau.zip => stdout (global_superstore)
unzip(zipfile = zip_path, exdir = unzip_path)

# Read in the excel file: df => data.frame
df <- read_xls(normalizePath(list.files(unzip_path, full.names = TRUE)))

# Regex the vector names to fit with R convention: names(df) => character vector 
names(df) <- gsub("\\W+", "_", tolower(trimws(names(df), "both")))

# Allocate some memory by creating an empty list the same size as the number of 
# customers: df_list => list
df_list <- vector("list", length(unique(df$customer_id)))

# Split the data.frame into the list by the customer_id: df_list => lis
df_list <- with(df, split(df, customer_id))      

# Sort the data (by date) and test whether or not each customer waited for their 
# order before ordering again: orders_prior_to_delivery => data.frame
orders_prior_to_delivery <- data.frame(do.call("rbind", Map(function(x){
  # Order the data.frame: y => data.frame
  y <- x[order(x$order_date),]
  # Return only the observations where the customer didn't wait: 
  # data.frame => GlobalEnv()
  with(y, y[c(FALSE, 
    apply(data.frame(sapply(order_date[-1], `<`, ship_date[-nrow(y)])), 2, any)),])
}, 
df_list)), row.names = NULL, stringsAsFactors = FALSE)

# Unique customers and orders that were ordered prior to shipping the 
# previous order: cust_orders_prior_to_delivery => data.frame
cust_orders_prior_to_delivery <- 
  unique(orders_prior_to_delivery[,c("order_id", "customer_id")])

推荐阅读