首页 > 解决方案 > Filtering transaction level data

问题描述

I am dealing with a data frame containing the transaction level data. It contains two fields, bill_id and product.

The data represents products purchased at a bill level, and a particular bill_id gets repeated as many times as the number of products purchased in that bill. For example, if 5 items have been purchased in bill_id 12345, the data for this bill will be like this:

bill_id product
  12345       A
  12345       B
  12345       C
  12345       D
  12345       E

My objective is to filter out data of all bills containing a certain product.

Following is an example of how I am performing this task currently:

library(dplyr)
set.seed(1)

# Sample data
dat <- data.frame(bill_id = sample(1:500, size = 1000, replace = TRUE),
                  product = sample(LETTERS, size = 1000, replace = 
                  TRUE),
                  stringsAsFactors = FALSE) %>% 
       arrange(bill_id, product)

# vector of bill_ids of product A
bills_productA <- dat %>% 
                  filter(product == "A") %>% 
                  pull(bill_id) %>% 
                  unique()

# data for bill_ids in vector bills_productA
dat_subset <- dat %>%
              filter(bill_id %in% bills_productA)

This leads to the creation of an intermediary vector of bill_ids (bills_productA) and a two-step filtering process (first find ids of bills containing the product, and then find all transactions of these bills).

Is there a more efficient way of performing this task?

标签: rdplyr

解决方案


You can filter the bill_id by directly subsetting it

library(dplyr)
dat_subset1 <- dat %>% filter(bill_id %in% unique(bill_id[product == "A"]))

identical(dat_subset, dat_subset1) 
#[1] TRUE

This would also work without unique in it but better to keep the list short.


推荐阅读