r - 我想按一列汇总,然后取一列的总和和另一列的平均值
问题描述
下面是我目前用来总结我的数据的代码,它正在工作。我的问题是我想实际取“CE100”列的平均值与总和。我怎样才能操纵下面的代码来做到这一点?
library(data.table, warn.conflicts = FALSE)
library(magrittr) ### MODIFIED
# library(lubridate, warn.conflicts = FALSE) ### MODIFIED
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/NAME/Documents/Raw Data/"
# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
"MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
"VA-WA", "WB-ZZ")
# Set location for output
out_directory <- "C:/Users/NAME/Documents/YTD Master/"
out_filename <- "OUTPUT.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- c("2018-01-01", "2018-06-30") ### MODIFIED
# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
# date_filter <- TRUE ### MODIFIED
##########
## CODE ##
##########
starttime <- Sys.time()
# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)
# filter filenames, only
selected_in_filenames <-
seq(as.Date(date_range[1]),
as.Date(date_range[2]), by = "1 month") %>%
format("%Y-%m") %>%
lapply(function(x) stringr::str_subset(in_filenames, x)) %>%
unlist()
# read and aggregate each file separetely
mastertable <- rbindlist(
lapply(selected_in_filenames, function(fn) {
message("Processing file: ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
{ # Add columns
print(paste0("Adding columns - ", subfolder, " (", j," of ", length(in_subfolders), ")"))
print(Sys.time()-starttime)
temptable[, ':='(CustPart = paste0(CUST_ID, INV_ITEM_ID))]}
# aggregate file but filtered for date_range
temptable[INVOICE_DT %between% date_range,
lapply(.SD, sum), by = .(CustPart, QTR = quarter(INVOICE_DT), YEAR = year(INVOICE_DT)),
.SDcols = c("Ext Sale", "CE100")]
})
)[
# second aggregation overall
, lapply(.SD, sum), by = .(CustPart, QTR, YEAR), .SDcols = c("Ext Sale", "CE100")]
# Save Final table
print("Saving master table")
fwrite(mastertable, file.path(out_directory, out_filename))
# rm(mastertable) ### MODIFIED
print(Sys.time()-starttime)
mastertable
我已经包含了我所有的代码来展示我是如何读取我的数据的。如果需要任何其他细节,比如一些需要使用的示例数据,请告诉我。
解决方案
OP 方法的关键点是交错聚合(在 Date Filter 中使用多个月份时,请参阅相关问题行不合并 R 中的重复项)。
OP 希望跨多个文件聚合数据,这些文件显然太大而无法完全加载并组合成一个大的 data.table。
相反,每个文件都被单独读入和聚合。小计被合并到一个 data.table 中,在第二个聚合步骤中计算总和。
现在,OP 希望在聚合步骤中包含总和以及平均值。交错聚合适用于总和和计数,但不适用于平均值,例如,mean(1:5)
3 与小计的平均值不同,mean(1:2)
并且mean(3:5)
:mean(c(mean(1:2), mean(3:5)))
为 2.75。
因此,下面的方法仅计算第一个和第二个聚合步骤的总和和计数,并分别计算所选列的平均值。数据取自OP 的另一个问题。此外,by =
为演示简化了参数,并data.range
已根据示例数据进行了调整。
library(data.table, warn.conflicts = FALSE)
library(magrittr) ### MODIFIED
# library(lubridate, warn.conflicts = FALSE) ### MODIFIED
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "Raw Data"
# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
"MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
"VA-WA", "WB-ZZ")
# Set location for output
out_directory <- "YTD Master"
out_filename <- "OUTPUT.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- c("2017-01-01", "2017-06-30") ### MODIFIED
# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
# date_filter <- TRUE ### MODIFIED
##########
## CODE ##
##########
starttime <- Sys.time()
# create vector of filenames to be processed
in_filenames <- list.files(
file.path(in_directory, in_subfolders),
pattern = "\\.txt$",
full.names = TRUE,
recursive = TRUE)
# filter filenames
selected_in_filenames <-
seq(as.Date(date_range[1]),
as.Date(date_range[2]), by = "1 month") %>%
format("%Y-%m") %>%
lapply(function(x) stringr::str_subset(in_filenames, x)) %>%
unlist()
# read and aggregate each file separetely
mastertable <- rbindlist(
lapply(selected_in_filenames, function(fn) {
message("Processing file: ", fn)
temptable <- fread(fn,
colClasses = c(CUSTOMER_TIER = "character"),
na.strings = "")
# aggregate file but filtered for date_range
temptable[INVOICE_DT %between% date_range,
c(.(N = .N), lapply(.SD, sum)),
by = .(CUST_ID,
QTR = quarter(INVOICE_DT), YEAR = year(INVOICE_DT)),
.SDcols = c("Ext Sale", "CE100")]
})
)[
# second aggregation overall
, lapply(.SD, sum),
by = .(CUST_ID, QTR, YEAR),
.SDcols = c("N", "Ext Sale", "CE100")]
# update mastertable with averages of selected columns
cols_avg <- c("CE100")
mastertable[, (cols_avg) := lapply(.SD, function(x) x/N),
.SDcols = cols_avg]
# Save Final table
print("Saving master table")
fwrite(mastertable, file.path(out_directory, out_filename))
# rm(mastertable) ### MODIFIED
print(Sys.time()-starttime)
mastertable
CUST_ID QTR YEAR N Ext Sale CE100 1: AK0010001 1 2017 4 427.803 29.4119358 2: CO0020001 1 2017 2 1540.300 NA 3: CO0010001 1 2017 2 -179.765 0.0084625
缺失值包含在聚合中。需要在业务方面决定如何处理缺失值。如果要从聚合中排除缺失值,则平均值的交错计算可能会变得更加复杂。
推荐阅读
- haproxy - HAProxy:防止对备份服务器的粘性
- docker - VueJS 和 Webpack:构建项目无法访问 ENV var
- angularjs - 带有多项选择的 JQuery Select2 自动完成下拉菜单
- postgresql - 气流 PostgresOperator 设置 ON_ERROR_STOP
- java - 如何从 AetherRepository 迁移到 MavenBndRepository?
- html - 使用 ng-repeat 从 JSon 获取数据并显示
- ios - 如何调试“线程 1:致命错误:Swift 中的索引超出范围”
- backstop.js - BackstopJS - 为所有场景设置通用选择器
- reactjs - Redux - 道具更改后不呈现“地图功能”但呈现“输入值”
- javascript - Javascript (NodeJS) 中的自定义 RSA 实现