r - 使用 R 中的 read.table 按日期分解百分位数数据
问题描述
我有以下玩具数据集:
dt <- read.table(text = "
Date Model Color Value Samples
1/29/2020 6:51:19 AM Gold Blue 0.5 500
1/29/2020 7:57:47 AM Gold Red 0.0 449
1/29/2020 3:39:04 PM Silver Blue 0.75 1320
1/29/2020 5:04:32 PM Silver Blue 1.5 103
1/29/2020 10:32:39 AM Gold Red 0.7 891
1/30/2020 1:02:12 AM Gold Blue 0.41 18103
1/30/2020 4:30:00 AM Copper Blue 0.83 564
1/30/2020 9:09:45 AM Silver Pink 1.17 173
1/30/2020 2:19:30 PM Platinum Brown 0.43 793
1/30/2020 4:43:32 PM Platinum Red 0.71 1763
1/30/2020 7:19:00 PM Gold Orange 1.92 503",
header = TRUE, stringsAsFactors = FALSE)
然后我获取这个 data.table 并生成一些百分位数数据,如下所示:
qs = dt[Value > 0, .(Samples = sum(Samples),
'50th' = quantile(Value, probs = c(0.50)),
'75th' = quantile(Value, probs = c(0.75)),
'90th' = quantile(Value, probs = c(0.90)),
'99th' = quantile(Value, probs = c(0.99))),
by = .(Model, Color)]
setkey(qs, 'Model')
最后,我将结果输出到 .csv 文件:
#outputs to csv file
write.csv(qs, file = "outfile.csv")
问题:我将如何编写结果以便:
a) 结果按日期细分(即只取日期,例如 2020 年 1 月 30 日和 2020 年 1 月 31 日,不包括时间) b) 日期写成行
例如(注意:下面的值是玩具数据,而不是真正的计算......只是想显示“日期”列的表示方式):
# Model Color Samples 50th 99th 99.9th 99.99th Date
# 1: Copper Blue 564 0.830 0.8300 0.83000 0.830000 01/29/2020
# 2: Gold Blue 18603 0.455 0.4991 0.49991 0.499991 01/29/2020
# 3: Gold Red 891 0.700 0.7000 0.70000 0.700000 01/29/2020
# 4: Gold Orange 503 1.920 1.9200 1.92000 1.920000 01/29/2020
# 5: Platinum Brown 793 0.430 0.4300 0.43000 0.430000 01/29/2020
# 6: Platinum Red 1763 0.710 0.7100 0.71000 0.710000 01/29/2020
# 7: Silver Blue 1423 1.125 1.4925 1.49925 1.499925 01/29/2020
# 8: Silver Pink 173 1.170 1.1700 1.17000 1.170000 01/29/2020
# 9: Copper Blue 564 0.830 0.8300 0.83000 0.830000 01/30/2020
#10: Gold Blue 18603 0.455 0.4991 0.49991 0.499991 01/30/2020
#11: Gold Red 891 0.700 0.7000 0.70000 0.700000 01/30/2020
#12: Gold Orange 503 1.920 1.9200 1.92000 1.920000 01/30/2020
#13: Platinum Brown 793 0.430 0.4300 0.43000 0.430000 01/30/2020
#14: Platinum Red 1763 0.710 0.7100 0.71000 0.710000 01/30/2020
#15: Silver Blue 1423 1.125 1.4925 1.49925 1.499925 01/30/2020
#16: Silver Pink 173 1.170 1.1700 1.17000 1.170000 01/30/2020
谢谢!
解决方案
如果我们需要在原始数据集中创建列,请使用:=
library(dplyr)
library(lubridate)
setDT(dt)[Value > 0, c("Samples", '50th', '75th', '90th', '99th') :=
c(list(sum(Samples)), as.list(quantile(Value,
probs = c(0.50, 0.75, 0.90, 0.99)))),
.(Model, Color, DateNoTime = as.Date(mdy_hms(Date)) )]
dt
# Date Model Color Value Samples 50th 75th 90th 99th
# 1: 1/29/2020 6:51:19 AM Gold Blue 0.50 500 0.500 0.5000 0.500 0.5000
# 2: 1/29/2020 7:57:47 AM Gold Red 0.00 449 NA NA NA NA
# 3: 1/29/2020 3:39:04 PM Silver Blue 0.75 1423 1.125 1.3125 1.425 1.4925
# 4: 1/29/2020 5:04:32 PM Silver Blue 1.50 1423 1.125 1.3125 1.425 1.4925
# 5: 1/29/2020 10:32:39 AM Gold Red 0.70 891 0.700 0.7000 0.700 0.7000
# 6: 1/30/2020 1:02:12 AM Gold Blue 0.41 18103 0.410 0.4100 0.410 0.4100
# 7: 1/30/2020 4:30:00 AM Copper Blue 0.83 564 0.830 0.8300 0.830 0.8300
# 8: 1/30/2020 9:09:45 AM Silver Pink 1.17 173 1.170 1.1700 1.170 1.1700
# 9: 1/30/2020 2:19:30 PM Platinum Brown 0.43 793 0.430 0.4300 0.430 0.4300
#10: 1/30/2020 4:43:32 PM Platinum Red 0.71 1763 0.710 0.7100 0.710 0.7100
#11: 1/30/2020 7:19:00 PM Gold Orange 1.92 503 1.920 1.9200 1.920 1.9200
这也将填充那些新列的Value <= 0
行NA
。
但是,如果打算用汇总值填充所有行,则进行连接并通过在中包含“日期”部分来创建“qs”by
qs <- setDT(dt)[Value > 0, .(Samples = sum(Samples),
'50th' = quantile(Value, probs = c(0.50)),
'75th' = quantile(Value, probs = c(0.75)),
'90th' = quantile(Value, probs = c(0.90)),
'99th' = quantile(Value, probs = c(0.99))),
by = .(Model, Color,
DateNoTime = format(as.Date(mdy_hms(Date)), "%m/%d/%Y") )]
qs[dt, on = .(Model, Color)]
如果我们不想在中包含“日期”by
并且只在输出中需要它
setDT(dt)[, DateNoTime := as.Date(mdy_hms(Date))
][Value > 0, c("Samples", '50th', '75th', '90th', '99th') :=
c(list(sum(Samples)), as.list(quantile(Value,
probs = c(0.50, 0.75, 0.90, 0.99)))),
.(Model, Color)]
dt
# Date Model Color Value Samples DateNoTime 50th 75th 90th 99th
# 1: 1/29/2020 6:51:19 AM Gold Blue 0.50 18603 2020-01-29 0.455 0.4775 0.491 0.4991
# 2: 1/29/2020 7:57:47 AM Gold Red 0.00 449 2020-01-29 NA NA NA NA
# 3: 1/29/2020 3:39:04 PM Silver Blue 0.75 1423 2020-01-29 1.125 1.3125 1.425 1.4925
# 4: 1/29/2020 5:04:32 PM Silver Blue 1.50 1423 2020-01-29 1.125 1.3125 1.425 1.4925
# 5: 1/29/2020 10:32:39 AM Gold Red 0.70 891 2020-01-29 0.700 0.7000 0.700 0.7000
# 6: 1/30/2020 1:02:12 AM Gold Blue 0.41 18603 2020-01-30 0.455 0.4775 0.491 0.4991
# 7: 1/30/2020 4:30:00 AM Copper Blue 0.83 564 2020-01-30 0.830 0.8300 0.830 0.8300
# 8: 1/30/2020 9:09:45 AM Silver Pink 1.17 173 2020-01-30 1.170 1.1700 1.170 1.1700
# 9: 1/30/2020 2:19:30 PM Platinum Brown 0.43 793 2020-01-30 0.430 0.4300 0.430 0.4300
#10: 1/30/2020 4:43:32 PM Platinum Red 0.71 1763 2020-01-30 0.710 0.7100 0.710 0.7100
#11: 1/30/2020 7:19:00 PM Gold Orange 1.92 503 2020-01-30 1.920 1.9200 1.920 1.9200
数据
dt <- structure(list(Date = c("1/29/2020 6:51:19 AM", "1/29/2020 7:57:47 AM",
"1/29/2020 3:39:04 PM", "1/29/2020 5:04:32 PM", "1/29/2020 10:32:39 AM",
"1/30/2020 1:02:12 AM", "1/30/2020 4:30:00 AM", "1/30/2020 9:09:45 AM",
"1/30/2020 2:19:30 PM", "1/30/2020 4:43:32 PM", "1/30/2020 7:19:00 PM"
), Model = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold",
"Copper", "Silver", "Platinum", "Platinum", "Gold"), Color = c("Blue",
"Red", "Blue", "Blue", "Red", "Blue", "Blue", "Pink", "Brown",
"Red", "Orange"), Value = c(0.5, 0, 0.75, 1.5, 0.7, 0.41, 0.83,
1.17, 0.43, 0.71, 1.92), Samples = c(500L, 449L, 1320L, 103L,
891L, 18103L, 564L, 173L, 793L, 1763L, 503L)),
class = "data.frame", row.names = c(NA,
-11L))
推荐阅读
- react-native - quickblox react-native-sdk
- android-studio - 如何在android studio中为android虚拟设备添加纬度和经度?
- odoo-14 - 如何在 odoo 14 中创建身份验证 rest api?
- firebase - 需要 Firebase 索引,但未提供链接
- node.js - Nock 无法拦截来自 httpService.get() 的请求
- reactjs - 我想在syncfusion ej2-react-schedule中单击打开弹出窗口
- audio - 在树莓派零中未检测到 USB 声卡 w
- c++ - 在函数中使用 C++ 中的预定义结构
- powershell - 从显示名称中获取 samaccountname - 但有附加功能
- postgresql - 如何使 Spark DataFrameReader jdbc 接受自定义类型的 Postgres 数组