首页 > 解决方案 > 使用 R 中的 read.table 按日期分解百分位数数据

问题描述

我有以下玩具数据集:

dt <- read.table(text = "
Date                    Model      Color    Value   Samples
1/29/2020 6:51:19 AM    Gold       Blue     0.5     500
1/29/2020 7:57:47 AM    Gold       Red      0.0     449
1/29/2020 3:39:04 PM    Silver     Blue     0.75    1320
1/29/2020 5:04:32 PM    Silver     Blue     1.5     103
1/29/2020 10:32:39 AM   Gold       Red      0.7     891
1/30/2020 1:02:12 AM    Gold       Blue     0.41    18103
1/30/2020 4:30:00 AM    Copper     Blue     0.83    564
1/30/2020 9:09:45 AM    Silver     Pink     1.17    173
1/30/2020 2:19:30 PM    Platinum   Brown    0.43    793
1/30/2020 4:43:32 PM    Platinum   Red      0.71    1763
1/30/2020 7:19:00 PM    Gold       Orange   1.92    503",
                 header = TRUE, stringsAsFactors = FALSE)

然后我获取这个 data.table 并生成一些百分位数数据,如下所示:

qs = dt[Value > 0, .(Samples = sum(Samples),
                     '50th'    = quantile(Value, probs = c(0.50)),
                     '75th'    = quantile(Value, probs = c(0.75)),
                     '90th'    = quantile(Value, probs = c(0.90)), 
                     '99th'    = quantile(Value, probs = c(0.99))),
        by = .(Model, Color)]
setkey(qs, 'Model')

最后,我将结果输出到 .csv 文件:

#outputs to csv file

write.csv(qs, file = "outfile.csv")

问题:我将如何编写结果以便:

a) 结果按日期细分(即只取日期,例如 2020 年 1 月 30 日和 2020 年 1 月 31 日,不包括时间) b) 日期写成行

例如(注意:下面的值是玩具数据,而不是真正的计算......只是想显示“日期”列的表示方式):

#       Model  Color Samples  50th   99th  99.9th  99.99th Date
# 1:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000 01/29/2020
# 2:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991 01/29/2020
# 3:     Gold    Red     891 0.700 0.7000 0.70000 0.700000 01/29/2020
# 4:     Gold Orange     503 1.920 1.9200 1.92000 1.920000 01/29/2020
# 5: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000 01/29/2020
# 6: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000 01/29/2020
# 7:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925 01/29/2020
# 8:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000 01/29/2020
# 9:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000 01/30/2020
#10:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991 01/30/2020
#11:     Gold    Red     891 0.700 0.7000 0.70000 0.700000 01/30/2020
#12:     Gold Orange     503 1.920 1.9200 1.92000 1.920000 01/30/2020
#13: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000 01/30/2020
#14: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000 01/30/2020
#15:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925 01/30/2020
#16:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000 01/30/2020

谢谢!

标签: rdataframedata.table

解决方案


如果我们需要在原始数据集中创建列,请使用:=

library(dplyr)
library(lubridate)
setDT(dt)[Value > 0,  c("Samples", '50th', '75th', '90th', '99th') := 
    c(list(sum(Samples)), as.list(quantile(Value,
      probs = c(0.50, 0.75, 0.90, 0.99)))),
        .(Model, Color, DateNoTime = as.Date(mdy_hms(Date)) )]
dt
#                     Date    Model  Color Value Samples  50th   75th  90th   99th
# 1:  1/29/2020 6:51:19 AM     Gold   Blue  0.50     500 0.500 0.5000 0.500 0.5000
# 2:  1/29/2020 7:57:47 AM     Gold    Red  0.00     449    NA     NA    NA     NA
# 3:  1/29/2020 3:39:04 PM   Silver   Blue  0.75    1423 1.125 1.3125 1.425 1.4925
# 4:  1/29/2020 5:04:32 PM   Silver   Blue  1.50    1423 1.125 1.3125 1.425 1.4925
# 5: 1/29/2020 10:32:39 AM     Gold    Red  0.70     891 0.700 0.7000 0.700 0.7000
# 6:  1/30/2020 1:02:12 AM     Gold   Blue  0.41   18103 0.410 0.4100 0.410 0.4100
# 7:  1/30/2020 4:30:00 AM   Copper   Blue  0.83     564 0.830 0.8300 0.830 0.8300
# 8:  1/30/2020 9:09:45 AM   Silver   Pink  1.17     173 1.170 1.1700 1.170 1.1700
# 9:  1/30/2020 2:19:30 PM Platinum  Brown  0.43     793 0.430 0.4300 0.430 0.4300
#10:  1/30/2020 4:43:32 PM Platinum    Red  0.71    1763 0.710 0.7100 0.710 0.7100
#11:  1/30/2020 7:19:00 PM     Gold Orange  1.92     503 1.920 1.9200 1.920 1.9200

这也将填充那些新列的Value <= 0NA


但是,如果打算用汇总值填充所有行,则进行连接并通过在中包含“日期”部分来创建“qs”by

qs <- setDT(dt)[Value > 0, .(Samples = sum(Samples),
                     '50th'    = quantile(Value, probs = c(0.50)),
                     '75th'    = quantile(Value, probs = c(0.75)),
                     '90th'    = quantile(Value, probs = c(0.90)), 
                     '99th'    = quantile(Value, probs = c(0.99))),
        by = .(Model, Color,
          DateNoTime = format(as.Date(mdy_hms(Date)), "%m/%d/%Y") )]



qs[dt, on = .(Model, Color)]

如果我们不想在中包含“日期”by并且只在输出中需要它

setDT(dt)[, DateNoTime := as.Date(mdy_hms(Date))
     ][Value > 0,  c("Samples", '50th', '75th', '90th', '99th') := 
    c(list(sum(Samples)), as.list(quantile(Value,
      probs = c(0.50, 0.75, 0.90, 0.99)))),
        .(Model, Color)]
dt
#                     Date    Model  Color Value Samples DateNoTime  50th   75th  90th   99th
# 1:  1/29/2020 6:51:19 AM     Gold   Blue  0.50   18603 2020-01-29 0.455 0.4775 0.491 0.4991
# 2:  1/29/2020 7:57:47 AM     Gold    Red  0.00     449 2020-01-29    NA     NA    NA     NA
# 3:  1/29/2020 3:39:04 PM   Silver   Blue  0.75    1423 2020-01-29 1.125 1.3125 1.425 1.4925
# 4:  1/29/2020 5:04:32 PM   Silver   Blue  1.50    1423 2020-01-29 1.125 1.3125 1.425 1.4925
# 5: 1/29/2020 10:32:39 AM     Gold    Red  0.70     891 2020-01-29 0.700 0.7000 0.700 0.7000
# 6:  1/30/2020 1:02:12 AM     Gold   Blue  0.41   18603 2020-01-30 0.455 0.4775 0.491 0.4991
# 7:  1/30/2020 4:30:00 AM   Copper   Blue  0.83     564 2020-01-30 0.830 0.8300 0.830 0.8300
# 8:  1/30/2020 9:09:45 AM   Silver   Pink  1.17     173 2020-01-30 1.170 1.1700 1.170 1.1700
# 9:  1/30/2020 2:19:30 PM Platinum  Brown  0.43     793 2020-01-30 0.430 0.4300 0.430 0.4300
#10:  1/30/2020 4:43:32 PM Platinum    Red  0.71    1763 2020-01-30 0.710 0.7100 0.710 0.7100
#11:  1/30/2020 7:19:00 PM     Gold Orange  1.92     503 2020-01-30 1.920 1.9200 1.920 1.9200

数据

dt <- structure(list(Date = c("1/29/2020 6:51:19 AM", "1/29/2020 7:57:47 AM", 
"1/29/2020 3:39:04 PM", "1/29/2020 5:04:32 PM", "1/29/2020 10:32:39 AM", 
"1/30/2020 1:02:12 AM", "1/30/2020 4:30:00 AM", "1/30/2020 9:09:45 AM", 
"1/30/2020 2:19:30 PM", "1/30/2020 4:43:32 PM", "1/30/2020 7:19:00 PM"
), Model = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", 
"Copper", "Silver", "Platinum", "Platinum", "Gold"), Color = c("Blue", 
"Red", "Blue", "Blue", "Red", "Blue", "Blue", "Pink", "Brown", 
"Red", "Orange"), Value = c(0.5, 0, 0.75, 1.5, 0.7, 0.41, 0.83, 
1.17, 0.43, 0.71, 1.92), Samples = c(500L, 449L, 1320L, 103L, 
891L, 18103L, 564L, 173L, 793L, 1763L, 503L)), 
class = "data.frame", row.names = c(NA, 
-11L))

推荐阅读