r - R:将最大的 5 行保留在一个表中
问题描述
我正在使用 R 编程语言。我创建了一些随机数据,然后编写了以下程序,该程序循环执行了一系列数据操作步骤:
#load library
library(dplyr)
library(data.table)
set.seed(123)
# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)
####
results_table <- data.frame()
for (i in 1:10 ) {
#generate random numbers
random_1 = runif(1, 80, 120)
random_2 = runif(1, random_1, 120)
random_3 = runif(1, 85, 120)
random_4 = runif(1, random_3, 120)
#bin data according to random criteria
train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
train_data$cat = as.factor(train_data$cat)
#new splits
a_table = train_data %>%
filter(cat == "a") %>%
select(a1, b1, c1, cat)
b_table = train_data %>%
filter(cat == "b") %>%
select(a1, b1, c1, cat)
c_table = train_data %>%
filter(cat == "c") %>%
select(a1, b1, c1, cat)
split_1 = runif(1,0, 1)
split_2 = runif(1, 0, 1)
split_3 = runif(1, 0, 1)
#calculate 60th quantile ("quant") for each bin
table_a = data.frame(a_table%>% group_by(cat) %>%
mutate(quant = quantile(c1, prob = split_1)))
table_b = data.frame(b_table%>% group_by(cat) %>%
mutate(quant = quantile(c1, prob = split_2)))
table_c = data.frame(c_table%>% group_by(cat) %>%
mutate(quant = quantile(c1, prob = split_3)))
#create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
#group all tables
final_table = rbind(table_a, table_b, table_c)
#create a table: for each bin, calculate the average of "diff"
final_table_2 = data.frame(final_table %>%
group_by(cat) %>%
summarize(
mean = mean(diff)
))
#add "total mean" to this table
final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
#format this table: add the random criteria to this table for reference
final_table_2$random_1 = random_1
final_table_2$random_2 = random_2
final_table_2$random_3 = random_3
final_table_2$random_4 = random_4
final_table_2$split_1 = split_1
final_table_2$split_2 = split_2
final_table_2$split_3 = split_3
final_table_2$iteration_number = i
results_table <- rbind(results_table, final_table_2)
final_results = dcast(setDT(results_table), iteration_number + random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
}
运行此循环 10 次后,结果(“final_results”)如下所示:
final_results
iteration_number random_1 random_2 random_3 random_4 split_1 split_2 split_3 a b c total
1: 1 95.67371 111.81329 94.00313 102.05692 0.84045638 0.6882731 0.7749321 0.82051282 0.6870229 0.7734554 0.730
2: 2 92.31360 110.07617 106.46871 109.53428 0.24615922 0.8777580 0.7847697 0.24731183 0.8777429 0.7840909 0.744
3: 3 81.02645 110.46446 116.42006 119.61718 0.11943576 0.9762721 0.9100522 0.14285714 0.9758162 0.9103448 0.943
4: 4 90.35986 116.70888 114.15588 116.72312 0.07675141 0.8661540 0.3236617 0.08139535 0.8658065 0.3207547 0.702
5: 5 89.28374 114.71034 119.70448 119.77249 0.08881443 0.6351936 0.8565509 0.09027778 0.6349614 0.8461538 0.573
6: 6 87.35767 103.85755 97.44462 116.04144 0.48372890 0.2319129 0.2701634 0.47368421 0.2326333 0.2711370 0.255
7: 7 112.91974 113.10267 99.20739 111.60051 0.52873965 0.6825709 0.5078129 0.52849741 0.6830709 0.5094340 0.605
8: 8 102.17487 117.17008 95.93786 96.80284 0.81599406 0.7785768 0.8593795 0.81300813 0.7795276 0.8586667 0.843
9: 9 82.62877 82.95787 105.70883 118.13665 0.44629189 0.0375750 0.4102906 0.44117647 0.1666667 0.4083333 0.408
10: 10 94.60865 106.70978 89.67872 104.21645 0.26431269 0.4899329 0.9060612 0.40000000 0.4897959 0.8992629 0.656
我正在尝试修改循环,以便在迭代期间,该表在任何时候都只会保留 5 个最大的结果(基于“final_results$total”的值)。这是为了防止决赛桌(“final_results”)变得太大。
循环完成后,我知道如何“修剪”“final_results”表,使其仅保留 5 个最大的行(以“final_results$total”计):
#sort the final table according to the desired criteria
sorted_table = final_results[order(final_results$total, decreasing = TRUE),]
#extract 5 biggest rows
sorted_table = sorted_table[1:5,]
#view the results
head(sorted_table)
iteration_number random_1 random_2 random_3 random_4 split_1 split_2 split_3 a b c total
1: 3 81.02645 110.4645 116.42006 119.61718 0.11943576 0.9762721 0.9100522 0.14285714 0.9758162 0.9103448 0.943
2: 8 102.17487 117.1701 95.93786 96.80284 0.81599406 0.7785768 0.8593795 0.81300813 0.7795276 0.8586667 0.843
3: 2 92.31360 110.0762 106.46871 109.53428 0.24615922 0.8777580 0.7847697 0.24731183 0.8777429 0.7840909 0.744
4: 1 95.67371 111.8133 94.00313 102.05692 0.84045638 0.6882731 0.7749321 0.82051282 0.6870229 0.7734554 0.730
5: 4 90.35986 116.7089 114.15588 116.72312 0.07675141 0.8661540 0.3236617 0.08139535 0.8658065 0.3207547 0.702
我的问题:但是是否可以重新编写循环,以便在任何时候,表只包含 5 行?如果我要运行这个循环 1,000,000 次,表格会变得非常大,我想提前修剪它。
例如
- 循环循环 5 次
- 对于第 6 次迭代,查看“total”的值是否小于“total”的前 5 个值中的任何一个
- 如果是,则丢弃此迭代的结果并转到第 7 次迭代。
- 如果否,则保留本次迭代的结果,丢弃属于最小迭代的行并进入第 7 次迭代
- 重复步骤 2),直到您将循环迭代 1,000,000 次。
是否可以在循环中添加此步骤并在创建表时对其进行修剪?还是只有在整个循环完成后才能修剪表格?
谢谢
谢谢
解决方案
我们可以添加行
final_results <- head(final_results[order(-total)], 5)
在循环结束时仅返回前 5 个“总计”行
for (i in 1:10 ) {
#generate random numbers
random_1 = runif(1, 80, 120)
random_2 = runif(1, random_1, 120)
random_3 = runif(1, 85, 120)
random_4 = runif(1, random_3, 120)
#bin data according to random criteria
train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
train_data$cat = as.factor(train_data$cat)
#new splits
a_table = train_data %>%
filter(cat == "a") %>%
select(a1, b1, c1, cat)
b_table = train_data %>%
filter(cat == "b") %>%
select(a1, b1, c1, cat)
c_table = train_data %>%
filter(cat == "c") %>%
select(a1, b1, c1, cat)
split_1 = runif(1,0, 1)
split_2 = runif(1, 0, 1)
split_3 = runif(1, 0, 1)
#calculate 60th quantile ("quant") for each bin
table_a = data.frame(a_table%>% group_by(cat) %>%
mutate(quant = quantile(c1, prob = split_1)))
table_b = data.frame(b_table%>% group_by(cat) %>%
mutate(quant = quantile(c1, prob = split_2)))
table_c = data.frame(c_table%>% group_by(cat) %>%
mutate(quant = quantile(c1, prob = split_3)))
#create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
#group all tables
final_table = rbind(table_a, table_b, table_c)
#create a table: for each bin, calculate the average of "diff"
final_table_2 = data.frame(final_table %>%
group_by(cat) %>%
summarize(
mean = mean(diff)
))
#add "total mean" to this table
final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
#format this table: add the random criteria to this table for reference
final_table_2$random_1 = random_1
final_table_2$random_2 = random_2
final_table_2$random_3 = random_3
final_table_2$random_4 = random_4
final_table_2$split_1 = split_1
final_table_2$split_2 = split_2
final_table_2$split_3 = split_3
final_table_2$iteration_number = i
results_table <- rbind(results_table, final_table_2)
final_results = dcast(setDT(results_table), iteration_number + random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
final_results <- head(final_results[order(-total)], 5)
}
推荐阅读
- django - 不显示字段 drf 序列化程序
- html - RSS/XML 解析
- javascript - 如何将 Material-table 与 Material-UI Dialog 结合起来?(反应JS)
- python - 有没有办法在 DJANGO 的模板中显示元素的计数器?
- javascript - 尝试安装 node-jasper 时出错
- pandas - 用于绘图的中心调色板 - Seaborn
- html - 无法下载文件:我正在使用锚标签下载文件,但本地主机附加到它并且链接混乱
- r - R:速度/聚合 - 在特定时间段内每列 A 的列 B 的唯一计数过多?
- nginx - nginx url重写失败
- javascript - 如何在引导程序 maxlength 上手动设置当前键入的字符?