r - 使用 R 中另一个数据处理大型 DataTable
问题描述
我需要处理一个大的data.table,在一个列中设置一个值,其中两个其他列和另一个单独的(小)搜索词data.table 之间存在匹配。该列的值取自找到匹配项的搜索词。
具体在以下位置找到匹配项:
- 主 DT 中的字段(示例代码中的状态)与搜索词 DT (searchStatus) 中的 str_detect 模式匹配
- 如果为上述 str_detect 模式指定,则来自主 DT 的日期必须在搜索词指定的日期范围内(从和到)
我在下面的代码中简化了我的用例,它使用了默认的风暴数据集(10k 记录)。主 DT 有 650 万条记录,搜索词 DT 不到 200 条。
我的两个问题是:
有没有办法提高效率(并减少处理大型数据表中的 650 万条记录所需的估计 90 分钟?
如何获得 data.table 结果而不是已转置的矩阵(列作为行,行作为列)?
非常感谢。
library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>%
mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())
# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01")
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to)
setDT(searchTerms)
# Function to determine if any search terms apply to the given row
# Typically only one is expected, although not guaranteed, so the first is taken
# A replacedStatus field is added containing either
# - the parameterised replacementStatus where a matching search term has been identified
# - the original status where no match has been identified
recodeValues <- function(row) {
date = row["date"]
status = row["status"]
recodeMatch <- head(
searchTerms[str_detect(status, searchTerms$searchStatus) &
(searchTerms$from == "" | date >= as.Date(searchTerms$from)) &
(searchTerms$to == "" | date <= as.Date(searchTerms$to))
,]
,1)
# I would use mult = "first" in the selection above but it seem to have no impact, so use head() instead
row["replacedStatus"] <- if_else(nrow(recodeMatch) > 0, recodeMatch[1]$replacementStatus, status)
return(row)
}
cat("Starting Recoding", "\n")
processorTime <- proc.time()
result <- apply(storms, 1, recodeValues)
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
View(result)
解决方案
如果我正确理解你想要什么,那么迭代小的“searchTerms”可能data.table
比迭代大的“storms”更有意义。
然后你可以做这样的事情,它可以更好地利用以下功能data.table
:
library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>%
mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())
# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01")
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to)
setDT(searchTerms)
cat("Starting Recoding", "\n")
#> Starting Recoding
processorTime <- proc.time()
for(i in seq_len(dim(searchTerms)[1])){
x <- as.list(searchTerms[i])
if(x$from == "") {
storms[grep(x$searchStatus, status),
status:= x$replacementStatus]
} else {
storms[grep(x$searchStatus, status) &
between(date, as.Date(x$from), as.Date(x$to)),
status:= x$replacementStatus]
}
}
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
#> Recoding time (Elapsed): 0.034 seconds
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
#> Estimated Recoding time (Elapsed) for 6.5m records: 0.3787879 minutes
tail(storms[])
#> name year month day hour lat long status category wind
#> 1: Kate 2015 11 10 12 29.5 -75.4 tropical storm 0 60
#> 2: Kate 2015 11 10 18 31.2 -74.0 tropical storm 0 60
#> 3: Kate 2015 11 11 0 33.1 -71.3 hurricane any time 1 65
#> 4: Kate 2015 11 11 6 35.2 -67.6 hurricane any time 1 70
#> 5: Kate 2015 11 11 12 36.2 -62.5 hurricane any time 1 75
#> 6: Kate 2015 11 11 18 37.6 -58.2 hurricane any time 1 65
#> pressure ts_diameter hu_diameter date
#> 1: 998 103.5702 0.0000 2015-11-10
#> 2: 993 103.5702 0.0000 2015-11-10
#> 3: 990 161.1092 23.0156 2015-11-11
#> 4: 985 207.1404 23.0156 2015-11-11
#> 5: 980 345.2340 34.5234 2015-11-11
#> 6: 980 379.7574 46.0312 2015-11-11
推荐阅读
- python - Python selenium 中的 ElementClickInterceptedException
- java - 有没有办法在不使用任何第三方应用程序的情况下以编程方式控制安卓中的充电?
- javascript - 如何检查 react.js 中的 startTime 是否小于 endTime
- python - 查找特定数量的python pandas的计数
- node.js - 当我尝试安装 @nuxtjs/firebase 时出现 NPM 错误
- opengl-es - 在 OpenGL Android ARCore 中绘制具有独特颜色和宽度的线条
- javascript - iframe中的body滚动事件是根据iframe的body标签还是父页面的body标签发生的?
- javascript - 如何使用 JavaScript 创建一个 div?
- python - 拆分不平衡的数据集
- database - 第一次将数据从其他来源插入应用程序表的最佳方法是什么?