首页 > 解决方案 > 使用 R 中另一个数据处理大型 DataTable

问题描述

我需要处理一个大的data.table,在一个列中设置一个值,其中两个其他列和另一个单独的(小)搜索词data.table 之间存在匹配。该列的值取自找到匹配项的搜索词。

具体在以下位置找到匹配项:

我在下面的代码中简化了我的用例,它使用了默认的风暴数据集(10k 记录)。主 DT 有 650 万条记录,搜索词 DT 不到 200 条。

我的两个问题是:

  1. 有没有办法提高效率(并减少处理大型数据表中的 650 万条记录所需的估计 90 分钟?

  2. 如何获得 data.table 结果而不是已转置的矩阵(列作为行,行作为列)?

非常感谢。

library(tidyverse)
library(lubridate)
library(data.table)

data("storms")
setDT(storms)
storms <- storms %>% 
  mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())

# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01") 
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to) 
setDT(searchTerms)

# Function to determine if any search terms apply to the given row
# Typically only one is expected, although not guaranteed, so the first is taken
# A replacedStatus field is added containing either
# - the parameterised replacementStatus where a matching search term has been identified
# - the original status where no match has been identified 
recodeValues <- function(row) { 
  date = row["date"]
  status = row["status"]
  recodeMatch <- head(
    searchTerms[str_detect(status, searchTerms$searchStatus) &
                  (searchTerms$from == "" | date >= as.Date(searchTerms$from)) &
                  (searchTerms$to == ""   | date <= as.Date(searchTerms$to))
                  ,]
      ,1)
  # I would use mult = "first" in the selection above but it seem to have no impact, so use head() instead
  row["replacedStatus"]  <- if_else(nrow(recodeMatch) > 0, recodeMatch[1]$replacementStatus, status)
  return(row)
}

cat("Starting Recoding", "\n")
processorTime <- proc.time()
result <- apply(storms, 1, recodeValues)
cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
View(result)

标签: rdata.tabletidyverse

解决方案


如果我正确理解你想要什么,那么迭代小的“searchTerms”可能data.table比迭代大的“storms”更有意义。

然后你可以做这样的事情,它可以更好地利用以下功能data.table

library(tidyverse)
library(lubridate)
library(data.table)
data("storms")
setDT(storms)
storms <- storms %>% 
    mutate(date = paste(year, month, day, sep="-") %>% ymd() %>% as.Date())

# Create simple Search Term Data Table
searchStatus = c("hurricane", "orm", "sion")
replacementStatus = c("hurricane any time", "76 storm", "81 depression")
from = c("", "1976-01-01", "1976-01-01") 
to = c("", "1981-01-01", "1981-12-31")
searchTerms = data.frame(searchStatus, replacementStatus, from, to) 
setDT(searchTerms)

cat("Starting Recoding", "\n")
#> Starting Recoding
processorTime <- proc.time()
for(i in seq_len(dim(searchTerms)[1])){
    x <- as.list(searchTerms[i])
    if(x$from == "") {
        storms[grep(x$searchStatus, status),
               status:= x$replacementStatus]  
    } else {
        storms[grep(x$searchStatus, status) &
                   between(date, as.Date(x$from), as.Date(x$to)),
               status:= x$replacementStatus]
    }
}

cat("Recoding time (Elapsed):", proc.time()[3] - processorTime[3], " seconds \n")
#> Recoding time (Elapsed): 0.034  seconds
cat("Estimated Recoding time (Elapsed) for 6.5m records:", (proc.time()[3] - processorTime[3]) * 6500000 / nrow(storms) / 60, " minutes \n")
#> Estimated Recoding time (Elapsed) for 6.5m records: 0.3787879  minutes
tail(storms[])
#>    name year month day hour  lat  long             status category wind
#> 1: Kate 2015    11  10   12 29.5 -75.4     tropical storm        0   60
#> 2: Kate 2015    11  10   18 31.2 -74.0     tropical storm        0   60
#> 3: Kate 2015    11  11    0 33.1 -71.3 hurricane any time        1   65
#> 4: Kate 2015    11  11    6 35.2 -67.6 hurricane any time        1   70
#> 5: Kate 2015    11  11   12 36.2 -62.5 hurricane any time        1   75
#> 6: Kate 2015    11  11   18 37.6 -58.2 hurricane any time        1   65
#>    pressure ts_diameter hu_diameter       date
#> 1:      998    103.5702      0.0000 2015-11-10
#> 2:      993    103.5702      0.0000 2015-11-10
#> 3:      990    161.1092     23.0156 2015-11-11
#> 4:      985    207.1404     23.0156 2015-11-11
#> 5:      980    345.2340     34.5234 2015-11-11
#> 6:      980    379.7574     46.0312 2015-11-11

推荐阅读