首页 > 解决方案 > 如何加快 grepl 函数的执行速度?

问题描述

尝试将此选项用于大量单词和文本:

# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada", 
             "continuous improvement is an unrealistic goal", 
             "phrase with no match")

# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words

实现大型列表和输入文本需要很多时间

有什么办法可以改变它以使过程更快?

标签: r

解决方案


一种可能性是使用grepl()with fixed = TRUE

lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE)))

或者,您可以使用stri_detect_fixed()from stringi

lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word)))

一个小模拟:

phrases <- rep(phrases, 100000)

library(microbenchmark)
microbenchmark(grepl = lapply(words, function(word) as.numeric(grepl(word, phrases))),
               grepl_fixed = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
               stri_detect_fixed = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
               times = 50)

Unit: milliseconds
              expr      min       lq      mean   median       uq       max neval
             grepl 857.5839 918.3976 1007.4775 957.3126 986.9762 1631.5336    50
       grepl_fixed 116.8073 130.1615  146.6852 139.1170 152.0428  278.1512    50
 stri_detect_fixed 105.2338 116.9041  128.8941 126.7353 135.7818  199.4968    50

正如@akrun 所提议的,可以通过替换为来实现一些性能as.numeric()改进+

microbenchmark(grepl_plus = lapply(words, function(word) +grepl(word, phrases)),
               grepl_fixed_plus = lapply(words, function(word) +grepl(word, phrases, fixed = TRUE)),
               stri_detect_fixed_plus = lapply(words, function(word) +stri_detect_fixed(phrases, word)),
               grepl_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases))),
               grepl_fixed_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
               stri_detect_fixed_as_numeric = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
               times = 50)

Unit: milliseconds
                         expr      min       lq      mean   median        uq       max
                   grepl_plus 839.2060 889.8748 1008.0753 926.4712 1022.6071 2063.8296
             grepl_fixed_plus 117.0043 126.4407  141.5917 136.5732  146.2262  318.7412
       stri_detect_fixed_plus 104.4772 110.3147  126.3931 115.9223  124.4952  423.4654
             grepl_as_numeric 851.4198 893.6703  957.4348 935.0965 1010.3131 1375.0810
       grepl_fixed_as_numeric 121.8952 128.6741  142.4962 136.3370  145.5004  235.6042
 stri_detect_fixed_as_numeric 106.0639 114.6759  128.0724 121.9647  135.4791  191.1315
 neval
    50
    50
    50
    50
    50
    50

推荐阅读