r - 如何加快 grepl 函数的执行速度?
问题描述
尝试将此选项用于大量单词和文本:
# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada",
"continuous improvement is an unrealistic goal",
"phrase with no match")
# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words
实现大型列表和输入文本需要很多时间
有什么办法可以改变它以使过程更快?
解决方案
一种可能性是使用grepl()
with fixed = TRUE
:
lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE)))
或者,您可以使用stri_detect_fixed()
from stringi
:
lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word)))
一个小模拟:
phrases <- rep(phrases, 100000)
library(microbenchmark)
microbenchmark(grepl = lapply(words, function(word) as.numeric(grepl(word, phrases))),
grepl_fixed = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
stri_detect_fixed = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
times = 50)
Unit: milliseconds
expr min lq mean median uq max neval
grepl 857.5839 918.3976 1007.4775 957.3126 986.9762 1631.5336 50
grepl_fixed 116.8073 130.1615 146.6852 139.1170 152.0428 278.1512 50
stri_detect_fixed 105.2338 116.9041 128.8941 126.7353 135.7818 199.4968 50
正如@akrun 所提议的,可以通过替换为来实现一些性能as.numeric()
改进+
:
microbenchmark(grepl_plus = lapply(words, function(word) +grepl(word, phrases)),
grepl_fixed_plus = lapply(words, function(word) +grepl(word, phrases, fixed = TRUE)),
stri_detect_fixed_plus = lapply(words, function(word) +stri_detect_fixed(phrases, word)),
grepl_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases))),
grepl_fixed_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
stri_detect_fixed_as_numeric = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
times = 50)
Unit: milliseconds
expr min lq mean median uq max
grepl_plus 839.2060 889.8748 1008.0753 926.4712 1022.6071 2063.8296
grepl_fixed_plus 117.0043 126.4407 141.5917 136.5732 146.2262 318.7412
stri_detect_fixed_plus 104.4772 110.3147 126.3931 115.9223 124.4952 423.4654
grepl_as_numeric 851.4198 893.6703 957.4348 935.0965 1010.3131 1375.0810
grepl_fixed_as_numeric 121.8952 128.6741 142.4962 136.3370 145.5004 235.6042
stri_detect_fixed_as_numeric 106.0639 114.6759 128.0724 121.9647 135.4791 191.1315
neval
50
50
50
50
50
50
推荐阅读
- javascript - WebRTC iceConnectionState 一直处于“检查”状态;(使用coturn)
- json - 如何在具有任意键的对象值的 JSON 之后对 Go 结构进行建模?
- jdbc - Ignite 2.11.0 无法使用 SqlFieldsQuery 和 DBeaver 使用 JDBC 或 REST 查看数据
- javascript - 如何同时为多个进度条设置动画?
- configuration - HAproxy - 后端何时抛出 50 倍错误?
- r - 使用 dplyr、group_by 与 mutate() 或 summarise() & str_c() 或 paste() & 折叠连接字符串/行,但保持 NA & 所有字符串
- c - C中的链表(中间插入)
- python - 在数据框中排序 CIDR,检查每个元素的文字等价和重叠
- spring-boot - thymleaf + spring-boot Validaiton 错误未显示在 html 上
- python - 如果后面没有另一个字符串,则删除字符串