首页 > 解决方案 > 如何计算有多少子字符串匹配列表的至少一个元素,只有当它们之前或之后没有否定?

问题描述

我有一个文本和一个模式列表:

text="By Gregory Crawford HONG KONG, Jan 1 (Reuter) - Lower interest rates should\\ boost loan growth for Hong Kong banks in 1996, but the sluggish\\ economy will limit profit next year, analysts said.\\  \"Overall profit growth for the sector next year will not be\\ fantastic,\\\\\\\\\\\" said Alan Hutcheson at Deutsche Morgan Grenfell.\\     \\\\\\\\\\\"On the property side, we're not expecting to see any major\\ resurgence in terms of demand for mortgages,\\\\\\\\\\\" he said."
patterns=c("boost","growth","fantastic")

然后我崩溃成:

patterns.col="\\bboost\\b|\\bgrowth\\b|\\bfantastic\\b"

我想计算模式中的单词出现在文本中的次数,不包括否定“no”、“not”、“don't”在它们之前或之后(在前一个/下一个 5 个词内)的实例或“不会”。

在这种情况下,我的预期输出将是:

#3

即“boost”和“growth”x2,而“fantastic”不计入,因为前面有“not”。

我怎么能那样做?

现在,我进行如下简单匹配:

count=str_count(text,patterns.col)

谢谢!

标签: rregexstringtext

解决方案


negatives = c("no", "not", "don't", "won't")

#Clean up text
x = gsub("[\\\\|,|\"|.]", "", text)
x = gsub("\\s+", " ", x)
x = unlist(strsplit(x, " "))

ind1 = which(x %in% negatives)
ind2 = which(x %in% patterns)

remove = sum(rowSums(sapply(ind1, function(x) sapply(ind2, function(y) abs(x - y) <= 5))) > 0)
add = length(ind2)

ans = add - remove
ans
#[1] 3

推荐阅读