首页 > 解决方案 > 在文本段落中搜索单词,然后在 R 中标记它们

问题描述

我有一个文本数据集,想在其中搜索各种单词,然后在找到它们时标记它们。这是示例数据:

df <- data.table("id" = c(1:3), "report" = c("Travel opens our eyes to art, history, and culture – but it also introduces us to culinary adventures we may have never imagined otherwise."
                                             , "We quickly observed that no one in Sicily cooks with recipes (just with the heart), so we now do the same."
                                             , "We quickly observed that no one in Sicily cooks with recipes so we now do the same."), "summary" = c("On our first trip to Sicily to discover our family roots,"
                                                                      , "If you’re not a gardener, an Internet search for where to find zucchini flowers results."
                                                                      , "add some fresh cream to make the mixture a bit more liquid,"))

到目前为止,我一直在使用 SQL 来处理这个问题,但是当你有很多单词列表要查找时,它就会变得很有挑战性。

dfOne <- sqldf("select id
              , case when lower(report) like '%opens%' then 1 else 0 end as opens
, case when lower(report) like '%cooks%' then 1 else 0 end as cooks
, case when lower(report) like '%internet%' then 1 else 0 end as internet
, case when lower(report) like '%zucchini%' then 1 else 0 end as zucchini
, case when lower(report) like '%fresh%' then 1 else 0 end as fresh
      from df
      ")

我正在寻找以更有效的方式做到这一点的想法。想象一下,如果您有很长的目标术语列表,则此代码可能会不必要地过长。

谢谢,

SM。

标签: rtext-mining

解决方案


1) sqldf

定义单词向量,然后将其转换为 SQL。请注意,这case when不是必需的,因为like已经产生了 0/1 结果。前缀允许将 R字符串替换为 SQL 语句sqldf。使用参数 to查看生成的 SQL 语句。不管多长,这只是两行代码。fn$$likelikeverbose=TRUEsqldfwords

words <- c("opens", "cooks", "internet", "zucchini", "fresh", "test me")

like <- toString(sprintf("\nlower(report) like '%%%s%%' as '%s'", words, words))
fn$sqldf("select id, $like from df", verbose = TRUE)

给予:

  id opens cooks internet zucchini fresh test me
1  1     1     0        0        0     0       0
2  2     0     1        0        0     0       0
3  3     0     1        0        0     0       0

2) 外层

words从上面使用我们可以使用outer如下。请注意,outer 中的函数(第三个参数)必须是矢量化的,我们可以grepl如图所示进行矢量化。check.names = FALSE如果您不介意与带有空格或标点符号的单词关联到语法 R 变量名称中,请省略。这会产生与 (1) 相同的输出。

with(df, data.frame(
    id, 
    +t(outer(setNames(words, words), report, Vectorize(grepl))), 
    check.names = FALSE
))

3) 应用

使用sapply我们可以获得与(2)相同的稍短的解决方案。输出与(1)和(2)中的相同。

with(df, data.frame(id, +sapply(words, grepl, report), check.names = FALSE))

推荐阅读