首页 > 解决方案 > R:用for循环在另一个字符串旁边找到一个特定的字符串

问题描述

我在单个向量中有一本小说的文本,它已被单词分割,novel.vector.words我正在寻找字符串“blood of”的所有实例。但是,由于向量是按单词分割的,所以每个单词都是它自己的字符串,我不知道要在向量中搜索相邻的字符串。

我对 for 循环的作用有基本的了解,并且按照教科书的一些说明,我可以使用这个 for 循环来定位“血液”的所有位置及其周围的上下文,以创建一个制表符描述的 KWIC 显示(关键词在上下文中)。

node.positions <- grep("blood", novel.vector.words)

output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header

#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match

for (i in 1:length(node.positions)){ # access each match...
  # access the current match
  node <- novel.vector.words[node.positions[i]]
  # access the left context of the current match
  left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
  # access the right context of the current match
  right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
  # concatenate and print the results
  cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}

但是,我不确定该怎么做,是使用 if 语句之类的东西来仅捕获“血”后跟“的”的实例。我需要在 for 循环中使用另一个变量吗?我想要它做的基本上是对于它找到的每个“血”实例,我想看看紧随其后的单词是否是“of”。我希望循环找到所有这些实例并告诉我向量中有多少。

标签: rfor-loopcorpus

解决方案


您可以创建一个索引dplyr::lead来匹配 'of' 之后的 'blood':

library(dplyr)

novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")

which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))

[1] 1 5

针对评论中的问题:

这当然可以通过基于循环的方法来完成,但是当已经有更好的设计和优化的包来完成文本挖掘任务的繁重工作时,重新发明轮子就没有什么意义了。

Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.

library(tidytext)
library(dplyr)
library(stringr)

## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
  mutate(line = row_number())

## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5

## Create ngrams using skip_ngrams token
blood_of <- fulltext %>% 
  unnest_tokens(output = ngram, input = text,  token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
  filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b"))) 

## Return count
blood_of %>%
  nrow

[1] 54

## Inspect first six line number indices
head(blood_of$line)

[1]  999 1279 1309 2192 3844 4135

推荐阅读