首页 > 解决方案 > 删除正则表达式时出错,将文本拆分为段落,然后在 R 中应用 ifelse

问题描述

我正在努力将正则表达式拆分文本删除为段落,然后将 IFELSE 应用于数据框。我期待着你的帮助。谢谢你。

我希望在第一段中为数据框中的每个文本搜索单词。此后,我有了要搜索的搜索词。如果单词存在,则输入 1,否则输入 0。

下面是表格。

data<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = c(NA, 
-15L), class = "data.frame")

对于文本列中的条目数,我正在搜索以下单词

library(stringr)
library(stringi)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(dplyr)
words<-c("field", "ocean", "glamor showcases")

我尝试了以下方法:

删除不需要的正则表达式。

当我尝试删除“\t”和“\n”时,出现以下错误:

data1<-data %>% mutate(Text=gsub("\\t",Text,""))

警告消息:在 gsub("\t", Text, "") 中:参数 'replacement' 的长度 > 1,并且只会使用第一个元素

按段落拆分

data1<-data %>% mutate(Text2=Text) %>% unnest_tokens("Text3",Text2,token="paragraphs")

如果 word 存在,则为 1,否则为 0 和决赛桌。

finaldata<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor"), field = structure(c(2L, 3L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), country = structure(c(3L, 2L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), glamor.showcases = structure(c(2L, 
    3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor")), .Names = c("ID", "Text", "field", 
"country", "glamor.showcases"), row.names = c(NA, -15L), class = "data.frame")

任何帮助,将不胜感激。谢谢你。

我看过以下资源 -

  1. 计算 R 中的单词出现次数

  2. 如何找到一列中的一个单词/单词存在于另一个包含句子的列中[重复]

  3. 在 R 中按段落分割

  4. 将文本文件拆分为R中的段落文件

标签: rdplyrtidyrtidyversetidytext

解决方案


你可以试试这个假设一个新的段落从df$Text开始\n\n

#search df$Text to find if it contains strings present in 'words' vector in its first paragraph
words_df <- do.call(cbind, lapply(words, function(x) 
  as.numeric(grepl(x, gsub("\n\n.*$", "", df$Text), ignore.case = T))))
colnames(words_df) <- words

#above outcome is combined with original dataframe to have the final result
final_df <- cbind(df, words_df)

这使

> final_df[, -(1:2)]
  field country glamor showcases
1     0       1                0
2     1       0                1


样本数据:

df <- structure(list(ID = structure(2:3, .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(2:3, .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\\n\\t\\t\\t\\t\\t \\n \\n   The soccer world cup is entralling. \\nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = 1:2, class = "data.frame")

words<-c("field", "country", "glamor showcases")

推荐阅读