r - 删除正则表达式时出错,将文本拆分为段落,然后在 R 中应用 ifelse
问题描述
我正在努力将正则表达式拆分文本删除为段落,然后将 IFELSE 应用于数据框。我期待着你的帮助。谢谢你。
我希望在第一段中为数据框中的每个文本搜索单词。此后,我有了要搜索的搜索词。如果单词存在,则输入 1,否则输入 0。
下面是表格。
data<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"),
Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017 he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.",
"\\n\\t\\t\\t\\t\\t \\n \\n The soccer world cup is entralling. \\nEveryone acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
), class = "factor")), .Names = c("ID", "Text"), row.names = c(NA,
-15L), class = "data.frame")
对于文本列中的条目数,我正在搜索以下单词
library(stringr)
library(stringi)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(dplyr)
words<-c("field", "ocean", "glamor showcases")
我尝试了以下方法:
删除不需要的正则表达式。
当我尝试删除“\t”和“\n”时,出现以下错误:
data1<-data %>% mutate(Text=gsub("\\t",Text,""))
警告消息:在 gsub("\t", Text, "") 中:参数 'replacement' 的长度 > 1,并且只会使用第一个元素
按段落拆分
data1<-data %>% mutate(Text2=Text) %>% unnest_tokens("Text3",Text2,token="paragraphs")
如果 word 存在,则为 1,否则为 0 和决赛桌。
finaldata<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"),
Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017 he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.",
"\\n\\t\\t\\t\\t\\t \\n \\n The soccer world cup is entralling. \\nEveryone acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
), class = "factor"), field = structure(c(2L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"0", "1"), class = "factor"), country = structure(c(3L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"0", "1"), class = "factor"), glamor.showcases = structure(c(2L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"0", "1"), class = "factor")), .Names = c("ID", "Text", "field",
"country", "glamor.showcases"), row.names = c(NA, -15L), class = "data.frame")
任何帮助,将不胜感激。谢谢你。
我看过以下资源 -
解决方案
你可以试试这个假设一个新的段落从df$Text
开始\n\n
#search df$Text to find if it contains strings present in 'words' vector in its first paragraph
words_df <- do.call(cbind, lapply(words, function(x)
as.numeric(grepl(x, gsub("\n\n.*$", "", df$Text), ignore.case = T))))
colnames(words_df) <- words
#above outcome is combined with original dataframe to have the final result
final_df <- cbind(df, words_df)
这使
> final_df[, -(1:2)]
field country glamor showcases
1 0 1 0
2 1 0 1
样本数据:
df <- structure(list(ID = structure(2:3, .Label = c("", "1", "2"), class = "factor"),
Text = structure(2:3, .Label = c("", "\\n\\t\\t\\t\\t \\n\\t\\t\\t\\t\\tPublication Date: October 31, 2017\\n\\t\\t\\t\\t October 31, 2017 he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.",
"\\n\\t\\t\\t\\t\\t \\n \\n The soccer world cup is entralling. \\nEveryone acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
), class = "factor")), .Names = c("ID", "Text"), row.names = 1:2, class = "data.frame")
words<-c("field", "country", "glamor showcases")
推荐阅读
- python - matplotlib.show() 似乎什么也没做
- node.js - 如何在猫鼬中填充对象数组中的每个元素?
- azure-devops - Azure Pipeline Cron Scheduler:不要同时轮询所有分支
- r - 在 r 中使用 dplyr 有效地重塑数据帧
- matomo - Matomo 中的目标创建问题
- git - 使用带有裸存储库的子模块失败
- jenkins - 应用和保存按钮在 Jenkins 上不起作用
- php - 如何在 prestashop 中发送电子邮件?
- ngrx - 如何使用实体适配器和并行返回状态对象
- ios - 按下按钮swiftUI时计时器不显示正确的时间