r - 正则表达式在 R 中匹配具有相邻和非相邻单词重复的句子
问题描述
我有一个带有句子的数据框;在某些句子中,单词被多次使用:
df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
"it 's like being in a play-group , in n it ?",
"oh is that that steak i got the other night ?",
"well where have the middle sized soda stream bottle gone ?",
"this is a half day , right ? needs a full day",
"yourself , everybody 'd be changing your hair in n it ?",
"cos he finishes at four o'clock on that day anyway .",
"no no no i 'm dave and you 're alan .",
"yeah , i mean the the film was quite long though",
"it had steve martin in it , it 's a comedy",
"oh it is a dreary old day in n it ?",
"no it 's not mother theresa , it 's saint theresa .",
"oh have you seen that face lift job he wants ?",
"yeah bolshoi 's right so which one is it then ?"))
我想匹配那些单词,任何单词,重复一次或多次的句子。
编辑 1:
重复的单词**可以*相邻,但不必相邻。这就是为什么连续重复单词的正则表达式不能回答我的问题的原因。
我在这段代码中取得了适度的成功:
df[grepl("(\\w+\\b\\s)\\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?
[2] it 's like being in a play-group , in n it ?
[3] oh is that that steak i got the other night ?
[4] this is a half day , right ? needs a full day
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .
[7] yeah , i mean the the film was quite long though
[8] it had steve martin in it , it 's a comedy
[9] oh it is a dreary old day in n it ?
成功只是适度的,因为一些不应该匹配的句子被匹配,例如,,yourself , everybody 'd be changing your hair in n it ?
而另一些不应该被匹配的句子,例如,no it 's not mother theresa , it 's saint theresa .
。如何改进代码以产生完全匹配?
预期结果:
df
Turn
2 it 's like being in a play-group , in n it ?
3 oh is that that steak i got the other night ?
5 this is a half day , right ? needs a full day
8 no no no i 'm dave and you 're alan .
9 yeah , i mean the the film was quite long though
10 it had steve martin in it , it 's a comedy
11 oh it is a dreary old day in n it ?
12 no it 's not mother theresa , it 's saint theresa .
编辑 2:
另一个问题是如何定义重复单词的确切数量。上述不完美的正则表达式匹配至少重复一次的单词。如果我将量词更改为{2}
,从而寻找一个单词的三次出现,我会得到这个代码和这个结果:
df[grepl("(\\w+\\b\\s)\\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
但同样,比赛并不完美,因为预期的结果是:
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy # "it" occurs 3 times
任何帮助深表感谢!
解决方案
用于定义重复单词的确切数量的选项。
提取相同单词出现 3 次的句子
更改正则表达式。
(\s?\b\w+\b\s)(.*\1){2}
第 1 组捕获的 (\s?\b\w+\b\s)
- \s? : 空格出现零次或一次。
- \b\w+\b :确切的单词字符。
\s :空格出现一次。
(.*\1) 被第 2 组捕获
(.*\1) :在第 1 组再次匹配之前出现零次或多次的任何字符。
(.*\1){2} :第 2 组匹配两次。
代码
df$Turn[grepl("(\\s?\\b\\w+\\b\\s)(.*\\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
- 将
strsplit(split="\\s")
句子拆分成单词。- 使用
sapply
andtable
统计每个列表元素中单词的出现次数,然后选择满足要求的句子。
- 使用
代码
library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
希望这可以帮助你:)
推荐阅读
- c# - 在 Unity GUI 中,当更改按钮字体大小时,如何更改按钮大小以换行
- c# - 在 Windows 服务中处理作业
- email - Mailchimp 订阅者弹出窗口小部件不显示电子邮件恶意列表弹出窗口
- php - 根据 Woocommerce 中允许的时间范围禁用结帐
- python - 在Django中将数据库从sqlite更改为postgreSQL时出错
- java - 应用程序可在 Windows 上运行,但不能在 Linux 上运行
- python - 竞争编码 - 掩码位 - 十进制到二进制
- go - 发布操作失败并出现“CSRF 令牌验证失败”错误
- c# - 自引用模型 - 带有产品列表的产品
- javascript - 使用回调将值传递给父函数