r - How to extract series of words separated by commas and start and end words?
问题描述
Given this sort of text,
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
I need to extract "this guy, this other guy, that guy, that other guy, something else"
So, I need to tell regex to match any sequence of words occurring between any one of the following:
two commas
a "particular phrase" and a comma
a comma and an "or"
an "or" and a space
I'd be content with a solution that includes a few undesired words, if that is the most that can be asked of regex.
I'd imagine the code would look something like this (which doesn't run because I am a total regex noob):
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
this_pattern <- "^.*\\b(particular phrase|,|or)\\W(\\w+\\W+)+\\W(,|or).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)
EDIT:
I am getting closer with this (which does run):
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
this_pattern <- "^.*\\b(particular phrase)\\W+(.*)\\W+(,|or).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)
#[1] "this guy, this other guy, that guy, that other guy,"
But how to include the last item "something else"?
解决方案
这是与当前要求最接近的:
(?:\bparticular phrase\b|\bor\b|,)\s*\b(?!or\b)(\w+(?:[^,.\w]+\w+)*?)(?=\s*(?:,|\bor\b))
查看正则表达式演示
细节
(?:\bparticular phrase\b|\bor\b|,)
- 一个完整的单词or
orparticular phrase
, 或一个逗号\s*
- 0+ 个空格\b
- 单词边界(?!or\b)
- 下一个词不能or
(\w+(?:[^,.\w]+\w+)*?)
- 第 1 组:\w+
- 1+字字符(?:[^,.\w]+\w+)*?
- 0+ 次重复,尽可能少,[^,.\w]+
- 除逗号、点或单词字符外的 1+ 个字符\w+
- 1+字字符
(?=\s*(?:,|\bor\b))
- 一个正向前瞻,需要 0+ 个空格和一个逗号,或者or
紧跟在当前位置之后的一个单词。
R 演示:
pattern <- "(?:\\bparticular phrase\\b|\\bor\\b|,)\\s*\\b(?!or\\b)\\K\\w+(?:[^,.\\w]+\\w+)*(?=\\s*,|\\bor\\b)"
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
regmatches(this_txt, gregexpr(pattern, this_txt, perl=TRUE, ignore.case=TRUE))[[1]]
输出:
[1] "this guy" "this other guy"
[3] "that guy" "that other guy"
[5] "something else blah blah blah"
推荐阅读
- c# - C# 最佳实践 (LINQ) 检查字符串列表/数组中的字符串
- digital-ocean - 数字海滴
- javascript - 如何将excel文件(exceljs)上传到s3
- flutter - 使用 Flutter with IOS 接收共享文件意图
- javascript - 分页逻辑 - Vue
- javascript - 在 Node js 中不使用 Buffer 的情况下将字符串转换为 base64 的替代方法
- json - 我无法使用 Swift JSON 打印到 TableView 数据
- c# - VlcControl.TakeSnapshot 生成文件但为空
- android - Kotlin 使用协程处理改造请求
- python - 重塑阵列以重组阵列拼贴