首页 > 解决方案 > How to extract series of words separated by commas and start and end words?

问题描述

Given this sort of text,

this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."

I need to extract "this guy, this other guy, that guy, that other guy, something else"

So, I need to tell regex to match any sequence of words occurring between any one of the following:

two commas

a "particular phrase" and a comma

a comma and an "or"

an "or" and a space

I'd be content with a solution that includes a few undesired words, if that is the most that can be asked of regex.

I'd imagine the code would look something like this (which doesn't run because I am a total regex noob):

this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
this_pattern <- "^.*\\b(particular phrase|,|or)\\W(\\w+\\W+)+\\W(,|or).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)

EDIT:

I am getting closer with this (which does run):

  this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
  this_pattern <- "^.*\\b(particular phrase)\\W+(.*)\\W+(,|or).*$"
  gsub(this_pattern, "\\2", this_txt, ignore.case = T)
#[1] "this guy, this other guy, that guy, that other guy,"

But how to include the last item "something else"?

标签: rregexgsub

解决方案


这是与当前要求最接近的:

(?:\bparticular phrase\b|\bor\b|,)\s*\b(?!or\b)(\w+(?:[^,.\w]+\w+)*?)(?=\s*(?:,|\bor\b))

查看正则表达式演示

细节

  • (?:\bparticular phrase\b|\bor\b|,)- 一个完整的单词oror particular phrase, 或一个逗号
  • \s* - 0+ 个空格
  • \b- 单词边界
  • (?!or\b)- 下一个词不能or
  • (\w+(?:[^,.\w]+\w+)*?)- 第 1 组:
    • \w+- 1+字字符
    • (?:[^,.\w]+\w+)*?- 0+ 次重复,尽可能少,
      • [^,.\w]+- 除逗号、点或单词字符外的 1+ 个字符
      • \w+ - 1+字字符
  • (?=\s*(?:,|\bor\b))- 一个正向前瞻,需要 0+ 个空格和一个逗号,或者or紧跟在当前位置之后的一个单词。

R 演示

pattern <- "(?:\\bparticular phrase\\b|\\bor\\b|,)\\s*\\b(?!or\\b)\\K\\w+(?:[^,.\\w]+\\w+)*(?=\\s*,|\\bor\\b)"
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
regmatches(this_txt, gregexpr(pattern, this_txt, perl=TRUE, ignore.case=TRUE))[[1]]

输出:

[1] "this guy"                      "this other guy"               
[3] "that guy"                      "that other guy"               
[5] "something else blah blah blah"

推荐阅读