首页 > 解决方案 > 正则表达式中的通配符仅在停止词之前是贪婪的

问题描述

我正在尝试构建一个与以下句子匹配的“简单”正则表达式(在 java 中):

I want to cook something
I want to cook something with chicken and cheese
I want to cook something with chicken but without onions
I want to cook something without onions but with chicken and cheese
I want to cook something with candy but without nuts within 30 minutes

在最好的情况下,它也应该匹配: I want to cook something with candy and without nuts within 30 minutes

在这些示例中,我想捕获烹饪过程的“包含”成分、“排除”成分和最大“持续时间”。正如您所看到的,这 3 个捕获组中的每一个在模式中都是可选的,每个都以一个特定的单词(with, (but )?without, within)开头,并且这些组应该使用通配符匹配直到找到下一个这些特定关键字. 此外,这些成分可以包含多个单词,因此在第二个/第三个示例中,“chicken and cheese”应该与命名的捕获组“included”匹配。

在最好的情况下,我想写一个类似于这个的模式:

I want to cook something ((with (?<include>.+))|((but )?without (?<exclude>.+))|(within (?<duration>.+) minutes))*

显然这不起作用,因为这些通配符也可以与关键字匹配,因此在第一个关键字匹配后,其他所有内容(包括其他关键字)都将与相应命名捕获组的贪婪通配符匹配。

我尝试使用前瞻,例如:

something ((with (?<IncludedIngredients>.*(?=but)))|(but )?without (?<ExcludedIngredients>.+))+

该正则表达式识别something with chicken but without onions但不匹配something with chicken.

在正则表达式中有一个简单的解决方案吗?

PS“简单”解决方案意味着我不必在一个句子中指定这些关键字的所有可能组合,并按每个组合中使用的关键字数量对它们进行排序。

标签: javaregexregex-lookaroundscapturing-group

解决方案


它可能可以归结为下面的构造。

(?m)^I[ ]want[ ]to[ ]cook[ ]something(?=[ ]|$)(?<Order>(?:(?<with>\b(?:but[ ])?with[ ](?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)*)|(?<without>\b(?:but[ ])?without[ ](?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)*)|(?<time>\bwithin[ ](?<duration>.+)[ ]minutes[ ]?)|(?<unknown>(?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)+))*)$

https://regex101.com/r/RHfGnb/1

展开

 (?m)
 ^ I [ ] want [ ] to [ ] cook [ ] something
 (?= [ ] | $ )
 (?<Order>                      # (1 start)
      (?:
           (?<with>                      # (2 start)
                \b
                (?: but [ ] )?
                with [ ]
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )*
           )                             # (2 end)
        |  (?<without>                   # (3 start)
                \b
                (?: but [ ] )?
                without [ ]
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )*
           )                             # (3 end)
        |  (?<time>                      # (4 start)
                \b within [ ]
                (?<duration> .+ )             # (5)
                [ ] minutes [ ]? 
           )                             # (4 end)
        |  (?<unknown>                   # (6 start)
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )+
           )                             # (6 end)
      )*
 )                             # (1 end)
 $

推荐阅读