首页 > 解决方案 > 如何避免匹配较长的字符串以支持较短的子字符串?

问题描述

我有以下正则表达式,旨在根据社交媒体数据提取参加活动的人数:

我是正则表达式的新手,但我尝试使用 {} 来限制匹配字符的数量。

([0-9]+)?(,)?[0-9]+(\s*(\.|,)\s*[0-9])?\s*(k|K)?\s*(P|p).*e\s*(G|g).*g

问题是它不仅匹配了这个“60 人去”,而且还匹配了这个“184 人感兴趣 20 人去”。

在第一种情况下,它给了我想要的值(即 60),但在第二种情况下,我得到的是 184 而不是 20。


示例 1:

"United Muslims of America shared their event. \nSponsored B \nIf you also think that there should only be peace, come with us on Juney 3 \nand let's make it happen. \nStop warl Stop killing the innocent! \nsrop \nKiLLiNG \nTHE iNNOCENT \nJUN \nLike \nMake peacei not war! \nSat PM EDT The White House Washington, \n184 people interested 20 people going \nComment \nInterested \n"

示例 2:

"BM shared their event. \nSponsored \nWe're proud to announce an initiative focused on providing free legal \neducation to empower our people and strengthen our community. \nWe believe that having these legal workshops on a monthly basis will prove \nto be beneficial in a tangible way for our community \nMeet you at \nLEGAL \nNIGHT A \nCharlotte, NC \nFREE LEGAL INFO FOR COMMUNITY \nJANUARY, 28, 5 PM \nJAN \n28 \nLegal Night at \nSat 5 PM \n95 people interested 18 people going \nCharlotte \n* Interested \n19 Reactions \nLike Comment \n"

标签: pythonregex

解决方案


如果你想匹配数字后面跟着人,你可以省略使用?中间添加可选部分,因为它匹配太多。

您的模式中的某些部分可以进行优化。假设您没有在代码中单独使用捕获的组并且只需要匹配:

  • (P|p)可以写成[pP]使用字符类
  • ([0-9]+)?可以写成[0-9]*
  • (G|g).*g将匹配 G 或 g,直到最后一次出现 g。您可以将其更新为[Gg]\S*g使用\S以匹配非空白字符。

例如

 \b[0-9]+ [Pp]eople [Gg]oing\b

正则表达式演示


推荐阅读