首页 > 解决方案 > Regex - Word boundary not working even with raw-string

问题描述

I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:

r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'

Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).

In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.

Remembering that the regex should match the MM/YYYY pattern in strings like:

"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."

Can you help me find the error in my pattern to make it match only my goal format?

enter image description here

标签: pythonregex

解决方案


问题在于字符串中的\b\d{2}/\d{4}\b匹配项,因为第一个正斜杠是分词符。解决方案是识别应该在匹配之前和之后的字符,并使用否定的环视来代替分词。在这里你可以使用正则表达式02/200001/02/2000

r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'

否定的lookbehind , (?<![\d/]), 防止代表月份的两位数字前面有一个数字或正斜杠;负前瞻,防止代表年份的(?![\d/])四位数字后跟一个数字或正斜杠。

正则表达式演示

Python 演示

如果6/2000还要匹配06/2000,则(?:0[1-9]改为(?:0?[1-9]


推荐阅读