首页 > 解决方案 > 跳过包含在我正在寻找的模式中的正则表达式模式

问题描述

我正在解析包含以 开头^[和结尾的脚注的 Pandoc-markdown 文件,]其中一些包含嵌入的[]. 例如:

...
to explain how the feature came to be as it is, so you can use generics more
effectively.^[Angelika Langer's [Java Generics FAQ](
www.angelikalanger.com/GenericsFAQ/JavaGenericsFAQ.html) as well as her other
writings (together with Klaus Kreft) were invaluable during the preparation of
this chapter.]
...

(在 Python 中)的简单方法:

re.compile(r"\^\[.+?\]", flags=re.DOTALL)

一开始就停止,]因此没有捕获整个脚注。有没有办法通过嵌套[]子句?

标签: regex

解决方案


您可以使用 PyPi 正则表达式模块使用子程序来做到这一点,您只需要在设置组边界时小心:

import regex
text = r"""...
to explain how the feature came to be as it is, so you can use generics more
effectively.^[Angelika Langer's [Java Generics FAQ](
www.angelikalanger.com/GenericsFAQ/JavaGenericsFAQ.html) as well as her other
writings (together with Klaus Kreft) were invaluable during the preparation of
this chapter.]
..."""
print( [x.group(1) for x in regex.finditer(r'\^(\[(?:[^][]++|(?1))*])', text)] )

输出:

["[Angelika Langer's [Java Generics FAQ](\nwww.angelikalanger.com/GenericsFAQ/JavaGenericsFAQ.html) as well as her other\nwritings (together with Klaus Kreft) were invaluable during the preparation of\nthis chapter.]"]

请参阅Python 演示正则表达式演示。细节:

  • \^-^字符
  • (\[(?:[^][]++|(?1))*])- 第 1 组:
    • \[- 一个[字符
    • (?:[^][]++|(?1))*- 0 次或多次出现:
      • [^][]++]- 除了and之外的一个或多个字符[
      • |- 或者
      • (?1)- 第 1 组模式
  • ]- 一个]字符。

推荐阅读