首页 > 解决方案 > 匹配此模式的正则表达式是什么?

问题描述

我想匹配以下文本。模式是一个以数字(例如 2.1)开始的项目,然后是一个或多个这样的项目。有些项目可以像 2.1 一样分布在多行中。我想匹配一块这样的物品。

模式将是:

(以 2.1 等数字开头的新行,可能后跟一行或多行不以 2.1 等数字开头的行)后跟一个或多个此类模式

2.1 [ii] Agreement and Plan of Reorganization, by and among the Company,
Force Acq. Corp. and Force Computers, Inc. as amended.
3.1 [viii] Articles of Incorporation of Company, as amended.
3.2 [viii] Bylaws of Company.
10.1 [I] Preferred Stock Purchase Agreement dated September 29, 1983,
together with amendments thereto dated February 28, 1984 and
10.2 [I] Form of Indemnification Agreement between Company and its
officers, directors and certain other key employees.
10.3 [I] Amendment to form of Indemnification Agreement.
10.4 [iv] 1983 Incentive Stock Option Plan, as amended August 13, 1991.
10.5 [vi] 1988 Employee Stock Purchase Plan, as amended October 1992.
10.6 [v] Amended and Restated 1992 Stock Option Plan.

这是我的正则表达式:

pattern = r"(?:\n\d{1,2}\.\d{1,2}.{1,200}){2,}\n"

text = re.sub(pattern,"", text, re.S)

还没有。多塔尔没有帮助。谢谢!

作为中间步骤,如何匹配不以 \d{1,2}.\d{1,2} 开头的行?负后视不适用于可变长度。

以下是更多示例文本:

2.01 Acquisition Agreement dated as of March 26, 1997 by and between
registrant and ISAR-Vermogensverwaltung Gbr mbH ("ISAR")(1)

3.01 Registrant's Amended and Restated Articles of Incorporation, as
amended(2)

3.02 Registrant's Certificate of Amendment of Articles of
Incorporation filed prior to the closing of registrant's initial
public offering(2)

3.03 Registrant's Amended and Restated Articles of Incorporation
filed following the closing of registrant's initial public
offering(2)

3.04 Registrant's Bylaws(2)
3.05 Registrant's Amended and Restated Bylaws adopted prior to the
closing of registrant's initial public offering(2)
3.06 Certificate of Amendment of Amended and Restated Articles of
Versant Object Technology Corporation(7)

3.07 Registrant's Certificate of Determination dated July 12, 1999,
incorporated by reference to the Company's current report on
Form 8-K (Exhibit 3.01) filed July 12, 1999.

4.01 [intentionally omitted]
4.02 Preferred Stock Purchase Agreement, dated as of April 27, 1994,
as amended(2)

10.01 Registrant's 1989 Stock Option Plan, as amended, and related
documents(2)**

10.02 Registrant's 1996 Equity Incentive Plan, as amended, and related
documents(3)**

10.03 Registrant's 1996 Directors Stock Option Plan, as amended, and
related documents(4)**

显着的特点是:(1)它们以2.01和10.03等数字开头(2)它们中的许多(至少2个)聚集在一起。不规则之处是: (1) 有的分布在多条线上,如 2.01,有的分布在一条线上,如 2.04。(2)它们之间可能有也可能没有空行,如2.01和3.01之间,3.04和3.05之间没有。

我想匹配完整的此类文本块并将其删除。其他文本是常规句子。其中一些可能以数字(例如标题的 2.1)开头,但它们不会像上面的文本那样聚集在一起。

标签: pythonregex

解决方案


如果您只想将每个段落作为一个项目,我建议如下:

import re
text = """ 2.1 [ii] Agreement and Plan of Reorganization, by and among the Company,
Force Acq. Corp. and Force Computers, Inc. as amended.
3.1 [viii] Articles of Incorporation of Company, as amended.
3.2 [viii] Bylaws of Company.
10.1 [I] Preferred Stock Purchase Agreement dated September 29, 1983,
together with amendments thereto dated February 28, 1984 and
10.2 [I] Form of Indemnification Agreement between Company and its
officers, directors and certain other key employees.
10.3 [I] Amendment to form of Indemnification Agreement.
10.4 [iv] 1983 Incentive Stock Option Plan, as amended August 13, 1991.
10.5 [vi] 1988 Employee Stock Purchase Plan, as amended October 1992.
10.6 [v] Amended and Restated 1992 Stock Option Plan."""

text = re.findall(r"\d{1,2}\.\d+.*?(?=\d{1,2}\.\d+|$)", text, re.S)

for paragraph in text:
    print(paragraph)

这产生:

2.1 [ii] Agreement and Plan of Reorganization, by and among the Company,
Force Acq. Corp. and Force Computers, Inc. as amended.

3.1 [viii] Articles of Incorporation of Company, as amended.

3.2 [viii] Bylaws of Company.

10.1 [I] Preferred Stock Purchase Agreement dated September 29, 1983,
together with amendments thereto dated February 28, 1984 and

10.2 [I] Form of Indemnification Agreement between Company and its
officers, directors and certain other key employees.

10.3 [I] Amendment to form of Indemnification Agreement.

10.4 [iv] 1983 Incentive Stock Option Plan, as amended August 13, 1991.

10.5 [vi] 1988 Employee Stock Purchase Plan, as amended October 1992.

10.6 [v] Amended and Restated 1992 Stock Option Plan.

关键是在.*后面,所以评估是懒惰的。这意味着正则表达式匹配它必须匹配的所有内容,但不会更多。如果你离开? 它匹配字符串的其余部分。

(?=...)允许您在结果中省略正则表达式,以便您只匹配下一段之前的所有内容。我希望这有帮助。


推荐阅读