首页 > 解决方案 > Regex similar to a Hearst Pattern in Python

问题描述

I'm trying to come up with a regex similiar to the ones listed here for Hearst Patterns in order to get the following results:

NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).
NP_The_Eleventh_Air_Force (NP_11_AF) is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF).

Doing re.search(regex, sentence) for each of this sentences I want to match this 2 groupsNP_The_Eleventh_Air_Force NP_a_Numbered_Air_Force

This is my attempt but it doesn't get any matches:

(NP_\\w+ (, )?is (NP_\\w+ ?))

标签: pythonregexpython-3.x

解决方案


In both sentences I think (, )? is not present, but the part before between parenthesis is so you could make that part optional instead.

Also move the last parenthesis from )) to (NP_\w+) to create the first group.

The pattern including the optional comma and space could be:

(NP_\w+)(?: \([^()]+\))? (?:, )?is (NP_\w+ ?)

Regex demo

If you don't need the space at the end and the comma space is not present, you pattern could be:

(NP_\w+)(?: \([^()]+\))? is (NP_\w+)
  • (NP_\w+) Capture group 1 Match NP_ and 1+ word chars
  • (?: \([^()]+\))? Optionally match a space and a part with parenthesis
  • is Match literally
  • (NP_\w+) Capture group 2 Match NP_ and 1+ word chars

See a regex demo | Python demo

For example

import re

regex = r"(NP_\w+)(?: \([^()]+\))? is (NP_\w+)"
test_str = "NP_The_Eleventh_Air_Force is NP_a_Numbered_Air_Force of NP_the_United_States_Air_Force_Pacific_Air_Forces (NP_PACAF)."
matches = re.search(regex, test_str)

if matches:
    print(matches.group(1))
    print(matches.group(2))

Output

NP_The_Eleventh_Air_Force
NP_a_Numbered_Air_Force

推荐阅读