首页 > 解决方案 > python 正则表达式获取所有内容,直到特定字符串

问题描述

我有以下字符串:

This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test

我需要提取所有内容,直到这个字符串组合:

From: *
Sent: *
To: *
Subject: *

*充当通配符。

所以我的结果应该是:

This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

我想用正则表达式过滤它,但我无法弄清楚。任何指针?

这是我在 regex101 中尝试的正则表达式模式,但由于某种原因它在我的 python 脚本中不起作用: r"([\w\W\n]+?)\n((?:from:[^\n]+)\n+((?:\s*sent:[^\n]+)\n+(?:\s*to:[^\n]+)\n*(?:\s*cc:[^\n]+)*\n*(?:\s*bcc:[^\n]+)*\n*(?:\s*subject:[^\n]+)*))"

谢谢!

标签: pythonregex

解决方案


您可以尝试使用re.findall积极的前瞻性。这里的方法是匹配从字符串开头到但不包括应该停止匹配的文本块的所有内容。

inp = """This is the most recent email of this thread

More text

From: a@a.com
Date: 13 August, 2018

More text...

From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""

stop_text = """From: a@a.com
Sent: Tuesday 23 July
To: b@b.com, c@c.com
Subject: Test"""
matches = re.findall(r'^.*?(?=' + stop_text + ')', inp, flags=re.DOTALL)
print(matches)

这打印:

['This is the most recent email of this thread\n\nMore text\n\nFrom: a@a.com\nDate: 13 August, 2018\n\nMore text...\n\n']

推荐阅读