首页 > 解决方案 > 如何使用正则表达式从引用中提取实际引用和作者?

问题描述

我正在从 Twitter 上抓取引文,并且从这些引文中,我想将实际引文与其作者分开。

如果推文格式不统一,我怎么能做到这一点?

我是正则表达式的新手,但这是我对 regex101 https://regex101.com/r/m3WtmX/5的最佳尝试。

下面是我的代码,我希望每个循环都打印sre.SRE_Match object,但最后一个打印None

import re

QUOTE_PATTERN = re.compile(r'^(?P<actual_quote>.*)\s+?-\s*(?P<author>.*)$')

# actual_quote is separated from author by space and dash
format_1 = "Any form of exercise, if pursued continuously, will help train us in perseverance -Mao Tse-Tung"

# separated by one space, dash and another space
format_2 = "Any form of exercise, if pursued continuously, will help train us in perseverance - Mao Tse-Tung"

# actual_quote is surrounded with double quotes character and
# is separated from author by space, dash and another space
format_3 = '"Any form of exercise, if pursued continuously, will help train us in perseverance" - Mao Tse-Tung'

# separated only with dash (no space)
format_4 = "Any form of exercise, if pursued continuously, will help train us in perseverance-Mao Tse-Tung"

for format in [format_1, format_2, format_3, format_4]:
    print(QUOTE_PATTERN.match(format))

标签: regexpython-3.x

解决方案


这真的很棘手,因为这些数据的结构不规则

以非贪婪的方式在破折号之前获取第一组的所有字符与您提供的引号一起使用。

^(?P<actual_quote>.*?)-(?P<author>.*)$

https://regex101.com/r/rcGzzK/2

如果您不想包含额外的空格:

^(?P<actual_quote>.*?)\s*-\s*(?P<author>.*)$

https://regex101.com/r/rcGzzK/3

不幸的是,如果引号本身有任何破折号,那么上面的正则表达式将不起作用。


推荐阅读