regex - 如何使用正则表达式从引用中提取实际引用和作者?
问题描述
我正在从 Twitter 上抓取引文,并且从这些引文中,我想将实际引文与其作者分开。
如果推文格式不统一,我怎么能做到这一点?
我是正则表达式的新手,但这是我对 regex101 https://regex101.com/r/m3WtmX/5的最佳尝试。
下面是我的代码,我希望每个循环都打印sre.SRE_Match object
,但最后一个打印None
。
import re
QUOTE_PATTERN = re.compile(r'^(?P<actual_quote>.*)\s+?-\s*(?P<author>.*)$')
# actual_quote is separated from author by space and dash
format_1 = "Any form of exercise, if pursued continuously, will help train us in perseverance -Mao Tse-Tung"
# separated by one space, dash and another space
format_2 = "Any form of exercise, if pursued continuously, will help train us in perseverance - Mao Tse-Tung"
# actual_quote is surrounded with double quotes character and
# is separated from author by space, dash and another space
format_3 = '"Any form of exercise, if pursued continuously, will help train us in perseverance" - Mao Tse-Tung'
# separated only with dash (no space)
format_4 = "Any form of exercise, if pursued continuously, will help train us in perseverance-Mao Tse-Tung"
for format in [format_1, format_2, format_3, format_4]:
print(QUOTE_PATTERN.match(format))
解决方案
这真的很棘手,因为这些数据的结构不规则。
以非贪婪的方式在破折号之前获取第一组的所有字符与您提供的引号一起使用。
^(?P<actual_quote>.*?)-(?P<author>.*)$
https://regex101.com/r/rcGzzK/2
如果您不想包含额外的空格:
^(?P<actual_quote>.*?)\s*-\s*(?P<author>.*)$
https://regex101.com/r/rcGzzK/3
不幸的是,如果引号本身有任何破折号,那么上面的正则表达式将不起作用。
推荐阅读
- android - HTML5在android WebView中没有视频只有音频播放
- amazon-s3 - 从 bucket.objects.filter(Prefix=prefix) 中排除 S3 文件夹
- g1ant - IMAP 未连接到邮箱
- flutter - 使用 Provider/ProxyProvider 时选择正确的依赖策略
- django - 如何检查外键是否存在?
- javascript - 高图表。如何在没有数据的情况下显示 X 轴标签线?
- c++ - GDB 显示损坏的指令地址偏移
- javascript - 我创建了一个图层控件,但没有出现复选框
- python - 实施 MSE 损失
- list - Gson在kotlin中反序列化整数数组