python - 正则表达式在几种情况下匹配版权声明中的公司名称
问题描述
我的时间很紧,要提出一个 python 正则表达式来匹配许多可能不同的版权声明中的公司名称,例如:
Copyright © 2019 Apple Inc. All rights reserved.
© 2019 Quid, Inc. All Rights Reserved.
© 2009 Database Designs
© 2019 Rediker Software, All Rights Reserved
©2019 EVOSUS, INC. ALL RIGHTS RESERVED
© 2019 Walmart. All Rights Reserved.
© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.
Copyright © 1978-2019 Berkshire Hathaway Inc.
© 2019 McKesson Corporation
© 2019 UnitedHealth Group. All rights reserved.
© Copyright 1999 - 2019 CVS Health
Copyright 2019 General Motors. All Rights Reserved.
© 2019 Ford Motor Company
©2019 AT&T Intellectual Property. All rights reserved.
© 2019 GENERAL ELECTRIC
Copyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.
© 2019 Verizon
© 2019 Fannie Mae
Copyright © 2018 Jonas Construction Software Inc. All rights reserved.
All Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved
© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121
© 2019 JPMorgan Chase & Co.
Copyright © 1995 - 2018 Boeing. All Rights Reserved.
© 2019 Bank of America Corporation. All rights reserved.
© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801
©2019 Cardinal Health. All rights reserved.
我所知道的正则表达式只是非常基本的东西,目前还不足以快速提出一个好的解决方案。
在我看来,至少对于这些示例,正确捕获公司名称的要求如下:
If there's a '©' or 'Copyright' in the sentence:
After '©' or 'Copyright' - look for a year, e.g. '2019', or a year range, e.g. '1995 - 2018' or '2003-2019' (spaces are to catch as well]):
If there's a dot somewhere after this year/year range, capture the text until the dot. E.g. in 'Copyright © 1978-2019 Berkshire Hathaway Inc.' capture 'Berkshire Hathaway Inc'
If there's no dot but there's the sentence 'All rights reserved', capture from the year/year range until there and also ignore any possible non-alphanumeric characters that precede it, such as spaces and commas. E.g. from '© 2019 Rediker Software, All Rights Reserved' capture 'Rediker Software'
If there's no dot nor the sentence 'All rights reserved', capture from the year/year range until the end. E.g. from '© 2019 Verizon' Capture 'Verizon'
关于这个好的正则表达式有什么建议吗?
解决方案
您可以考虑使用正则表达式
(?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
请参阅正则表达式演示。使用不区分大小写的修饰符re.I
。
细节
(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)
- 任何一个©(?:\s*Copyright)?
-©
char 后跟 0+ 个空格的可选子字符串,然后Copyright
|
- 或者Copyright(?:\s*©)?
-Copyright
后跟 0+ 个空格和©
字符的可选子字符串
\s*
- 0+ 个空格\d+
- 1+ 位(\d{4}
如果年份总是包含 4 位,则使用)(?:\s*-\s*\d+)?
- 一个可选序列,-
包含 0+ 个空格,然后是 1+ 个数字(\d{4}
如果年份总是包含 4 个数字,则使用)\s*
- 0+ 个空格(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
-捕获组 1:任何替代方案:.*?(?=\W*All\s+rights\s+reserved)
- 除换行符以外的任何 0+ 个字符,尽可能少,最多 0+ 个非单词字符后跟All rights reserved
字符串[^.]*(?=\.)
.
- 除了尽可能多的字符之外的任何 0+ 字符,但.
不包括.
.*
- 线路的其余部分
import re
s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved."
rx = r"(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.\n]*(?=\.)|.*)"
for m in re.findall(rx, s, re.I):
print(m)
输出:
Apple Inc
Quid, Inc
Database Designs
Rediker Software
EVOSUS, INC
Walmart
Exxon Mobil Corporation
Berkshire Hathaway Inc
McKesson Corporation
UnitedHealth Group
CVS Health
General Motors
Ford Motor Company
AT&T Intellectual Property
GENERAL ELECTRIC
AmerisourceBergen Corporation
Verizon
Fannie Mae
Jonas Construction Software Inc
Kroger | The Kroger Co
Express Scripts Holding Company
JPMorgan Chase & Co
Boeing
Bank of America Corporation
Wells Fargo
Cardinal Health
Quid, Inc
推荐阅读
- angular - 如何在每个代码中使用 Angular 中的(单击)功能添加元素?
- javascript - 如何在会话存储中存储暗模式?
- amazon-web-services - 使用 IAM 角色 ID 获取 IAM 角色名称
- javascript - Javascript:为什么警报说未定义而不是字母“k”
- android - Chrome Mobile 失去了主题颜色
- google-apps-script - 如何使用 GAS 发送与聊天机器人一起使用的 http POST 响应
- excel - 发送到邮件正文时如何保持表格列中的超链接可点击?
- sapui5 - 为什么我在 sapUI5 中收到日期格式错误?
- python - Python 单元测试补丁函数 - 避免将模拟函数传递给测试函数
- c# - 将类类型作为参数传递,共享基类属性和方法