python - 用于在 Python 中检测公司名称的正则表达式
问题描述
我想使用 Python 用正则表达式检测公司名称。
这是我的想法:
- 公司名称应包含 1 到 3 个单词
- 公司名称中的第一个单词应大写
- 公司名称中的一个词可以有 .com 或 .co (Amazon.com Inc)
- 公司名称的最后一个单词(第四个单词)应为 Inc., Ltd, GmbH, AG, GmbH, Group, Holding 等。
- 名称的最后一个单词和 Inc., Ltd, GmbH, AG 之间有时可以是 ',' 或 ', '
我已经尝试过这样的事情,但它不起作用:
address_1 = 'I work in Amazon.com Inc.'
address_2 = 'Company named Swiss Medic Holding invested in vaccine'
address_3 = 'what do you think about Abercrombie & Fitch Co. ?'
address_4 = 'do you work in Delta Group?'
address_5 = 'I have worked in CocaCola Gmbh'
regex_company = '([A-Z][\w]+[ -]+){1,3}(Ltd|ltd|LTD|llc|LLC|Inc|inc|INC|plc|Corp|Group)'
found = re.search(regex_company, address)
我想打印检测到的公司的结果,我使用了相同的正则表达式逻辑来查找街道地址,效果很好,但对于公司名称却没有。这是我使用的正则表达式:
regex_street = "(\d{0,6})(?:\w)\s([A-Z][\w]+[ -]+){1,3}(Street|St|Road|Rd)
正则表达式逻辑:数字 + 1-3 个单词 + street/st/road/rd
解决方案
利用
\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b
请参阅正则表达式证明。
解释
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
co 'co'
--------------------------------------------------------------------------------
m? 'm' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (between 0 and 2
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
[ -]+ any character of: ' ', '-' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
[ -]+ any character of: ' ', '-' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
co 'co'
--------------------------------------------------------------------------------
m? 'm' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
){0,2} end of grouping
--------------------------------------------------------------------------------
[,\s]+ any character of: ',', whitespace (\n, \r,
\t, \f, and " ") (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?i: group, but do not capture (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally):
--------------------------------------------------------------------------------
ltd 'ltd'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
llc 'llc'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
inc 'inc'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
plc 'plc'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
co 'co'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
rp 'rp'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
group 'group'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
holding 'holding'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
gmbh 'gmbh'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
蟒蛇代码:
import re
regex = r"\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b"
test_str = ("I work in Amazon.com Inc.\n"
"Company named Swiss Medic Holding invested in vaccine\n"
"what do you think about Abercrombie & Fitch Co. ?\n"
"do you work in Delta Group?\n"
"I have worked in CocaCola Gmbh")
print(re.findall(regex, test_str))
结果:['Amazon.com Inc', 'Swiss Medic Holding', 'Abercrombie & Fitch Co', 'Delta Group', 'CocaCola Gmbh']
推荐阅读
- r - 如何将 xmlToDataframe 折叠成单行
- apache-spark - spark.rdd.compress 及其保存表的效果
- c# - DataGridViewComboBoxColumn:日期格式不适用于列表中的项目
- python - 使用具有范围的列表的 python 熊猫数据框的条件
- html-email - 为什么outlook.com 移动文本下划线?
- sql-server - 存储在列中的 EXEC 查询
- c# - 如何更改可视化分析器
- angular - 如何在角度中形成验证无模型属性
- sqlite - SQL SERVER 中的“FOR XML PATH('')”在 SQLite 中的等价物是什么?
- angular - 子路由重新加载而不从父路由获取数据