首页 > 解决方案 > 用于在 Python 中检测公司名称的正则表达式


我想使用 Python 用正则表达式检测公司名称。


  1. 公司名称应包含 1 到 3 个单词
  2. 公司名称中的第一个单词应大写
  3. 公司名称中的一个词可以有 .com 或 .co (Amazon.com Inc)
  4. 公司名称的最后一个单词(第四个单词)应为 Inc., Ltd, GmbH, AG, GmbH, Group, Holding 等。
  5. 名称的最后一个单词和 Inc., Ltd, GmbH, AG 之间有时可以是 ',' 或 ', '


address_1 = 'I work in Amazon.com Inc.'
address_2 = 'Company named Swiss Medic Holding invested in vaccine'
address_3 = 'what do you think about Abercrombie & Fitch Co. ?'
address_4 = 'do you work in Delta Group?'
address_5 = 'I have worked in CocaCola Gmbh'

regex_company = '([A-Z][\w]+[ -]+){1,3}(Ltd|ltd|LTD|llc|LLC|Inc|inc|INC|plc|Corp|Group)'
found = re.search(regex_company, address)


regex_street = "(\d{0,6})(?:\w)\s([A-Z][\w]+[ -]+){1,3}(Street|St|Road|Rd)

正则表达式逻辑:数字 + 1-3 个单词 + street/st/road/rd

标签: pythonregex



\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b



  \b                       the boundary between a word char (\w) and
                           something that is not a word char
  [A-Z]                    any character of: 'A' to 'Z'
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
    \.                       '.'
    co                       'co'
    m?                       'm' (optional (matching the most amount
  )?                       end of grouping
  (?:                      group, but do not capture (between 0 and 2
                           times (matching the most amount
    [ -]+                    any character of: ' ', '-' (1 or more
                             times (matching the most amount
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
      &                        '&'
      [ -]+                    any character of: ' ', '-' (1 or more
                               times (matching the most amount
    )?                       end of grouping
    [A-Z]                    any character of: 'A' to 'Z'
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
      \.                       '.'
      co                       'co'
      m?                       'm' (optional (matching the most
                               amount possible))
    )?                       end of grouping
  ){0,2}                   end of grouping
  [,\s]+                   any character of: ',', whitespace (\n, \r,
                           \t, \f, and " ") (1 or more times
                           (matching the most amount possible))
  (?i:                     group, but do not capture (case-
                           insensitive) (with ^ and $ matching
                           normally) (with . not matching \n)
                           (matching whitespace and # normally):
    ltd                      'ltd'
   |                        OR
    llc                      'llc'
   |                        OR
    inc                      'inc'
   |                        OR
    plc                      'plc'
   |                        OR
    co                       'co'
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
      rp                       'rp'
    )?                       end of grouping
   |                        OR
    group                    'group'
   |                        OR
    holding                  'holding'
   |                        OR
    gmbh                     'gmbh'
  )                        end of grouping
  \b                       the boundary between a word char (\w) and
                           something that is not a word char


import re

regex = r"\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b"

test_str = ("I work in Amazon.com Inc.\n"
    "Company named Swiss Medic Holding invested in vaccine\n"
    "what do you think about Abercrombie & Fitch Co. ?\n"
    "do you work in Delta Group?\n"
    "I have worked in CocaCola Gmbh")

print(re.findall(regex, test_str))

结果['Amazon.com Inc', 'Swiss Medic Holding', 'Abercrombie & Fitch Co', 'Delta Group', 'CocaCola Gmbh']
