首页 > 解决方案 > 使用正则表达式从列表中提取部分和子部分

问题描述

我有一个列表,我通过 pdfplumber 从文本文件中提取:

data=['1.1  SUMMARY \n', ' \n', 'A.  Furnish and install: \n', '1.  Soffit support framing. \n', '2.  Universal grid system. \n', '3.  Steel mesh infill. \n', ' \n', 'B.  Perform all drilling and cutting in miscellaneous metal items required for the \n', 'attachment of other items. \n', ' \n', 'C.  Perform all shop painting for all surfaces of exposed to view galvanized and non-\n', 'galvanized metals, and post-erection touch-up of shop prime coat, using the same \n', 'material as shop- prime coating. \n', ' \n', 'D.  Perform application of liquid zinc touch-up to all welds of galvanized steel items \n', 'furnished hereunder. \n', ' \n']

我想从列表中提取 A、B、C 和 D(部分),然后是相应的子部分(如果有的话,如 1、2、3 等)以及一些告诉我该部分有子部分的映射。

逻辑是每当 a\n作为列表元素出现时,下一个元素始终是部分。对于子部分,没有模式,但它很可能以数字开头,如您所见。我希望输出没有新行\n

例如, section 的一个列表:

['A.  Furnish and install:','B.  Perform all drilling and cutting in miscellaneous metal items required for the attachment of other items.','C.  Perform all shop painting for all surfaces of exposed to view galvanized and non-galvanized metals, and post-erection touch-up of shop prime coat, using the same material as shop- prime coating. ','D.  Perform application of liquid zinc touch-up to all welds of galvanized steel items furnished hereunder. ']

和小节:

['1.  Soffit support framing.', '2.  Universal grid system.', '3.  Steel mesh infill.']

任何类型的映射都可以让我知道哪个小节属于哪个节。(在这种情况下,第一节只有 1 个小节)

目前我已经尝试过拆分re.split("\n[\s]+\n",data),这给了我结果

['1.1  SUMMARY ', ' A.  Furnish and install: \n 1.  Soffit support framing. \n 2.  Universal grid system. \n 3.  Steel mesh infill. ', ' B.  Perform all drilling and cutting in miscellaneous metal items required for the \n attachment of other items. ', ' C.  Perform all shop painting for all surfaces of exposed to view galvanized and non-\n galvanized metals, and post-erection touch-up of shop prime coat, using the same \n material as shop- prime coating. ', ' D.  Perform application of liquid zinc touch-up to all welds of galvanized steel items \n furnished hereunder. ', '']

但这有两个缺点。First being\n存在于所有部分中,包括那些没有任何子部分的部分,如果我们开始删除\n,那么我们将不知道该部分是否有任何子部分。

标签: pythonregexliststring-matching

解决方案


利用

import re
data=['1.1  SUMMARY \n', ' \n', 'A.  Furnish and install: \n', '1.  Soffit support framing. \n', '2.  Universal grid system. \n', '3.  Steel mesh infill. \n', ' \n', 'B.  Perform all drilling and cutting in miscellaneous metal items required for the \n', 'attachment of other items. \n', ' \n', 'C.  Perform all shop painting for all surfaces of exposed to view galvanized and non-\n', 'galvanized metals, and post-erection touch-up of shop prime coat, using the same \n', 'material as shop- prime coating. \n', ' \n', 'D.  Perform application of liquid zinc touch-up to all welds of galvanized steel items \n', 'furnished hereunder. \n', ' \n']
text_from_data = re.sub(r"\s*\n\s*", r"\n", "\n".join(data))
regex = r"(?m)(?P<Section>[A-Z]+\. .*(?:\n(?!\d|[A-Z]+\.).*)*)(?P<Subsections>(?:\n\d+\..*)*)"
matches = re.finditer(regex, text_from_data)
for match in matches:
    print(match.group("Section").strip())
    print(match.group("Subsections").strip().splitlines())

请参阅Python 证明

使用re.sub(r"\s*\n\s*", r"\n", "\n".join(data))将文本连接成单个字符串,在数据项之间使用单个换行符。

(?m)(?P<Section>[A-Z]+\. .*(?:\n(?!\d|[A-Z]+\.).*)*)(?P<Subsections>(?:\n\d+\..*)*) 神奇

--------------------------------------------------------------------------------
  (?m)                     set flags for this block (with ^ and $
                           matching start and end of line) (case-
                           sensitive) (with . not matching \n)
                           (matching whitespace and # normally)
--------------------------------------------------------------------------------
  (?P<Section>               group and capture to "Section" group:
--------------------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
                             ' '
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        \d                       digits (0-9)
--------------------------------------------------------------------------------
       |                        OR
--------------------------------------------------------------------------------
        [A-Z]+                   any character of: 'A' to 'Z' (1 or
                                 more times (matching the most amount
                                 possible))
--------------------------------------------------------------------------------
        \.                       '.'
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      .*                       any character except \n (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of "Section" group
--------------------------------------------------------------------------------
  (?P<Subsections>         group and capture to "Subsections" group:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \n                       '\n' (newline)
--------------------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      .*                       any character except \n (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of "Subsections" group

推荐阅读