python - 使用正则表达式从列表中提取部分和子部分
问题描述
我有一个列表,我通过 pdfplumber 从文本文件中提取:
data=['1.1 SUMMARY \n', ' \n', 'A. Furnish and install: \n', '1. Soffit support framing. \n', '2. Universal grid system. \n', '3. Steel mesh infill. \n', ' \n', 'B. Perform all drilling and cutting in miscellaneous metal items required for the \n', 'attachment of other items. \n', ' \n', 'C. Perform all shop painting for all surfaces of exposed to view galvanized and non-\n', 'galvanized metals, and post-erection touch-up of shop prime coat, using the same \n', 'material as shop- prime coating. \n', ' \n', 'D. Perform application of liquid zinc touch-up to all welds of galvanized steel items \n', 'furnished hereunder. \n', ' \n']
我想从列表中提取 A、B、C 和 D(部分),然后是相应的子部分(如果有的话,如 1、2、3 等)以及一些告诉我该部分有子部分的映射。
逻辑是每当 a\n
作为列表元素出现时,下一个元素始终是部分。对于子部分,没有模式,但它很可能以数字开头,如您所见。我希望输出没有新行\n
。
例如, section 的一个列表:
['A. Furnish and install:','B. Perform all drilling and cutting in miscellaneous metal items required for the attachment of other items.','C. Perform all shop painting for all surfaces of exposed to view galvanized and non-galvanized metals, and post-erection touch-up of shop prime coat, using the same material as shop- prime coating. ','D. Perform application of liquid zinc touch-up to all welds of galvanized steel items furnished hereunder. ']
和小节:
['1. Soffit support framing.', '2. Universal grid system.', '3. Steel mesh infill.']
任何类型的映射都可以让我知道哪个小节属于哪个节。(在这种情况下,第一节只有 1 个小节)
目前我已经尝试过拆分re.split("\n[\s]+\n",data)
,这给了我结果
['1.1 SUMMARY ', ' A. Furnish and install: \n 1. Soffit support framing. \n 2. Universal grid system. \n 3. Steel mesh infill. ', ' B. Perform all drilling and cutting in miscellaneous metal items required for the \n attachment of other items. ', ' C. Perform all shop painting for all surfaces of exposed to view galvanized and non-\n galvanized metals, and post-erection touch-up of shop prime coat, using the same \n material as shop- prime coating. ', ' D. Perform application of liquid zinc touch-up to all welds of galvanized steel items \n furnished hereunder. ', '']
但这有两个缺点。First being\n
存在于所有部分中,包括那些没有任何子部分的部分,如果我们开始删除\n
,那么我们将不知道该部分是否有任何子部分。
解决方案
利用
import re
data=['1.1 SUMMARY \n', ' \n', 'A. Furnish and install: \n', '1. Soffit support framing. \n', '2. Universal grid system. \n', '3. Steel mesh infill. \n', ' \n', 'B. Perform all drilling and cutting in miscellaneous metal items required for the \n', 'attachment of other items. \n', ' \n', 'C. Perform all shop painting for all surfaces of exposed to view galvanized and non-\n', 'galvanized metals, and post-erection touch-up of shop prime coat, using the same \n', 'material as shop- prime coating. \n', ' \n', 'D. Perform application of liquid zinc touch-up to all welds of galvanized steel items \n', 'furnished hereunder. \n', ' \n']
text_from_data = re.sub(r"\s*\n\s*", r"\n", "\n".join(data))
regex = r"(?m)(?P<Section>[A-Z]+\. .*(?:\n(?!\d|[A-Z]+\.).*)*)(?P<Subsections>(?:\n\d+\..*)*)"
matches = re.finditer(regex, text_from_data)
for match in matches:
print(match.group("Section").strip())
print(match.group("Subsections").strip().splitlines())
请参阅Python 证明。
使用re.sub(r"\s*\n\s*", r"\n", "\n".join(data))
将文本连接成单个字符串,在数据项之间使用单个换行符。
(?m)(?P<Section>[A-Z]+\. .*(?:\n(?!\d|[A-Z]+\.).*)*)(?P<Subsections>(?:\n\d+\..*)*)
神奇:
--------------------------------------------------------------------------------
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
(?P<Section> group and capture to "Section" group:
--------------------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
' '
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of "Section" group
--------------------------------------------------------------------------------
(?P<Subsections> group and capture to "Subsections" group:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of "Subsections" group
推荐阅读
- ios - 具有 MSAL 的 Xamarin 无法将访问令牌保存为在 Ad-Hoc 预配期间更改的钥匙串访问组
- java - 将字节切片中的负数转换为 int
- python - 如何在 Jupyter 笔记本中导入 CPLEX?
- java - Eclipse 在类路径中找不到 SQLite JDBC。BuildPath 已经自定义。但为什么?
- python - 如何使用 Python Flask 通过 REST 响应发送附件
- jenkins - 在 Jenkins 中,如何手动检查插件更新的签名?
- c# - 在 C# 中使用 MachineKey 编码进行 Umbraco 解密
- ruby - 用 Ruby 解压字符串
- c# - 如何让 Windows 窗体应用程序响应多种屏幕分辨率?
- vuejs2 - vue2 中的动态表单问题:[TypeError: Cannot read property '_withTask' of undefined]