python - 如何从字符串中拆分数据并将它们放入预定义的类别中?
问题描述
我正在尝试将以下存储在数组中的句子分成几类。这些类别是线路号、车站、关闭类型和日期,因为我正在抓取的所有地铁关闭公告都是这种格式。
"Line 1: Finch to Sheppard-Yonge nightly early closures March 23 to 26 - CANCELLED"
"Line 1: Lawrence to St Clair weekend closure Sunday, March 29 - REVISED"
"Line 1: Sheppard-Yonge to St Clair nightly early closures March 30 to April 2 - REVISED"9
例如:
Line = {0:"Line 1", 1:"Line 1", 2:"Line 1"}
Stations = {0: "Finch to Sheppard-Yonge", 1:"Lawrence to St Clair", 2:"Sheppard-Yonge to St Clair"}
我创建了一些非常复杂的 for 循环来执行此操作,但是,它们有很多错误,并且每个类别都需要不同的代码逻辑。下面是我如何从上述句子中提取“闭包类型”的示例,我假设有 3 种闭包类型存储在closure_types
数组中:
closure_types = ["nightly early closures","single day closure","weekend closure"]
closure_types_split = []
for closure_type in closure_types:
split_closure_type_a = closure_type.split()
closure_types_split.append(split_closure_type_a)
closure_type_categorized = []
for i in range(len(split_closures)):
for ins in range(len(closure_types_split)):
try:
first_word_in_closure_types_split = closure_types_split[ins][0]
first_word = split_closures[i].index(str(first_word_in_closure_types_split))
if split_closures[i][first_word] == 'nightly':
last_word = first_word + 3
closure_type_categorized.append(split_closures[i][first_word:last_word])
elif split_closures[i][first_word] == 'single':
last_word = first_word + 2
closure_type_categorized.append(split_closures[i][first_word:last_word])
elif split_closures[i][first_word] == 'weekend':
last_word = first_word + 2
closure_type_categorized.append(split_closures[i][first_word:last_word])
except:
pass
我的问题是是否有更简单的方法来做我想做的事情?或者是否有任何 python 库旨在做我想做的事情?
解决方案
这可以使用正则表达式来处理
import re
# note: spaces in the names must use `\s` (see St Clair),
# because the re pattern uses verbose mode.
stations = '|'.join(line.strip() for line in
r"""
Finch
Lawrence
Sheppard-Yonge
St\sClair
""".strip().splitlines())
# The re pattern is a raw f-string so the {stations} can be inserted.
pattern = rf"""(?ix)
Line\s+(?P<line>\d+):
\s*
(?P<where>(?:{stations})(?:\s*to\s*(?:{stations}))*) # one or more stations separated by 'to'
\s*
(?P<what>(?:\w*\s+)*?closures?) # phrase ending with closure or closures
\s*
(?P<when>[^-]*) # everything up to a '-'
\s*
(?:-\s* (?P<note>.*))? # if there is a '-' everything after it
"""
template = re.compile(pattern)
在测试用例上使用它:
testcases = [
"Line 1: Finch to Sheppard-Yonge nightly early closures March 23 to 26 - CANCELLED",
"Line 1: Lawrence to St Clair weekend closure Sunday, March 29 - REVISED",
"Line 1: Sheppard-Yonge to St Clair nightly early closures March 30 to April 2 - REVISED",
]
for test in testcases:
mo = template.search(test)
print(mo.groupdict())
印刷:
{'line': '1', 'where': 'Finch to Sheppard-Yonge', 'what': 'nightly early closures', 'when': 'March 23 to 26 ', 'note': 'CANCELLED'}
{'line': '1', 'where': 'Lawrence to St Clair', 'what': 'weekend closure', 'when': 'Sunday, March 29 ', 'note': 'REVISED'}
{'line': '1', 'where': 'Sheppard-Yonge to St Clair', 'what': 'nightly early closures', 'when': 'March 30 to April 2 ', 'note': 'REVISED'}
对于更复杂的解析问题,我喜欢 TatSu 库。
推荐阅读
- sql - 如何为 oracle 中的每个线程选择和阻止行?(PostgreSQL 有一个工作示例)
- javascript - 当js中的代码返回时,Flask表单提交无法停止
- python - 带有硒的 Instagram 登录脚本,无法执行 .send_keys('test')
- java - java中的19位时间戳转换
- html - Xpath 选择器,根据子元素的内部文本选择元素
- vue.js - 如何在Vue中编写路由器?
- angular - 使用 Firebase 进行基于角色的用户访问控制
- javascript - if 语句即使在错误条件下也会执行
- elasticsearch - 弹性搜索ignore_above设置使用
- node.js - 如何使用 express api 在 Mongoose 中获取已创建对象的返回