python - 如何从 regex.findall 的匹配中返回字典列表?
问题描述
我正在处理数百个文档,并且正在编写一个函数,该函数将查找特定单词及其值并返回字典列表。
我正在专门寻找一条特定信息(“城市”和引用它的数字)。但是,在一些文件中,我有一个城市,而在另一些文件中,我可能有二十个甚至一百个,所以我需要一些非常通用的东西。
一个文本示例(括号像这样搞砸了):
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
或者
text2 = 'About medium-sized cities such as City: Eger (population was: 32,352). However etc etc'
使用正则表达式我找到了我正在寻找的字符串:
p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)
它将整个文本作为列表返回。
[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
现在,这就是我卡住的地方,我不知道如何继续。我应该使用 regex.findall 还是 regex.finditer?
考虑到文档中“城市”的数量各不相同,我想返回一个字典列表。如果我在文本 2 中运行,我会得到:
d = [{'cities': 'Eger', 'population': '32,352'}]
如果我在文本一中运行:
d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]
我真的很感激任何帮助,伙计们!
解决方案
您可以re.finditer
在匹配的文本上使用具有命名捕获组(以您的键命名)的正则表达式x.groupdict()
来获取结果字典:
import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
print([x.groupdict() for x in p2.finditer(m.group(1))])
# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]
在线查看Python 3 演示。
第二个p2
正则表达式是
(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)
请参阅正则表达式演示。
这里,
(?P<city>\w+)
- 组“城市”:1+字字符\s*\(
- 0+ 个空格和(
[^()\d]*
(
- 除和)
和数字之外的任何 0+ 字符(?P<population>\d[\d,]*)
- 组“人口”:一个数字后跟 0+ 数字或/和逗号
您可能会尝试p2
在整个原始字符串上运行正则表达式(请参阅演示),但它可能会过度匹配。
推荐阅读
- angular - 从角度 8 的 ng-autocomplete 下拉列表中删除所选项目
- javascript - 有没有一种简单的方法来呈现列表项数组?
- python - 基于引用计数以“错误”顺序调用 Python 析构函数
- c# - ASP.NET MVC:登录名或密码与 Microsoft 帐户系统中的一个不匹配
- reactjs - create-react-app 不会生成 public 和 src 文件夹并显示警告 UNMET PEER DEPENDENCY
- content-security-policy - CSP:无法将更改的 URL 列入白名单,通配符不起作用
- r - R 代码:model.matrix.default(mt, mf, contrasts) 中的错误:变量 1 没有级别
- sql-server - Python pyodbc 连接到 SQL Server
- node.js - React Hooks:在 Socket.io 处理程序中调用时状态未更新
- c++ - Intel UHD 630 上的默认帧缓冲区未进行伽马校正