首页 > 解决方案 > 如何从 regex.findall 的匹配中返回字典列表?

问题描述

我正在处理数百个文档,并且正在编写一个函数,该函数将查找特定单词及其值并返回字典列表。

我正在专门寻找一条特定信息(“城市”和引用它的数字)。但是,在一些文件中,我有一个城市,而在另一些文件中,我可能有二十个甚至一百个,所以我需要一些非常通用的东西。

一个文本示例(括号像这样搞砸了):

text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'

或者

text2 = 'About medium-sized cities such as City: Eger (population was: 32,352). However etc etc'

使用正则表达式我找到了我正在寻找的字符串:

p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)

它将整个文本作为列表返回。

[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']

现在,这就是我卡住的地方,我不知道如何继续。我应该使用 regex.findall 还是 regex.finditer?

考虑到文档中“城市”的数量各不相同,我想返回一个字典列表。如果我在文本 2 中运行,我会得到:

d = [{'cities': 'Eger', 'population': '32,352'}] 

如果我在文本一中运行:

d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]

我真的很感激任何帮助,伙计们!

标签: pythonregexstringlistbackend

解决方案


您可以re.finditer在匹配的文本上使用具有命名捕获组(以您的键命名)的正则表达式x.groupdict()来获取结果字典:

import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
    print([x.groupdict() for x in p2.finditer(m.group(1))])

# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]

在线查看Python 3 演示

第二个p2正则表达式是

(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)

请参阅正则表达式演示

这里,

  • (?P<city>\w+)- 组“城市”:1+字字符
  • \s*\(- 0+ 个空格和(
  • [^()\d]*(- 除和)和数字之外的任何 0+ 字符
  • (?P<population>\d[\d,]*)- 组“人口”:一个数字后跟 0+ 数字或/和逗号

您可能会尝试p2在整个原始字符串上运行正则表达式(请参阅演示),但它可能会过度匹配。


推荐阅读