python - 使用正则表达式提取数据

问题描述

text='''

        Consumer Price Index:
        +0.3% in Aug 2020

        Unemployment Rate:
        +2.4% in Aug 2020
'''

使用正则表达式将数据提取到元组列表中，例如

[('Consumer Price Index', '+0.2%', 'Aug 2020'), ...]

并返回元组列表

我尝试了几次

re.findall( , text)

有人有好主意吗？

标签： pythonregexnlp

我将首先拆分字符串'\n\n'以将它们分成单独的部分（以避免混淆），然后在每个部分上运行正则表达式以提取组。

看这个例子：

import re

text = '''

        Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020
        '''


sections = text.split('\n\n')

results = []

for section in sections:
    pattern = re.compile(r'\s+([\w\s]+):\n.+(\+.+) in ([\w\d\s]+)')

    matches = pattern.match(section)

    if matches:
        results.append(matches.groups())

print(results)

输出：

[
   ('Consumer Price Index', '+0.2%', 'Sep 2020'),
   ('Unemployment Rate', '+7.9%', 'Sep 2020')
]

更新：

这是一个解决方案，re.findall但就像我说的那样，根据text结构的不同，可能存在不一致。为了安全起见，您应该分而治之。

import re

text = '''

        Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020
        '''


sections = text.split('\n\n')

pattern = re.compile(r'\s+([\w\s]+):\n.+(\+.+) in ([\w\d\s]+)\n')

results = pattern.findall(text)

print(results)

python - 使用正则表达式提取数据

问题描述

解决方案

推荐阅读