首页 > 解决方案 > 在正则表达式中包含换行符和任何其他字符 - Python

问题描述

我目前在 Python 中使用 Jupyter Notebook 和 Regex 从 txt 格式的字典文件创建单词和定义字典。

来自文本文件的示例数据: ABACINATE\nA*bac"i*nate, v.t. Etym: [LL. abacinatus, p.p. of abacinare; ab off +\nbacinus a basin.]\n\nDefn: To blind by a red-hot metal plate held before the eyes. [R.]\n\nABACINATION\nA*bac`i*na"tion, n.\n\nDefn: The act of abacinating. [R.]\n\n

我试图创建的模式包括获取单词的所有大写字母,然后删除文本直到定义。

期望的输出

{'word': 'ABACINATE', 'definition': To blind by a red-hot metal plate held before the eyes.'}
{'word': 'ABACINATION', 'definition': The act of abacinating.'}

我已经尝试过的模式是

pattern="""
(?P<word>[A-Z*]{3,}) #retrieve capital letter word
(\n.*\n\n\Defn:) #ignore all text up until Defn:
(?P<definition>\w*) #retrieve any worded character after Defn:
(.\ ) #end at the full stop and space
"""
for item in re.finditer(pattern,all_words,re.VERBOSE):
    print(item.groupdict())

我正在努力处理这里的换行符。我试图隔离大写字母,然后立即从换行符开始并忽略任何字符,直到'Defn:'之前的两个换行符,并检索以句号结尾的定义。

有没有办法以这种方式处理换行符?

标签: pythonpython-3.xregexjupyter-notebook

解决方案


您大多拥有它,只是缺少一个非贪婪匹配和定义中字符的扩展集。

import re
all_words = """ABACINATE\nA*bac"i*nate, v.t. Etym: [LL. abacinatus, p.p. of abacinare; ab off +\nbacinus a basin.]\n\nDefn: To blind by a red-hot metal plate held before the eyes. [R.]\n\nABACINATION\nA*bac`i*na"tion, n.\n\nDefn: The act of abacinating. [R.]\n\n"""

pattern="""
(?P<word>[A-Z*]{3,})([\s\S]*?Defn:)(?P<definition>[a-zA-Z -]*)
"""
for item in re.finditer(pattern,all_words,re.VERBOSE):
    print(item.groupdict())

{'word': 'ABACINATE', 'definition': '被眼前的炽热金属板致盲'} {'word': 'ABACINATION', 'definition': 'abacinating 的行为'}


推荐阅读