首页 > 解决方案 > 使用内部函数不起作用的正则表达式查找和替换操作

问题描述

我是 Stack Overflow 上的新手,我希望有人可以帮助我使用以下代码。

我正在尝试改编来自 Ascher、Ravenscroft 和 Martelli Python Cookbook 的一段代码。我想Text使用字典键:值对(所有文本都是 utf-8)将包含“long-s”的所有单词替换为用现代小写 s 拼写的等效单词。我能够毫无问题地从现有的制表符分隔文件构建字典(我在代码中使用了一个简单的示例字典以便于编辑),但是我想一次完成所有更改以提高速度和效率. 我已经删除了代码的mapandescape部分,因为我认为“long-s”不需要转义(不过我可能是错的!)。第一部分工作正常,但内部功能one_xlat似乎没有做任何事情。它不返回/打印Text最后,并且没有错误消息。我已经在命令行和 IDLE 中运行了代码,结果相同。我已经使用和不使用mapand运行了代码,并且escape为了确定,我已经重命名了变量,但我不能完全让它工作。有人可以帮忙吗?抱歉,如果我遗漏了一些明显的东西,并在此先感谢您。

Ascher、Ravenscroft 和 Martelli 的原始代码:

import re
def multiple_replace(text, adict):
    rx = re.compile('|'.join(map(re.escape, adict)))
    def one_xlat(match):
        return adict[match.group(0)]
    return rx.sub(one_xlat, text)

改编版:

import re

adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."

def word_replace(text, adictCR):
    regex_dict = re.compile('|'.join(adictCR))
    print(regex_dict)
    def one_xlat(match):
        return adictCR[match.group(0)]
    return regex_dict.sub(one_xlat, text)
    print(text)

word_replace(text, adictCR)

标签: pythonregexpython-3.xdictionary

解决方案


我会这样重写你的代码:

# -*- coding: utf-8 -*-
import re

adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."

new_s=[]        
for g in (m.group(0) for m in re.finditer(r'\w+|\W+', text)):
    if g in adictCR:
        g=adictCR[g]
    new_s.append(g)

然后你可以得到你的新字符串''.join(new_s)

注意:该模式'\w+|\W+'仅适用于具有非 ascii 文本的 Python 最新版本(3.1+)。您也可以split(r'(\W)', str)作为替代方案,但我认为这不适用于带有 utf-8 的 Python 2。


推荐阅读