python - 删除字符串中重复的 puntaction

问题描述

我正在清理一些文本，如下所示：

Great talking with you. ? See you, the other guys and Mr. Jack Daniels next  week, I hope-- ? Bobette ? ? Bobette  Riner???????????????????????????????   Senior Power Markets Analyst??????   TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell:  832/428-7008 bobette.riner@ipgdirect.com http://www.tradersnewspower.com ? ?  - cinhrly020101.doc

它有多个空格和问号，要清理它我使用正则表达式：

def remove_duplicate_characters(text):     
    text = re.sub("\s+"," ",text) 
    text = re.sub("\s*\?+","?",text)
    text = re.sub("\s*\?+","?",text)
    return text


remove_duplicate_characters(msg)



remove_duplicate_characters(msg)

这给了我以下结果：

'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner@ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'

对于这种特殊情况，它确实有效，但如果我想添加更多要删除的字符，它看起来不是最好的方法。有没有解决这个问题的最佳方法？

标签： pythonregex

要将所有连续的标点符号替换为单个出现的字符，您可以使用

re.sub(r"([^\w\s]|_)\1+", r"\1", text)

如果必须删除前导空格，请使用r"\s*([^\w\s]|_)\1+"正则表达式。

在线查看正则表达式演示。

如果您想为此通用正则表达式引入异常，您可以在左侧添加一个替代项，您可以在其中捕获要保留连续标点符号的所有上下文：

re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)

请参阅此正则表达式演示。

正((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+则表达式匹配并捕获一个...（两端不包含其他点）和一个://字符串（在 URLS 中常见），其余的是带有调整后引用的原始正则表达式（从现在开始，有两个捕获组）。

替换模式中的\1\2将捕获的值放回结果字符串中。

python - 删除字符串中重复的 puntaction

问题描述

解决方案

推荐阅读