首页 > 解决方案 > 修正句子:在标点符号后加空格,但不能在小数点或缩写后加空格

问题描述

当句子没有大写并且标点符号被正确分隔时,我会处理非常混乱的文本。我需要在标点符号 [.,:;)!?] 之后添加空格,但不是十进制数字或缩写。

这是一个例子:

mystring = 'this is my first sentence with (brackets)in it. this is the second?What about this sentence with D.D.T. in it?or this with 4.5?'

这是我到目前为止的地方。

def fix_punctuation(text):
    def sentence_case(text):
        # Split into sentences. Therefore, find all text that ends
        # with punctuation followed by white space or end of string.
        sentences = re.findall('[^.!?]+[.!?](?:\s|\Z)', text)

        # Capitalize the first letter of each sentence
        sentences = [x[0].upper() + x[1:] for x in sentences]

        # Combine sentences
        return ''.join(sentences)
    
    #add space after punctuation
    text = re.sub('([.,;:!?)])', r'\1 ', text)
    #capitalize sentences
    text = sentence_case(text)
    
    return text

这给了我这个输出:

'This is my first sentence with (brackets) in it.  this is the second? What about this sentence with D. D. T.  in it? Or this with 4. 5? '

我尝试了这里这里建议的方法,但它们不适用于我的情况。正则表达式让我的大脑受伤,所以我非常感谢你的帮助。

标签: pythonpython-3.xregexre

解决方案


您可以使用前瞻来检查该点后面的字符是否不是数字,并且不是另一个点后面的字符(缩写)。您只需要将此应用于该点,并以不同的方式处理其他行尾标点符号。但你也不应该在以下之间注入空格!?

text = re.sub(r"(\.)(?=[^\d\s.][^.])|([,;:!?)])(?=\w)", r"\1\2 ", text)

您想要涵盖的场景越多,它就会变得越复杂。


推荐阅读