首页 > 解决方案 > Python - 句子结尾和其他句号之间的区别

问题描述

我正在尝试清理文本,在本例中是一篇文章。因为我把文本放在一个长行中,我想把每个句子都放在一个新行中,所以我只是这样做了:

content.replace(".", ".\n")

好吧,它没有用。这篇文章包含的内容e.g. Dr. Taylor Train Nr. 11512很明显,我的结果看起来很愚蠢。

有谁知道我可以用什么来可靠地从实际句号中过滤掉这些“非句子结尾”句号?在这种情况下,我可以通过检查它是否包含我猜的元音和辅音来检查句号前面的字符串是否是实际单词。但总的来说,我不知道我能在这里做什么。

标签: pythonstring

解决方案


我知道,这并不能真正回答您的问题,但是如果您只是想“清理”文本以便很好地打印它,您可以在设定的字符数之后插入新行,而不是句子的结尾:

text = """Does anyone have an idea what i can use to reliably filter out these "non-sentence ending" full stops from actual full stops? In this case, i could just check if the string in front of the full stop is an actual word, by checking if it contains a vowel and a consonant i guess. But in general, i have no idea what i can do here."""

text = text.split(' ')
line_length = 0
index = 0

for word in text:
    if (line_length + len(word)) < 70:
        index += 1
        line_length += len(word) + 1
    else:
        text.insert(index, '\n')
        index += 2
        line_length = len(word) + 1

print(' '.join(text))

输出将是:

Does anyone have an idea what i can use to reliably filter out these 
 "non-sentence ending" full stops from actual full stops? 
 In this case, i could just check if the string in front of the full 
 stop is an actual word, by checking if it contains a vowel and a consonant 
 i guess. But in general, i have no idea what i can do here. 


推荐阅读