python - 在保留换行符的同时进一步拆分文本

问题描述

我正在使用以下内容拆分文本para并保留换行符\n

from nltk import SpaceTokenizer
para="\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)

这给了我以下 print(sent)

['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

我的目标是获得以下输出

['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

也就是说，我要拆分成，拆分'comma,'成，拆分成，保留'comma'',' 'period.''period''.' 'question?''question''?' while\n

我已经尝试过word_tokenize，它将实现拆分'comma'等','但不保留\n

在保留的同时，我可以做些什么来进一步拆分sent如上所示\n？

标签： pythonstringsplitnltktokenize

https://docs.python.org/3/library/re.html#re.split可能是您想要的。

然而，从您想要的输出的外观来看，您需要对字符串进行更多处理，而不仅仅是对其应用单个函数。

我会先\n用一个字符串替换所有的，就像new_line_goes_here在拆分字符串之前一样，然后在它全部拆分后new_line_goes_here替换。\n

python - 在保留换行符的同时进一步拆分文本

问题描述

解决方案

推荐阅读