首页 > 解决方案 > Python:根据短语长度标记文本

问题描述

我想根据短语长度标记文本。

例如,构建一个函数process_text("Some text to be tokenized please", n = 3),其中 n 是短语长度,结果应该是这样的["Some text to","be tokenized please"]

我该如何实施?

谢谢!

编辑:

好吧,也许我想出了一些有用的东西

from nltk import ngrams

def process_text(text, n = 1):
    text= list(ngrams(text.split(), n))
    tokenised=[" ".join(i) for i in text]
            
    return tokenised

process_text("Some text to be tokenized please", n = 3)

标签: pythonnltktokenize

解决方案


这是使用列表推导的另一种方式:

def tokenize(text):
    words = text.split(" ")
    return [' '.join(words[i:i+3]) for i in range(0, len(words), 3)]

print(tokenize("Some text to be tokenized please"))
# ['Some text to', 'be tokenized please']

然而,这并不完美,即

>>> tokenize("Some text to be tokenized please")
['Some text to', 'be tokenized please']
>>> tokenize("Some text to be tokenized please ")
['Some text to', 'be tokenized please', '']
>>> tokenize(" Some text to be tokenized please ")
[' Some text', 'to be tokenized', 'please ']
>>> tokenize(" Some text  to be tokenized please ")
[' Some text', ' to be', 'tokenized please ']
>>> tokenize(" Some text  to be   tokenized please ")
[' Some text', ' to be', '  tokenized', 'please ']

但您可以根据您的用例进行调整。


推荐阅读