首页 > 解决方案 > 如何进行句子标记化

问题描述

这是我用于 sent_tokenize 的代码

import nltk
from nltk.tokenize import sent_tokenize
sent_tokenize(comments1)

数据集

数据集

我用一个数组一个一个地获取句子,但是没有用

Arr=sent_tokenize(comments1)
Arr
Arr[0]

当我使用Arr[1]这个错误时

IndexError                                
Traceback (most recent call last) <ipython-input-27-c15dd30f2746> in <module>
----> 1 Arr[1]

IndexError: list index out of range

标签: pythonnltk

解决方案


阅读以下评论。

# Standard sentence tokenizer.
def sent_tokenize(text, language='english'):
    """
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
    return tokenizer.tokenize(text)


def tokenize(self, text, realign_boundaries=True):
    """
    Given a text, returns a list of the sentences in that text.
    """
    return list(self.sentences_from_text(text, realign_boundaries))

由于language='english'!, ?, .... 作为句子的结尾,因此可以添加comments1 = comments1.replace('\n', '. ')before sent_tokenize(comments1)

您的案例可能被复制为nltk 句子标记器,将新行视为句子边界


推荐阅读