首页 > 解决方案 > TypeError:使用 NLTK word_tokenize 时的预期字符串或类似字节的对象

问题描述

我正在尝试导入 CSV 文件,然后使用 NLTK 分析文本。CSV 文件包含几列,但现在我只想分析此文件中的一列。

csv文件的样本是: 来自csv文件的样本数据

读取 CSV 文件和使用 word_tokenize 的代码如下:

import pandas as pd
import nltk
#nltk.download('all')

data=pd.read_csv("Output-analysis.csv")
print (data.SAT_COMMENTS)

from nltk.tokenize import word_tokenize
tokenize_word=word_tokenize(data.SAT_COMMENTS)
print(tokenize_word)

看来我可以读取和打印 SAT_Comment 列没问题,但是当我尝试使用 word_tokenize 时,它​​指出 csv 文件中有一些行,然后是 TypeError: expected string or bytes-like object error。

错误详情:

Traceback (most recent call last):
  File "C:\Users\Rachel\Desktop\SAT analysis\Attempts.py", line 22, in <module>
    tokenize_word=word_tokenize(data.SAT_COMMENTS)
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 1272, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 1326, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 1326, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in span_tokenize
    for sl in slices:
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 1357, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 314, in _pair_iter
    prev = next(it)
  File "C:\Users\Rachel\AppData\Local\Programs\Python\Python38\lib\site-packages\nltk\tokenize\punkt.py", line 1330, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object

有什么建议么?我知道 word_tokenize 处理一次不是字符串,但我不知道这里有什么问题。谢谢

标签: pandascsv

解决方案


确保该列中没有 NaN

data.SAT_COMMENTS = data.SAT_COMMENTS.fillna('')

推荐阅读