首页 > 解决方案 > 如何将 pandas Dataframe 转换为可用于 NLTK 的字符串或类似字节的对象

问题描述

pandas Dataframe 中的一列包含文本信息,我想将它们放在一起作为一段文本用于进一步的 NLTK。

IE

    book    lines
0   dracula The Project Gutenberg EBook of Dracula, by Br...
1   dracula \n
2   dracula This eBook is for the use of anyone anywhere a...
3   dracula almost no restrictions whatsoever. You may co...
4   dracula re-use it under the terms of the Project Guten...

其次是我的代码

list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['lines']) if i.lower() not in stop_words and i.isalpha()]

并得到了错误

Traceback (most recent call last):

File "<ipython-input-267-3bb703816dc6>", line 1, in <module>
list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['Injury_desc']) if i.lower() not in stop_words and i.isalpha()]

File "C:\Users\LIUX\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\regexp.py", line 131, in tokenize
return self._regexp.findall(text)

TypeError: expected string or bytes-like object

标签: pythonpandasnltk

解决方案


错误即将到来,因为您将数据帧传递给 wordpunct_tokenize 函数,该函数只需要字符串或类似字节的对象。

您需要遍历所有行并将行一一传递给 wordpunct_tokenize。

list_of_words = []
for line in data['lines']:
    list_of_words.extend([i.lower() for i in wordpunct_tokenize(line) if i.lower() not in stop_words and i.isalpha()])

希望这可以帮助。


推荐阅读