python - 如何使用 sklearn.countvectorizer?
问题描述
我尝试使用 sklearn.countvectorizer 但它没有用。我使用了一个带有 2 个示例行的语料库(我打算稍后导入维基百科数据,但现在我需要让系统正常工作):
__label__1 Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a "worst book" contest.
__label__2 Glorious story: I loved Whisper of the wicked saints. The story was amazing and I was pleasantly surprised at the changes in the book. I am not normaly someone who is into romance novels, but the world was raving about this book and so I bought it. I loved it !
这是我创建世界级矢量化器的代码:
# load the dataset
data = open('corpus.txt').read()
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
content = line.split()
labels.append(content[0])
texts.append(content[1:])
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
我收到以下错误:
AttributeError: 'list' object has no attribute 'lower'
即使我将 trainDF['text'] 转换为字符串,我也会收到另一个错误:
ValueError: Iterable over raw text documents expected, string object received.
我该怎么办?
解决方案
小写=假
作为 CountVectorizer() 的参数
推荐阅读
- node.js - 如何允许用户在api中添加参数?
- vue.js - 如何使用 Cypress e2e 快照测试和 Vue 停止更新 UI 快照
- c - NEON 在 IMX7 上具有与 C 相同的性能
- c# - C#:从 ADLS gen2 blob 下载大型 json 文件并反序列化到对象
- reactjs - 某些参数使用reactjs返回函数错误
- c# - 如何让 ListBox 显示来自文本框的“ID”的所有信息(ASP.NET C#)
- python - 包版本更改时获取 tox 以重新安装 console_scripts
- docker - docker 中的 Couchbase 进行集成测试:使端口 8092、8093、8094 和 8095 可配置以能够使用 docker 的随机端口
- javascript - 将 setTimeout 添加到 Promise.all
- javascript - 原型和关闭