首页 > 解决方案 > 如何使用 sklearn.countvectorizer?

问题描述

我尝试使用 sklearn.countvectorizer 但它没有用。我使用了一个带有 2 个示例行的语料库(我打算稍后导入维基百科数据,但现在我需要让系统正常工作):

__label__1 Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a "worst book" contest.
__label__2 Glorious story: I loved Whisper of the wicked saints. The story was amazing and I was pleasantly surprised at the changes in the book. I am not normaly someone who is into romance novels, but the world was raving about this book and so I bought it. I loved it !

这是我创建世界级矢量化器的代码:

# load the dataset
data = open('corpus.txt').read()
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
    content = line.split()
    labels.append(content[0])
    texts.append(content[1:])

# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

我收到以下错误:

AttributeError: 'list' object has no attribute 'lower'

即使我将 trainDF['text'] 转换为字符串,我也会收到另一个错误:

ValueError: Iterable over raw text documents expected, string object received.

我该怎么办?

标签: pythonscikit-learncountvectorizer

解决方案


小写=假

作为 CountVectorizer() 的参数


推荐阅读