首页 > 解决方案 > ValueError:发现样本数量不一致的输入变量:[2, 515738]

问题描述

首先我想说这是我第一次尝试这个。其次,我不确定我是否将这个问题放在正确的论坛上。如果没有,请见谅。

我正在尝试对我的数据使用朴素贝叶斯。单击此处下载数据集。

这是我到目前为止的代码:

data = pd.read_json('/Users/rokayadarai/Desktop/Coding/DataSets/Hotel_Reviews.json')
data.head()

#stopword are not usefull (a, and, the)
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

y = data['Reviewer_Score']
X = scipy.sparse.hstack([vectorizer.fit_transform(data['Negative_Review']),
                        vectorizer.fit_transform(data['Positive_Review'])]
                       )

#515738 observations and 106514 unique words
print (y.shape)
print (X.shape)

#split the data - 0.2 means 20% of the data. 123 means use same dataset with every test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)

#train naive bayes classifier
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

当我尝试运行它时,我收到错误:

ValueError                                Traceback (most recent call last)
~/traintestfile.py in 
     33 #train naive bayes classifier
     34 clf = naive_bayes.MultinomialNB()
---> 35 clf.fit(X_train, y_train)
     36 
     37 

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    618 
    619         labelbin = LabelBinarizer()
--> 620         Y = labelbin.fit_transform(y)
    621         self.classes_ = labelbin.classes_
    622         if Y.shape[1] == 1:

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_label.py in fit_transform(self, y)
    458             Shape will be [n_samples, 1] for binary problems.
    459         """
--> 460         return self.fit(y).transform(y)
    461 
    462     def transform(self, y):

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_label.py in fit(self, y)
    435 
    436         self.sparse_input_ = sp.issparse(y)
--> 437         self.classes_ = unique_labels(y)
    438         return self
    439 

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in unique_labels(*ys)
     95     _unique_labels = _FN_UNIQUE_LABELS.get(label_type, None)
     96     if not _unique_labels:
---> 97         raise ValueError("Unknown label type: %s" % repr(ys))
     98 
     99     ys_labels = set(chain.from_iterable(_unique_labels(y) for y in ys))

ValueError: Unknown label type: (array([ 7.5,  9.2,  9.2, ...,  5.8, 10. ,  9.6]),)

有人可以帮我吗?我被困住了。我知道我做错了什么,但我不知道是什么,而且似乎无法在互联网上找到任何可以帮助我的东西。

标签: pythonmachine-learningscikit-learn

解决方案


您一次只能对一列进行矢量化,并且似乎在vectorizer.fit_transform(['Negative_Review', 'Positive_Review'])未使用数据框的地方有错字。

下面应该可以工作,其中矢量化器分别在 2 列上完成,然后连接:

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus import stopwords
import scipy

data = pd.read_csv('Hotel_Reviews.csv.zip')
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

y = data["Reviewer_Score"]
x = scipy.sparse.hstack([vectorizer.fit_transform(data['Negative_Review']),
                        vectorizer.fit_transform(data['Positive_Review'])]
                       )

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=123)

推荐阅读