python - TidfVectorizer - TypeError:TypeError:预期的字符串或类似字节的对象
问题描述
我正在尝试使 TfidfVectorizer 对象适合视频游戏评论列表,但由于某种原因,我收到了错误消息。
这是我的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features = 50000, use_idf = True, ngram_range=(1,3),
preprocessor = data_preprocessor.preprocess_tokenized_review)
print(train_set_x[0])
%time tfidf_matrix = tfidf_vectorizer.fit_transform(train_set_x)
这是错误消息:
I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>()
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1379 Tf-idf-weighted document-term matrix.
1380 """
-> 1381 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1382 self._tfidf.fit(X)
1383 # X is already a transformed view of raw_documents so
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
790 for doc in raw_documents:
791 feature_counter = {}
--> 792 for feature in analyze(doc):
793 try:
794 feature_idx = vocabulary[feature]
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
264
265 return lambda doc: self._word_ngrams(
--> 266 tokenize(preprocess(self.decode(doc))), stop_words)
267
268 else:
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
239 return self.tokenizer
240 token_pattern = re.compile(self.token_pattern)
--> 241 return lambda doc: token_pattern.findall(doc)
242
243 def get_stop_words(self):
TypeError: expected string or bytes-like object
请注意,输出的第一部分代表我的视频游戏数据集中的一条评论。如果有人知道发生了什么,我将不胜感激。先感谢您!
解决方案
我认为这个问题是由data_preprocessor.preprocess_tokenized_review
函数(你没有分享)引起的。
证明(使用默认值preprocessor=None
):
In [19]: from sklearn.feature_extraction.text import TfidfVectorizer
In [20]: X = ["I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Onlin
...: e Pass, and the all powerful Battlefield 4 Beta access."]
In [21]: tfidf_vectorizer = TfidfVectorizer(max_features=50000, use_idf=True, ngram_range=(1,3))
In [22]: r = tfidf_vectorizer.fit_transform(X)
In [25]: r
Out[25]:
<1x84 sparse matrix of type '<class 'numpy.float64'>'
with 84 stored elements in Compressed Sparse Row format>
所以当我们不为preprocessor
参数传递任何值时它工作得很好。
推荐阅读
- jenkins - 詹金斯性能问题
- android - E/Parcel:解组时找不到类:com.android.billingclient.api.BillingClientImpl$1 java.lang.ClassNotFoundException:
- excel - 在 MS Excel 中获取 4 个中 3 个大值的总和
- python - 如何在一个地方将一个值更改为另一个值并在几个函数中使用它?
- android - 检测到的包降级:Xamarin.Android.Support.Annotations 从 27.0.2 到 27.0.2-preview1
- c# - 每 5 分钟运行一次 windows 服务
- python - 为什么扩展 List 类在这里不起作用
- php - Twilio Sms 不发送 Php
- python - 无法使用 PointField django 在模板上显示地图
- java - Jenkins Selenium 作业未并行运行