python - AttributeError:未找到下限;从 Sklearn CountVectorizer 中删除不常见的功能?
问题描述
制作语料库和词汇
K = 10
XYtr['description'] = XYtr['description'].fillna("nan")
Xte['description'] = Xte['description'].fillna("nan")
corpus = list(XYtr['description'])+list(Xte['description'])
vectorizer = CountVectorizer()
corpus = vectorizer.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components = K)
lda.fit(corpus)
#There are no problems until here
# Create a list of (term, frequency) tuples sorted by their frequency
sum_words = corpus.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1])
# Keep only the terms in a list
vocabulary, _ = zip(*words_freq[:int(total_features * 0.2)])
vocabulary = list(vocabulary)
#Finally, we use the vocabulary to limit the model to the less frequent terms.
bottom_vect = CountVectorizer(vocabulary=vocabulary)
topics = bottom_vect.fit_transform(corpus)
这在代码的最后一行返回“AttributeError:lower not found”。因此,我无法获得“主题”。
对于一些建议将不胜感激。
这是我的数据集的几行
XYtr:
特:
解决方案
你得到那个错误是因为你corpus
用CountVectorizer()
. 使用示例:
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
将 的结果分配CountVectorizer()
给另一个对象X
:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components = 2)
lda.fit(X)
sum_words = X.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1])
total_features = len(words_freq)
vocabulary, _ = zip(*words_freq[:int(total_features * 0.2)])
vocabulary = list(vocabulary)
然后重新运行您的 CountVectorizer :
bottom_vect = CountVectorizer(vocabulary=vocabulary)
topics = bottom_vect.fit_transform(corpus)
推荐阅读
- eclipse - 处理圆形类并增加半径时,无法让我的代码更新 javafx 中的 GUI
- flutter - Flutter 在 facebook 应用程序 android 和 ios 中打开 facebook 链接
- javascript - 如何避免从 ExpressJS 抛出 400 错误
- vuejs2 - 以编程方式添加 Bootstrap 弹出窗口 vue-full-calendar
- django - 如何确保在将数据从这些列迁移到另一个表之前完成用于删除列的 django 迁移
- python - 将多个方法应用于单个对象?
- c# - 从 Postman 发布 Mongo ObjectId
- r - 为“显示当前功能的帮助”创建 F1 风格的快捷方式
- c# - 找出数据是否存在并使用 C# 代码将它们隔离
- git - Tortoise Git:执行 git push 时无法生成错误