首页 > 解决方案 > 如何修复“ValueError:提供的数据在使用特征大小 17721 进行训练时具有 1 个维度”使用 sklearn LDA 模型进行预测

问题描述

我在大数据集(由 16k 篇不同的文章组成)上训练了 LDA 模型,并用 pickle 保存了模型,以便以后使用。当我尝试在其他数据上使用它时(选择了 1 篇文章并想查看将分配什么主题),我收到错误“ValueError:提供的数据有 1 个维度,而模型是用特征大小 17721 训练的。”

这是我用于训练模型并保存它的代码。

prepared_data = []
prepared_string = ""
for article_text in all_article_text:
   paragraph = article_text.get('text')
   paragraph is not None and len(paragraph) > 20:
   split_words = paragraph.split()
   table = str.maketrans('','', string.punctuation)
   stripped = [w.translate(table) for w in split_words]
   lower_words = [word.lower() for word in stripped]
   words_with_stops_words = [word for word in lower_words if word.isalpha()]
   words = [w1 for w1 in words_with_stops_words if not w1 in stop_words]
   stemmed_words = [PorterStemmer().stem(word) for word in words]
   for i in stemmed_words:
         if len(i) > 2:
         prepared_data.append(i)
         prepared_string += i + " "

    # all_documents.append(prepared_data)
   if len(prepared_string) > 20:
        all_documents.append(prepared_string)

vectorizer = CountVectorizer(analyzer='word',
                             min_df=1
                             )

data_vectorized = vectorizer.fit_transform(all_documents)

model = LatentDirichletAllocation(n_topics=10,               # Number of topics
                                      max_iter=3000,               # Max learning iterations 
                                      learning_method='online',
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      learning_decay= .5
                                  )

model.fit(data_vectorized)

lda_output = model.transform(data_vectorized)

filename = 'LDA_model3000_iter.sav'
pickle.dump(model, open(filename, 'wb'))

## Loading model
file_name = 'LDA_model3000_iter.sav'
loaded_model = pickle.load(open(file_name,'rb'))

vectorizer = CountVectorizer(analyzer='word',
                             min_df=1,
                             vocabulary=all_documents
                             )

vectorizer._validate_vocabulary()

new_vect = vectorizer.transform(all_documents)

loaded_model.transform(new_vect)

但我得到的是“ValueError:提供的数据有 1 个维度,而模型是用特征大小 17721 训练的。”

标签: pythonscikit-learnpredictionlda

解决方案


推荐阅读