python - 加载泡菜 NotFittedError:CountVectorizer - 未安装词汇
问题描述
我正在尝试使用 scikit 机器学习对垃圾邮件进行分类。一旦我将矢量化器和分类器转储到各自的 .pkl 文件中并在 temp.py 中导入 tem 以进行预测,我就会收到此错误:
raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: CountVectorizer - Vocabulary wasn't fitted
一旦我建立一个模型保存模型名称(my_model.pkl),(vectorizer.pkl)并重新启动我的内核,但是当我在示例文本的预测过程中加载保存的模型(sample.pkl)时,它给出了相同的Volcubary not发现错误。
应用程序.py:
import pandas as pd
df = pd.read_csv('spam.csv', encoding="latin-1")
#Drop the columns not needed
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
#Create a new column label which has the same values as v1 then set the ham and spam values to 0 and 1 which is the standard format for our prediction
df['label'] = df['v1'].map({'ham': 0, 'spam': 1})
#Create a new column having the same values as v2 column
df['message'] = df['v2']
#Now drop the v1 and v2
df.drop(['v1', 'v2'], axis=1, inplace=True)
#print(df.head(10))
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer().fit_transform(df['message'])
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
#Split the data
X_train, X_test, y_train, y_test = train_test_split(bow_transformer, df['label'], test_size=0.33, random_state=42)
#Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
pickle.dump(bow_transformer, open("vector.pkl", "wb"))
pickle.dump(clf, open("my_model.pkl", "wb"))
temp.py:::我在这个文件中做预测
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
vectorizer = pickle.load(open("my_model.pkl", "rb"))
selector = pickle.load(open("vector.pkl", "rb"))
test_set=["heloo how are u"]
new_test=cv.transform(test_set)
解决方案
在您app.py
中,您正在腌制文档术语矩阵而不是矢量化器,
pickle.dump(bow_transformer, open("vector.pkl", "wb"))
bow_transformer 在哪里
bow_transformer = CountVectorizer().fit_transform(df['message'])
当你temp.py
解开它时,你只有文档术语矩阵。腌制它的正确方法是:
bow_transformer = CountVectorizer().fit(df['message'])
bow_transformer_dtm = bow_transformer.transform(df['message'])
现在你可以腌制你的bow_transformer
使用
pickle.dump(bow_transformer, open("vector.pkl", "wb"))
这将是一个转换器,而不是文档术语矩阵。
在你的temp.py
你可以解开它并使用它,如下图所示:
selector = pickle.load(open("vector.pkl", "rb"))
test_set=["heloo how are u"]
new_test=selector.transform(test_set)
希望这可以帮助!
推荐阅读
- python - 如何在 Python 中打印查询结果,包括列名
- javascript - 在 datapower 中将 application/pdf、text/html 数据转换为 multipart/form-data
- python - 如何将多个文本文件读入数组?
- amazon-web-services - 调查 AWS SNS 没有任何副作用?
- ios - 在不使用 react-native-cli 的情况下设置 React Native 项目
- react-native - 标记标注中显示的图像被剪切
- javascript - 为什么不显示:flex; 和 justify-content: center; 将此幻灯片居中?
- postgresql - 如何在 Postgresql 中读取 Word 文档
- swift - NSDocument 选项卡窗口恢复
- c# - WinForm 的 Windows 高 DPI 缩放