python - 无法理解 joblib.load
问题描述
我有一个使用 joblib 保存的 sklearn 管道。
当我再次加载它时,它甚至会执行旧文件中的打印语句。是重新训练模型吗???
#classifier code
#classifier.py file Name
import joblib
import pyprind
import pandas as pd
import os
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import re
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
basepath = 'aclImdb'
labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
for l in ('pos', 'neg'):
path = os.path.join(basepath, s, l)
print(path)
for file in sorted(os.listdir(path)):
with open(os.path.join(path, file),
'r', encoding='utf-8') as infile:
txt = infile.read()
df = df.append([[txt, labels[l]]],
ignore_index=True)
pbar.update()
df.columns = ['review', 'sentiment']
nltk.download('stopwords')
stop = stopwords.words('english')
count = CountVectorizer()
def preprocessor(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
text = re.sub('[\W]+', ' ', text.lower()) +\
' '.join(emoticons).replace('-', '')
return text
df['review'] = df['review'].apply(preprocessor)
porter = PorterStemmer()
def tokenizer(text):
return text.split()
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
tokenizer=tokenizer
)
lr_tfidf = Pipeline([('vect', tfidf),
('clf', LogisticRegression(penalty='l2'))])
lr_tfidf.fit(X_train, y_train)
print(lr_tfidf.score(X_test, y_test))
joblib.dump(lr_tfidf, 'LRClassifier.pkl')
print("DONE")
现在,当我加载这个LRClassifier.pkl
时,它会加载数据集,并打印分数,然后打印“DONE”消息。是再培训,然后再做所有的步骤吗?
代码是
from classifier import tokenizer
import joblib
clf2 = joblib.load('LRClassifier.pkl')
此代码的输出是
(deepTest) ahmad@ahmad:~/Desktop/ML Website$ python app.py
aclImdb/test/pos
0% [####### ] 100% | ETA: 00:00:33aclImdb/test/neg
0% [############### ] 100% | ETA: 00:00:31aclImdb/train/pos
0% [###################### ] 100% | ETA: 00:00:19aclImdb/train/neg
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:28
[nltk_data] Downloading package stopwords to /home/ahmad/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
0.88116
DONE
解决方案
推荐阅读
- swift - AVAudoPlayerNode 播放声音定义的次数
- doctrine-orm - 带有 var 的学说存储库 FindBy Array
- scala - 独立的 Kafka Spark Sinks(多个生产者和经纪人)
- jsf - 在 JSF 视图中根据另一个组件的值表达式动态更新一个组件
- r - 无法在 R 中调用 Fortran 子例程(R 崩溃)
- java - 使用 TabLayout 在 android 上锁定标签更改
- javascript - JavaScript 命名空间和来自两个地方的调用函数
- scala - 分层采样:在 Scala 中为 sampleBy 方法形成分数图
- apache-kafka - 具有融合模式注册表的 Avro DataFileWriter API
- reactjs - 我无法通过相机胶卷库显示视频,它只显示图像