首页 > 解决方案 > 无法理解 joblib.load

问题描述

我有一个使用 joblib 保存的 sklearn 管道。

当我再次加载它时,它甚至会执行旧文件中的打印语句。是重新训练模型吗???

#classifier code
#classifier.py file Name

import joblib
import pyprind
import pandas as pd
import os
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import re
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

basepath = 'aclImdb'
labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        print(path)

        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                        'r', encoding='utf-8') as infile:
                txt = infile.read()

            df = df.append([[txt, labels[l]]],
                        ignore_index=True)
            pbar.update()

df.columns = ['review', 'sentiment']
nltk.download('stopwords')
stop = stopwords.words('english')
count = CountVectorizer()

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

df['review'] = df['review'].apply(preprocessor)
porter = PorterStemmer()
def tokenizer(text):
    return text.split()

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        tokenizer=tokenizer
                        )
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(penalty='l2'))])

lr_tfidf.fit(X_train, y_train)
print(lr_tfidf.score(X_test, y_test))
joblib.dump(lr_tfidf, 'LRClassifier.pkl')

print("DONE")

现在,当我加载这个LRClassifier.pkl时,它会加载数据集,并打印分数,然后打印“DONE”消息。是再培训,然后再做所有的步骤吗?

代码是

from classifier import tokenizer
import joblib

clf2 = joblib.load('LRClassifier.pkl')

此代码的输出是

(deepTest) ahmad@ahmad:~/Desktop/ML Website$ python app.py 
aclImdb/test/pos
0% [#######                       ] 100% | ETA: 00:00:33aclImdb/test/neg
0% [###############               ] 100% | ETA: 00:00:31aclImdb/train/pos
0% [######################        ] 100% | ETA: 00:00:19aclImdb/train/neg
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:28
[nltk_data] Downloading package stopwords to /home/ahmad/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
0.88116
DONE

标签: pythonscikit-learnjoblib

解决方案


推荐阅读