首页 > 解决方案 > 是什么导致腌制变压器和普通变压器之间的执行时间差异?

问题描述

我在 scikit-learn 中训练了一个降维模型。它将 PCA 应用于文本中的术语频率。训练完成后,运行模型大约需要 1.7 秒。当我现在使用 joblib 或 dill 腌制,然后在同一个 python shell 中取消腌制模型时,执行时间会上升到大约 6 秒。

我用 %prun 进行了分析,这是正常的

2430 function calls (2410 primitive calls) in 1.708 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    1.685    0.842    1.685    0.842 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.010    0.010    1.694    1.694 compressed.py:464(_mul_multivector)
        1    0.001    0.001    0.001    0.001 {built-in method scipy.sparse._sparsetools.csr_matmat_pass2}
       54    0.001    0.000    0.001    0.000 numeric.py:424(asarray)

这是腌制/未腌制的:

2428 function calls (2408 primitive calls) in 5.806 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    5.786    2.893    5.786    2.893 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.009    0.009    5.796    5.796 compressed.py:464(_mul_multivector)
        1    0.001    0.001    0.001    0.001 {built-in method scipy.sparse._sparsetools.csr_matmat_pass2}
        1    0.001    0.001    0.001    0.001 {built-in method scipy.sparse._sparsetools.csr_matmat_pass1}
        1    0.001    0.001    5.804    5.804 pipeline.py:752(transform)

所以看起来 numpy.ndarray 的 ravel 函数正在消耗更多的时间。同样,在对 1 个或 10 个样本进行降维时,执行时间也没有差异。什么会导致这种情况?

更新:我在下面添加了一个简单的可重现示例

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from time import time
from sklearn.externals import joblib
​

dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

tfidf_vectorizer_subl = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=None,
                                   strip_accents = 'unicode',
                                   ngram_range =(1,2), 
                                   sublinear_tf=True
                                  )
tfidf_subl = tfidf_vectorizer_subl.fit_transform(dataset.data[:5000])

n_components = 1000
svd = TruncatedSVD(n_components)
svd.fit(tfidf_subl)

t0 = time()
X_lsa = svd.transform(tfidf_subl)
print("done in %fs" % (time() - t0)) #done in 2.240295s

with open('test', 'wb') as file:
    joblib.dump(svd, file)  

with open('test' ,'rb') as f:
    svd_unpickled = joblib.load(f)

t0 = time()
X_lsa = svd_unpickled.transform(tfidf_subl)
print("done in %fs" % (time() - t0)) #done in 3.551007s WHY DOES THIS TAKE LONGER

标签: pythonscikit-learnpicklejoblibdill

解决方案


推荐阅读