首页 > 解决方案 > 在大型数据集上计算没有内存错误的余弦相似度的快速方法

问题描述

我需要在 scipy.sparse.csr.csr_matrix 上计算余弦相似度。通过最直接的 sklearn 实现,我遇到了矩阵形状较大的内存错误。即使对于较小的形状,性能也不是很好,CPU 负载也不会真正超过 25%。因此,对于更大的数据集,我也想让它更快更稳定。

我找到了一个关于速度问题的非常好的资源,我删除了原始帖子中除最快版本之外的所有版本,并将我的简单 sklearn 实现添加为“方法 2”。我确认在我的机器(32GB RAM,WIN10,Python 3.6.4)上,“方法 1”只运行大约 4% 的时间,这是使用代码中构建的数据集的“方法 2”所需的时间。以下是来自zbinsd的改编代码:

# Code adopted from zbinsd @ https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print("Input data shape:", Asp.shape)

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        similarity = np.dot(A, A.T)
        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag
        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0
        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)
        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 2:
        return cosine_similarity(Asp)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 3):
    if m in [2]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print("Method 1")
%timeit calc_sim(A, method=1)
print("Method 2")
%timeit calc_sim(A, method=2)

我还找到了关于内存问题的一个很好的资源 ,结果我已经考虑了icm 的建议,只使用独特的出现,所以不知道如何进一步改进。

转到源自 sklearn 计数矢量化器的原始数据

TFvectorizer = CountVectorizer(lowercase=False, tokenizer=log_tokenizer, ngram_range=(1,1))
TF = TFvectorizer.fit_transform(unique_msgs)
all_msgs_vect = TFvectorizer.transform(all_msgs)

我有两个问题:

问题 #1:对于我的原始数据集的小样本,方法 1 比方法 2 快,但两种方法实际上使用的 CPU 资源都不超过 ~25%

In [1]: type(all_msgs_vect)
Out[1]: scipy.sparse.csr.csr_matrix

In [2]: all_msgs_vect.shape
Out[2]: (5000, 529)


# Method 1
In [3]: start = datetime.now()
   ...: print(datetime.now())
   ...: msg_CosSim = cosine_similarity(all_msgs_vect)
   ...: print('Method 1 took', datetime.now() - start)
2019-09-09 10:44:33.039660
Method 1 took 0:00:00.117537

# Method 2
In [4]: start = datetime.now()
   ...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
   ...: square_mag = np.diag(similarity)
   ...: inv_square_mag = 1 / square_mag
   ...: inv_square_mag[np.isinf(inv_square_mag)] = 0
   ...: inv_mag = np.sqrt(inv_square_mag)
   ...: cosine = similarity * inv_mag
   ...: msg_CosSim2 = cosine.T * inv_mag
   ...: print('Method 2 took', datetime.now() - start)
Method 2 took 0:00:08.399767
__main__:4: RuntimeWarning: divide by zero encountered in true_divide

知道为什么与zbinsds的示例与我的数据不同,他提出的方法实际上更慢,以及我如何利用空闲的 75% 的 CPU 资源?

问题 #2:对于我的原始数据的大量样本,我遇到了两种方法的内存错误,其中“方法 1”从未超过约 20% 的内存负载,而“方法 2”在产生错误之前很快达到约 60% .

In [2]: all_msgs_vect.shape
Out[2]: (1063867, 3128)

In [3]: start = datetime.now()
   ...: msg_CosSim = cosine_similarity(all_msgs_vect)
   ...: print('Method 1 took', datetime.now() - start)
   ...: 
2019-09-09 11:13:53.808270
Traceback (most recent call last):

  File "<ipython-input-3-11dcc36bb82a>", line 3, in <module>
    msg_CosSim = cosine_similarity(all_msgs_vect)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\metrics\pairwise.py", line 925, in cosine_similarity
    K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
    ret = a * b

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError


In [4]: start = datetime.now()
   ...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
   ...: square_mag = np.diag(similarity)
   ...: inv_square_mag = 1 / square_mag
   ...: inv_square_mag[np.isinf(inv_square_mag)] = 0
   ...: inv_mag = np.sqrt(inv_square_mag)
   ...: cosine = similarity * inv_mag
   ...: msg_CosSim2 = cosine.T * inv_mag
   ...: print('Method 2 took', datetime.now() - start)
Traceback (most recent call last):

  File "<ipython-input-4-070750736bc5>", line 2, in <module>
    similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)

MemoryError

知道如何利用所有可用内存来处理这些数据量吗?我有一种模糊的感觉,在“方法 2”中 .toarray() 是一个问题,但如何避免呢?简单地摆脱它并不能解决内存问题,而且我不确定在这种情况下 dot natrix 计算是否仍然正常工作:

In [5]: similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
Traceback (most recent call last):

  File "<ipython-input-5-e006c93b9bfd>", line 1, in <module>
    similarity = np.dot(all_msgs_vect, all_msgs_vect.T)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError

我希望我提供了有关我的原始数据的足够信息,因为我不能在这里真正上传它,但如果没有,请告诉我!任何输入高度赞赏...

谢谢,马克

标签: pythonmemory

解决方案


推荐阅读