python - 在大型数据集上计算没有内存错误的余弦相似度的快速方法
问题描述
我需要在 scipy.sparse.csr.csr_matrix 上计算余弦相似度。通过最直接的 sklearn 实现,我遇到了矩阵形状较大的内存错误。即使对于较小的形状,性能也不是很好,CPU 负载也不会真正超过 25%。因此,对于更大的数据集,我也想让它更快更稳定。
我找到了一个关于速度问题的非常好的资源,我删除了原始帖子中除最快版本之外的所有版本,并将我的简单 sklearn 实现添加为“方法 2”。我确认在我的机器(32GB RAM,WIN10,Python 3.6.4)上,“方法 1”只运行大约 4% 的时间,这是使用代码中构建的数据集的“方法 2”所需的时间。以下是来自zbinsd的改编代码:
# Code adopted from zbinsd @ https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1
# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T
# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))
print("Input data shape:", Asp.shape)
# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
if method == 1:
similarity = np.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
return cosine.T * inv_mag
if method == 2:
return cosine_similarity(Asp)
# Assert that all results are consistent with the first model ("truth")
for m in range(1, 3):
if m in [2]: # The sparse case
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
else:
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))
# Time them:
print("Method 1")
%timeit calc_sim(A, method=1)
print("Method 2")
%timeit calc_sim(A, method=2)
我还找到了关于内存问题的一个很好的资源 ,结果我已经考虑了icm 的建议,只使用独特的出现,所以不知道如何进一步改进。
转到源自 sklearn 计数矢量化器的原始数据
TFvectorizer = CountVectorizer(lowercase=False, tokenizer=log_tokenizer, ngram_range=(1,1))
TF = TFvectorizer.fit_transform(unique_msgs)
all_msgs_vect = TFvectorizer.transform(all_msgs)
我有两个问题:
问题 #1:对于我的原始数据集的小样本,方法 1 比方法 2 快,但两种方法实际上使用的 CPU 资源都不超过 ~25%
In [1]: type(all_msgs_vect)
Out[1]: scipy.sparse.csr.csr_matrix
In [2]: all_msgs_vect.shape
Out[2]: (5000, 529)
# Method 1
In [3]: start = datetime.now()
...: print(datetime.now())
...: msg_CosSim = cosine_similarity(all_msgs_vect)
...: print('Method 1 took', datetime.now() - start)
2019-09-09 10:44:33.039660
Method 1 took 0:00:00.117537
# Method 2
In [4]: start = datetime.now()
...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
...: square_mag = np.diag(similarity)
...: inv_square_mag = 1 / square_mag
...: inv_square_mag[np.isinf(inv_square_mag)] = 0
...: inv_mag = np.sqrt(inv_square_mag)
...: cosine = similarity * inv_mag
...: msg_CosSim2 = cosine.T * inv_mag
...: print('Method 2 took', datetime.now() - start)
Method 2 took 0:00:08.399767
__main__:4: RuntimeWarning: divide by zero encountered in true_divide
知道为什么与zbinsds的示例与我的数据不同,他提出的方法实际上更慢,以及我如何利用空闲的 75% 的 CPU 资源?
问题 #2:对于我的原始数据的大量样本,我遇到了两种方法的内存错误,其中“方法 1”从未超过约 20% 的内存负载,而“方法 2”在产生错误之前很快达到约 60% .
In [2]: all_msgs_vect.shape
Out[2]: (1063867, 3128)
In [3]: start = datetime.now()
...: msg_CosSim = cosine_similarity(all_msgs_vect)
...: print('Method 1 took', datetime.now() - start)
...:
2019-09-09 11:13:53.808270
Traceback (most recent call last):
File "<ipython-input-3-11dcc36bb82a>", line 3, in <module>
msg_CosSim = cosine_similarity(all_msgs_vect)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\metrics\pairwise.py", line 925, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
return self._mul_sparse_matrix(other)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError
In [4]: start = datetime.now()
...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
...: square_mag = np.diag(similarity)
...: inv_square_mag = 1 / square_mag
...: inv_square_mag[np.isinf(inv_square_mag)] = 0
...: inv_mag = np.sqrt(inv_square_mag)
...: cosine = similarity * inv_mag
...: msg_CosSim2 = cosine.T * inv_mag
...: print('Method 2 took', datetime.now() - start)
Traceback (most recent call last):
File "<ipython-input-4-070750736bc5>", line 2, in <module>
similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
MemoryError
知道如何利用所有可用内存来处理这些数据量吗?我有一种模糊的感觉,在“方法 2”中 .toarray() 是一个问题,但如何避免呢?简单地摆脱它并不能解决内存问题,而且我不确定在这种情况下 dot natrix 计算是否仍然正常工作:
In [5]: similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
Traceback (most recent call last):
File "<ipython-input-5-e006c93b9bfd>", line 1, in <module>
similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
return self._mul_sparse_matrix(other)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError
我希望我提供了有关我的原始数据的足够信息,因为我不能在这里真正上传它,但如果没有,请告诉我!任何输入高度赞赏...
谢谢,马克
解决方案
推荐阅读
- javascript - 如何添加对特定消息的反应?
- git - 将 git last 标记获取到 Azure 管道变量中
- python - 如何将枚举列表分成字符串python
- sql - 读取大型数据集的 XML 节点时交叉应用的性能问题
- pycharm - 在pycharm中找不到python控制台调试器
- nginx - 502 坏门丹药/凤凰
- amazon-web-services - AWS 访问密钥是否基于任何公共标准
- frida - frida - 如果在附加之前调用方法,则挂钩 Android 方法不起作用
- symfony - 学说延迟加载集合为空
- javascript - 如何在 NanoGallery2 的 API 中使用事件