首页 > 解决方案 > 坚持一个大 scipy.sparse.csr_matrix

问题描述

我有一个非常大的稀疏 scipy 矩阵。尝试使用save_npz导致以下错误:

>>> sp.save_npz('/projects/BIGmatrix.npz',W)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 716, in _savez
    pickle_kwargs=pickle_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/format.py", line 597, in write_array
    array.tofile(fp)
OSError: 6257005295 requested and 3283815408 written

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/_matrix_io.py", line 78, in save_npz
    np.savez_compressed(file, **arrays_dict)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 659, in savez_compressed
    _savez(file, args, kwds, True)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 721, in _savez
    raise IOError("Failed to write to %s: %s" % (tmpfile, exc))
OSError: Failed to write to /projects/BIGmatrix.npzg6ub_z3y-numpy.npy: 6257005295 requested and 3283815408 written

因此,我想尝试通过将其持久化到 postgres,psycopg2但我还没有找到一种迭代所有非零值的方法,因此我可以将它们作为表中的行持久化。

处理此任务的最佳方法是什么?

标签: pythonnumpyscipypersistencesparse-matrix

解决方案


保存__dict__矩阵对象的所有属性,并重新创建csr_matrix加载时:

from scipy import sparse
import numpy as np

a = np.zeros((1000, 2000))
a[np.random.randint(0, 1000, 100), np.random.randint(0, 2000, 100)] = np.random.randn(100)

b = sparse.csr_matrix(a)

np.savez("tmp", data=b.data, indices=b.indices, indptr=b.indptr, shape=np.array(b.shape))
f = np.load("tmp.npz")
b2 = sparse.csr_matrix((f["data"], f["indices"], f["indptr"]), shape=f["shape"])
(b != b2).sum()

推荐阅读