首页 > 解决方案 > 按行拆分稀疏矩阵

问题描述

我有一个scipy.sparse.csr.csr_matrix维度(8723, 1741277)

如何有效地将它按行分成 n 个块?

块的行数最好大致相等。

我说的大致是因为它取决于(行数)/(块数)是否会返回任何剩余部分。

我认为你可以很容易地在numpy.split数组中做到这一点,但它似乎不适用于稀疏矩阵。

具体来说,如果我选择不能与 8723 完全整除的 n 块数,我会收到此错误:

ValueError: array split does not result in an equal division

如果我选择与 8723 完全可分的 n-chunks 数,我会收到此错误:

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

我想将稀疏矩阵分成块的原因是因为我想将稀疏矩阵转换为(密集)数组,但我不能直接这样做,因为它整体太大。

标签: pythonnumpysparse-matrix

解决方案


In [6]: from scipy import sparse                                                                     
In [7]: M = sparse.random(12,3,.1,'csr')                                                             
In [8]: np.split?                                                                                    
In [9]: np.split(M,3)                                                                                
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     55     try:
---> 56         return getattr(obj, method)(*args, **kwds)
     57 

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
    687         else:
--> 688             raise AttributeError(attr + " not found")
    689 

AttributeError: swapaxes not found

During handling of the above exception, another exception occurred:

AxisError                                 Traceback (most recent call last)
<ipython-input-9-11a4dcdd89af> in <module>
----> 1 np.split(M,3)

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
    848             raise ValueError(
    849                 'array split does not result in an equal division')
--> 850     res = array_split(ary, indices_or_sections, axis)
    851     return res
    852 

/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in array_split(ary, indices_or_sections, axis)
    760 
    761     sub_arys = []
--> 762     sary = _nx.swapaxes(ary, axis, 0)
    763     for i in range(Nsections):
    764         st = div_points[i]

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in swapaxes(a, axis1, axis2)
    583 
    584     """
--> 585     return _wrapfunc(a, 'swapaxes', axis1, axis2)
    586 
    587 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     64     # a downstream library like 'pandas'.
     65     except (AttributeError, TypeError):
---> 66         return _wrapit(obj, method, *args, **kwds)
     67 
     68 

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     44     except AttributeError:
     45         wrap = None
---> 46     result = getattr(asarray(obj), method)(*args, **kwds)
     47     if wrap:
     48         if not isinstance(result, mu.ndarray):

AxisError: axis1: axis 0 is out of bounds for array of dimension 0

如果我们申请np.arrayM我们会得到一个 0d 对象数组;只是稀疏对象周围的天真包装。

In [10]: np.array(M)                                                                                 
Out[10]: 
array(<12x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>, dtype=object)
In [11]: _.shape                                                                                     
Out[11]: ()

拆分正确的密集等价物:

In [12]: np.split(M.A,3)                                                                             
Out[12]: 
[array([[0.        , 0.61858517, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]), array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]), array([[0.        , 0.89573059, 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.02334738],
        [0.        , 0.        , 0.        ]])]

和直接稀疏分裂:

In [13]: [M[i:j,:] for i,j in zip([0,4,8],[4,8,12])]                                                 
Out[13]: 
[<4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>]

对于稀疏矩阵,像这样的切片不如密集矩阵有效。密集切片是视图。稀疏的必须是副本。唯一的例外是lil格式,它有一个get_rowview方法。虽然有许多函数可以从块中构造稀疏矩阵,但并不需要将它们拆分的函数。

可能sklearn有一些拆分功能。它有一些稀疏效用函数来解决它自己对稀疏矩阵的使用。


推荐阅读