首页 > 解决方案 > 分段错误(核心转储)错误 scikit 在大型稀疏矩阵上学习余弦相似度(86196 X 15497)

问题描述

我在尺寸为 86196 x 15497(大小为 10.3 GB)的输入矩阵上运行了以下代码。它是一个主要是稀疏矩阵。我正在使用 sklearn.metircs.pairwise 余弦相似度。

我能够毫无问题地运行 43098 x 15497 输入矩阵(结果余弦相似度输出矩阵为 13.8 GB,尺寸为 43098 x 43098)。但是,当尝试 86196 x 15497 输入矩阵时,我收到“分段错误(核心转储)”错误。这是在 Google Compute 实例上(16 个 vCPU,128 GB RAM)。

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# df is TFIDF calculations on data set of features
df = pd.read_csv('/path...')

df_piv = pd.pivot(full, index='individuals', columns='features', values='TFIDF')

df_piv.fillna(0, inplace=True)

cm = cosine_similarity(df_piv)

我还通过执行“df_piv.values”尝试将输入作为 numpy.ndarray 而不是 pandas.core.frame.dataframe,但仍然出现错误。

这是 C 堆栈跟踪的第一行:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff5f86e29 in dgemm_otcopy_HASWELL () from /home/gndumbri/.local/lib/python3.8/site-packag
es/numpy/core/../../numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so

我不认为这个问题是内存问题。我使用 gdb 进行了 C 堆栈跟踪,我想知道问题是否与第 3 方包(即 numpy 或 sklearn)有关。另外,我希望得到的余弦相似度矩阵为 55.4 GB。

如果我需要向 numpy/sklearn 报告问题,我复制并保存了 C 堆栈跟踪,但想问是否有人看到我做错了什么,或者有替代方法来实现我的实现?

我查看了这篇SO 帖子,运行 C 堆栈跟踪后用户没有得到响应。

Python 版本 3.8.5

sklearn 版本 0.24.1

numpy 版本 1.20.1

熊猫版本 1.2.2

更新

以下是 C 堆栈跟踪报告的完整回溯:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff5f86e29 in dgemm_otcopy_HASWELL () from /home/gndumbri/.local/lib/python3.8/site-packag
es/numpy/core/../../numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so
(gdb) bt
#0  0x00007ffff5f86e29 in dgemm_otcopy_HASWELL ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp
-r0-5bebc122.3.13.dev.so
#1  0x00007ffff5311766 in inner_thread ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp
-r0-5bebc122.3.13.dev.so
#2  0x00007ffff5432635 in exec_blas ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp
-r0-5bebc122.3.13.dev.so
#3  0x00007ffff5312113 in dsyrk_thread_LN ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp
-r0-5bebc122.3.13.dev.so
#4  0x00007ffff5222ea8 in cblas_dsyrk ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp
-r0-5bebc122.3.13.dev.so
#5  0x00007ffff7314e76 in DOUBLE_matmul_matrixmatrix.constprop.6 ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-
x86_64-linux-gnu.so
#6  0x00007ffff73191fb in DOUBLE_matmul ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-
x86_64-linux-gnu.so
#7  0x00007ffff732a77e in PyUFunc_GenericFunction_int ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-
x86_64-linux-gnu.so
#8  0x00007ffff732acb1 in ufunc_generic_call ()
   from /home/gndumbri/.local/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-
x86_64-linux-gnu.so
#9  0x00000000005f46d6 in _PyObject_MakeTpCall (callable=<numpy.ufunc at remote 0x7ffff7683a90>,
    args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>)
    at ../Include/internal/pycore_pyerrors.h:13
#10 0x00000000005f4de5 in _PyObject_Vectorcall (kwnames=0x0, nargsf=2, args=0x7fffffffd5b0,
    callable=<numpy.ufunc at remote 0x7ffff7683a90>) at ../Include/cpython/abstract.h:125
#11 _PyObject_Vectorcall (kwnames=0x0, nargsf=2, args=0x7fffffffd5b0,
    callable=<numpy.ufunc at remote 0x7ffff7683a90>) at ../Include/cpython/abstract.h:115
#12 _PyObject_FastCall (nargs=2, args=0x7fffffffd5b0,
--Type <RET> for more, q to quit, c to continue without paging--<RET>
    func=<numpy.ufunc at remote 0x7ffff7683a90>) at ../Include/cpython/abstract.h:147
#13 object_vacall (base=<optimized out>, callable=<numpy.ufunc at remote 0x7ffff7683a90>,
    vargs=<optimized out>) at ../Objects/call.c:1186
#14 0x00000000005f4f65 in PyObject_CallFunctionObjArgs (callable=<optimized out>)
    at ../Objects/call.c:1259
#15 0x000000000050e327 in binary_op1 (v=<numpy.ndarray at remote 0x7fff97579b10>,
    w=<numpy.ndarray at remote 0x7fff97579c90>, op_slot=<optimized out>)
    at ../Objects/abstract.c:808
#16 0x000000000061884a in binary_op (v=<numpy.ndarray at remote 0x7fff97579b10>,
    w=<numpy.ndarray at remote 0x7fff97579c90>, op_slot=272, op_name=0x6d858f "@")
    at ../Objects/abstract.c:837
#17 0x00000000004b907e in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:1490
#18 0x000000000056955a in PyEval_EvalFrameEx (throwflag=0,
    f=Frame 0x7fff99274840, for file /home/gndumbri/.local/lib/python3.8/site-packages/sklearn/uti
ls/extmath.py, line 152, in safe_sparse_dot (a=<numpy.ndarray at remote 0x7fff97579b10>, b=<numpy.
ndarray at remote 0x7fff97579c90>, dense_output=True)) at ../Python/ceval.c:741
#19 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>,
    locals=<optimized out>, args=<optimized out>, argcount=<optimized out>,
    kwnames=<optimized out>, kwargs=0x7fff992851a0, kwcount=<optimized out>, kwstep=1, defs=0x0,
    defcount=0, kwdefs={'dense_output': False}, closure=0x0, name='safe_sparse_dot',
    qualname='safe_sparse_dot') at ../Python/ceval.c:4298
#20 0x00000000005f7323 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fff99285190,
    nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:435
#21 0x00000000005f3d42 in PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>,
    callable=<function at remote 0x7fff9aa56d30>) at ../Objects/call.c:1296
#22 PyObject_Call (callable=<function at remote 0x7fff9aa56d30>, args=<optimized out>,
    kwargs=<optimized out>) at ../Objects/call.c:227
#23 0x000000000056ca92 in do_call_core (kwdict={'dense_output': True},
    callargs=(<numpy.ndarray at remote 0x7fff97579b10>, <numpy.ndarray at remote 0x7fff97579c90>),
 func=<function at remote 0x7fff9aa56d30>, tstate=<optimized out>) at ../Python/ceval.c:5010
#24 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:3559
#25 0x000000000056955a in PyEval_EvalFrameEx (throwflag=0,
    f=Frame 0x7fffc5d1a240, for file /home/gndumbri/.local/lib/python3.8/site-packages/sklearn/uti
--Type <RET> for more, q to quit, c to continue without paging--<RET>
ls/validation.py, line 63, in inner_f (args=(<numpy.ndarray at remote 0x7fff97579b10>, <numpy.ndar
ray at remote 0x7fff97579c90>), kwargs={'dense_output': True}, extra_args=0))
    at ../Python/ceval.c:741
#26 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>,
    locals=<optimized out>, args=<optimized out>, argcount=<optimized out>,
    kwnames=<optimized out>, kwargs=0x7fff9926ef10, kwcount=<optimized out>, kwstep=1, defs=0x0,
    defcount=0, kwdefs=0x0,
    closure=(<cell at remote 0x7fff9ab059d0>, <cell at remote 0x7fff9ab05910>, <cell at remote 0x7
fff9ab05a00>, <cell at remote 0x7fff9ab05a60>, <cell at remote 0x7fff9ab05070>),
    name='safe_sparse_dot', qualname='safe_sparse_dot') at ../Python/ceval.c:4298
#27 0x00000000005f7323 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fff9926ef00,
    nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:435
#28 0x000000000056c451 in _PyObject_Vectorcall (kwnames=('dense_output',),
    nargsf=<optimized out>, args=<optimized out>, callable=<function at remote 0x7fff9aa56e50>)
    at ../Include/cpython/abstract.h:127
#29 call_function (kwnames=('dense_output',), oparg=<optimized out>,
    pp_stack=<synthetic pointer>, tstate=<optimized out>) at ../Python/ceval.c:4963
#30 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:3515
#31 0x000000000056955a in PyEval_EvalFrameEx (throwflag=0,
    f=Frame 0x7fff9926ed60, for file /home/gndumbri/.local/lib/python3.8/site-packages/sklearn/met
rics/pairwise.py, line 1444, in cosine_similarity (X=<numpy.ndarray at remote 0x7fff97579ab0>, Y=<
numpy.ndarray at remote 0x7fff97579ab0>, dense_output=True, X_normalized=<numpy.ndarray at remote
0x7fff97579b10>, Y_normalized=<numpy.ndarray at remote 0x7fff97579b10>)) at ../Python/ceval.c:741
#32 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>,
    locals=<optimized out>, args=<optimized out>, argcount=<optimized out>,
    kwnames=<optimized out>, kwargs=0x7ffff77ef7c0, kwcount=<optimized out>, kwstep=1,
    defs=0x7fff9ac78ed8, defcount=2, kwdefs=0x0, closure=0x0, name='cosine_similarity',
    qualname='cosine_similarity') at ../Python/ceval.c:4298
#33 0x00000000005f7323 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff77ef7b8,
    nargsf=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:435
#34 0x000000000056b26e in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>,
    args=0x7ffff77ef7b8, callable=<function at remote 0x7fff9a0d08b0>)
    at ../Include/cpython/abstract.h:127
#35 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>,
--Type <RET> for more, q to quit, c to continue without paging--<RET>
    tstate=0x962400) at ../Python/ceval.c:4963
#36 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:3500
#37 0x000000000056955a in PyEval_EvalFrameEx (throwflag=0,
    f=Frame 0x7ffff77ef640, for file cos_sim.py, line 46, in <module> ())
    at ../Python/ceval.c:741
#38 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>,
    locals=<optimized out>, args=<optimized out>, argcount=<optimized out>,
    kwnames=<optimized out>, kwargs=0x0, kwcount=<optimized out>, kwstep=2, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at ../Python/ceval.c:4298
#39 0x000000000068c4a7 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0,
    kwcount=0, kws=0x0, argcount=0, args=0x0, locals=<optimized out>, globals=<optimized out>,
    _co=<optimized out>) at ../Python/ceval.c:4327
#40 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at ../Python/ceval.c:718
#41 0x000000000067bc91 in run_eval_code_obj (co=0x7ffff777a920,
    globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFi
leLoader(name='__main__', path='cos_sim.py') at remote 0x7ffff7800550>, '__spec__': None, '__annot
ations__': {}, '__builtins__': <module at remote 0x7ffff787b0e0>, '__file__': 'cos_sim.py', '__cac
hed__': None, 'time': <module at remote 0x7ffff7860b30>, 'pd': <module at remote 0x7ffff76e50e0>,
'np': <module at remote 0x7ffff76e5810>, 'cosine_similarity': <function at remote 0x7fff9a0d08b0>,
 'path_A': '/home/gndumbri/diag.csv', 'calc_tfidf': <function at remote 0x7ffff78421f0>, 'start':
<float at remote 0x7fff9a0be250>, 'full': <DataFrame(_is_copy=None, _mgr=<BlockManager at remote 0
x7fff95b30ac0>, _item_cache={'TOTL_BENE_CNT': <Series(_is_copy=None, _mgr=<SingleBlockManager at r
emote 0x7fff99c23a00>, _item_cache={}, _attrs={}, _flags=<Flags(_allows_duplicate_labels=True, _ob
j=<weakref at remote 0x7fff992d81d0>) at remote 0x7fff9ac32970>, _name='TOTL_BENE_CNT', _index=<Ra
ngeIndex(_range=<range at remote 0x7ffff78007e0>, _name=...(truncated),
    locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFil
eLoader(name='__main__', path='cos_sim.py') at remote 0x7ffff7800550>, '__spec__': None, '__annota
tions__': {}, '__builtins__': <module at remote 0x7ffff787b0e0>, '__file__': 'cos_sim.py', '__cach
ed__': None, 'time': <module at remote 0x7ffff7860b30>, 'pd': <module at remote 0x7ffff76e50e0>, '
np': <module at remote 0x7ffff76e5810>, 'cosine_similarity': <function at remote 0x7fff9a0d08b0>,
'path_A': '/home/gndumbri/diag.csv', 'calc_tfidf': <function at remote 0x7ffff78421f0>, 'start': <
float at remote 0x7fff9a0be250>, 'full': <DataFrame(_is_copy=None, _mgr=<BlockManager at remote 0x
7fff95b30ac0>, _item_cache={'TOTL_BENE_CNT': <Series(_is_copy=None, _mgr=<SingleBlockManager at re
--Type <RET> for more, q to quit, c to continue without paging--<RET>
mote 0x7fff99c23a00>, _item_cache={}, _attrs={}, _flags=<Flags(_allows_duplicate_labels=True, _obj
=<weakref at remote 0x7fff992d81d0>) at remote 0x7fff9ac32970>, _name='TOTL_BENE_CNT', _index=<Ran
geIndex(_range=<range at remote 0x7ffff78007e0>, _name=...(truncated))
    at ../Python/pythonrun.c:1125
#42 0x000000000067bd0f in run_mod (mod=<optimized out>, filename=<optimized out>,
    globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFi
leLoader(name='__main__', path='cos_sim.py') at remote 0x7ffff7800550>, '__spec__': None, '__annot
ations__': {}, '__builtins__': <module at remote 0x7ffff787b0e0>, '__file__': 'cos_sim.py', '__cac
hed__': None, 'time': <module at remote 0x7ffff7860b30>, 'pd': <module at remote 0x7ffff76e50e0>,
'np': <module at remote 0x7ffff76e5810>, 'cosine_similarity': <function at remote 0x7fff9a0d08b0>,
 'path_A': '/home/gndumbri/diag.csv', 'calc_tfidf': <function at remote 0x7ffff78421f0>, 'start':
<float at remote 0x7fff9a0be250>, 'full': <DataFrame(_is_copy=None, _mgr=<BlockManager at remote 0
x7fff95b30ac0>, _item_cache={'TOTL_BENE_CNT': <Series(_is_copy=None, _mgr=<SingleBlockManager at r
emote 0x7fff99c23a00>, _item_cache={}, _attrs={}, _flags=<Flags(_allows_duplicate_labels=True, _ob
j=<weakref at remote 0x7fff992d81d0>) at remote 0x7fff9ac32970>, _name='TOTL_BENE_CNT', _index=<Ra
ngeIndex(_range=<range at remote 0x7ffff78007e0>, _name=...(truncated),
    locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFil
eLoader(name='__main__', path='cos_sim.py') at remote 0x7ffff7800550>, '__spec__': None, '__annota
tions__': {}, '__builtins__': <module at remote 0x7ffff787b0e0>, '__file__': 'cos_sim.py', '__cach
ed__': None, 'time': <module at remote 0x7ffff7860b30>, 'pd': <module at remote 0x7ffff76e50e0>, '
np': <module at remote 0x7ffff76e5810>, 'cosine_similarity': <function at remote 0x7fff9a0d08b0>,
'path_A': '/home/gndumbri/diag.csv', 'calc_tfidf': <function at remote 0x7ffff78421f0>, 'start': <
float at remote 0x7fff9a0be250>, 'full': <DataFrame(_is_copy=None, _mgr=<BlockManager at remote 0x
7fff95b30ac0>, _item_cache={'TOTL_BENE_CNT': <Series(_is_copy=None, _mgr=<SingleBlockManager at re
mote 0x7fff99c23a00>, _item_cache={}, _attrs={}, _flags=<Flags(_allows_duplicate_labels=True, _obj
=<weakref at remote 0x7fff992d81d0>) at remote 0x7fff9ac32970>, _name='TOTL_BENE_CNT', _index=<Ran
geIndex(_range=<range at remote 0x7ffff78007e0>, _name=...(truncated), flags=<optimized out>,
    arena=<optimized out>) at ../Python/pythonrun.c:1147
#43 0x000000000067bdcb in PyRun_FileExFlags (fp=0x95f340, filename_str=<optimized out>,
    start=<optimized out>,
    globals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFi
leLoader(name='__main__', path='cos_sim.py') at remote 0x7ffff7800550>, '__spec__': None, '__annot
ations__': {}, '__builtins__': <module at remote 0x7ffff787b0e0>, '__file__': 'cos_sim.py', '__cac
hed__': None, 'time': <module at remote 0x7ffff7860b30>, 'pd': <module at remote 0x7ffff76e50e0>,
'np': <module at remote 0x7ffff76e5810>, 'cosine_similarity': <function at remote 0x7fff9a0d08b0>,
--Type <RET> for more, q to quit, c to continue without paging--<RET>
 'path_A': '/home/gndumbri/diag.csv', 'calc_tfidf': <function at remote 0x7ffff78421f0>, 'start':
<float at remote 0x7fff9a0be250>, 'full': <DataFrame(_is_copy=None, _mgr=<BlockManager at remote 0
x7fff95b30ac0>, _item_cache={'TOTL_BENE_CNT': <Series(_is_copy=None, _mgr=<SingleBlockManager at r
emote 0x7fff99c23a00>, _item_cache={}, _attrs={}, _flags=<Flags(_allows_duplicate_labels=True, _ob
j=<weakref at remote 0x7fff992d81d0>) at remote 0x7fff9ac32970>, _name='TOTL_BENE_CNT', _index=<Ra
ngeIndex(_range=<range at remote 0x7ffff78007e0>, _name=...(truncated),
    locals={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFil
eLoader(name='__main__', path='cos_sim.py') at remote 0x7ffff7800550>, '__spec__': None, '__annota
tions__': {}, '__builtins__': <module at remote 0x7ffff787b0e0>, '__file__': 'cos_sim.py', '__cach
ed__': None, 'time': <module at remote 0x7ffff7860b30>, 'pd': <module at remote 0x7ffff76e50e0>, '
np': <module at remote 0x7ffff76e5810>, 'cosine_similarity': <function at remote 0x7fff9a0d08b0>,
'path_A': '/home/gndumbri/diag.csv', 'calc_tfidf': <function at remote 0x7ffff78421f0>, 'start': <
float at remote 0x7fff9a0be250>, 'full': <DataFrame(_is_copy=None, _mgr=<BlockManager at remote 0x
7fff95b30ac0>, _item_cache={'TOTL_BENE_CNT': <Series(_is_copy=None, _mgr=<SingleBlockManager at re
mote 0x7fff99c23a00>, _item_cache={}, _attrs={}, _flags=<Flags(_allows_duplicate_labels=True, _obj
=<weakref at remote 0x7fff992d81d0>) at remote 0x7fff9ac32970>, _name='TOTL_BENE_CNT', _index=<Ran
geIndex(_range=<range at remote 0x7ffff78007e0>, _name=...(truncated), closeit=1,
    flags=0x7fffffffe2d8) at ../Python/pythonrun.c:1063
#44 0x000000000067de4e in PyRun_SimpleFileExFlags (fp=0x95f340, filename=<optimized out>,
    closeit=1, flags=0x7fffffffe2d8) at ../Python/pythonrun.c:428
#45 0x00000000006b6032 in pymain_run_file (cf=0x7fffffffe2d8, config=0x9617d0)
    at ../Modules/main.c:381
#46 pymain_run_python (exitcode=0x7fffffffe2d0) at ../Modules/main.c:606
#47 Py_RunMain () at ../Modules/main.c:685
#48 0x00000000006b63bd in Py_BytesMain (argc=<optimized out>, argv=<optimized out>)
    at ../Modules/main.c:739
#49 0x00007ffff7df70b3 in __libc_start_main (main=0x4eea30 <main>, argc=2, argv=0x7fffffffe4b8,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe4a8) at ../csu/libc-start.c:308
#50 0x00000000005fa4de in _start () at ../Objects/bytesobject.c:2560

标签: pythonnumpyscikit-learnsegmentation-faultcosine-similarity

解决方案


推荐阅读