首页 > 解决方案 > Dask 在连接大熊猫数据帧时效率不高,并给出内存错误

问题描述

起初,我尝试了 pandas 数据框的典型串联:

df=pd.concat([df,df_filtered2],axis=1,sort=False)

但它给出了错误:

/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
Traceback (most recent call last):
  File "process_data_interpolation.py", line 435, in <module>
    df=pd.concat([df,df_filtered2],axis=1,sort=False)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 255, in concat
    sort=sort,
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 335, in __init__
    obj._consolidate(inplace=True)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5270, in _consolidate
    self._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5252, in _consolidate_inplace
    self._protect_consolidate(f)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5241, in _protect_consolidate
    result = f()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5250, in f
    self._data = self._data.consolidate()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 932, in consolidate
    bm._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
    list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
    new_values = new_values[argsort]
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64

所以我尝试了 Dask:

df = dd.concat([df,df_filtered2],axis=1)

但它也给了我 MemoryError:

/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
Traceback (most recent call last):
  File "process_data_interpolation.py", line 443, in <module>
    df = dd.concat([df,df_filtered2],axis=1)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/multi.py", line 1045, in concat
    dfs = _maybe_from_pandas(dfs)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in _maybe_from_pandas
    for df in dfs
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in <listcomp>
    for df in dfs
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in from_pandas
    for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in <dictcomp>
    for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1308, in _get_slice_axis
    return self._slice(indexer, axis=axis, kind="iloc")
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 166, in _slice
    return self.obj._slice(obj, axis=axis, kind=kind)
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 3371, in _slice
    result = self._constructor(self._data.get_slice(slobj, axis=axis))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 755, in get_slice
    bm._consolidate_inplace()
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
    list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
  File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
    new_values = new_values[argsort]
MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64

我还能尝试什么?我在具有 128GB RAM 内存的 linux 节点上运行 Python 脚本。在我的情况下,删除不必要的列并将某些列转换为整数后,熊猫数据框之一的大小为 44.48 GB。

标签: pythonpandasdask

解决方案


Dask Best Practices 文档中回答了这个问题:

https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask


推荐阅读