python - Dask 在连接大熊猫数据帧时效率不高,并给出内存错误
问题描述
起初,我尝试了 pandas 数据框的典型串联:
df=pd.concat([df,df_filtered2],axis=1,sort=False)
但它给出了错误:
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
Traceback (most recent call last):
File "process_data_interpolation.py", line 435, in <module>
df=pd.concat([df,df_filtered2],axis=1,sort=False)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 335, in __init__
obj._consolidate(inplace=True)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5270, in _consolidate
self._consolidate_inplace()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5252, in _consolidate_inplace
self._protect_consolidate(f)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5241, in _protect_consolidate
result = f()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 5250, in f
self._data = self._data.consolidate()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 932, in consolidate
bm._consolidate_inplace()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
new_values = new_values[argsort]
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64
所以我尝试了 Dask:
df = dd.concat([df,df_filtered2],axis=1)
但它也给了我 MemoryError:
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
Traceback (most recent call last):
File "process_data_interpolation.py", line 443, in <module>
df = dd.concat([df,df_filtered2],axis=1)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/multi.py", line 1045, in concat
dfs = _maybe_from_pandas(dfs)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in _maybe_from_pandas
for df in dfs
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/core.py", line 4465, in <listcomp>
for df in dfs
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in from_pandas
for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dask/dataframe/io/io.py", line 209, in <dictcomp>
for i, (start, stop) in enumerate(zip(locations[:-1], locations[1:]))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
return self._get_slice_axis(key, axis=axis)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 1308, in _get_slice_axis
return self._slice(indexer, axis=axis, kind="iloc")
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/indexing.py", line 166, in _slice
return self.obj._slice(obj, axis=axis, kind=kind)
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/generic.py", line 3371, in _slice
result = self._constructor(self._data.get_slice(slobj, axis=axis))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 755, in get_slice
bm._consolidate_inplace()
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 937, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1913, in _consolidate
list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
File "/home/user/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3323, in _merge_blocks
new_values = new_values[argsort]
MemoryError: Unable to allocate array with shape (41, 156082680) and data type float64
我还能尝试什么?我在具有 128GB RAM 内存的 linux 节点上运行 Python 脚本。在我的情况下,删除不必要的列并将某些列转换为整数后,熊猫数据框之一的大小为 44.48 GB。
解决方案
Dask Best Practices 文档中回答了这个问题:
https://docs.dask.org/en/latest/best-practices.html#load-data-with-dask
推荐阅读
- reactjs - 0.59 RN 更新后获取返回 blob 而不是文本
- php - Codeigniter 在本地 mamp 上将 2.0.2 升级到 3.1.10 会话问题
- sql - Mode-within-group equivalent in Presto
- php - 无法将 JSON 响应从 windows-1253 转换为 utf8
- python - 如何更改熊猫中单元格、列的宽度?
- r - 如何引用数据框的列名,该列名与使用“mutate”时要访问的函数不同?
- etl - 多个临时表的 CDC 策略
- pkcs#11 - 在哪里可以找到 luna safenet 客户端日志?
- ruby-on-rails - 如何修复多控制器视图中form_for联系人所需的参数
- animation - Easing animateMotion in SVG