首页 > 解决方案 > 从 ClickHouse 获取数据到 Pandas 数据帧时减少 RAM 消耗

问题描述

从 ClickHouse 获取数据到 Pandas 数据帧时,我试图减少 RAM 消耗。我现在有:

Filename: full.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    37     89.1 MiB     89.1 MiB           1   @profile
    38                                         def full():
    39    754.0 MiB    664.9 MiB           1       df = pd.read_sql(sql=query, con=conn)
    40    754.0 MiB      0.0 MiB           1       return df


CPU times: user 1.25 s, sys: 280 ms, total: 1.53 s
Wall time: 1min 8s

和:

Filename: it.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    37     89.4 MiB     89.4 MiB           1   @profile
    38                                         def it():
    39     89.4 MiB      0.0 MiB           1       df = pd.DataFrame()
    40    661.8 MiB    572.4 MiB           1       iterator = pd.read_sql(sql=query, con=conn, chunksize=100_000)
    41    813.1 MiB     41.4 MiB           9       for chunk in iterator:
    42    813.1 MiB    109.9 MiB           8           df = df.append(chunk)
    43    813.1 MiB      0.0 MiB           1       return df


CPU times: user 1.37 s, sys: 367 ms, total: 1.73 s
Wall time: 1min 12s

memory-profiler 测量connSQLAlchemy create_engine。行数为 712 300。

正如你所看到的,它们或多或少是相等的。但根据记忆,它们不是。我希望作为迭代器的pandas.read_sql结果(当定义了大小时)的大小会小得多。或者这是非常预期和正常的?

UPD。pandahouse显示了这样的结果:

Filename: ph.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    30     78.9 MiB     78.9 MiB           1   @profile
    31                                         def ph():
    32    290.1 MiB    211.2 MiB           1       df = read_clickhouse(query, connection={...})
    33    290.1 MiB      0.0 MiB           1       return df

标签: pandasclickhouse

解决方案


推荐阅读