首页 > 解决方案 > 将整个(大)架构转换为 hdf5

问题描述

我正在尝试使用 postgreSQL 查询导出整个数据库模式(大约 20 GB)以创建最终唯一的 hdf5 文件。

因为这个大小不适合我的计算机内存,所以我使用了 chunks 参数。

首先我使用这个函数来建立连接:

def make_connectstring(prefix, db, uname, passa, hostname, port):
    """return an sql connectstring"""
    connectstring = prefix + "://" + uname + ":" + passa + "@" + hostname + \
                    ":" + port + "/" + db
    return connectstring

然后我创建了一个临时文件夹来保存每个 hdf5 文件。

def query_to_hdf5(connectstring, query, verbose=False, chunksize=50000):

    engine = sqlalchemy.create_engine(connectstring, 
        server_side_cursors=True)    

    # get the data to temp chunk filese
    i = 0
    paths_chunks = []
    with tempfile.TemporaryDirectory() as td:
        for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize):
            path = td + "/chunk" + str(i) + ".hdf5"
            df.to_hdf(path, key='data')
            print(path)
            if verbose:
                print("wrote", path)
            paths_chunks.append(path)
            i+=1


connectstring = make_connectstring(prefix, db, uname, passa, hostname, port)
query = "SELECT * FROM public.zz_ges"
df = query_to_hdf5(connectstring, query)

将所有这些文件合并为一个代表整个数据框的单个文件的最佳方法是什么?

我试过这样的事情:

    df = pd.DataFrame()
    print(path)
    for path in paths_chunks:
        df_scratch = pd.read_hdf(path)
        df = pd.concat([df, df_scratch])
        if verbose:
            print("read", path)

但是,内存增长非常快。我需要一些可以更有效的东西。

更新:

def make_connectstring(prefix, db, uname, passa, hostname, port):
    """return an sql connectstring"""
    connectstring = prefix + "://" + uname + ":" + passa + "@" + hostname + \
                    ":" + port + "/" + db
    return connectstring

def query_to_df(connectstring, query, verbose=False, chunksize=50000):

    engine = sqlalchemy.create_engine(connectstring, 
        server_side_cursors=True)    

    # get the data to temp chunk filese
    with pd.HDFStore('output.h5', 'w') as store:
        for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize):
            store.append('data', df)

标签: pythonpandaspostgresqldask

解决方案


我建议HDFStore直接使用a,这样您就可以在从数据库中获取块时附加块,例如:

with pd.HDFStore('output.h5', 'w') as store:
  for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize):
    store.append('data', df)

这是基于您现有的代码,所以不完整,如果不清楚,请告诉我

note I'm opening the store in w mode so it'll delete the file every time. otherwise append will just keep adding the same rows to the end of the table. alternatively you could remove the key first

when you open the store you also get lots of options like compression to use but it doesn't seem to be well documented, help(pd.HDFStore) describes complevel and complib for me


推荐阅读