首页 > 解决方案 > How to store a class instance in HDF5

问题描述

TL;DR: Question is in the title. See code snippet.

I need to store pandas.DataFrame objects in a dictionary-like data structure and to save them to disk. In my current implementation, I'm using a non-nested Python dict in the form Dict[str, pandas.DataFrame] and I save all pandas.DataFrame to disk every minute as csv file. However, these two responsibilities (data storage in memory and to disk) might be elegantly unified using data structures such as HDF5.

One important constraint is that I cannot change the type of what is stored in the pandas.DataFrame and apparently not all object types can be stored in a HDF5. The reason is that I'm implementing an 3rd party interface with predefined data types which need to be handled in their native form. Mapping instances to different object (e.g. instance to dict) will require to write an additional layer of logic to map different types of object back and forth (dict to instance), which is bad.

A similar question with answer here. However, I'm not necessarily interested in querying the stored instances afterwards. In addition, I would ideally keep the amount of extra logic to serialise the instance at its minimum (if needed at all). Data compression is also not a problem. Alternatively, a potential answer could also point to a well established python package which has already encapsulated the logic to store class instances in HDF5 or a similar data model.

import pandas as pd


class C:
    def __init__(self, a=0):
        self.a = a

    def return_42(self):
        return self.a

df = pd.DataFrame([C()])
df.dtypes
#    0    object
#    dtype: object

store = pd.HDFStore('store1.hdf5')
store.append('c', pd.DataFrame([C()]))
#    TypeError: Cannot serialize the column [0] because 
#    its data contents are [mixed] object dtype.

标签: pythonpandasdata-structuresiohdf5

解决方案


推荐阅读