首页 > 解决方案 > 在 pandas 中有效地增长数据框

问题描述

在迭代的基础上,我正在生成一个如下所示的 DataFrame:

              RIC RICRoot ISIN ExpirationDate                      Exchange           ...            OpenInterest  BlockVolume  TotalVolume2  SecurityDescription  SecurityLongDescription
closingDate                                                                           ...                                                                                                 
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                     NaN         None          None       SP500 IDX MAR0                     None

我把它变成一个多索引的DF:

tmp.columns = pd.MultiIndex.from_arrays( [ [contract]*len(tmp.columns), tmp.columns.tolist() ] )

该数据的参考名称在哪里contract,您可以在下面的输出中看到SPH0

    SPH0                                                                     ...                                                                                            
              RIC RICRoot ISIN ExpirationDate                      Exchange           ...           OpenInterest BlockVolume TotalVolume2 SecurityDescription SecurityLongDescription
closingDate                                                                           ...                                                                                            
2018-03-15   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-16   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-19   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-20   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None
2018-03-21   SPH0      SP          2020-03-20  CME:Index and Options Market           ...                    NaN        None         None      SP500 IDX MAR0                    None

我目前有一种非常低效的方式来合并这些 DataFrame:

if df is None:
            df = tmp;
        else:
            df = df.merge( tmp, how='outer', left_index=True, right_index=True)

这非常慢。我想将所有这些 tempdf 与它们各自的合同名称一起存储在关联的映射样式中,并且能够以矢量化的方式轻松地引用它们的数据。最佳解决方案是什么?水平/垂直增长重要吗?

标签: pythonpandasnumpydataframe

解决方案


IIUC,您可以使用pd.concat()并传递您的数据框列表和生成的 MultiIndex 数据框的键。采取以下数据框样本:

import pandas as pd

df1 = pd.DataFrame([                                                                                            
['2018-03-11',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-12',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-15',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-23',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'SPH0',      'SP',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df2 = pd.DataFrame([                                                                                            
['2018-03-15',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-22',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-24',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'HAB3',      'HA',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

df3 = pd.DataFrame([                                                                                            
['2018-03-15',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-16',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-18',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-20',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market'],
['2018-03-21',   'UHA6',      'UH',          '2020-03-20',  'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])

现在打电话pd.concat()

pd.concat([df1, df2, df3], keys=['SPH0','HAB3','UHA6'])

产量:

       closingDate              ...                                   Exchange
SPH0 0  2018-03-11              ...               CME:Index and Options Market
     1  2018-03-12              ...               CME:Index and Options Market
     2  2018-03-15              ...               CME:Index and Options Market
     3  2018-03-23              ...               CME:Index and Options Market
     4  2018-03-24              ...               CME:Index and Options Market
HAB3 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-22              ...               CME:Index and Options Market
     3  2018-03-24              ...               CME:Index and Options Market
     4  2018-03-20              ...               CME:Index and Options Market
UHA6 0  2018-03-15              ...               CME:Index and Options Market
     1  2018-03-16              ...               CME:Index and Options Market
     2  2018-03-18              ...               CME:Index and Options Market
     3  2018-03-20              ...               CME:Index and Options Market
     4  2018-03-21              ...               CME:Index and Options Market

您还可以使用列表推导来创建要传递给的数据框列表pd.concat(),例如:

my_keys = ['SPH0','HAB3','UHA6']
dfs = [create_df(key) for key in my_keys]
pd.concat(dfs, keys=my_keys)

函数create_df()返回数据框的位置。


推荐阅读