python - 在 pandas 中有效地增长数据框
问题描述
在迭代的基础上,我正在生成一个如下所示的 DataFrame:
RIC RICRoot ISIN ExpirationDate Exchange ... OpenInterest BlockVolume TotalVolume2 SecurityDescription SecurityLongDescription
closingDate ...
2018-03-15 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-16 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-19 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-20 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-21 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
我把它变成一个多索引的DF:
tmp.columns = pd.MultiIndex.from_arrays( [ [contract]*len(tmp.columns), tmp.columns.tolist() ] )
该数据的参考名称在哪里contract
,您可以在下面的输出中看到SPH0
:
SPH0 ...
RIC RICRoot ISIN ExpirationDate Exchange ... OpenInterest BlockVolume TotalVolume2 SecurityDescription SecurityLongDescription
closingDate ...
2018-03-15 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-16 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-19 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-20 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
2018-03-21 SPH0 SP 2020-03-20 CME:Index and Options Market ... NaN None None SP500 IDX MAR0 None
我目前有一种非常低效的方式来合并这些 DataFrame:
if df is None:
df = tmp;
else:
df = df.merge( tmp, how='outer', left_index=True, right_index=True)
这非常慢。我想将所有这些 tempdf 与它们各自的合同名称一起存储在关联的映射样式中,并且能够以矢量化的方式轻松地引用它们的数据。最佳解决方案是什么?水平/垂直增长重要吗?
解决方案
IIUC,您可以使用pd.concat()
并传递您的数据框列表和生成的 MultiIndex 数据框的键。采取以下数据框样本:
import pandas as pd
df1 = pd.DataFrame([
['2018-03-11', 'SPH0', 'SP', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-12', 'SPH0', 'SP', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-15', 'SPH0', 'SP', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-23', 'SPH0', 'SP', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-24', 'SPH0', 'SP', '2020-03-20', 'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])
df2 = pd.DataFrame([
['2018-03-15', 'HAB3', 'HA', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-16', 'HAB3', 'HA', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-22', 'HAB3', 'HA', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-24', 'HAB3', 'HA', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-20', 'HAB3', 'HA', '2020-03-20', 'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])
df3 = pd.DataFrame([
['2018-03-15', 'UHA6', 'UH', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-16', 'UHA6', 'UH', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-18', 'UHA6', 'UH', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-20', 'UHA6', 'UH', '2020-03-20', 'CME:Index and Options Market'],
['2018-03-21', 'UHA6', 'UH', '2020-03-20', 'CME:Index and Options Market']],
columns=['closingDate', 'RIC', 'RICRoot', 'ExpirationDate', 'Exchange'])
现在打电话pd.concat()
:
pd.concat([df1, df2, df3], keys=['SPH0','HAB3','UHA6'])
产量:
closingDate ... Exchange
SPH0 0 2018-03-11 ... CME:Index and Options Market
1 2018-03-12 ... CME:Index and Options Market
2 2018-03-15 ... CME:Index and Options Market
3 2018-03-23 ... CME:Index and Options Market
4 2018-03-24 ... CME:Index and Options Market
HAB3 0 2018-03-15 ... CME:Index and Options Market
1 2018-03-16 ... CME:Index and Options Market
2 2018-03-22 ... CME:Index and Options Market
3 2018-03-24 ... CME:Index and Options Market
4 2018-03-20 ... CME:Index and Options Market
UHA6 0 2018-03-15 ... CME:Index and Options Market
1 2018-03-16 ... CME:Index and Options Market
2 2018-03-18 ... CME:Index and Options Market
3 2018-03-20 ... CME:Index and Options Market
4 2018-03-21 ... CME:Index and Options Market
您还可以使用列表推导来创建要传递给的数据框列表pd.concat()
,例如:
my_keys = ['SPH0','HAB3','UHA6']
dfs = [create_df(key) for key in my_keys]
pd.concat(dfs, keys=my_keys)
函数create_df()
返回数据框的位置。
推荐阅读
- c# - htmlagilityPack:网页不返回完整的 html
- typescript - 为什么我从这个函数中得到 NaN?
- c - 如何在c中增加结构矩阵中的值?
- javascript - 如何隐藏具有重复数据属性的孩子的所有父母
- r - 如何在R中使用gather()函数来堆叠分散到一列中的数据
- r - 更改绘图中一条线的颜色和线型
- c++ - 如何在 Qt 中放大/缩小图像的选定部分?
- visual-studio - 构建时如何阻止.net-core不断添加文件夹
- javascript - JSON 从 API 获取 Javascript,但我想从我自己的工作中获取
- visual-studio-code - 让 Jedi 查看代码完成的其他路径