首页 > 解决方案 > 如何在双 groupby 数据框中拆分数据?

问题描述

我有一个大数据框,它有两个索引列-'date'和'con'

In [28]: df = pd.read_csv('~/futures_min_all.csv')

In [29]: df
Out[29]: 
            open   close    high     low     tvr    oi  vol  ticker      date     tme con
0         2854.0  2850.0  2854.0  2850.0  5696.0  1226    2  MA1506  20140618   93000  MA
1         2854.0  2854.0  2854.0  2854.0  5698.0  1228    2  MA1506  20140618   93100  MA
2         2854.0  2854.0  2854.0  2854.0     0.0  1228    0  MA1506  20140618   93200  MA
3         2854.0  2854.0  2854.0  2854.0     0.0  1228    0  MA1506  20140618   93300  MA
4         2854.0  2851.0  2854.0  2851.0  5698.0  1228    2  MA1506  20140618   93400  MA
...          ...     ...     ...     ...     ...   ...  ...     ...       ...     ...  ..
27277818  2359.0  2359.0  2359.0  2359.0     0.0     0    0  PM2105  20201223  145600  PM
27277819  2359.0  2359.0  2359.0  2359.0     0.0     0    0  PM2105  20201223  145700  PM
27277820  2359.0  2359.0  2359.0  2359.0     0.0     0    0  PM2105  20201223  145800  PM
27277821  2359.0  2359.0  2359.0  2359.0     0.0     0    0  PM2105  20201223  145900  PM
27277822  2359.0  2359.0  2359.0  2359.0     0.0     0    0  PM2105  20201223  150000  PM

对于每个骗局,日期范围都不同。

In [30]: df.groupby(['con']).apply(lambda x: set(x['date']))
Out[30]: 
con
A     {20160512, 20160513, 20160516, 20160517, 20160...
AG    {20160512, 20160513, 20160516, 20160517, 20160...
AL    {20160512, 20160513, 20160516, 20160517, 20160...
AP    {20181008, 20181009, 20181010, 20181011, 20181...
AU    {20160512, 20160513, 20160516, 20160517, 20160...
                            ...                        
WH    {20160512, 20160513, 20160516, 20160517, 20160...
WR    {20160512, 20160513, 20160516, 20160517, 20160...
Y     {20160512, 20160513, 20160516, 20160517, 20160...
ZC    {20160512, 20160513, 20160516, 20160517, 20160...
ZN    {20160512, 20160513, 20160516, 20160517, 20160...
Length: 68, dtype: object

我想将数据框分成两部分:

前 70% 后 30%

让我解释:

对于 con == 'IC','date' 范围是 [20200101 - 20201230](大约 365 天)

最重要的 70% 的 IC 将约为 [20200101 - 20200820]

对于 con == 'NI','date' 范围可能是 [20200801-20201230]

在最重要的 70% 中,“NI”行只会在 [20200801-20201020] 中保留“日期”

附言。我认为 head(0.7 *len(df)) 方法可能无法解决这个问题。因为,我想要 set(date) 的 70%,而不是 df 的 70%

我认为必须有一些优雅的方法可以用来将原始 df 拆分为我想要的,你能帮忙吗?

标签: pythonpandas

解决方案


这是前 70% 的唯一行的修改解决方案,如果顺序不重要更改pd.uniquesets:

f = lambda x: x.head(int(len(pd.unique(x['Date'])) * -.7))
df1 = df.groupby('con',group_keys=False).apply(f).reset_index(drop=True)

最后过滤所有不匹配的行:

df2 = df[~df.index.isin(df1.index)]

推荐阅读