首页 > 解决方案 > 在 pandas 中使用双字符串标题重塑数据

问题描述

我有以下数据框:

                    AAPL,Price AAPL,Volume  ...  GOOG,Ivol  GOOG,Shares
  Date                                                                
0 2019-12-25              21.5        1879            20.0         2010                             
1 2019-12-26              22.1        1887            19.9         2000                              
2 2019-12-27              23.0        1888            19.9         2045       
3 2019-12-30              22.3        1887            NaN          2050                                
4 2019-12-31              22.4        1900            20.1         1998       

我需要对这些股票的每一个特征进行回归,并且我必须附加一组特定的假人。因此,我的目标是重塑数据库,使我有一个双重索引,由日期在第一位,股票名称在第二位,即

Date        Stock  Price  Volume ... Ivol Shares
2019-12-25  AAPL    21.5    1879 ... 22.1   3121
...         ...    ...    ...    ... ...  ... 
2019-12-25  GOOG    45.8    NaN  ... 20.0   2000 
...
2019-12-25   VER     NaN    NaN  ...  NaN    NaN   
2019-12-26  AAPL   ...  
...                                           
2019-12-31   VER    42.4    1900 ... 50.1   1998 

我的问题是我不知道如何处理列名,因为它们是“Stock1,Feature1”类型的字符串,因此我不知道如何生成适当的双索引。有人可以帮忙吗?请注意,每只股票的特征都是相同的,即特征列表不会因股票而改变。当然,在重塑数据方面,我对不同类型的解决方案持开放态度

标签: pandasreshapemulti-index

解决方案


你可以用 重塑它pd.wide_to_long。为了获得正确的格式,我将首先将列 Index 修改为 make 'AAPL,Price'-> 'Price,AAPL'

df.columns = [','.join(x[::-1]) for x in df.columns.str.split(',')]
stubs = df.columns.str.split(',').str[0].unique().tolist()
#['Price', 'Volume', 'Ivol', 'Shares']    

df = df.reset_index('Date')  # Need it out of index
df = pd.wide_to_long(df, i='Date', j='Stock', stubnames=stubs, sep=',', suffix='.*')

#df, based on the 4 columns provided:

                  Price  Volume  Ivol  Shares
Date       Stock                             
2019-12-25 AAPL    21.5  1879.0   NaN     NaN
           GOOG     NaN     NaN  20.0  2010.0
2019-12-26 AAPL    22.1  1887.0   NaN     NaN
           GOOG     NaN     NaN  19.9  2000.0
2019-12-27 AAPL    23.0  1888.0   NaN     NaN
           GOOG     NaN     NaN  19.9  2045.0
2019-12-30 AAPL    22.3  1887.0   NaN     NaN
           GOOG     NaN     NaN   NaN  2050.0
2019-12-31 AAPL    22.4  1900.0   NaN     NaN
           GOOG     NaN     NaN  20.1  1998.0

也可能更简单,您可以创建列 MultiIndex 然后stack

df.columns = pd.MultiIndex.from_tuples([tuple(x) for x in df.columns.str.split(',')])

df = df.stack(level=0) # Index level won't have a name.

推荐阅读