首页 > 解决方案 > 使用多列索引堆叠数据框

问题描述

我有一个带有多列索引的熊猫数据框。我试图以一种看起来像“预期”数据框的方式堆叠数据。

我尝试了 .stack 方法,但它并没有完全按照第二个数据框中显示的方式堆叠。

df = pd.read_excel(
    "https://testme162.s3.amazonaws.com/just_stack.xlsx", header=list(range(5))
)

expected = pd.read_excel(
    "https://testme162.s3.amazonaws.com/just_stack.xlsx", sheet_name="expected",
)

这是一张图片,解释了数据的外观......

原始和预期的数据框

截至 7 月 23 日,共发现约 82 条记录。2 个文件上传到 2 个 (s3) 存储桶中。总数可能匹配也可能不匹配 (66 + 16 = 82)。桶、文件名和它的计数需要每天堆叠。每天只有 1 行。

标签: pandas

解决方案


问题设置

基本上,您从一个转置和嵌套的 DataFrame 开始,以将所有分组的行与总计和小计一起放在一行中。

您的预期结果是原始数据的重建,然后以不同的方式聚合。

解决方案

因此,我相信“最佳实践”解决方案是将这些数据分解为其原始数据(文件名和 doc_counts)并从那里开始处理。

### split df on columns by buckets, seting (key, bucket_key) as index
### concatenate on axis=0
df.set_index(df.columns[0], inplace=True)
clean_dfs = []
j_bucket1, j_bucket2 = 3, 15 # input the column indexes for each bucket
for df1 in (df.iloc[:, j_bucket1:j_bucket2], df.iloc[:, j_bucket2:]):
    df1.set_index(df1.columns[0], append=True, inplace=True) #include bucket to index
    df1 = df1.drop(df1.columns[0:3], axis=1) # drop subtotals hardcoded to df
    df1.columns = ['filename', 'mycount'] * 4 # rename columns, preparing for concatenation and stacking
    clean_dfs.append(df1)
out = pd.concat(clean_dfs).rename_axis(['key', 'bucket']) # remove extra levels in MultiIndex level names
out = out.stack().dropna().rename('value').to_frame().reset_index(-1, drop=False) # this is the stack you were refering in the question
# reset the index level containing value type (filename / doc_count) so it is out of the index for concatenation
# if you left it there then concat down below would duplicate key+bucket rows again
# now it is the first column in dataframe 'out', pass it to groupby
out = pd.concat([out1.rename(var)
    for var, out1 in out.groupby(out.columns[0]).value], axis=1
    ).reset_index()

print(out)

输出

               key        bucket                                           filename mycount
0   cwl-2020.07.22    tempexport                          email_newsletters_old.csv    7327
1   cwl-2020.07.22    tempexport                               kyc_match_api.csv.gz    2053
2   cwl-2020.07.18  eiva-temp-s3                           perconabackupncpr.tar.gz    2109
3   cwl-2020.07.23    tempexport                                             000.gz      66
4   cwl-2020.07.20    tempexport                                             000.gz      33
5   cwl-2020.07.20    tempexport                                     svc_dlr.csv.gz       1
6   cwl-2020.07.20    tempexport                               svc_receiving.csv.gz       1
7   cwl-2020.07.20    tempexport                                 svc_sending.csv.gz       1
8   cwl-2020.07.25      mmm_eiva  mmm_servers_backup/tomcat_233/schedule_backup/...      15
9   cwl-2020.07.22      mmm_eiva  mmm_servers_backup/tomcat_233/schedule_backup/...      15
10  cwl-2020.07.18      mmm_eiva  mmm_servers_backup/tomcat_233/schedule_backup/...      15
11  cwl-2020.07.23      mmm_eiva  mmm_servers_backup/tomcat_233/schedule_backup/...      16
12  cwl-2020.07.20      mmm_eiva  mmm_servers_backup/tomcat_233/schedule_backup/...      15
13  cwl-2020.07.20      mmm_eiva  mmm_servers_backup/app_db_198/db/email360_emai...       1

现在我们有干净的数据可以使用。计算 key 和 bucket 的小计并插入 DataFrame:

### calculate doc_count totals per bucket and keys
out.insert(loc=1, column='doc_count',
    value=out.groupby('key').mycount.transform('sum').astype(np.int))
out.insert(loc=3, column='b_doc_count',
    value=out.groupby(['key', 'bucket']).mycount.transform('sum').astype(np.int))

### sort to preference
out.sort_values(['mycount'], ascending=False, inplace=True)
out.sort_values(['b_doc_count'], ascending=False, inplace=True)
out.sort_values(['doc_count'], ascending=False, inplace=True)
print(out)

输出

               key  doc_count        bucket  b_doc_count                                           filename mycount
0   cwl-2020.07.22       9395    tempexport         9380                          email_newsletters_old.csv    7327
1   cwl-2020.07.22       9395    tempexport         9380                               kyc_match_api.csv.gz    2053
9   cwl-2020.07.22       9395      mmm_eiva           15  mmm_servers_backup/tomcat_233/schedule_backup/...      15
2   cwl-2020.07.18       2124  eiva-temp-s3         2109                           perconabackupncpr.tar.gz    2109
10  cwl-2020.07.18       2124      mmm_eiva           15  mmm_servers_backup/tomcat_233/schedule_backup/...      15
3   cwl-2020.07.23         82    tempexport           66                                             000.gz      66
11  cwl-2020.07.23         82      mmm_eiva           16  mmm_servers_backup/tomcat_233/schedule_backup/...      16
4   cwl-2020.07.20         52    tempexport           36                                             000.gz      33
5   cwl-2020.07.20         52    tempexport           36                                     svc_dlr.csv.gz       1
6   cwl-2020.07.20         52    tempexport           36                               svc_receiving.csv.gz       1
7   cwl-2020.07.20         52    tempexport           36                                 svc_sending.csv.gz       1
12  cwl-2020.07.20         52      mmm_eiva           16  mmm_servers_backup/tomcat_233/schedule_backup/...      15
13  cwl-2020.07.20         52      mmm_eiva           16  mmm_servers_backup/app_db_198/db/email360_emai...       1
8   cwl-2020.07.25         15      mmm_eiva           15  mmm_servers_backup/tomcat_233/schedule_backup/...      15

推荐阅读