首页 > 解决方案 > 多索引数据帧中数据的年份总和

问题描述

我有一个多索引数据框

df

                        2021-06-28                                                                         2021-07-05      
                        total_orders    total_sales     last_year_sales     last_year_total_orders         total_orders     total_sales     last_year_sales     last_year_total_orders
group       category                                
A           Amazon      195.000         1,268.850       5,194.950           195.000                         ...            ...              ...                 ...
            Netflix     37.000          15,323.800      11,461.250          20.00
            Apple       61.000          18,274.385      19,102.220          30.00   
            Facebook    106.000         19,722.650      22,796.420          50.00
            Tesla       13.000          1,688.675       992.600             25.00
            Uber        4.000           1,906.000       3,671.150           15.00
            Google      61.000          55,547.255      30,853.115          10.00
            total       477.000         113,731.615     94,071.705          56.00
B           Amazon      50.000          3,219.650       6,022.300           400.00
            Netflix     17.000          5,918.500       4,749.000           30.00
            Apple       50.000          15,852.060      7,373.600           27.00
            Facebook    75.000          17,743.700      33,514.000          15.00
            Tesla       14.000          1,708.750       902.010             66.00
            Uber        3.000           937.010         349.300             94.00
            Google      43.000          37,795.150      35,943.450          65.00
            total       252.000         83,174.820      88,853.660          61.00

我正在尝试创建一个从今天开始的year_to_date专栏。正如你在上面看到的,我每周有 4 列,从本周开始,我有几周的时间。因此,从我开始的每个星期都会有一个列,它将简单地保存从 开始的所有前几周的累积总和。sumsumtotal_sales2021-06-282020-06-292021-10-112021-06-28sales2021-06-28

我希望结果看起来像这样(我删除了一些列以获得更好的视觉效果):

                        2021-06-28                              2021-07-05                                                              2021-07-12
                        total_orders    total_sales             total_orders     total_sales            year_to_date_sales              total_orders     total_sales            year_to_date_sales     
group       category                                
A           Amazon      195.000         1,268.850                ...             1000                   1,268.850 + 1000                ...              5000                   1,268.850 + 1000 + 5000
            Netflix     37.000          15,323.800     
            Apple       61.000          18,274.385               ...             2000                   15,323.800 + 2000               ...              6000                   1,268.850 + 1000 + 6000
            Facebook    106.000         19,722.650     
            Tesla       13.000          1,688.675      
            Uber        4.000           1,906.000      
            Google      61.000          55,547.255     
            total       477.000         113,731.615    
B           Amazon      50.000          3,219.650      
            Netflix     17.000          5,918.500      
            Apple       50.000          15,852.060     
            Facebook    75.000          17,743.700     
            Tesla       14.000          1,708.750      
            Uber        3.000           937.010        
            Google      43.000          37,795.150     
            total       252.000         83,174.820     

我试过了:

# Adding year to date sales

s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')] 
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)

# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me

我假设我会按行使用np.sum(s, axis = 1)sum,所以我会得到每个groupand的结果category。但是目前,我无法选择整个感兴趣的范围,我的方法会为每周而不是下周创建相同的值,而不是2021-06-28累积总和。

我怎样才能达到这样的结果?

更新

使用后

df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
        .rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))

df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)

由于某种原因,我没有添加新列:

df.loc[:,'2021-07-05'].columns


MultiIndex([('2021-07-05',       'total_orders'),
            ('2021-07-05',        'total_sales'),
            ('2021-07-05',        'last_year_sales'),
            ('2021-07-05',        'last_year_total_orders')]
           names=['created_at', None])

当我检查df1它只有一列的内容时2021-06-28,我预计每周都有多列。决赛中的新栏目df仅存在于2021-06-28未来的其他日期,但不存在。

df data

df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'A',
  8: 'A',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B',
  14: 'B',
  15: 'B',
  16: 'B',
  17: 'B',
  18: 'all',
  19: 'all'},
 ('category', ''): {0: 'Amazon',
  1: 'Apple',
  2: 'Facebook',
  3: 'Google',
  4: 'Netflix',
  5: 'Tesla',
  6: 'Total',
  7: 'Uber',
  8: 'total',
  9: 'Amazon',
  10: 'Apple',
  11: 'Facebook',
  12: 'Google',
  13: 'Netflix',
  14: 'Tesla',
  15: 'Total',
  16: 'Uber',
  17: 'total',
  18: 'Total',
  19: 'total'},
 (pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
  1: 61.0,
  2: 106.0,
  3: 61.0,
  4: 37.0,
  5: 13.0,
  6: 954.0,
  7: 4.0,
  8: 477.0,
  9: 50.0,
  10: 50.0,
  11: 75.0,
  12: 43.0,
  13: 17.0,
  14: 14.0,
  15: 504.0,
  16: 3.0,
  17: 252.0,
  18: 2916.0,
  19: 2916.0},
 (pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
  1: 18274.385000000002,
  2: 19722.65,
  3: 55547.255,
  4: 15323.800000000001,
  5: 1688.6749999999997,
  6: 227463.23,
  7: 1906.0,
  8: 113731.615,
  9: 3219.6499999999996,
  10: 15852.060000000001,
  11: 17743.7,
  12: 37795.15,
  13: 5918.5,
  14: 1708.75,
  15: 166349.64,
  16: 937.01,
  17: 83174.82,
  18: 787625.7400000001,
  19: 787625.7400000001},
 (pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
  1: 39.0,
  2: 79.0,
  3: 49.0,
  4: 10.0,
  5: 10.0,
  6: 436.0,
  7: 5.0,
  8: 218.0,
  9: 89.0,
  10: 34.0,
  11: 133.0,
  12: 66.0,
  13: 21.0,
  14: 20.0,
  15: 732.0,
  16: 3.0,
  17: 366.0,
  18: 2336.0,
  19: 2336.0},
 (pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
  1: 12138.96,
  2: 19084.175,
  3: 40033.46000000001,
  4: 4280.15,
  5: 1495.1,
  6: 165548.29,
  7: 1764.15,
  8: 82774.145,
  9: 8314.92,
  10: 12776.649999999996,
  11: 28048.075,
  12: 55104.21000000002,
  13: 6962.844999999999,
  14: 3053.2000000000003,
  15: 231049.11000000002,
  16: 1264.655,
  17: 115524.55500000001,
  18: 793194.8000000002,
  19: 793194.8000000002},
 (pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
  1: 56.0,
  2: 106.0,
  3: 44.0,
  4: 34.0,
  5: 13.0,
  6: 716.0,
  7: 9.0,
  8: 358.0,
  9: 101.0,
  10: 22.0,
  11: 120.0,
  12: 40.0,
  13: 13.0,
  14: 8.0,
  15: 610.0,
  16: 1.0,
  17: 305.0,
  18: 2652.0,
  19: 2652.0},
 (pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
  1: 19102.219999999994,
  2: 22796.420000000002,
  3: 30853.115,
  4: 11461.25,
  5: 992.6,
  6: 188143.41,
  7: 3671.15,
  8: 94071.705,
  9: 6022.299999999998,
  10: 7373.6,
  11: 33514.0,
  12: 35943.45,
  13: 4749.000000000001,
  14: 902.01,
  15: 177707.32,
  16: 349.3,
  17: 88853.66,
  18: 731701.46,
  19: 731701.46},
 (pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
  1: 47.0,
  2: 87.0,
  3: 45.0,
  4: 13.0,
  5: 8.0,
  6: 494.0,
  7: 2.0,
  8: 247.0,
  9: 81.0,
  10: 36.0,
  11: 143.0,
  12: 56.0,
  13: 9.0,
  14: 9.0,
  15: 670.0,
  16: 1.0,
  17: 335.0,
  18: 2328.0,
  19: 2328.0},
 (pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
  1: 14985.05,
  2: 16790.899999999998,
  3: 36202.729999999996,
  4: 4024.97,
  5: 1034.45,
  6: 163960.32999999996,
  7: 1385.65,
  8: 81980.16499999998,
  9: 5600.544999999999,
  10: 11209.92,
  11: 32832.61,
  12: 42137.44500000001,
  13: 3885.1499999999996,
  14: 1191.5,
  15: 194912.34000000003,
  16: 599.0,
  17: 97456.17000000001,
  18: 717745.3400000001,
  19: 717745.3400000001},
 (pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0}}).set_index(['group','category'])

标签: pythonpandasmulti-index

解决方案


DataFrame.cumsum与 一起使用axis=1,添加到原始列和排序列MultiIndex

idx = pd.IndexSlice
df1 = (df.loc[:, idx['2021-06-28':'2021-10-11', 'total_sales']]
        .rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))

df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)

推荐阅读