python - 多索引数据帧中数据的年份总和
问题描述
我有一个多索引数据框
df
2021-06-28 2021-07-05
total_orders total_sales last_year_sales last_year_total_orders total_orders total_sales last_year_sales last_year_total_orders
group category
A Amazon 195.000 1,268.850 5,194.950 195.000 ... ... ... ...
Netflix 37.000 15,323.800 11,461.250 20.00
Apple 61.000 18,274.385 19,102.220 30.00
Facebook 106.000 19,722.650 22,796.420 50.00
Tesla 13.000 1,688.675 992.600 25.00
Uber 4.000 1,906.000 3,671.150 15.00
Google 61.000 55,547.255 30,853.115 10.00
total 477.000 113,731.615 94,071.705 56.00
B Amazon 50.000 3,219.650 6,022.300 400.00
Netflix 17.000 5,918.500 4,749.000 30.00
Apple 50.000 15,852.060 7,373.600 27.00
Facebook 75.000 17,743.700 33,514.000 15.00
Tesla 14.000 1,708.750 902.010 66.00
Uber 3.000 937.010 349.300 94.00
Google 43.000 37,795.150 35,943.450 65.00
total 252.000 83,174.820 88,853.660 61.00
我正在尝试创建一个从今天开始的year_to_date
专栏。正如你在上面看到的,我每周有 4 列,从本周开始,我有几周的时间。因此,从我开始的每个星期都会有一个列,它将简单地保存从 开始的所有前几周的累积总和。sum
sum
total_sales
2021-06-28
2020-06-29
2021-10-11
2021-06-28
sales
2021-06-28
我希望结果看起来像这样(我删除了一些列以获得更好的视觉效果):
2021-06-28 2021-07-05 2021-07-12
total_orders total_sales total_orders total_sales year_to_date_sales total_orders total_sales year_to_date_sales
group category
A Amazon 195.000 1,268.850 ... 1000 1,268.850 + 1000 ... 5000 1,268.850 + 1000 + 5000
Netflix 37.000 15,323.800
Apple 61.000 18,274.385 ... 2000 15,323.800 + 2000 ... 6000 1,268.850 + 1000 + 6000
Facebook 106.000 19,722.650
Tesla 13.000 1,688.675
Uber 4.000 1,906.000
Google 61.000 55,547.255
total 477.000 113,731.615
B Amazon 50.000 3,219.650
Netflix 17.000 5,918.500
Apple 50.000 15,852.060
Facebook 75.000 17,743.700
Tesla 14.000 1,708.750
Uber 3.000 937.010
Google 43.000 37,795.150
total 252.000 83,174.820
我试过了:
# Adding year to date sales
s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)
# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me
我假设我会按行使用np.sum(s, axis = 1)
它sum
,所以我会得到每个group
and的结果category
。但是目前,我无法选择整个感兴趣的范围,我的方法会为每周而不是下周创建相同的值,而不是2021-06-28
累积总和。
我怎样才能达到这样的结果?
更新
使用后
df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
由于某种原因,我没有添加新列:
df.loc[:,'2021-07-05'].columns
MultiIndex([('2021-07-05', 'total_orders'),
('2021-07-05', 'total_sales'),
('2021-07-05', 'last_year_sales'),
('2021-07-05', 'last_year_total_orders')]
names=['created_at', None])
当我检查df1
它只有一列的内容时2021-06-28
,我预计每周都有多列。决赛中的新栏目df
仅存在于2021-06-28
未来的其他日期,但不存在。
df data
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
解决方案
DataFrame.cumsum
与 一起使用axis=1
,添加到原始列和排序列MultiIndex
:
idx = pd.IndexSlice
df1 = (df.loc[:, idx['2021-06-28':'2021-10-11', 'total_sales']]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
推荐阅读
- perl - Expect.pm send 修剪数字符号
- php - PHP JSON解码到数组以获取特定键值(所有字段同名)
- python - 如何将 Number.txt 文件转换为列表
- python - 这个嵌套循环的列表理解的正确语法?
- mysql - 列出所有可以在运行时修改的变量
- c# - OData v4.0 - 将值更改为结果
- c# - 使用 foreach 同时将数据插入 .CSV 文件
- php - 如果第一行不为空,如何将值插入第二行?数据库
- javascript - 如何使用 javascript 使用下拉菜单过滤 html 表的数据
- java - Jooq (java) - 方言默认不支持类型类 org.jooq.impl.UnqualifiedName