python - 有没有办法在 groupby 期间优化 pandas 应用功能?
问题描述
我有一个数据框 - df 如下:
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
现在,我想要两个数据框,如下所示:
df_1:
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
df_1 的逻辑:
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2:
Code Avg_Amount
543 156.25
612 100
715 200
df_2 的逻辑:
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
创建数据框的代码 - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
获取所需 df_2 的代码:
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
尽管代码运行良好,但由于我的数据集很大,因此需要时间。我正在寻找优化的代码?我认为应用功能需要时间?请问有更好的优化代码吗?
解决方案
对于 DataFrame 1,您可以这样做:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
对于 DataFrame 2,您可以这样做:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
推荐阅读
- r - data.table SD 使用 NA 填充返回每个组所要求的尽可能多的行,而不是现有的尽可能多的行
- c# - 添加消费者而不将 MassTransit 添加到依赖注入容器
- python - 有什么方法可以让 Jupyter 笔记本脚本作为 (i)python 脚本工作?
- angular - RxJS - 管道可观察的重新运行过于频繁
- .htaccess - 301 重定向根目录而不将新根目录级联到所有其他 301 重定向
- ocaml - 当 let* _ = 时,let* 有更好的语法吗?
- react-native - expo 弹出后在项目上运行 react-native start 时出错
- python - 日期值与指定的格式不匹配 [YYYY-MM-DD] 复制时 Psycopg2 中的错误
- python - Python Dash - 结合两个回调函数
- html - 如何在Angular中的芯片旁边放置文本输入