首页 > 解决方案 > 替代循环遍历所有行熊猫

问题描述

我有一个这样的数据框:

d = {
     'jobid': [100, 101,103,104,100,100,101], 
     'memberid': [1,2,3,3,3,2,1],
     'cluster':['bronze','silver','gold','gold','gold','silver','silver']
    }
df = pd.DataFrame(data=d)
df
jobid   memberid    cluster
0   100 1   bronze
1   101 2   silver
2   103 3   gold
3   104 3   gold
4   100 3   gold
5   100 2   silver
6   101 1   silver

我使用以下代码找到了每个作业的每个集群的百分比:

for i in df['jobid']:
    perc_bronze=round((df.loc[(df['jobid']==i) & (df['cluster']=='bronze')].count()[0]/df.loc[(df['jobid']==i)].count()[0])*100,2)
    df.loc[df['jobid']==i,'BronzeCluster']=perc_bronze
    perc_silver=round((df.loc[(df['jobid']==i) & (df['cluster']=='silver')].count()[0]/df.loc[(df['jobid']==i)].count()[0])*100,2)
    df.loc[df['jobid']==i,'SilverCluster']=perc_silver
    perc_gold=round((df.loc[(df['jobid']==i) & (df['cluster']=='gold')].count()[0]/df.loc[(df['jobid']==i)].count()[0])*100,2)
    df.loc[df['jobid']==i,'GoldCluster']=perc_gold

输出:

    jobid   memberid    cluster BronzeCluster   SilverCluster   GoldCluster
0   100 1   bronze  33.33   33.33   33.33
1   101 2   silver  0.00    100.00  0.00
2   103 3   gold    0.00    0.00    100.00
3   104 3   gold    0.00    0.00    100.00
4   100 3   gold    33.33   33.33   33.33
5   100 2   silver  33.33   33.33   33.33
6   101 1   silver  0.00    100.00  0.00

最终结果是正确的,但问题是运行大型数据集需要大量时间。是否有另一种方法可以以较低的计算成本获得此输出?

标签: pythonpandasdataframe

解决方案


这段代码:

unstacked_df = df.groupby(['jobid', 'cluster']).count().unstack()
frequency_df = ((unstacked_df / unstacked_df.sum())*100).fillna(0)
print(frequency_df)

输出:

        memberid                      
cluster   bronze       gold     silver
jobid                                 
100        100.0  33.333333  33.333333
101          0.0   0.000000  66.666667
103          0.0  33.333333   0.000000
104          0.0  33.333333   0.000000

这是预期的行为吗?


推荐阅读