首页 > 解决方案 > pd.dataframe:如何计算变量的值并找到概率

问题描述

这是我的数据:

df1 = pd.DataFrame()
df1['a1'] = ['ABC','ACC','BCC','ABC','ABC','ACC','BCC']
df1['b1'] = ['ACC','AAC','BAC','ACC','ACC','AAC','BAC']
df1['group'] = ['A1','A2','A1','A3','A2','A1','A1']
df1['names'] = ['n1','n2','n3','n4','n1','n3','n3']

df2 = pd.DataFrame()
df2['a2'] = ['ACC','BCC','ABC']
df2['b2'] = ['AAC','BAC','ACC']
df2['types'] = ['t1','t2','t3']

DF = pd.merge(df1, df2, left_on=['a1','b1'], right_on=['a2','b2'])

>>> DF.sort_values('group')
    a1   b1 group names   a2   b2 types
0  ABC  ACC    A1    n1  ABC  ACC    t3
4  ACC  AAC    A1    n3  ACC  AAC    t1
5  BCC  BAC    A1    n3  BCC  BAC    t2
6  BCC  BAC    A1    n3  BCC  BAC    t2
2  ABC  ACC    A2    n1  ABC  ACC    t3
3  ACC  AAC    A2    n2  ACC  AAC    t1
1  ABC  ACC    A3    n4  ABC  ACC    t3

我想计算每个名称的总出现时间(df 的 nrow)中每种类型出现的概率,然后对每个组求和。

例如,对于组A1

for n1: 
P_1 = P(t1_n1)+P(t2_n1)+P(t3_n1) = 0+0+1/7 = 1/7
for n2: 
P_2 = P(t1_n2)+P(t2_n2)+P(t3_n2) = 0
for n3: 
P_3 = P(t1_n3)+P(t2_n3)+P(t3_n3) = 1/7+0+2/7 = 3/7
for n4:
P_4 = P(t1_n4)+P(t2_n4)+P(t3_n4) = 0 

P_total = P_1+P_2+P_3+P_4

预期输出:

   groups   P_n1   P_n2  P_n3   P_n4  P_total
0  A1        1/7     0     3/7    0   4/7
1  A2        ....
2  A3        
3  A4        

如何在没有很多循环功能的情况下以一种优雅的方式完成我的目标?谢谢

标签: pythonpandasdataframeprobability

解决方案


您可以将 pd.crosstab 与 normalize=True 一起使用:

pd.crosstab(DF['group'],DF['names'],normalize=True)

names        n1        n2        n3        n4
group                                        
A1     0.142857  0.000000  0.428571  0.000000
A2     0.142857  0.142857  0.000000  0.000000
A3     0.000000  0.000000  0.000000  0.142857

为您提供总数等:

pd.crosstab(DF['group'],DF['names'],normalize=True)\
.assign(total = lambda x : x.sum(axis=1)).reset_index()

names group        n1        n2        n3        n4     total
0        A1  0.142857  0.000000  0.428571  0.000000  0.571429
1        A2  0.142857  0.142857  0.000000  0.000000  0.285714
2        A3  0.000000  0.000000  0.000000  0.142857  0.142857

推荐阅读