首页 > 解决方案 > 如何在 Pandas 数据框中查找每个列顺序的总和和计数?

问题描述

我有一个如下所示的电子邮件交换数据框:

sender receiver words 
a        b       10
a        c       5
a        c       15
b        a       50
b        a       30

我想进入下面的数据框。Words_xy表示从发送者发送到接收者的字数,而words_yx是从接收者发送到发送者的字数。Email_xy是从发送方发送到接收方的消息总数(计数),反之亦然email_yx

x-y      y-x   words_xy  words_yx   email_xy   email_yx
ab       ba       10      (50+30)      1         2
ac       ca     (5+15)      0          2         0 

我能够通过首先合并使用来获得sender, receiverxyx-y组合df['x-y']=df['sender']+df['receiver']。然后我应用下面的代码来获取words_xy和email_xy:

df['email_xy'] = df.groupby('sender-receiver')['sender-receiver'].transform('count')
df['words_xy']=df.groupby(['sender-receiver','email_xy'], as_index=False)['words'].sum()

如何获得 yx 的反向并将其附加到数据框?也许有比使用 Pandas 更好的方法来实现这一点?

标签: pythonpandas

解决方案


我的方法是这样的:

Groupby agg得到总和和计数

new_df = (
    df
        .groupby(['sender', 'receiver'], as_index=False)
        .agg(words=('words', 'sum'), email=('words', 'count'))
)
  sender receiver  words  email
0      a        b     10      1
1      a        c     20      2
2      b        a     80      2

使用 self 进行自连接 ( merge) 和交换发送方和接收方列的 self 以获得关系:

new_df = (
    new_df
        .merge(new_df.rename(columns={'sender': 'receiver',
                                      'receiver': 'sender'}),
               on=['sender', 'receiver'],
               suffixes=('_xy', '_yx'),
               how='left')
        .fillna(0)
        .convert_dtypes()
)
  sender receiver  words_xy  email_xy  words_yx  email_yx
0      a        b        10         1        80         2
1      a        c        20         2         0         0
2      b        a        80         2        10         1

然后添加新列:

new_df['x-y'] = new_df['sender'] + new_df['receiver']
new_df['y-x'] = new_df['receiver'] + new_df['sender']
  sender receiver  words_xy  email_xy  words_yx  email_yx x-y y-x
0      a        b        10         1        80         2  ab  ba
1      a        c        20         2         0         0  ac  ca
2      b        a        80         2        10         1  ba  ab

摆脱重复:有很多方法可以做到这一点,但我选择了stack++ drop_duplicatesunstack

new_df = (
    new_df.drop(columns=['sender', 'receiver'])
        .set_index(['words_xy', 'words_yx', 'email_xy', 'email_yx'])
        .stack()
        .drop_duplicates()
        .unstack()
        .reset_index()
)
   words_xy  words_yx  email_xy  email_yx x-y y-x
0        10        80         1         2  ab  ba
1        20         0         2         0  ac  ca

最后重新排序列:

new_df = new_df[['x-y', 'y-x', 'words_xy', 'words_yx', 'email_xy', 'email_yx']]
  x-y y-x  words_xy  words_yx  email_xy  email_yx
0  ab  ba        10        80         1         2
1  ac  ca        20         0         2         0

推荐阅读