首页 > 解决方案 > 合并具有多列条件和比较的数据框

问题描述

不确定这是做我想做的最好或正确的方法。

我有以下df:

df = pd.DataFrame(np.array([['1-1-2020', '123','How can I Help?', 'Delivered'], ['1-1-2020', '123','How can I Help?', 'Opened'], ['1-2-2021', '100','New Offer', 'Delivered'],['1-2-2021', '100','New Offer', 'Delivered'],['1-4-2021', '144','Last chance, buy now!', 'Delivered'],['1-4-2021', '144','Last chance, buy now!', 'Delivered'],['2-4-2021', '144','Last chance, buy now!', 'Opened']]),

                   columns=['Date', 'Customer_ID','Subject', 'Status'])


    Date    Customer_ID     Subject              Status
0   1-1-2020    123     How can I Help?         Delivered
1   1-1-2020    123     How can I Help?         Opened
2   1-2-2021    100     New Offer               Delivered
3   1-2-2021    100     New Offer               Delivered
4   1-4-2021    144     Last chance, buy now!   Delivered
5   1-4-2021    144     Last chance, buy now!   Delivered
6   2-4-2021    144     Last chance, buy now!   Opened

在这个 df 客户 123 收到了一封电子邮件,然后它在第二行被打开。客户 100 发送了两次电子邮件,客户 144 发送了两次电子邮件,其中一个打开了。

我正在尝试使用最后操作日期跟踪每位客户的每封电子邮件的已发送和打开状态。

因此,我创建了两个数据框:一个用于交付,一个用于打开,并将它们合并到交付的一个上以跟踪打开的内容。

df_del = df.loc[(df['Status'] == 'Delivered')]
df_open = df.loc[(df['Status'] == 'Opened')]

d = df_del.rename(columns={'Date': 'Date Delivered'})
o = df_open.rename(columns={'Date': 'last action date', 'Status': 'Open Status'})

w = d.merge(o, on=['Customer_ID','Subject'], how='left')

w

由此可见:

Date Delivered  Customer_ID       Subject            Status     last action date Open Status
0   1-1-2020    123         How can I Help?           Delivered     1-1-2020       Opened
1   1-2-2021    100         New Offer                 Delivered         NaN        NaN
2   1-2-2021    100         New Offer                 Delivered         NaN        NaN
3   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened
4   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened

我所期待的:

Date Delivered  Customer_ID       Subject            Status     last action date Open Status
0   1-1-2020    123         How can I Help?           Delivered     1-1-2020       Opened
1   1-2-2021    100         New Offer                 Delivered     1-2-2021       NaN
2   1-2-2021    100         New Offer                 Delivered     1-2-2021       NaN
3   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened
4   1-4-2021    144         Last chance, buy now!     Delivered     1-4-2021       NaN

标签: pythonpython-3.xpandasdataframe

解决方案


让我们使用一个伪“订单”列:

df_del = df.loc[(df['Status'] == 'Delivered')].copy()
df_open = df.loc[(df['Status'] == 'Opened')].copy()

df_del['order'] = df_del.groupby(['Customer_ID']).cumcount()
df_open['order'] = df_open.groupby(['Customer_ID']).cumcount()

d = df_del.rename(columns={'Date': 'Date Delivered'})
o = df_open.rename(columns={'Date': 'last action date', 'Status': 'Open Status'})

w = d.merge(o, on=['Customer_ID','Subject','order'], how='left')

w['last action date'] = w['last action date'].fillna(w['Date Delivered'])

W

输出:

  Date Delivered Customer_ID                Subject     Status  order last action date Open Status
0       1-1-2020         123        How can I Help?  Delivered      0         1-1-2020      Opened
1       1-2-2021         100              New Offer  Delivered      0         1-2-2021         NaN
2       1-2-2021         100              New Offer  Delivered      1         1-2-2021         NaN
3       1-4-2021         144  Last chance, buy now!  Delivered      0         2-4-2021      Opened
4       1-4-2021         144  Last chance, buy now!  Delivered      1         1-4-2021         NaN

推荐阅读