首页 > 解决方案 > Python Pandas 对多个条件求和

问题描述

以下是我的示例数据:

        Customer   Document Date   Clearing Date   Invoice_Amount
0       A          09/13/2016      11/04/2016      2,007,324
1       A          04/18/2016      07/11/2016      631,714
2       A          09/13/2016      09/16/2016      4,000,000
3       A          07/11/2017      09/23/2017      5,000,000
4       A          05/03/2016      06/17/2016      2,000,000
---     ---        ---             ---             ---
1158    H          04/21/2017      06/28/2017      3,000,000
1159    H          04/25/2017      05/19/2017      1,000,000
1160    H          11/03/2017      12/11/2017      4,500,000
1161    H          03/15/2018      05/27/2018      3,500,000
1162    H          02/21/2018      05/03/2018      1,500,000

我想创建一个新变量(在 Invoice_Amount 之后添加一个新列)No_Paid,它计算“在客户新发票的文档日期之前支付的发票数量”。

预期的输出如下...

        Customer   Document Date   Clearing Date   Invoice_Amount No_Paid*
0       A          09/13/2016      11/04/2016      2,007,324          8 
1       A          04/18/2016      07/11/2016      631,714            1
2       A          09/13/2016      09/16/2016      4,000,000          8
3       A          07/11/2017      09/23/2017      5,000,000          6
4       A          05/03/2016      06/17/2016      2,000,000          1
---     ---        ---             ---             ---              ---
1158    H          04/21/2017      06/28/2017      3,000,000          5 
1159    H          04/25/2017      05/19/2017      1,000,000          3
1160    H          11/03/2017      12/11/2017      4,500,000          7
1161    H          03/15/2018      05/27/2018      3,500,000         37
1162    H          02/21/2018      05/03/2018      1,500,000         37

目前,我使用 for 循环来实现预期的输出

import pandas as pd
df = pd.read_csv('E:\data.csv')
df['Document Date'] = pd.to_datetime(df['Document Date'],format="%m/%d/%Y")
df['Clearing Date'] = pd.to_datetime(df['Clearing Date'],format="%m/%d/%Y")
df["No_Paid"] = ""
for i in df.index: 
     Vendor= df.loc[i,"Vendor"]
     Doc_Date= df.loc[i,"Document Date"]
     Six_Month = Doc_Date - pd.Timedelta(days=180)
     df.loc[i,"No_Paid"] = df.loc[(df["Vendor"] == Vendor) & (df["Clearing Date"] < Doc_Date) & (df["Document Date"] >= Six_Month),"Invoice_Amount"].count()

在实际情况下,我有超过 100,000 个发票数据,这需要更长的时间我尝试使用 df.apply ...但无法达到相同的输出...

标签: pythonpandas

解决方案


以你的例子为例:

import pandas as pd
# read in csv (save as csv or read in using pd.read_excel)
df = pd.read_csv('file.csv')
# to datetime just in case
df['Doc_Date'] = pd.to_datetime(df['Doc_Date'])
df['Exp_Date'] = pd.to_datetime(df['Exp_Date'])
df['Overdue'] = df['Doc_Date'] - df['Exp_Date']
# 180 days for 6 months
df['6M_Age'] = df['Doc_Date'] - pd.Timedelta(days=180)
# Hard to tell what the line in the middle of the data means
# you can group by two columns if you need too
df['Sum_of_paid'] = df.groupby('ID').cumsum()

推荐阅读