首页 > 解决方案 > Filtering a pandas dataframe by aggregating on two columns

问题描述

I have a pandas dataframe. Here are the first five rows:

      InvoiceNo StockCode                          Description  Quantity      InvoiceDate       UnitPrice  CustomerID         Country  
    0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
    1    536365     71053                  WHITE METAL LANTERN         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom
    2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   2010-12-01 08:26:00       2.75     17850.0  United Kingdom
    3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom
    4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   2010-12-01 08:26:00       3.39     17850.0  United Kingdom

I would like to group by StockCode and CustomerID, and sum Quantity. Then, I'd like to throw out all of the StockCode/CustomerID pairs where this sum is negative. The desired final product is the original dataframe with the rows corresponding to these StockCode/CustomerID pairs removed.

I have a working solution:

retail_df.groupby(['CustomerID','StockCode']).filter(lambda x: x['Quantity'].sum() >= 0)

However, it takes my laptop four minutes to run it. There are 406829 rows. Is there a faster way?

标签: pandasdataframepandas-groupby

解决方案


这应该可以解决问题:

df2=retail_df.groupby(['CustomerID','StockCode'])["Quantity"].sum().ge(0)

retail_df=retail_df.set_index(['CustomerID','StockCode']).loc[df2.loc[df2].index].reset_index(drop=False)

推荐阅读