python - 为什么 groupby 和 sum 中未提及的列将被删除？

问题描述

我有这个数据框：

    InvoiceID   PaymentDate          TotalRevenue   Discount     Discount_Revenue
0   72A04E22    2020-07-03 17:25:13   1650000.0      0.0          1650000.0
1   54FCFCB9    2021-03-17 14:26:08   5500000.0      0.0          5500000.0
...

在以下聚合之后，列PaymentDate被删除：

df.groupby(by=['InvoiceID'])[['TotalRevenue','Discount','Discount_Revenue']].sum().reset_index(drop=True, inplace=True)

如何仍然保留未在 group by 或聚合函数中提及的列？

标签： pythonpandaspandas-groupby

当您groupby使用sum它时，意味着您正在聚合数据：从多行相同的行中，InvoiceID您只保留一个，其中所有行的值的总和为df.

假设这是您的数据框，同一行两次：

  InvoiceID          PaymentDate  TotalRevenue  Discount  Discount_Revenue
0  72A04E22  2020-07-03 17:25:13     1650000.0       0.0         1650000.0
1  54FCFCB9  2021-03-17 14:26:08     5500000.0       0.0         5500000.0
2  54FCFCB9  2021-03-17 14:26:08     5500000.0       1.0         5500000.0

然后你可以看到求和Discount 的效果，例如：

>>> df.groupby('InvoiceID')['Discount'].sum()
InvoiceID
54FCFCB9    1.0
72A04E22    0.0
Name: Discount, dtype: float64

具体回答您的问题：该列PaymentDate被删除，因为您没有指定如何聚合它

对于添加没有意义的列，例如PaymentDate，您需要定义另一个要使用的聚合函数。您要保留第一个付款日期吗？最后一个？
请注意，InvoiceID在上面的示例中并没有消失，您有意在代码中将其删除.reset_index(drop=True)

假设我们选择保留最后付款日期，然后使用reset_indexwithoutdrop=True也保留 InvoiceID，我们有：

>>> invoice_groups = df.groupby('InvoiceID')
>>> invoices = invoice_groups.sum().join(invoice_groups['PaymentDate'].max()).reset_index()
>>> invoices
  InvoiceID  TotalRevenue  Discount  Discount_Revenue         PaymentDate
0  54FCFCB9    11000000.0       1.0        11000000.0 2021-03-17 14:26:08
1  72A04E22     1650000.0       0.0         1650000.0 2020-07-03 17:25:13

这就是您的所有列，所有列都以某种方式（总和或最大值）从原始数据框中的行聚合。

python - 为什么 groupby 和 sum 中未提及的列将被删除？

问题描述

解决方案

推荐阅读