首页 > 解决方案 > 通过取 Pandas 中行的差异,用“其他”标签替换“总计”标签

问题描述

我目前有一个如下所示的数据框:

 account     region     measure     value     date
 acct1       USA        Expense     100       1/31/2019
 acct1       USA        Sales       150       1/31/2019
 acct2       USA        Expense     1000      1/31/2019
 acct2       USA        Sales       1500      1/31/2019
 acct2       East       Expense     500       1/31/2019
 acct2       East       Sales       800       1/31/2019
 acct1       West       Expense     90        1/31/2019
 acct1       West       Sales       140       1/31/2019
 acct2       West       Expense     450       1/31/2019
 acct2       West       Sales       500       1/31/2019

有与日期、度量和值相关联的帐户和区域。

尽管美国地区标签是一个总数,但东部和西部地区不一定加起来就是美国的总数。我的目标是将“美国”行更改为“其他”标签,允许现在的三个区域汇总到以前的“美国”数字。

因此,它将如下所示:

 account     region     measure     value     date
 acct1       Other      Expense     10        1/31/2019
 acct1       Other      Sales       10        1/31/2019
 acct2       Other      Expense     50        1/31/2019
 acct2       Other      Sales       200       1/31/2019
 acct2       East       Expense     500       1/31/2019
 acct2       East       Sales       800       1/31/2019
 acct1       West       Expense     90        1/31/2019
 acct1       West       Sales       140       1/31/2019
 acct2       West       Expense     450       1/31/2019
 acct2       West       Sales       500       1/31/2019

如您所见,“东部”和“西部”区域没有变化,三个区域的总和为之前的“美国”总数。

我尝试了许多不同的方法来实现这一点,但无济于事。首先,我尝试为每个区域拆分三个 dfs:

 df_usa = df[df['region'] == 'USA']
 df_east = df[df['region'] == 'east']
 df_west = df[df['region'] == 'west']

然后创建一个“其他”df并根据列减去:

 df_usa['value'] = df_usa['value'] - df_east['value'] - df_west['value']

这不起作用,因为每个区域 df 具有不同数量的帐户/行。我还需要考虑数据集中的其他日期。

我敢肯定你会说,我对 pandas 计算还是陌生的。

标签: pythonpython-3.xpandas

解决方案


如果您进行一些重塑,则使用数据会更容易一些。首先,将标识元素移动到索引并将区域移动到列可以更清楚地了解哪些内容:

In [46]: regions = df.set_index(["account", "region", "measure", "date"]).sort_index().unstack(1)

In [47]: regions.columns = regions.columns.droplevel()

In [48]: regions
Out[48]:
region                      East     USA   West
account measure date
acct1   Expense 1/31/2019    NaN   100.0   90.0
        Sales   1/31/2019    NaN   150.0  140.0
acct2   Expense 1/31/2019  500.0  1000.0  450.0
        Sales   1/31/2019  800.0  1500.0  500.0

在这种形状中,很容易创建另一个字段:

In [49]: regions['Other'] = (regions['USA'] * 2) - regions.sum(axis=1)

In [50]: regions
Out[50]:
region                      East     USA   West  Other
account measure date
acct1   Expense 1/31/2019    NaN   100.0   90.0   10.0
        Sales   1/31/2019    NaN   150.0  140.0   10.0
acct2   Expense 1/31/2019  500.0  1000.0  450.0   50.0
        Sales   1/31/2019  800.0  1500.0  500.0  200.0

然后,您可以继续以这种格式使用它,或者您可以重新堆叠东西以恢复原始形式:

In [51]: regions.drop("USA", axis=1).stack().reset_index().rename(columns={0: 'value'})
Out[51]:
  account  measure       date region  value
0   acct1  Expense  1/31/2019   West   90.0
1   acct1  Expense  1/31/2019  Other   10.0
2   acct1    Sales  1/31/2019   West  140.0
3   acct1    Sales  1/31/2019  Other   10.0
4   acct2  Expense  1/31/2019   East  500.0
5   acct2  Expense  1/31/2019   West  450.0
6   acct2  Expense  1/31/2019  Other   50.0
7   acct2    Sales  1/31/2019   East  800.0
8   acct2    Sales  1/31/2019   West  500.0
9   acct2    Sales  1/31/2019  Other  200.0

推荐阅读