首页 > 解决方案 > 显示百分比差异的开始和结束日期(在 Python 中)

问题描述

我有一个数据集 df,我希望在其中找到百分比差异和差异。我希望查看较早的日期并将该值与与其连续的下一个日期进行比较。所需的输出应显示正在比较的日期。

 id    date         value

 1     11/01/2020   10
 2     11/01/2020   5
 1     10/01/2020   20
 2     10/01/2020   30
 1     09/01/2020   15
 2     09/01/2020   10
 3     11/01/2020   5

期望的输出

  id    startdate       enddate           diff     percent


  1     09/01/2020      10/01/2020        5         33                 
  1     10/01/2020      11/01/2020       -10       -50
  2     09/01/2020      10/01/2020        20        200               
  2     10/01/2020      11/01/2020       -25       -83.33
  3     11/01/2020      11/01/2020        0         0 

我想一次查看一组并将前一个值与下一个值进行比较,并找到百分比增加和差异。

例如

ID 1,从 09/01/2020 到 10/01/2020:从15 到 20,相差5 % 相差 33%

从 2020 年 10 月 1 日到 2020 年 11 月 1 日:20 到 10, 差异为-10 ,差异为50%。

感谢这个平台上的成员,这很好用,但是,我如何显示两个日期?

result = (df.sort_values(["id", "date", "value"])
              # use this later to drop the first row per group
              # if number is greater than 1, else leave as-is
      .assign(counter=lambda x: x.groupby("id").date.transform("size"),
              date_shift=lambda x: x.groupby(["id"]).date.shift(1),
              value_shift=lambda x: x.groupby("id").value.shift(1),
              diff=lambda x: x.value - x.value_shift,
              percent=lambda x: x["diff"].div(x.value_shift).mul(100).round(2))
       # here is where the counter column becomes useful
       # drop rows where date_shift is null and counter is > 1
       # this way if number of rows in the group is just one it is kept, 
       # if greater than one, the first row is dropped, 
       # as the first row would have nulls due to the `shift` method.
      .query("not (date_shift.isna() and counter>1)")
      .loc[:, ["id", "date", "diff", "percent"]]
      .fillna(0))

  result

任何建议表示赞赏

标签: pythonpandasnumpytime-series

解决方案


没有魔法。只需groupby-shift根据需要一一生成列。

数据

import pandas as pd
import io

df = pd.read_csv(io.StringIO("""
id    date         value
1     11/01/2020   10
2     11/01/2020   5
1     10/01/2020   20
2     10/01/2020   30
1     09/01/2020   15
2     09/01/2020   10
3     11/01/2020   5
"""), sep=r"\s{2,}", engine="python")

df["date"] = pd.to_datetime(df["date"])

代码

# sort and groupby    
df2 = df.sort_values(["id", "date", "value"])
gp_obj = df2.groupby("id")

# produce the required columns
df2["startdate"] = gp_obj["date"].shift()
df2["diff"] = gp_obj["value"].diff()
df2["percent"] = df2["diff"] / df2["value"].shift() * 100

# drop NA rows and reorganize
df2.rename(columns={"date": "enddate"}, inplace=True)
df2 = df2[["id", "startdate", "enddate", "diff", "percent"]][~df2["diff"].isna()]

输出

print(df2)
   id  startdate    enddate  diff     percent
2   1 2020-09-01 2020-10-01   5.0   33.333333
0   1 2020-10-01 2020-11-01 -10.0  -50.000000
3   2 2020-09-01 2020-10-01  20.0  200.000000
1   2 2020-10-01 2020-11-01 -25.0  -83.333333

推荐阅读