python - 显示百分比差异的开始和结束日期(在 Python 中)
问题描述
我有一个数据集 df,我希望在其中找到百分比差异和差异。我希望查看较早的日期并将该值与与其连续的下一个日期进行比较。所需的输出应显示正在比较的日期。
id date value
1 11/01/2020 10
2 11/01/2020 5
1 10/01/2020 20
2 10/01/2020 30
1 09/01/2020 15
2 09/01/2020 10
3 11/01/2020 5
期望的输出
id startdate enddate diff percent
1 09/01/2020 10/01/2020 5 33
1 10/01/2020 11/01/2020 -10 -50
2 09/01/2020 10/01/2020 20 200
2 10/01/2020 11/01/2020 -25 -83.33
3 11/01/2020 11/01/2020 0 0
我想一次查看一组并将前一个值与下一个值进行比较,并找到百分比增加和差异。
例如,
ID 1,从 09/01/2020 到 10/01/2020:从15 到 20,相差5 % 相差 33%
从 2020 年 10 月 1 日到 2020 年 11 月 1 日:从20 到 10, 差异为-10 ,差异为50%。
感谢这个平台上的成员,这很好用,但是,我如何显示两个日期?
result = (df.sort_values(["id", "date", "value"])
# use this later to drop the first row per group
# if number is greater than 1, else leave as-is
.assign(counter=lambda x: x.groupby("id").date.transform("size"),
date_shift=lambda x: x.groupby(["id"]).date.shift(1),
value_shift=lambda x: x.groupby("id").value.shift(1),
diff=lambda x: x.value - x.value_shift,
percent=lambda x: x["diff"].div(x.value_shift).mul(100).round(2))
# here is where the counter column becomes useful
# drop rows where date_shift is null and counter is > 1
# this way if number of rows in the group is just one it is kept,
# if greater than one, the first row is dropped,
# as the first row would have nulls due to the `shift` method.
.query("not (date_shift.isna() and counter>1)")
.loc[:, ["id", "date", "diff", "percent"]]
.fillna(0))
result
任何建议表示赞赏
解决方案
没有魔法。只需groupby-shift
根据需要一一生成列。
数据
import pandas as pd
import io
df = pd.read_csv(io.StringIO("""
id date value
1 11/01/2020 10
2 11/01/2020 5
1 10/01/2020 20
2 10/01/2020 30
1 09/01/2020 15
2 09/01/2020 10
3 11/01/2020 5
"""), sep=r"\s{2,}", engine="python")
df["date"] = pd.to_datetime(df["date"])
代码
# sort and groupby
df2 = df.sort_values(["id", "date", "value"])
gp_obj = df2.groupby("id")
# produce the required columns
df2["startdate"] = gp_obj["date"].shift()
df2["diff"] = gp_obj["value"].diff()
df2["percent"] = df2["diff"] / df2["value"].shift() * 100
# drop NA rows and reorganize
df2.rename(columns={"date": "enddate"}, inplace=True)
df2 = df2[["id", "startdate", "enddate", "diff", "percent"]][~df2["diff"].isna()]
输出
print(df2)
id startdate enddate diff percent
2 1 2020-09-01 2020-10-01 5.0 33.333333
0 1 2020-10-01 2020-11-01 -10.0 -50.000000
3 2 2020-09-01 2020-10-01 20.0 200.000000
1 2 2020-10-01 2020-11-01 -25.0 -83.333333
推荐阅读
- javascript - reactjs中的通用类型
- javascript - Looping through and getting frequency of all the elements in an array
- java - java中加070+010的结果是什么?
- angular - control.registerOnChange 不是具有 ng-dynamic-forms 和自定义表单元素的函数
- c# - 无法访问数据库上下文的模型属性 - .net core (3.1)
- python - 来自索引的 Numpy 矩阵
- sql - 在 SQL 中,如何从具有特定频率的表中选择一行,例如频率 = 2 或频率 = 4?
- python - 通过 OneLogin 示例提供 SSO 的 Python 应用程序 - 如何转移到生产环境?
- git - 使用`git rebase --rebase-merges`时“拒绝用坏名更新引用”
- python - 使用 Scapy 和 Python 读取加密的用户平面 PCAP 内容