python - 与不为零的更近的邻居重复值,直到值改变
问题描述
我有一个包含大约 3000 万条记录的数据框,按 ID 分类并按日期打开。
我需要在以前的日期数中分配一列最近行的值,我已经尝试使用 windows 参数使用滚动函数,但不幸的是我必须分配的数字是可变的(我的意思是,它可以是12, 6, 7, ...),并且此参数不接受来自同一 df 的变量。
工作正常,但 12 数字可以是可变的:
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=12)
df['REPEAT_AMOUNT'] = df['AMOUNT'].rolling(window, min_periods=1).sum()
我尝试通过使用 df 中的变量来更改 window_size 的数量:
df['REPEAT_AMOUNT'] = df['AMOUNT'].rolling(window=df['VARIABLE_DISTRIBUTION'].astype(int), min_periods=1).sum()
但我收到以下错误:
ValueError: window must be an integer
此外,第一个记录的月份差异可能与后续记录的日期数字不同。
这是我目前拥有的:
+----+-----------------------+------------+---------+
| ID | VARIABLE_DISTRIBUTION | DATE | AMOUNT |
+----+-----------------------+------------+---------+
| 1 | 12 | 30-04-2021 | - |
| 1 | 12 | 31-05-2021 | - |
| 1 | 12 | 30-06-2021 | - |
| 1 | 12 | 31-07-2021 | 100 |
| 1 | 12 | 31-08-2021 | - |
| 1 | 12 | 30-09-2021 | - |
| 1 | 12 | 31-10-2021 | - |
| 1 | 12 | 30-11-2021 | - |
| 1 | 12 | 31-12-2021 | - |
| 1 | 12 | 31-01-2022 | - |
| 1 | 12 | 28-02-2022 | - |
| 1 | 12 | 31-03-2022 | - |
| 1 | 12 | 30-04-2022 | - |
| 1 | 12 | 31-05-2022 | - |
| 1 | 12 | 30-06-2022 | - |
| 1 | 12 | 31-07-2022 | 150 |
| 2 | 8 | 30-04-2021 | - |
| 2 | 8 | 31-05-2021 | 200 |
| 2 | 8 | 30-06-2021 | - |
| 2 | 8 | 31-07-2021 | - |
| 2 | 8 | 31-08-2021 | - |
| 2 | 8 | 30-09-2021 | - |
| 2 | 8 | 31-10-2021 | - |
| 2 | 8 | 30-11-2021 | - |
| 2 | 8 | 31-12-2021 | - |
| 2 | 8 | 31-01-2022 | 300 |
| 2 | 8 | 28-02-2022 | - |
| 2 | 8 | 31-03-2022 | - |
| 2 | 8 | 30-04-2022 | - |
| 2 | 8 | 31-05-2022 | - |
| 2 | 8 | 30-06-2022 | - |
| 2 | 8 | 31-07-2022 | - |
+----+-----------------------+------------+---------+
这就是我想要的:
+----+-----------------------+------------+----------+----------------+
| ID | VARIABLE_DISTRIBUTION | DATE | AMOUNT | REPEAT_AMOUNT |
+----+-----------------------+------------+----------+----------------+
| 1 | 12 | 30-04-2021 | - | 100 |
| 1 | 12 | 31-05-2021 | - | 100 |
| 1 | 12 | 30-06-2021 | - | 100 |
| 1 | 12 | 31-07-2021 | 100 | 100 |
| 1 | 12 | 31-08-2021 | - | 150 |
| 1 | 12 | 30-09-2021 | - | 150 |
| 1 | 12 | 31-10-2021 | - | 150 |
| 1 | 12 | 30-11-2021 | - | 150 |
| 1 | 12 | 31-12-2021 | - | 150 |
| 1 | 12 | 31-01-2022 | - | 150 |
| 1 | 12 | 28-02-2022 | - | 150 |
| 1 | 12 | 31-03-2022 | - | 150 |
| 1 | 12 | 30-04-2022 | - | 150 |
| 1 | 12 | 31-05-2022 | - | 150 |
| 1 | 12 | 30-06-2022 | - | 150 |
| 1 | 12 | 31-07-2022 | 150 | 150 |
| 2 | 8 | 30-04-2021 | - | 200 |
| 2 | 8 | 31-05-2021 | 200 | 200 |
| 2 | 8 | 30-06-2021 | - | 300 |
| 2 | 8 | 31-07-2021 | - | 300 |
| 2 | 8 | 31-08-2021 | - | 300 |
| 2 | 8 | 30-09-2021 | - | 300 |
| 2 | 8 | 31-10-2021 | - | 300 |
| 2 | 8 | 30-11-2021 | - | 300 |
| 2 | 8 | 31-12-2021 | - | 300 |
| 2 | 8 | 31-01-2022 | 300 | 300 |
| 2 | 8 | 28-02-2022 | - | - |
| 2 | 8 | 31-03-2022 | - | - |
| 2 | 8 | 30-04-2022 | - | - |
| 2 | 8 | 31-05-2022 | - | - |
| 2 | 8 | 30-06-2022 | - | - |
| 2 | 8 | 31-07-2022 | - | - |
+----+-----------------------+------------+----------+----------------+
我感谢有关窗口参数或任何其他类型解决方案的任何帮助。
对不起非母语英语
非常感谢。
解决方案
IIUC 用途groupby bfill
:
df['REPEAT_AMOUNT'] = df.groupby('ID')['AMOUNT'].bfill()
df
:
ID VARIABLE_DISTRIBUTION DATE AMOUNT REPEAT_AMOUNT
0 1 12 30-04-2021 NaN 100
1 1 12 31-05-2021 NaN 100
2 1 12 30-06-2021 NaN 100
3 1 12 31-07-2021 100 100
4 1 12 31-08-2021 NaN 150
5 1 12 30-09-2021 NaN 150
6 1 12 31-10-2021 NaN 150
7 1 12 30-11-2021 NaN 150
8 1 12 31-12-2021 NaN 150
9 1 12 31-01-2022 NaN 150
10 1 12 28-02-2022 NaN 150
11 1 12 31-03-2022 NaN 150
12 1 12 30-04-2022 NaN 150
13 1 12 31-05-2022 NaN 150
14 1 12 30-06-2022 NaN 150
15 1 12 31-07-2022 150 150
16 2 8 30-04-2021 NaN 200
17 2 8 31-05-2021 200 200
18 2 8 30-06-2021 NaN 300
19 2 8 31-07-2021 NaN 300
20 2 8 31-08-2021 NaN 300
21 2 8 30-09-2021 NaN 300
22 2 8 31-10-2021 NaN 300
23 2 8 30-11-2021 NaN 300
24 2 8 31-12-2021 NaN 300
25 2 8 31-01-2022 300 300
26 2 8 28-02-2022 NaN NaN
27 2 8 31-03-2022 NaN NaN
28 2 8 30-04-2022 NaN NaN
29 2 8 31-05-2022 NaN NaN
30 2 8 30-06-2022 NaN NaN
31 2 8 31-07-2022 NaN NaN
推荐阅读
- swift - 尝试加载大图像时 UIImagePickerController 崩溃
- java - 识别列表中已在数据库中的项目并将其删除
- c# - 如何在c#中执行cmd命令并返回值
- java - 我的插入排序有什么问题
- geode - spring boot 数据 geode 放入区域导致重复的小写 PDX 元素
- python - 使用扩张因果卷积和 Tensorflow 2.0 的时间序列预测
- php - Laravel:“强制”查询构建器返回空列表
- javascript - 试图简化这个功能,所以我不必运行它 9 次
- powerbi - 分页报告 - 将所有单个过滤器值分别导出为 pdf
- javascript - 我可以使用闭包来保存在功能范围内声明的路由器吗?