python - 如何创建指示符列来指示数据框中先前条目的特定更改?
问题描述
情况:
我目前有一个客户数据框,按CLIENT_ID and CURRENT_DATE_STATUS
. CLIENT_ID
如下所示:
CLIENT_ID | 当前_日期_状态 | 地位 |
---|---|---|
10002 | 2017-07-21 | 开始 |
10002 | 2017-07-21 | 开始 |
10002 | 2018-07-01 | 搅动 |
10002 | 2018-07-01 | 搅动 |
10002 | 2019-01-01 | 重新启动 |
11811 | 2019-08-15 | 开始 |
11811 | 2019-08-15 | 开始 |
11811 | 2019-12-31 | 重新启动 |
22101 | 2020-03-11 | 开始 |
22101 | 2020-03-11 | 开始 |
22101 | 2020-03-11 | 开始 |
22101 | 2020-11-01 | 搅动 |
22300 | 2018-05-06 | 开始 |
22300 | 2018-05-06 | 开始 |
问题:
如何创建指示符Boolean 1 or 0
列,指示:
- 如果上一个
STATUS
条目已更改CHURNED or RESTARTED
为每个CLIENT_ID
.
目标:
生成的数据框如下所示:
CLIENT_ID | 当前_日期_状态 | 地位 | 停止 |
---|---|---|---|
10002 | 2017-07-21 | 开始 | 0 |
10002 | 2017-07-21 | 开始 | 0 |
10002 | 2018-07-01 | 搅动 | 1 |
10002 | 2018-07-01 | 搅动 | 0 |
10002 | 2019-01-01 | 重新启动 | 1 |
11811 | 2019-08-15 | 开始 | 0 |
11811 | 2019-08-15 | 开始 | 0 |
11811 | 2019-12-31 | 重新启动 | 1 |
22101 | 2020-03-11 | 开始 | 0 |
22101 | 2020-03-11 | 开始 | 0 |
22101 | 2020-03-11 | 开始 | 0 |
22101 | 2020-11-01 | 搅动 | 1 |
22300 | 2018-05-06 | 开始 | 0 |
22300 | 2018-05-06 | 开始 | 0 |
用于生成所述数据框的代码:
import pandas as pd
data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)
解决方案
您可以比较 eqaul by 的实际值,每组Series.eq
移位DataFrameGroupBy.shift
for not equal Series.ne
,chain by &
for bitwiseAND
和 last chain by |
for bitwiseOR
并转换为整数:
s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')
df['STOPPED'] = (m1 | m2).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
另一种解决方案是按先前比较移位的值,然后如果按列表匹配,则按位按Series.isin
最后一个链:&
AND
m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])
df['STOPPED'] = (m3 & m4).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
推荐阅读
- salesforce - 实体“任务”上没有这样的“电子邮件”列。
- java - Apache DateUtils 无法解析 2018-03-11 02:00:00 夏令时时间戳
- mysql - mysql在查询后没有释放池连接
- c++ - 如何将字符串加载到 BYTE* 数组的元素中 (C++)
- php - 雄辩的 stdClass 对象返回数组?
- javascript - Javascript:格式化日期
- c# - Microsoft.Extensions.Logging.Console 在控制台应用程序中不起作用
- ios - iOS中的重复本地通知 - 目标c
- javascript - Enable CORS in httpd.conf
- ios - WKWebView 媒体控件标题显示Url