pandas - 从熊猫列中特定值的出现中选择所有前 6 个月的数据记录
问题描述
每当客户完成特定交易时,我想为客户选择所有前 6 个月的记录。数据如下:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
在这里,我想查看描述“已移动”,然后为每个 Cust_ID 选择所有过去 6 个月。
输出应如下所示:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
我想在python中做到这一点。请帮忙。
解决方案
想法是由过滤和移位创建Series
的,最后一个过滤器的值不太像这样的偏移量:datetimes
Moved
MonthOffset
Series.map
编辑:获取每个Moved
值的所有日期时间:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1:获取所有日期时间,每个组的日期时间最少,删除Moved
每个组的另一个:Moved
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
编辑2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved
推荐阅读
- uwp - 后台任务中的 twinapi.appcore.dll 故障模块
- java - Apache HttpClient 4.x - 多线程执行和资源泄漏
- javascript - 单击分页后如何检索搜索栏上的文本
- excel - 创建一个excel vba循环以水平列出每组垂直日期的开始日期和结束日期之间的日期
- ios - 下拉隐藏导航项时刷新控件在表格视图上方留下很大间隙
- android - Burp Suite - Pixel 2 XL (Android P) 找不到使用 Burp Suite 进行证书的方法
- airflow - 安装气流时出错:默认情况下,Airflow 的依赖项之一安装 GPL
- ios - iOS 如何重新启动应用程序以更改语言 Swift 4
- r - 为什么用线条填充 SpatialLinesDataFrame 会引发错误?
- java - Lombok、Spring mongodb 和 jackson 构造函数问题