python - 需要一种与 pandas.merge_asof() 进行多对一合并的方法
问题描述
我有一个与以下链接中列出的帖子类似的问题: pandas merging based on a timestamp which do not fully match
但是,我需要在具有 pandas.merge_asof() 功能的同时进行多对一匹配。
我有两个数据框,df1 和 df2。
import pandas as pd
import numpy as np
from io import StringIO
dtc = [['CALL_DATE']]
df1 = pd.read_csv(StringIO(u'''
CALL_DATE,customer,status
2017-01-03 14:12:58,70892,P
2017-01-06 20:00:25,70892,P
2017-01-07 09:42:58,70892,X
2017-01-03 13:56:41,70928,N
2017-01-07 15:16:26,70928,C
2017-01-03 15:39:11,71075,U
2017-01-03 15:46:29,71075,N
'''))
df2 = pd.read_csv(StringIO(u'''
CALL_DATE,customer,Note
2017-01-03 14:09:00,70892,Call to return
2017-01-06 19:59:00,70892,Wrong Item shipped
2017-01-07 09:36:00,70892,Survey denied
2017-01-03 13:56:00,70928,TGGT
2017-01-03 13:53:00,70928,Open issue
2017-01-03 13:56:00,70928,No Record of listings
2017-01-07 15:15:00,70928,Need Translator
2017-01-07 15:16:00,70928,rescheduled appointment
2017-01-03 15:39:11,71075,New Contact
2017-01-03 15:46:29,71075,open membership
2017-01-03 15:46:29,71075,recurring delivery scheduled
'''))
df1['CALL_DATE'] = pd.to_datetime(df1['CALL_DATE'], format = '%Y-%m-%d %H:%M:%S')
df2['CALL_DATE'] = pd.to_datetime(df2['CALL_DATE'], format = '%Y-%m-%d %H:%M:%S')
这两个数据框需要合并,最终结果类似于以下内容:
df3 = pd.read_csv(StringIO(u'''
2017-01-03 14:12:58,70892,P,2017-01-03 14:09:00,Call to return
2017-01-06 20:00:25,70892,P,2017-01-06 19:59:00,Wrong Item shipped
2017-01-07 09:42:58,70892,P,2017-01-07 09:36:00,Survey denied
2017-01-03 13:56:41,70928,N,2017-01-03 13:56:00,TGGT
2017-01-03 13:56:41,70928,N,2017-01-03 13:53:00,Open issue
2017-01-03 13:56:41,70928,N,2017-01-03 13:56:00,70928,No Record of listings
2017-01-07 15:16:26,70928,C,2017-01-07 15:15:00,Need Translator
2017-01-07 15:16:26,70928,C,2017-01-07 15:16:00,rescheduled appointment
2017-01-03 15:39:11,71075,U,2017-01-03 15:39:11,New Contact
2017-01-03 15:46:29,71075,N,2017-01-03 15:46:29,open membership
2017-01-03 15:46:29,71075,N,2017-01-03 15:46:29,recurring delivery schedule
'''))
在提供的样本数据中,时间差确实很小,但在很多情况下,时间差可以达到几个小时几乎一整天。我正在尝试将注释与该客户最近的客户条目相匹配。df2 条目也可以在(时间方面)df1 条目之前或之后出现。
当我执行 pandas.merge_asof() 时,它只是在进行一对一的合并,我丢失了应该与客户文件一起使用的笔记。
解决方案
也许您所要做的就是在您的merge_asof
通话中切换数据帧的顺序?因为这对我有用:
df1.sort_values(by='CALL_DATE', inplace=True)
df2.sort_values(by='CALL_DATE', inplace=True)
df1['STATUS_DATE'] = df1.CALL_DATE # preserves times from df1
df3 = pd.merge_asof(df2, df1, on='CALL_DATE', by='customer', direction='nearest')
调用print(df3)
输出(在我的机器上):
CALL_DATE customer Note status \
0 2017-01-03 13:53:00 70928 Open issue N
1 2017-01-03 13:56:00 70928 TGGT N
2 2017-01-03 13:56:00 70928 No Record of listings N
3 2017-01-03 14:09:00 70892 Call to return P
4 2017-01-03 15:39:11 71075 New Contact U
5 2017-01-03 15:46:29 71075 open membership N
6 2017-01-03 15:46:29 71075 recurring delivery scheduled N
7 2017-01-06 19:59:00 70892 Wrong Item shipped P
8 2017-01-07 09:36:00 70892 Survey denied X
9 2017-01-07 15:15:00 70928 Need Translator C
10 2017-01-07 15:16:00 70928 rescheduled appointment C
STATUS_DATE
0 2017-01-03 13:56:41
1 2017-01-03 13:56:41
2 2017-01-03 13:56:41
3 2017-01-03 14:12:58
4 2017-01-03 15:39:11
5 2017-01-03 15:46:29
6 2017-01-03 15:46:29
7 2017-01-06 20:00:25
8 2017-01-07 09:42:58
9 2017-01-07 15:16:26
10 2017-01-07 15:16:26
如果列顺序困扰您,您可以随时重新排序列。
推荐阅读
- javascript - 我的代码在 codepen 上工作,但不在我的编辑器上
- java - 在同一个 osgi 包中可以有两个服务接口吗?
- r - Mac上的C50包安装错误:C编译器无法创建可执行文件
- android - 如何检测在 Youtube Autoplay 新闻提要等回收站视图中查看可见性(80% 左右)在可见约 80 时播放视频
- android - androidx.constraintlayout.widget.Constraints VS androidx.constraintlayout.widget.ConstraintLayout?
- flutter - 如何将 Flutter 中的传感器流转换为频率较低的流?
- asp.net-mvc - 使用正则表达式的电子邮件验证不起作用。ASP:NET MVC C# Visual Studio 2017
- c# - 如何在 Toast 中使用嵌入的图像
- amazon-ecs - Pipeline 永远不会通过 codepipeline 完成 AWS ECS Fargate 任务的部署
- python - 使用带有 OpenCV 的颜色直方图重新识别人员