python - Pandas DataFrame identifiying rows that share column values according to a specific logic
问题描述
I have a dataset of tickets bought from a transportation service provider, with columns
['ID', 'EMAIL', 'FROM', 'TO', 'DATE']
And I want to detect tickets that are two-way tickets aka. FROM = TO of each other
for example :
The ticket {125, 'ana.alvez@gmail.com', 'Paris', 'Berlin', '01/01/2020'}
and ticket {426, 'ana.alvez@gmail.com', 'Berlin', 'Paris', '01/01/2020'}
are linked.
I did the following code to handle it but it has bad performance.
def isTwoway(data, EMAIL, FROM, TO, DATE):
t = data.query('EMAIL == "{0}" & FROM == "{1}" & TO == "{2}" & DATE == "{3}"'.format(EMAIL, TO, FROM, DATE))
return len(t) > 0
df['isTwoway'] = df.apply(lambda x: isTwoway(df, x.EMAIL, x.FROM, x.TO, x.DATE), axis=1)
MCVE
import pandas as pd
df = pd.DataFrame({'ID' : [1, 2, 3, 4, 5],
'EMAIL' : ['ana.alvez@gmail.com', 'sara.dispa@yahoo.com', 'mona.talbi@hotmail.com', 'mona.talbi@hotmail.com',
'ana.alvez@gmail.com'],
'FROM' : ['Paris', 'Madrid', 'Casablanca', 'Berlin', 'Berlin'],
'TO' : ['Berlin', 'Dublin', 'Porto', 'Paris', 'Paris'],
'DATE' : ['12/01/2020', '13/01/2020', '27/01/2020', '27/01/2020', '12/01/2020']})
def isTwoway(data, EMAIL, FROM, TO, DATE):
t = data.query('EMAIL == "{0}" & FROM == "{1}" & TO == "{2}" & DATE == "{3}"'.format(EMAIL, TO, FROM, DATE))
return len(t) > 0
df['isTwoway'] = df.apply(lambda x: isTwoway(df, x.EMAIL, x.FROM, x.TO, x.DATE), axis=1)
解决方案
pd.merge
您可以尝试对同一数据框进行左连接或内连接 ( ),并为left_on
和right_on
参数指定不同的列,以便您可以设置 FROM=TO。之后,在 EMAIL 和 DATE 列中删除重复项以查找匹配的票对。
import pandas as pd
# make data
df = pd.DataFrame({'ID': [1,2,3,4,5], 'EMAIL': ['a@a.com', 'b@b.com', 'a@a.com', 'b@b.com', 'c@c.com'], 'FROM': ['Berlin', 'Paris', "Paris", 'Berlin', "Berlin"], 'TO': ["Paris", "Berlin", "Berlin", "Paris", "Paris"], 'DATE': ["01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020"]})
df
ID EMAIL FROM TO DATE
0 1 a@a.com Berlin Paris 01/01/2020
1 2 b@b.com Paris Berlin 01/01/2020
2 3 a@a.com Paris Berlin 01/01/2020
3 4 b@b.com Berlin Paris 01/01/2020
4 5 c@c.com Berlin Paris 01/01/2020
# carry out a left merge with the indicator flag set to TRUE to see the result (could alternatively do an "inner" merge)
dfout = df.merge(df, how="left", left_on=["EMAIL", "FROM", "TO", "DATE"], right_on=["EMAIL", "TO", "FROM", "DATE"], indicator=True)
dfout
ID_x EMAIL FROM_x TO_x DATE ID_y FROM_y TO_y _merge
0 1 a@a.com Berlin Paris 01/01/2020 3.0 Paris Berlin both
1 2 b@b.com Paris Berlin 01/01/2020 4.0 Berlin Paris both
2 3 a@a.com Paris Berlin 01/01/2020 1.0 Berlin Paris both
3 4 b@b.com Berlin Paris 01/01/2020 2.0 Paris Berlin both
4 5 c@c.com Berlin Paris 01/01/2020 NaN NaN NaN left_only
# drop the duplicates to see the results
dfout.drop_duplicates(subset=["EMAIL", "DATE"])
ID_x EMAIL FROM_x TO_x DATE ID_y FROM_y TO_y _merge
0 1 a@a.com Berlin Paris 01/01/2020 3.0 Paris Berlin both
1 2 b@b.com Paris Berlin 01/01/2020 4.0 Berlin Paris both
4 5 c@c.com Berlin Paris 01/01/2020 NaN NaN NaN left_only
推荐阅读
- python - Numpy.where(ValueError:只能比较相同标签的系列对象)
- ios - UIKit:如何将多个 CALayer 混合到一个视图层中?
- azure - Pulumi azure-native 提供程序:Azure WebApp:参数 LinuxFxVersion 的值无效
- feathersjs - 使用 feathersjs 登录 github 的问题
- android - Material Theme Colors 导致空白 Jetpack Compose 按钮
- python - Flask 将在 __init__.py 中执行代码,但无法让它在 app.py 中运行
- angular - 如何通过 Angular 中的 POST 请求将 CSV 文件发送到 Spring 应用程序?
- esp32 - 如何修复 - 无法在 Esp32 中打开端口“/dev/cu.SLAB_USBtoUART”
- azure-devops - Azure devops 发布管道触发器
- postgresql - 如何在 azure 管道中使用的 windows-latest 代理上使用 postgres db?