首页 > 解决方案 > Pandas DataFrame identifiying rows that share column values according to a specific logic

问题描述

I have a dataset of tickets bought from a transportation service provider, with columns

['ID', 'EMAIL', 'FROM', 'TO', 'DATE']

And I want to detect tickets that are two-way tickets aka. FROM = TO of each other for example : The ticket {125, 'ana.alvez@gmail.com', 'Paris', 'Berlin', '01/01/2020'} and ticket {426, 'ana.alvez@gmail.com', 'Berlin', 'Paris', '01/01/2020'} are linked.

I did the following code to handle it but it has bad performance.

def isTwoway(data, EMAIL, FROM, TO, DATE):
    t = data.query('EMAIL == "{0}" & FROM == "{1}" & TO == "{2}" & DATE == "{3}"'.format(EMAIL, TO, FROM, DATE))
    return len(t) > 0

df['isTwoway'] = df.apply(lambda x: isTwoway(df, x.EMAIL, x.FROM, x.TO, x.DATE), axis=1)

MCVE

import pandas as pd
df = pd.DataFrame({'ID' : [1, 2, 3, 4, 5], 
                   'EMAIL' : ['ana.alvez@gmail.com', 'sara.dispa@yahoo.com', 'mona.talbi@hotmail.com', 'mona.talbi@hotmail.com',
                              'ana.alvez@gmail.com'],
                   'FROM' : ['Paris', 'Madrid', 'Casablanca', 'Berlin', 'Berlin'],
                   'TO' : ['Berlin', 'Dublin', 'Porto', 'Paris', 'Paris'],
                   'DATE' : ['12/01/2020', '13/01/2020', '27/01/2020', '27/01/2020', '12/01/2020']})

def isTwoway(data, EMAIL, FROM, TO, DATE):
    t = data.query('EMAIL == "{0}" & FROM == "{1}" & TO == "{2}" & DATE == "{3}"'.format(EMAIL, TO, FROM, DATE))
    return len(t) > 0

df['isTwoway'] = df.apply(lambda x: isTwoway(df, x.EMAIL, x.FROM, x.TO, x.DATE), axis=1)

标签: pythonpandas

解决方案


pd.merge您可以尝试对同一数据框进行左连接或内连接 ( ),并为left_onright_on参数指定不同的列,以便您可以设置 FROM=TO。之后,在 EMAIL 和 DATE 列中删除重复项以查找匹配的票对。

import pandas as pd

# make data
df = pd.DataFrame({'ID': [1,2,3,4,5], 'EMAIL': ['a@a.com', 'b@b.com', 'a@a.com', 'b@b.com', 'c@c.com'], 'FROM': ['Berlin', 'Paris', "Paris", 'Berlin', "Berlin"], 'TO': ["Paris", "Berlin", "Berlin", "Paris", "Paris"], 'DATE': ["01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020"]})

df
   ID    EMAIL    FROM      TO        DATE
0   1  a@a.com  Berlin   Paris  01/01/2020
1   2  b@b.com   Paris  Berlin  01/01/2020
2   3  a@a.com   Paris  Berlin  01/01/2020
3   4  b@b.com  Berlin   Paris  01/01/2020
4   5  c@c.com  Berlin   Paris  01/01/2020

# carry out a left merge with the indicator flag set to TRUE to see the result (could alternatively do an "inner" merge)
dfout = df.merge(df, how="left", left_on=["EMAIL", "FROM", "TO", "DATE"], right_on=["EMAIL", "TO", "FROM", "DATE"], indicator=True)

dfout
   ID_x    EMAIL  FROM_x    TO_x        DATE  ID_y  FROM_y    TO_y     _merge
0     1  a@a.com  Berlin   Paris  01/01/2020   3.0   Paris  Berlin       both
1     2  b@b.com   Paris  Berlin  01/01/2020   4.0  Berlin   Paris       both
2     3  a@a.com   Paris  Berlin  01/01/2020   1.0  Berlin   Paris       both
3     4  b@b.com  Berlin   Paris  01/01/2020   2.0   Paris  Berlin       both
4     5  c@c.com  Berlin   Paris  01/01/2020   NaN     NaN     NaN  left_only

# drop the duplicates to see the results
dfout.drop_duplicates(subset=["EMAIL", "DATE"])
   ID_x    EMAIL  FROM_x    TO_x        DATE  ID_y  FROM_y    TO_y     _merge
0     1  a@a.com  Berlin   Paris  01/01/2020   3.0   Paris  Berlin       both
1     2  b@b.com   Paris  Berlin  01/01/2020   4.0  Berlin   Paris       both
4     5  c@c.com  Berlin   Paris  01/01/2020   NaN     NaN     NaN  left_only

推荐阅读