首页 > 解决方案 > Pandas - 获取与两个数据帧之间的 url 匹配的模式

问题描述

我有2个类型的数据框,

d1 = {'Domain': ['amazon.com', 'apple.com', 'amazon.com','xyz.com'], 'Pattern': ['kindle','music','subscribe-and-save',''],'Other Important Info':['a','b','c','d']}
df1 = pd.DataFrame(d1)

d2 = {'Domain': ['google.com','google.com','amazon.com','amazon.com', 'youtube.com', 'amazon.com'], 'Url': ['https://google.com/kindle','https://google.com/','https://amazon.com/subscribe-and-save','https://amazon.com/abc','https://youtube.com/music','https:amazon.com/kindle']}
df2 = pd.DataFrame(d2)

主要目的是基于“域”以及“模式”在“网址”中时合并两个数据帧。

所以结果应该是以下数据框

{'Domain':['amazon.com','amazon.com'],'Url':['https://amazon.com/subscribe-and-save','https:amazon.com/kindle'],'Other Important Info':['c','a']}

我目前的做法是,

def lookup_table(value, df):
    out = None
    list_items = df['Pattern'].tolist()
    for item in list_items:
        if item in value:
            out = item
            break
    return out

df2['Pattern'] = df2['url'].apply(lambda x: lookup_table(x, df1[df1['Pattern']!='']))

merged = pd.merge(df2[df2['Pattern'].notnull()], df1[df1['Pattern']!=''],on=['Domain','Pattern'],how='left')

但是,由于 for 循环,lookup_table 函数运行时间过长

我怎样才能更快地做到这一点?在 Windows 上使用 Python 2。

标签: pythonpandaspython-2.7

解决方案


df1

       Domain             Pattern Other Important Info
0  amazon.com              kindle                    a
1   apple.com               music                    b
2  amazon.com  subscribe-and-save                    c
3     xyz.com                                         

df2

        Domain                                    Url
0   google.com              https://google.com/kindle
1   google.com                    https://google.com/
2   amazon.com  https://amazon.com/subscribe-and-save
3   amazon.com                 https://amazon.com/abc
4  youtube.com              https://youtube.com/music
5   amazon.com                https:amazon.com/kindle

主要目的是基于“域”以及“模式”在“网址”中时合并两个数据帧。

df = df1.merge(df2, on='Domain')
df.loc[df.apply(lambda x: x.Pattern in x.Url, axis=1)]

输出

       Domain             Pattern Other Important Info  \
2  amazon.com              kindle                    a   
3  amazon.com  subscribe-and-save                    c   

                                     Url  
2                https:amazon.com/kindle  
3  https://amazon.com/subscribe-and-save  

推荐阅读