首页 > 解决方案 > 给定两列相同的熊猫,在行中找到相似的元素以创建新列

问题描述

我的数据框看起来,

df = 
     query    subject     HPSame
0    cat      dog         HPS_1
1    cat      horse       HPS_2
2    king     queen       HPS_3
3    queen    people      HPS_4
4    CAR      VAN         HPS_5
5    dog      tiger       HPS_6
6    CAR      TRUCK       HPS_7
7    horse    deer        HPS_8
8    CAR      JEEP        HPS_9
9    TRUCK    LORRY       HPS_10
10   VAN      TRAIN       HPS_11
11   people   children    HPS_12

在 df 中,query 与 subject 相似,即 cat 与 dog 相似,因此标记为 HPS_1。另外,猫与马相似,狗与虎相似,因此,应该有相同的匹配标签,HPS_1。我正在寻找类似的元素,例如 if a = b = c = d 并在新列中给它们相同的标签。我试图简化我的问题。主题和查询本质上由字母数字元素组成,WP_020314852.1 = WP_004217899.1 = WP_150395973.1 表示相同类型。预期结果如下。

df = 

     query    subject     HPSame   match
0    cat      dog         HPS_1    HPS_1
1    cat      horse       HPS_2    HPS_1
2    king     queen       HPS_3    HPS_3
3    queen    people      HPS_4    HPS_3
4    CAR      VAN         HPS_5    HPS_5
5    dog      tiger       HPS_6    HPS_1
6    CAR      TRUCK       HPS_7    HPS_5
7    horse    deer        HPS_8    HPS_1
8    CAR      JEEP        HPS_9    HPS_5
9    TRUCK    LORRY       HPS_10   HPS_5
10   VAN      TRAIN       HPS_11   HPS_5
11   people   children    HPS_12   HPS_3  

我试过,

df['query_s'] = df['query'].shift(-1)
df['HPSame_s'] = df['HPSame'].shift(-1)
condition = [(df['query'] == df['query_s'])]
ifTrue = df['HPSame']
ifFalse = df['HPSame_s']
df['match'] = np.where(condition, ifTrue, ifFalse)

这会抛出 ValueError:值的长度与索引的长度不匹配

标签: pythonpandasdataframe

解决方案


我们可以使用带有图论连接组件的Networkx 库来做到这一点:

import pandas as pd
import networkx as nx
import numpy as np

# Copy your input dataframe from question
df = pd.read_clipboard()

# Create a graph network
G = nx.from_pandas_edgelist(df, 'query', 'subject')

# Use connected_components method to find groups
grps = dict(enumerate(nx.connected_components(G)))

# Match back to dataframe
df['match'] = [k for i in df['query'] for k, v in grps.items() if i in v]
df['match'] = df.groupby('match')['HPSame'].transform('first')

print(df)

输出:

     query   subject  HPSame  match
0      cat       dog   HPS_1  HPS_1
1      cat     horse   HPS_2  HPS_1
2     king     queen   HPS_3  HPS_3
3    queen    people   HPS_4  HPS_3
4      CAR       VAN   HPS_5  HPS_5
5      dog     tiger   HPS_6  HPS_1
6      CAR     TRUCK   HPS_7  HPS_5
7    horse      deer   HPS_8  HPS_1
8      CAR      JEEP   HPS_9  HPS_5
9    TRUCK     LORRY  HPS_10  HPS_5
10     VAN     TRAIN  HPS_11  HPS_5
11  people  children  HPS_12  HPS_3

来自数据帧的图网络图像:

fig, ax = plt.subplots(figsize=(10,8))
nx.draw_networkx(G, node_color='y')

在此处输入图像描述


推荐阅读