python - 给定两列相同的熊猫,在行中找到相似的元素以创建新列
问题描述
我的数据框看起来,
df =
query subject HPSame
0 cat dog HPS_1
1 cat horse HPS_2
2 king queen HPS_3
3 queen people HPS_4
4 CAR VAN HPS_5
5 dog tiger HPS_6
6 CAR TRUCK HPS_7
7 horse deer HPS_8
8 CAR JEEP HPS_9
9 TRUCK LORRY HPS_10
10 VAN TRAIN HPS_11
11 people children HPS_12
在 df 中,query 与 subject 相似,即 cat 与 dog 相似,因此标记为 HPS_1。另外,猫与马相似,狗与虎相似,因此,应该有相同的匹配标签,HPS_1。我正在寻找类似的元素,例如 if a = b = c = d 并在新列中给它们相同的标签。我试图简化我的问题。主题和查询本质上由字母数字元素组成,WP_020314852.1 = WP_004217899.1 = WP_150395973.1 表示相同类型。预期结果如下。
df =
query subject HPSame match
0 cat dog HPS_1 HPS_1
1 cat horse HPS_2 HPS_1
2 king queen HPS_3 HPS_3
3 queen people HPS_4 HPS_3
4 CAR VAN HPS_5 HPS_5
5 dog tiger HPS_6 HPS_1
6 CAR TRUCK HPS_7 HPS_5
7 horse deer HPS_8 HPS_1
8 CAR JEEP HPS_9 HPS_5
9 TRUCK LORRY HPS_10 HPS_5
10 VAN TRAIN HPS_11 HPS_5
11 people children HPS_12 HPS_3
我试过,
df['query_s'] = df['query'].shift(-1)
df['HPSame_s'] = df['HPSame'].shift(-1)
condition = [(df['query'] == df['query_s'])]
ifTrue = df['HPSame']
ifFalse = df['HPSame_s']
df['match'] = np.where(condition, ifTrue, ifFalse)
这会抛出 ValueError:值的长度与索引的长度不匹配
解决方案
我们可以使用带有图论连接组件的Networkx 库来做到这一点:
import pandas as pd
import networkx as nx
import numpy as np
# Copy your input dataframe from question
df = pd.read_clipboard()
# Create a graph network
G = nx.from_pandas_edgelist(df, 'query', 'subject')
# Use connected_components method to find groups
grps = dict(enumerate(nx.connected_components(G)))
# Match back to dataframe
df['match'] = [k for i in df['query'] for k, v in grps.items() if i in v]
df['match'] = df.groupby('match')['HPSame'].transform('first')
print(df)
输出:
query subject HPSame match
0 cat dog HPS_1 HPS_1
1 cat horse HPS_2 HPS_1
2 king queen HPS_3 HPS_3
3 queen people HPS_4 HPS_3
4 CAR VAN HPS_5 HPS_5
5 dog tiger HPS_6 HPS_1
6 CAR TRUCK HPS_7 HPS_5
7 horse deer HPS_8 HPS_1
8 CAR JEEP HPS_9 HPS_5
9 TRUCK LORRY HPS_10 HPS_5
10 VAN TRAIN HPS_11 HPS_5
11 people children HPS_12 HPS_3
来自数据帧的图网络图像:
fig, ax = plt.subplots(figsize=(10,8))
nx.draw_networkx(G, node_color='y')
推荐阅读
- swift - 从数组中获取事件列表
- sbt - 错误 405 放置 | 如何通过sbt发布到nexus?
- jquery - 我找不到 jquery 函数的正确解决方案
- r - 在 R 中将现有的传单地图导出为 KML/KMZ 格式
- c# - 如何获取包含 HardwarId 的最后一个 ID
- javascript - 离子如何过滤数组
- networking - Kubernetes 暴露的 pod 连接被拒绝 - 一次有效,有时无效
- php - 无法进行也链接 WP 中的类别/标签的 MySQL 查询
- reactjs - 如何在某些反应组件中隐藏导航栏
- javascript - 我有两个数组,我需要打印第一个数组中的第一个元素和第二个数组中的第一个元素,依此类推