首页 > 解决方案 > 基于 Python 中的键列连接嵌入

问题描述

我有 2 个 df:

>>> df1

 key       
 a, b      
 c        
 a, d, c   


>>> df2

 key       embeddings (dtype=float32) 
 a         array([-1.1132643 ,  0.793635  ,  0.8664889])
 a         array([-1.1132643 ,  0.793635  ,  0.8664889])
 b         array([-0.19276126,  -0.48233205,  0.17549737])
 c         array([0.2080252 ,  0.01567003, 0.0717131])
 d         array([4.74671781e,  6.70781136, -1.19117641])

我想根据 df1 中的键连接 df2 中的嵌入。df1 的期望输出现在应该是:

>>> df1

 key       embeddings
 a, b      array([-1.1132643 ,  0.793635  ,  0.8664889]), array([-0.19276126,  -0.48233205,  0.17549737])
 c         array([0.2080252 ,  0.01567003, 0.0717131])
 a, d, c   array([-1.1132643 ,  0.793635  ,  0.8664889]), array([4.74671781e,  6.70781136, -1.19117641]), array([0.2080252 ,  0.01567003, 0.0717131])

关于我应该应用哪种方法的任何建议?非常感激!

标签: python

解决方案


设置最小示例数据集:

import numpy as np
import pandas as pd
from tabulate import tabulate

# create mockup dataset
data = [
    np.array([-1.1132643, 0.793635, 0.8664889]),
    np.array([-1.1132643, 0.793635, 0.8664889]),
    np.array([-0.19276126, -0.48233205, 0.17549737]),
    np.array([0.2080252, 0.01567003, 0.0717131]),
    np.array([4.74671781, 6.70781136, -1.19117641])
]
keys = ['a', 'a', 'b', 'c', 'd']

# define merging keys
key_pairs = [('a', 'b'), ('c'), ('a', 'd', 'c')]

普通的python解决方案:

# create dictionary
df = {}
for key, value in zip(keys, data):
    if key not in df.keys():
        df[key] = value

# define merging keys
key_pairs = [('a', 'b'), ('c'), ('a', 'd', 'c')]

# merge by keys
merged_dict = {}
for key_pair in key_pairs:
    for key in key_pair:
        if key_pair not in merged_dict:
            merged_dict[key_pair] = []
        merged_dict[key_pair].append(df[key])

转换为数据框以便很好地打印解决方案

df = pd.DataFrame([merged_dict.keys(), merged_dict.values()], index=["key", "embeddings"]).transpose()
print(tabulate(df, headers=df.columns, tablefmt="psql"))

输出:

+----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+
|    | key             | embeddings                                                                                                                                 |
|----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | ('a', 'b')      | [array([-1.1132643,  0.793635 ,  0.8664889]), array([-0.19276126, -0.48233205,  0.17549737])]                                              |
|  1 | c               | [array([0.2080252 , 0.01567003, 0.0717131 ])]                                                                                              |
|  2 | ('a', 'd', 'c') | [array([-1.1132643,  0.793635 ,  0.8664889]), array([ 4.74671781,  6.70781136, -1.19117641]), array([0.2080252 , 0.01567003, 0.0717131 ])] |
+----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------+

最后一个建议:
您的数据集不是平面的,因此您有 2 个选择:
1 - 在您的项目中继续使用字典结构。
2 - 展平数据集,移动到 Dataframe 并使用 pandas。
..祝你好运。


推荐阅读