首页 > 解决方案 > 根据列中的 ID 组合和扩展 Dataframe

问题描述

我有 3 个数据框 A、B、C:

import pandas as pd

A = pd.DataFrame({"id": [1,2], 
                "connected_to_B_id1":["A","B"],
                "connected_to_B_id2":["B","C"],
                "connected_to_B_id3":["C", np.nan],
                # entry can have multiple ids from B
                })

B = pd.DataFrame({"id": ["A","B","C"], 
                "connected_to_C_id1":[1,1,2],
                "connected_to_C_id2":[2,2,np.nan],
                # entry can have multiple ids from C
                })
C = pd.DataFrame({"id": [1,2], 
                "name":["a","b"],
                })


#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
                  "id_B": ["A","A","B","B","C","B","B","C"],
                  "id_C": [1,2,1,2,2,1,2,1],
                  "name": ["a","b","a","b","b","a","b","a"]
                })

我想使用存储在每个数据帧的“connected_to_X”列中的 ID 来创建一个数据帧,其中包含记录在三个单独的数据帧中的所有关系。

将数据帧组合到 A、B 和 C 到 D 的最优雅方法是什么?目前我正在使用 dicts、lists 和 for 循环,它的混乱和复杂。

丁:

|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
|  0 | 1 | A | 1 | a |
|  1 | 1 | A | 2 | b |
|  2 | 1 | B | 1 | a |
|  3 | 1 | B | 2 | b |
|  4 | 1 | C | 2 | b |
|  5 | 2 | B | 1 | a |
|  6 | 2 | B | 2 | b |
|  7 | 2 | C | 1 | a |

标签: pythonpandasdataframe

解决方案


您只需要取消透视AB然后您就可以加入表格。

(A.
 melt(id_vars='id').
 merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
 merge(C, left_on = 'value_y', right_on='id').
 drop(columns = ['variable_x', 'variable_y', 'value_x']).
 sort_values(['id_x', 'id_y']).
 reset_index(drop=True).
 reset_index()
 )

   index  id_x id_y  value_y  id name
0      0     1    A      1.0   1    a
1      1     1    A      2.0   2    b
2      2     1    B      1.0   1    a
3      3     1    B      2.0   2    b
4      4     1    C      2.0   2    b
5      5     2    B      1.0   1    a
6      6     2    B      2.0   2    b
7      7     2    C      2.0   2    b

推荐阅读