首页 > 解决方案 > 通过多次加入自身表来创建新列

问题描述

我有一个包含大家庭成员列表的熊猫数据框。

import pandas as pd

data = {'child':['Joe','Anna','Anna','Steffani','Bob','Rea','Dani','Dani','Selma','John','Kevin'],
             'parents':['Steffani','Bob','Steffani','Dani','Selma','Anna','Selma','John','Kevin','-','Robert'],
            }
df = pd.DataFrame(data)

从这个数据框中,我需要通过在右侧添加多个列来显示数据之间的关系来构建一个新表。右栏中的值显示了长辈关系。每列代表关系。如果我可以绘制图表,它可能看起来像这样:

child --> parents --> grandparents --> parents of grandparents --> grandparents of grandparents --> etc.

因此,数据帧的预期输出将如下所示:

    child       parents     A           B           C           D (etc)
---------------------------------------------------------------------------------
0   Joe         Steffani    Dani        Selma       Kevin       <If still possible>
1   Joe         Steffani    Dani        John        -
2   Anna        Bob         Selma       Kevin       Robert
3   Anna        Steffani    Dani        Selma       Kevin
4   Anna        Steffani    Dani        John        -
5   Steffani    Dani        Selma       Kevin       Robert
6   Steffani    Dani        John        -           -
7   Bob         Selma       Kevin       Robert      -
8   Rea         Anna        Bob         Selma       Kevin
9   Rea         Anna        Steffani    Dani        Selma
10  Rea         Anna        Steffani    Dani        John
11  Dani        Selma       Kevin       Robert      -
12  Dani        John        -           -           -
13  Selma       Kevin       Robert      -           -
14  John        -           -           -           -
15  Kevin       Robert      -           -           -

目前,我使用手动构建新表pandas.merge。但是我需要做很多次,直到最后一列与左列没有长辈关系。例如:

步骤1

df2 = pd.merge(df, df, left_on='parents', right_on='child', how='left').fillna('-')
df2 = df2[['child_x','parents_x','parents_y']]
df2.columns = ['child','parents','A']

第2步

df3 = pd.merge(df2, df, left_on='A', right_on='child', how='left').fillna('-')
df3 = df3[['child_x','parents_x','A','parents_y']]
df3.columns = ['child','parents','A','B']

第 3 步

df4 = pd.merge(df3, df, left_on='B', right_on='child', how='left').fillna('-')
df4 = df4[['child_x','parents_x','A','B','parents_y']]
df4.columns = [['child','parents','A','B','C']]

第4步

如果 C 列中的值仍然具有长辈关系,则编写类似的代码为 D 列添加第 6 列。

问题:

由于我的dataframe中有大数据(超过10K的数据点),不一步一步写代码如何解决呢?我不知道构建决赛桌需要多少步骤。

提前感谢您的帮助。

标签: pythondataframejoinmerge

解决方案


reduce考虑使用suffixes参数对merge重复列名进行一些处理并删除中间列的链合并:

def proc_build(x,y):
    temp = (pd.merge(x, y, left_on='parents', right_on='child', 
                     how='left', suffixes=['_',''])                     
              .fillna('-'))

    return temp       

final_df = (reduce(proc_build, [df, df, df, df])
               .set_axis(['child', 'parents',
                          'child1', 'A', 
                          'child2', 'B',
                          'child3', 'C'], axis='columns', inplace=False)
               .reindex(['child', 'parents'] + list('ABC'), axis='columns')
           )

print(final_df)

#        child   parents         A       B       C
# 0        Joe  Steffani      Dani   Selma   Kevin
# 1        Joe  Steffani      Dani    John       -
# 2       Anna       Bob     Selma   Kevin  Robert
# 3       Anna  Steffani      Dani   Selma   Kevin
# 4       Anna  Steffani      Dani    John       -
# 5   Steffani      Dani     Selma   Kevin  Robert
# 6   Steffani      Dani      John       -       -
# 7        Bob     Selma     Kevin  Robert       -
# 8        Rea      Anna       Bob   Selma   Kevin
# 9        Rea      Anna  Steffani    Dani   Selma
# 10       Rea      Anna  Steffani    Dani    John
# 11      Dani     Selma     Kevin  Robert       -
# 12      Dani      John         -       -       -
# 13     Selma     Kevin    Robert       -       -
# 14      John         -         -       -       -
# 15     Kevin    Robert         -       -       -

要扩展另一列,例如D ,请在and中添加另一个带有附加列表项的df迭代参数,特别是and 。虽然有一些方法可以使这些项目动态化,但可能会变得昂贵,因此应该以一些声明性的强调来处理。reduceset_axisreindex['child4', 'D']list('ABCD')reduce

final_df = (reduce(proc_build, [df] * 5)
               .set_axis(['child', 'parents',
                          'child1', 'A', 
                          'child2', 'B',
                          'child3', 'C',
                          'child4', 'D'], axis='columns', inplace=False)
               .reindex(['child', 'parents'] + list('ABCD'), axis='columns')
           )

print(final_df)

#        child   parents         A       B       C       D
# 0        Joe  Steffani      Dani   Selma   Kevin  Robert
# 1        Joe  Steffani      Dani    John       -       -
# 2       Anna       Bob     Selma   Kevin  Robert       -
# 3       Anna  Steffani      Dani   Selma   Kevin  Robert
# 4       Anna  Steffani      Dani    John       -       -
# 5   Steffani      Dani     Selma   Kevin  Robert       -
# 6   Steffani      Dani      John       -       -       -
# 7        Bob     Selma     Kevin  Robert       -       -
# 8        Rea      Anna       Bob   Selma   Kevin  Robert
# 9        Rea      Anna  Steffani    Dani   Selma   Kevin
# 10       Rea      Anna  Steffani    Dani    John       -
# 11      Dani     Selma     Kevin  Robert       -       -
# 12      Dani      John         -       -       -       -
# 13     Selma     Kevin    Robert       -       -       -
# 14      John         -         -       -       -       -
# 15     Kevin    Robert         -       -       -       -

推荐阅读