首页 > 解决方案 > 在数据框字典中合并数据框

问题描述

我有一个数据框字典dict,例如:

{   
‘table_1’:              name             color             type
                        Banana           Yellow            Fruit,
‘another_table_1’:      city             state             country
                        Atlanta          Georgia           United States,
‘and_another_table_1’:  firstname        middlename        lastname
                        John             Patrick           Snow,
‘table_2’:              name             color             type
                        Red              Apple             Fruit,
‘another_table_2’:      city             state             country
                        Arlington        Virginia          United States,
‘and_another_table_2’:  firstname        middlename        lastname
                        Alex             Justin            Brown,
‘table_3’:              name             color             type
                        Lettuce          Green             Vegetable,
‘another_table_3’:      city             state             country
                        Dallas           Texas             United States,
‘and_another_table_3’:  firstname        middlename        lastname
                        Michael          Alex              Smith             }

我想根据它们的名称将这些数据框合并在一起,这样最后我将只有 3 个数据框:

table

name        color       type
Banana     Yellow     Fruit
Red         Apple     Fruit
Lettuce     Green     Vegetable

another_table

city        state          country
Atlanta     Georgia        United States
Arlington   Virginia       United States
Dallas      Texas          United States

and_another_table

firstname        middlename        lastname
John             Patrick           Snow
Alex             Justin            Brown
Michael          Alex              Smith

根据我的初步研究,Python 似乎应该可以做到这一点:

  1. 通过使用.split,字典理解并itertools.groupby根据键名将字典内的数据框组合在一起
  2. 使用这些分组结果创建字典字典
  3. 使用pandas.concat函数循环遍历这些字典并将数据帧组合在一起

我对 Python 没有太多经验,我对如何实际编写代码有点迷茫。

我已经查看了 如何在列表中对类似项目进行分组?在字典帖子中合并数据框,但它们没有那么有用,因为在我的情况下,数据框的名称长度会有所不同。

此外,我不想硬编码任何数据框名称,因为它们有 1000 多个。

标签: pythonpandasdictionarygroup-bynested

解决方案


这是一种方法:

给出这个数据框字典:

dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
      'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
      'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
      'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
      'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
      'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
     }

tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
dict_of_dfs = {i:pd.concat([dd[x] for x in dd.keys() if x.startswith(i)]) for i in tables}

输出一个新的组合表字典:

dict_of_dfs['table']

#      Name   color   type
# 0  Banana  Yellow  Fruit
# 0   Apple     Red  Fruit

dict_of_dfs['another_table']

#        city     state        Country
# 0   Atlanta   Georgia  United States
# 0  Arlinton  Virginia  United States

dict_of_dfs['and_another_table']

#   firstname middlename lastnme
# 0      John    Patrick    Snow
# 0      Alex     Justin   Brown

另一种使用集合中的 defaultdict 的方法,创建组合数据框的列表:

from collections import defaultdict
import pandas as pd

dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
      'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
      'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
      'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
      'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
      'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
     }
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])

d = defaultdict(list)

[d[i].append(dd[k]) for i in tables for k in dd.keys() if k.startswith(i)]
l_of_dfs = [pd.concat(d[i]) for i in d.keys()]
print(l_of_dfs[0])
print('\n')
print(l_of_dfs[1])
print('\n')
print(l_of_dfs[2])

输出:

       city     state        Country
0   Atlanta   Georgia  United States
0  Arlinton  Virginia  United States


  firstname middlename lastnme
0      John    Patrick    Snow
0      Alex     Justin   Brown


     Name   color   type
0  Banana  Yellow  Fruit
0   Apple     Red  Fruit

推荐阅读