首页 > 解决方案 > 使用python连接多个电子表格并在存在重复项时将数据附加到行中

问题描述

我正在尝试根据一个关键字段将多个 .CSV 文件中的某些字段连接在一起。但是,在有重复的情况下,我想将信息附加到现有字段。

DF 数据和结果/期望结果示例

有谁知道这样做的方法?

这是我当前代码的一个示例,它可以使用指定的字段和连接数据,但会导致重复条目:

DF1 = pd.read_csv(('facilities.csv'), header = 0, dtype = object)
DF2 = pd.read_csv(('permits.csv'), header = 0, dtype = object)
DF3 = pd.read_csv(('programs.csv'), header = 0, dtype = object)


# Select only necessary columns from CSVs
DF1_reduc = DF1[['ID','FACILITY_TYPE_CODE','FACILITY_NAME','LOCATION_ADDRESS']]
DF2_reduc = DF2[['ID','ACTIVITY_ID','PERMIT_NAME','PERMIT_STATUS_CODE']]
DF3_reduc = DF3[['ID','PROG_CODE']]


#Joining all tables together
joined_tables = [DF1_reduc, DF2_reduc, DF3_reduc]
joined_tables = [table.set_index('ID') for table in joined_tables]
joined_tables = joined_tables[0].join(joined_tables[1:])

标签: pythonpandasjoinmergeappend

解决方案


按唯一列分组并使用agg()组合行:

df = df1.join([df2, df3])

df = df.groupby(['ID'
                ,'FACILITY_TYPE_CODE'
                ,'FACILITY_NAME'
                ,'LOCATION_ADDRESS']) \
       .agg(lambda s: ', '.join(s.fillna('')
                                 .unique()
                                 .astype('str')))

# Drop index for concise output.
print(df.reset_index(drop=True))
#   ACTIVITY_ID        PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0  1111, 1234  PERMIT 1, permit1               A, C       ABC
# 1        1111           PERMIT 2                  B       ABC
# 2  2000, 1234           PERMIT 3               A, C          
# 3        1111            permit4                  C          

或者,如果要将值分组为集合,则更简单:

df = df1.join([df2, df3])

df = df.groupby(['ID'
                ,'FACILITY_TYPE_CODE'
                ,'FACILITY_NAME'
                ,'LOCATION_ADDRESS']) \
       .agg(set)

# Drop index for concise output.
print(df.reset_index(drop=True))
#     ACTIVITY_ID          PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0  {1234, 1111}  {PERMIT 1, permit1}             {A, C}     {ABC}
# 1        {1111}           {PERMIT 2}                {B}     {ABC}
# 2  {2000, 1234}           {PERMIT 3}             {A, C}     {nan}
# 3        {1111}            {permit4}                {C}     {nan}

更多阅读:https ://pandas.pydata.org/docs/user_guide/groupby.html


样本数据:

import io
import pandas as pd

facilities = io.StringIO("""
ID,FACILITY_TYPE_CODE,FACILITY_NAME,LOCATION_ADDRESS
04R1,GAB,Facility 1,HIGHWAY 1 E
05R2,GAB,Facility 2,1200 MOUNTAIN ROAD
05R7,VOR,Facility 3,500 MARSH PASS
0K09,FOP,Facility 4,67 SEA LANE
""")

permits = io.StringIO("""
ID,ACTIVITY_ID,PERMIT_NAME,PERMIT_STATUS_CODE
04R1,1111,PERMIT 1,A
04R1,1234,permit1,C
05R2,1111,PERMIT 2,B
05R7,2000,PERMIT 3,A
05R7,1234,PERMIT 3,C
0K09,1111,permit4,C
""")

programs = io.StringIO("""
ID,PROG_CODE
04R1,ABC
05R2,ABC
05R7,
0K09,
""")

df1 = pd.read_csv(facilities, index_col='ID')
df2 = pd.read_csv(permits, index_col='ID')
df3 = pd.read_csv(programs, index_col='ID')

推荐阅读