python - 使用python连接多个电子表格并在存在重复项时将数据附加到行中
问题描述
我正在尝试根据一个关键字段将多个 .CSV 文件中的某些字段连接在一起。但是,在有重复的情况下,我想将信息附加到现有字段。
有谁知道这样做的方法?
这是我当前代码的一个示例,它可以使用指定的字段和连接数据,但会导致重复条目:
DF1 = pd.read_csv(('facilities.csv'), header = 0, dtype = object)
DF2 = pd.read_csv(('permits.csv'), header = 0, dtype = object)
DF3 = pd.read_csv(('programs.csv'), header = 0, dtype = object)
# Select only necessary columns from CSVs
DF1_reduc = DF1[['ID','FACILITY_TYPE_CODE','FACILITY_NAME','LOCATION_ADDRESS']]
DF2_reduc = DF2[['ID','ACTIVITY_ID','PERMIT_NAME','PERMIT_STATUS_CODE']]
DF3_reduc = DF3[['ID','PROG_CODE']]
#Joining all tables together
joined_tables = [DF1_reduc, DF2_reduc, DF3_reduc]
joined_tables = [table.set_index('ID') for table in joined_tables]
joined_tables = joined_tables[0].join(joined_tables[1:])
解决方案
df = df1.join([df2, df3])
df = df.groupby(['ID'
,'FACILITY_TYPE_CODE'
,'FACILITY_NAME'
,'LOCATION_ADDRESS']) \
.agg(lambda s: ', '.join(s.fillna('')
.unique()
.astype('str')))
# Drop index for concise output.
print(df.reset_index(drop=True))
# ACTIVITY_ID PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0 1111, 1234 PERMIT 1, permit1 A, C ABC
# 1 1111 PERMIT 2 B ABC
# 2 2000, 1234 PERMIT 3 A, C
# 3 1111 permit4 C
或者,如果要将值分组为集合,则更简单:
df = df1.join([df2, df3])
df = df.groupby(['ID'
,'FACILITY_TYPE_CODE'
,'FACILITY_NAME'
,'LOCATION_ADDRESS']) \
.agg(set)
# Drop index for concise output.
print(df.reset_index(drop=True))
# ACTIVITY_ID PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0 {1234, 1111} {PERMIT 1, permit1} {A, C} {ABC}
# 1 {1111} {PERMIT 2} {B} {ABC}
# 2 {2000, 1234} {PERMIT 3} {A, C} {nan}
# 3 {1111} {permit4} {C} {nan}
更多阅读:https ://pandas.pydata.org/docs/user_guide/groupby.html
样本数据:
import io
import pandas as pd
facilities = io.StringIO("""
ID,FACILITY_TYPE_CODE,FACILITY_NAME,LOCATION_ADDRESS
04R1,GAB,Facility 1,HIGHWAY 1 E
05R2,GAB,Facility 2,1200 MOUNTAIN ROAD
05R7,VOR,Facility 3,500 MARSH PASS
0K09,FOP,Facility 4,67 SEA LANE
""")
permits = io.StringIO("""
ID,ACTIVITY_ID,PERMIT_NAME,PERMIT_STATUS_CODE
04R1,1111,PERMIT 1,A
04R1,1234,permit1,C
05R2,1111,PERMIT 2,B
05R7,2000,PERMIT 3,A
05R7,1234,PERMIT 3,C
0K09,1111,permit4,C
""")
programs = io.StringIO("""
ID,PROG_CODE
04R1,ABC
05R2,ABC
05R7,
0K09,
""")
df1 = pd.read_csv(facilities, index_col='ID')
df2 = pd.read_csv(permits, index_col='ID')
df3 = pd.read_csv(programs, index_col='ID')
推荐阅读
- css - 如何在 CSS 中制作部分透明的棋盘格?
- python-3.x - Python:尝试在函数内引用和使用我在脚本顶部声明的变量时出现未解析的引用(变量名)
- swift - 不使用 segue 将数据从 viewController 发送到 tableViewCell
- discord.js - 如何获取公会 Discord.js 中的成员列表
- c++ - 将 Xerces 库与 Geant4 链接时出现 CMake 错误
- php - Xdebug 已安装,但未显示在 phpinfo 上
- blogger - Ads.txt 未更新
- c# - 不应为已部署的应用程序启用开发环境。点网核心
- firebase - 如何在flutter中检查用户是否使用firebase for web App登录?
- ios - Azure Devops macOS 管道在缺少配置文件的情况下托管失败