python - 使用 python pandas 加入多个 CSV 文件
问题描述
我正在尝试使用 python pandas 从多个 csv 文件创建一个 CSV 文件。
accreditation.csv
:-
"pid","accreditation_body","score"
"25799","TAAC","4.5"
"25796","TAAC","5.6"
"25798","DAAC","5.7"
ref_university
:-
"id","pid","survery_year","end_year"
"1","25799","2018","2018"
"2","25797","2016","2018"
我想通过阅读来自的指令来创建一个新表table_structure.csv
。我想加入两个表并重写accreditation.csv
. 通过匹配列值REFERENCES ref_university(id, survey_year)
来连接和ref_university.csv
插入列值。id
survery_year
pid
table_structure.csv
:-
table_name,attribute_name,attribute_type,Description
,,,
accreditation,accreditation_body,varchar,
,grading,varchar,
,pid,int4, "REFERENCES ref_university(id, survey_year)"
,score,float8,
修改后的 CSV 文件应如下所示,
新accreditation.csv
:-
"accreditation_body","grading","pid","id","survery_year","score"
"TAAC","","25799","1","2018","2018","4.5"
"TAAC","","25797","2","2016","2018","5.6"
"DAAC","","25798","","","","5.7"
我可以在熊猫中读取 csv
df = pd.read_csv("accreditation.csv")
但是,阅读 REFERENCES 指令并选择列值的推荐方法是什么。如果没有值,则该列应为空白。我们不能硬核pid
熊猫功能。table_structure.csv
如果有参考,我们必须阅读并匹配,然后调用提到的列。它不应该被合并,只应该添加特定的列。
解决方案
动态解决方案是可能的,但不是那么容易:
df = pd.read_csv("table_structure.csv")
#remove only NaNs rows
df = df.dropna(how='all')
#repalce NaNs by forward filling
df['table_name'] = df['table_name'].ffill()
#create for each table_name one row
df = (df.dropna(subset=['Description'])
.join(df.groupby('table_name')['attribute_name'].apply(list)
.rename('cols'), 'table_name'))
#get name of DataFrame and new columns names
df['df1'] = df['Description'].str.extract('REFERENCES\s*(.*)\s*\(')
df['new_cols'] = df['Description'].str.extract('\(\s*(.*)\s*\)')
df['new_cols'] = df['new_cols'].str.split(', ')
#remove unnecessary columns
df = df.drop(['attribute_type','Description'], axis=1).set_index('table_name')
print (df)
table_name
accreditation pid [accreditation_body, grading, pid, score]
df1 new_cols
table_name
accreditation ref_university [id, survey_year]
#for select by named create dictioanry of DataFrames
data = {'accreditation' : pd.read_csv("accreditation.csv"),
'ref_university': pd.read_csv("ref_university.csv")}
#seelct by index
v = df.loc['accreditation']
print (v)
attribute_name pid
cols [accreditation_body, grading, pid, score]
df1 ref_university
new_cols [id, survey_year]
Name: accreditation, dtype: object
按字典和按Series
v选择
df = pd.merge(data[v.name],
data[v['df1']][v['new_cols'] + [v['attribute_name']]],
on=v['attribute_name'],
how='left')
转换为:
df = pd.merge(data['accreditation'],
data['ref_university'][['id', 'survey_year'] + ['pid']],
on='pid',
how='left')
并返回:
print (df)
pid accreditation_body score id survey_year
0 25799 TAAC 4.5 1.0 2018.0
1 25796 TAAC 5.6 NaN NaN
2 25798 DAAC 5.7 NaN NaN
df = df.reindex(columns=df.columns.union(v['cols']))
print (df)
accreditation_body grading id pid score survey_year
0 TAAC NaN 1.0 25799 4.5 2018.0
1 TAAC NaN NaN 25796 5.6 NaN
2 DAAC NaN NaN 25798 5.7 NaN
推荐阅读
- python - 根据表 01(python)中的公式字段计算值(表 02)
- hybris - Hybris 将 OrderModel 转换为 CartModel
- html - 将链接数组转换为链接 - React
- python-3.x - 将 url 列的值一一传递给 Python 中的网络爬虫代码
- reactjs - 当 createAsyncThunk 返回一个 promise 时如何更新组件(react/redux 工具包)
- javascript - 从 Firebase Cloud Firestore 中提取数据
- r - 我使用“as.layer()”的图表不再起作用
- ios - 图表的集合视图演示:图表 cocoapods 的“没有可用的图表数据”
- proxy - Snowflake ODBC 驱动程序未使用有效的代理环境变量?
- json - 使用命令行打开、编辑和保存文件中的内容