首页 > 解决方案 > 合并具有相同id的csv行并对列进行处理

问题描述

编辑1

@Serge 答案适用于我的测试数据,但在我的真实数据中出现错误而失败

开始.csv

id;index;level1;level2;level3;strate;value
;1;Recruitement;;Recruitement start date;Total;2021-07-01
;1;Recruitement;;Recruitement start date;Ivory Coast;2021-07-01
;1;Recruitement;;Recruitement start date;Madagascar;2021-07-01
;1;Recruitement;;Recruitement start date;South Africa;2021-07-01
;2;Recruitement;;Sex (Woman), %;Total;47.0
;2;Recruitement;;Sex (Woman), %;Ivory Coast;40.0
;2;Recruitement;;Sex (Woman), %;Madagascar;40.0
;2;Recruitement;;Sex (Woman), %;South Africa;60.0
5-Total;3;Follow up;;D7 visits: N performed;Total;4                # -> LINE TO MERGE WITH...
5-Ivory Coast;3;Follow up;;D7 visits: N performed;Ivory Coast;1
5-Madagascar;3;Follow up;;D7 visits: N performed;Madagascar;2
5-South Africa;3;Follow up;;D7 visits: N performed;South Africa;1
5-Total;4;Follow up;;D7 visits: N expected;Total;12                #.... THIS LINE
5-Ivory Coast;4;Follow up;;D7 visits: N expected;Ivory Coast;4
5-Madagascar;4;Follow up;;D7 visits: N expected;Madagascar;4
5-South Africa;4;Follow up;;D7 visits: N expected;South Africa;4
;6;Recruitement;;Age, median;Total;35.0
;6;Recruitement;;Age, median;Ivory Coast;35.0
;6;Recruitement;;Age, median;Madagascar;31.0
;6;Recruitement;;Age, median;South Africa;36.0
;7;Recruitement;;Age, IQR;Total;(26.5-48.5)
;7;Recruitement;;Age, IQR;Ivory Coast;(26.0-48.0)
;7;Recruitement;;Age, IQR;Madagascar;(26.0-49.0)
;7;Recruitement;;Age, IQR;South Africa;(28.0-40.0)
;8;Recruitement;;Randomized all, N;Total;15
;8;Recruitement;;Randomized all, N;Ivory Coast;5
;8;Recruitement;;Randomized all, N;Madagascar;5
;8;Recruitement;;Randomized all, N;South Africa;5
15-Total;9;Recruitement;;Included severity stage 1, N;Total;1
15-South Africa;9;Recruitement;;Included severity stage 1, N;South Africa;1
16-Total;10;Recruitement;;Included severity stage 2, N;Total;8
16-Ivory Coast;10;Recruitement;;Included severity stage 2, N;Ivory Coast;4
16-Madagascar;10;Recruitement;;Included severity stage 2, N;Madagascar;1
16-South Africa;10;Recruitement;;Included severity stage 2, N;South Africa;3
17-Total;11;Recruitement;;Included severity stage 3, N;Total;6
17-Ivory Coast;11;Recruitement;;Included severity stage 3, N;Ivory Coast;1
17-Madagascar;11;Recruitement;;Included severity stage 3, N;Madagascar;4
17-South Africa;11;Recruitement;;Included severity stage 3, N;South Africa;1
15-Total;12;Recruitement;;Included severity stage 1, %;Total;7.0
15-Ivory Coast;12;Recruitement;;Included severity stage 1, %;Ivory Coast;nan
15-Madagascar;12;Recruitement;;Included severity stage 1, %;Madagascar;nan
15-South Africa;12;Recruitement;;Included severity stage 1, %;South Africa;20.0
16-Total;13;Recruitement;;Included severity stage 2, %;Total;53.0
16-Ivory Coast;13;Recruitement;;Included severity stage 2, %;Ivory Coast;80.0
16-Madagascar;13;Recruitement;;Included severity stage 2, %;Madagascar;20.0
16-South Africa;13;Recruitement;;Included severity stage 2, %;South Africa;60.0
17-Total;14;Recruitement;;Included severity stage 3, %;Total;40.0
17-Ivory Coast;14;Recruitement;;Included severity stage 3, %;Ivory Coast;20.0
17-Madagascar;14;Recruitement;;Included severity stage 3, %;Madagascar;80.0
17-South Africa;14;Recruitement;;Included severity stage 3, %;South Africa;20.0

我想合并具有相同id的csv行并对行的列进行一些处理

开始.csv

index;level1;level2;level3;strate;value;id
3;Follow up;;D7 visits: N performed;Total;2;1
4;Follow up;;D7 visits: N expected;Total;6;1
3;Follow up;;D7 visits: N performed;Ivory Coast;1;2
4;Follow up;;D7 visits: N expected;Ivory Coast;4;2
3;Follow up;;D7 visits: N performed;Madagascar;1;3
4;Follow up;;D7 visits: N expected;Madagascar;2;3

我想像这样合并具有相同 id 的行:

结束.csv

index;level1;level2;level3;strate;value;id
3;Follow up;;D7 visits: N performed/expected %;Total;0.33;1
4;Follow up;;D7 visits: N performed/expected %;Ivory Coast;0.75;2
4;Follow up;;D7 visits: N performed/expected %;IMadagascar;0.50;3

我可能需要使用 groupby('id') 并为列处理值应用一个函数,但是......

df.replace('D7 visits: N expected','D7 visits: N performed/expected %').groupby("id",as_index=False,sort=False).last()

给出输出

index;level1;level2;level3;strate;value;id
3;Follow up;;D7 visits: N performed/expected %;Total;6;1
4;Follow up;;D7 visits: N performed/expected %;Ivory Coast;4;2
4;Follow up;;D7 visits: N performed/expected %;IMadagascar;2;3

标签: pythonpandas

解决方案


旋转数据框可能会有所帮助:

tmp = df.pivot(index='id', columns='level3', values='value')
tmp['value'] = tmp['D7 visits: N performed']/tmp['D7 visits: N expected']
tmp['level3'] = 'D7 visits: N performed/expected %'

resul = df[['index', 'level1', 'level2', 'strate', 'id']].drop_duplicates(
    'id').merge(tmp[['level3', 'value']], left_on='id',right_index=True).reindex(
        columns=['index', 'level1', 'level2', 'level3', 'strate', 'value', 'id'])

resul.to_csv('end.csv', sep=';')

应该生成预期的 csv 文件。


推荐阅读