首页 > 解决方案 > 按组将行插入数据帧,并且条目来自另一个 dataframe_complex 匹配

问题描述

我希望为每个组在名为“df_recorded”的数据框中插入一些条目,并从另一个名为“df_missed”的数据框中搜索该条目。

import pandas as pd

df_recorded = pd.DataFrame({
    'id': ['2008 11', '2008 11', '2008 11', '2008 07', '2008 07', '2008 12', '2008 12', '2008 12'],
    'info': ['recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', ],
    'score': [98, 68, 79, 75, 66, 62, 60, 60],
    'date' : ['2010-12-10', '2010-10-01', '2010-09-12', '2010-12-10', '2010-11-01', '2010-12-07', '2010-11-10', '2010-09-12']
})

df_missed = pd.DataFrame({
    'id': ['2008 11', '2008 07', '2008 12'],
    'missed_score': [62, 72, 80],
    'missed_date': ['2010-08-01', '2010-10-20', '2010-07-23']
})


        id      info  score        date
0  2008 11  recorded     98  2010-12-10
1  2008 11  recorded     68  2010-10-01
2  2008 11  recorded     79  2010-09-12
3  2008 07  recorded     75  2010-12-10
4  2008 07  recorded     66  2010-11-01
5  2008 12  recorded     62  2010-12-07
6  2008 12  recorded     60  2010-11-10
7  2008 12  recorded     60  2010-09-12

df_missed
        id  missed_score missed_date
0  2008 11            62  2010-08-01
1  2008 07            72  2010-10-20
2  2008 12            80  2010-07-23

我想在'df_recorded'中为每个组添加一行,例如在'info'列中添加相同的'id = 2008 11'和一个名为'missed'的新条目,然后添加分数和日期通过搜索 df_missed 表,结果应如下所示:

Target result:
         id      info  score        date
0   2008 11  recorded     98  2010-12-10
1   2008 11  recorded     68  2010-10-01
2   2008 11  recorded     79  2010-09-12
3   2008 11    missed     62  2010-08-01 # new record
4   2008 07  recorded     75  2010-12-10
5   2008 07  recorded     66  2010-11-01
6   2008 07    missed     72  2010-10-20 # new record
7   2008 12  recorded     62  2010-12-07
8   2008 12  recorded     60  2010-11-10
9   2008 12  recorded     60  2010-09-12
10  2008 12    missed     80  2010-07-23 # new record

我尝试使用循环进行编码,但速度非常慢且效率低下。因此,如果您有任何改进的想法,请提供帮助。非常感谢。

标签: pythonpandasdataframeinsertmatch

解决方案


IIUC,您可以简单地重命名缺少的 df 和 中的列concat

df_missed.columns = ["id", "score", "date"]

df = pd.concat([df_recorded,df_missed], ignore_index=True, sort=False).sort_values("id", ascending=False)
df.loc[df["info"].isnull(),"info"] = "missing"
print (df)

         id      info  score        date
5   2008 12  recorded     62  2010-12-07
6   2008 12  recorded     60  2010-11-10
7   2008 12  recorded     60  2010-09-12
10  2008 12   missing     80  2010-07-23
0   2008 11  recorded     98  2010-12-10
1   2008 11  recorded     68  2010-10-01
2   2008 11  recorded     79  2010-09-12
8   2008 11   missing     62  2010-08-01
3   2008 07  recorded     75  2010-12-10
4   2008 07  recorded     66  2010-11-01
9   2008 07   missing     72  2010-10-20

推荐阅读