首页 > 解决方案 > Pandas:合并具有混合数据类型的列

问题描述

我有主要和次要数据框。当 ID 变量组合相同时,我想用辅助数据框中的值替换主数据框中的值。其中一个 ID 变量在主数据框中具有混合数据类型。我能够解决这个问题,但我的解决方案似乎过于复杂,我希望这里有人可以帮助我找到更优雅的方法。

请注意,ID2 = 'Missing' 或 'indicator' = 1 的行永远不需要替换。

primary_df = pd.DataFrame(data=
        {'ID1': ['XXX111','XXX111','XXX111','XXX111','YYY222','YYY222','ZZZ333','ZZZ333','ZZZ333'],
         'ID2': ['0-100', -1.0, -2.0, -3.0, '0-10', -1.0,'300-400', 'Missing', '-4.0'],
         'value' : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
         'indicator': [1,np.nan, np.nan, np.nan, 1, np.nan, 1, np.nan, np.nan]})

secondary_df = pd.DataFrame(data=
        {'ID1': list(['XXX111','ZZZ333']),
         'ID2': list([-3,-4]),
         'value': list([0.04, 0.09])})

desired_df = pd.DataFrame(data=
        {'ID1': ['XXX111','XXX111','XXX111','XXX111','YYY222','YYY222','ZZZ333','ZZZ333','ZZZ333'],
         'ID2': ['0-100', -1, -2, -3, '0-10', -1,'300-400', 'Missing', -4],
         'value' : [0.1, 0.2, 0.3, 0.04, 0.5, 0.6, 0.7, 0.8, 0.09],
         'indicator': [1,np.nan, np.nan, np.nan, 1, np.nan, 1, np.nan, np.nan]})

In [6]: primary_df
Out[6]: 
      ID1      ID2  value  indicator
0  XXX111    0-100    0.1        1.0
1  XXX111       -1    0.2        NaN
2  XXX111       -2    0.3        NaN
3  XXX111       -3    0.4        NaN
4  YYY222     0-10    0.5        1.0
5  YYY222       -1    0.6        NaN
6  ZZZ333  300-400    0.7        1.0
7  ZZZ333  Missing    0.8        NaN
8  ZZZ333     -4.0    0.9        NaN

In [7]:secondary_df
Out[7]: 
      ID1  ID2  value
0  XXX111   -3   0.04
1  ZZZ333   -4   0.09

desired_df
Out[8]: 
      ID1      ID2  value  indicator
0  XXX111    0-100   0.10        1.0
1  XXX111       -1   0.20        NaN
2  XXX111       -2   0.30        NaN
3  XXX111       -3   0.04        NaN
4  YYY222     0-10   0.50        1.0
5  YYY222       -1   0.60        NaN
6  ZZZ333  300-400   0.70        1.0
7  ZZZ333  Missing   0.80        NaN
8  ZZZ333       -4   0.09        NaN

这是我非常不受欢迎的解决方案:

pdfIndctr  = primary_df.copy()[primary_df.indicator==1] # pick up rows with indicator = 1, will never need to be replaced
pdfMissing = primary_df.copy()[primary_df['ID2']=='Missing'] # pick up rows with ID2 = 'Missing', will never need to be replaced
pdfRest    = primary_df.copy()[(primary_df['ID2'] != 'Missing') & (primary_df.indicator.isnull())] # pick up the rest of the rows
pdfRest['ID2'] = pdfRest.ID2.apply(lambda x: int(float(x))) # change the data type on ID2 for merging with secondary_df

pdfRest_fixed = pd.merge(pdfRest, secondary_df, on=['ID1','ID2'], how='inner', suffixes=['drop','']) # merge to fix the rows to be replaced
pdfRest_same  = pd.merge(pdfRest, secondary_df, on=['ID1','ID2'], how='left',  suffixes=['','drop'], indicator=True) # merge again to identify rows not to be replaced
pdfRest_same  = pdfRest_same.copy()[pdfRest_same._merge=='left_only'] # drop the rows in the second merge that were also found in the secondary_df

desired_df = pdfIndctr.append(pdfMissing, sort=True).append(pdfRest_fixed, sort=True).append(pdfRest_same, sort=True) # put everything back together
desired_df.drop(columns = ['_merge','valuedrop'], inplace=True) # drop unnecessary rows

标签: pandasmergemissing-datacomplex-data-types

解决方案


推荐阅读