pandas - Pandas:合并具有混合数据类型的列
问题描述
我有主要和次要数据框。当 ID 变量组合相同时,我想用辅助数据框中的值替换主数据框中的值。其中一个 ID 变量在主数据框中具有混合数据类型。我能够解决这个问题,但我的解决方案似乎过于复杂,我希望这里有人可以帮助我找到更优雅的方法。
请注意,ID2 = 'Missing' 或 'indicator' = 1 的行永远不需要替换。
primary_df = pd.DataFrame(data=
{'ID1': ['XXX111','XXX111','XXX111','XXX111','YYY222','YYY222','ZZZ333','ZZZ333','ZZZ333'],
'ID2': ['0-100', -1.0, -2.0, -3.0, '0-10', -1.0,'300-400', 'Missing', '-4.0'],
'value' : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
'indicator': [1,np.nan, np.nan, np.nan, 1, np.nan, 1, np.nan, np.nan]})
secondary_df = pd.DataFrame(data=
{'ID1': list(['XXX111','ZZZ333']),
'ID2': list([-3,-4]),
'value': list([0.04, 0.09])})
desired_df = pd.DataFrame(data=
{'ID1': ['XXX111','XXX111','XXX111','XXX111','YYY222','YYY222','ZZZ333','ZZZ333','ZZZ333'],
'ID2': ['0-100', -1, -2, -3, '0-10', -1,'300-400', 'Missing', -4],
'value' : [0.1, 0.2, 0.3, 0.04, 0.5, 0.6, 0.7, 0.8, 0.09],
'indicator': [1,np.nan, np.nan, np.nan, 1, np.nan, 1, np.nan, np.nan]})
In [6]: primary_df
Out[6]:
ID1 ID2 value indicator
0 XXX111 0-100 0.1 1.0
1 XXX111 -1 0.2 NaN
2 XXX111 -2 0.3 NaN
3 XXX111 -3 0.4 NaN
4 YYY222 0-10 0.5 1.0
5 YYY222 -1 0.6 NaN
6 ZZZ333 300-400 0.7 1.0
7 ZZZ333 Missing 0.8 NaN
8 ZZZ333 -4.0 0.9 NaN
In [7]:secondary_df
Out[7]:
ID1 ID2 value
0 XXX111 -3 0.04
1 ZZZ333 -4 0.09
desired_df
Out[8]:
ID1 ID2 value indicator
0 XXX111 0-100 0.10 1.0
1 XXX111 -1 0.20 NaN
2 XXX111 -2 0.30 NaN
3 XXX111 -3 0.04 NaN
4 YYY222 0-10 0.50 1.0
5 YYY222 -1 0.60 NaN
6 ZZZ333 300-400 0.70 1.0
7 ZZZ333 Missing 0.80 NaN
8 ZZZ333 -4 0.09 NaN
这是我非常不受欢迎的解决方案:
pdfIndctr = primary_df.copy()[primary_df.indicator==1] # pick up rows with indicator = 1, will never need to be replaced
pdfMissing = primary_df.copy()[primary_df['ID2']=='Missing'] # pick up rows with ID2 = 'Missing', will never need to be replaced
pdfRest = primary_df.copy()[(primary_df['ID2'] != 'Missing') & (primary_df.indicator.isnull())] # pick up the rest of the rows
pdfRest['ID2'] = pdfRest.ID2.apply(lambda x: int(float(x))) # change the data type on ID2 for merging with secondary_df
pdfRest_fixed = pd.merge(pdfRest, secondary_df, on=['ID1','ID2'], how='inner', suffixes=['drop','']) # merge to fix the rows to be replaced
pdfRest_same = pd.merge(pdfRest, secondary_df, on=['ID1','ID2'], how='left', suffixes=['','drop'], indicator=True) # merge again to identify rows not to be replaced
pdfRest_same = pdfRest_same.copy()[pdfRest_same._merge=='left_only'] # drop the rows in the second merge that were also found in the secondary_df
desired_df = pdfIndctr.append(pdfMissing, sort=True).append(pdfRest_fixed, sort=True).append(pdfRest_same, sort=True) # put everything back together
desired_df.drop(columns = ['_merge','valuedrop'], inplace=True) # drop unnecessary rows
解决方案
推荐阅读
- wpf - WPF 网格未显示
- arrays - 如何对两个数组进行排序并将公共元素快速添加到第三个数组中
- javascript - 在谷歌搜索时,搜索结果显示自定义模板标签
- java - 您如何将对象与可比较对象进行比较?
- html - XSL 的放置位置:选择条件
- java - Eclipse MAT 的 ParseHeapDump 实用程序可以运行 OQL 吗?
- reporting-services - SSRS Rest API - 获取扩展设置
- python - 消息:无效参数:无法在 python 中终止已退出的进程
- java - 添加年份的Java日历问题
- java - 如何在没有 iBatis 的情况下在 JDBC 中运行 SQL 脚本?