python - 如果在另一个数据框中的一组列之间存在匹配项，如何删除 Pandas 数据框中的行？

问题描述

我在数据框中有一组人，我需要主数据集中没有出现的人员列表。目前我正在检查名字和姓氏。

data_to_check_dataset是需要检查的输入数据，它包含很多列，但目前我只需要检查first_name，last_name。

	名	姓	...
0	詹姆士	苹果	...
1	爱丽丝	测试	...
...	...	...	...
10000	保罗	测试	...

有时数据字段可能完全为空白，并被读取为 nan 值。

	名	姓	...
0	詹姆斯·康普	楠	...
1	保罗有限公司	楠	...
...	...	...	...
10000	保罗其他	楠	...

我正在检查的数据框 current_people_dataset：，它包含许多我已将名称列重命名为的列 first_name，last_name。由于某种原因，它的空值是空白的，我认为是因为

	名	姓	...
0	F A	l_A	...
1	乙		...
...	...	...	...
900000	保罗	史密斯	...

data_to_check_dataset总是current_people_dataset小于。列顺序不是固定的，并且可以根据从此处加载数据的位置而改变。

目前我一直在尝试从这里调整代码。

new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))

这会ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat在比较列时引发错误。

标签： pythonpandas

That's what error is saying so one way is to typecase first_name and last_name of both df's to string by using astype():

data_to_check_dataset[['first_name','last_name']]=data_to_check_dataset[['first_name','last_name']].astype(str)
current_people_dataset[['first_name','last_name']]=current_people_dataset[['first_name','last_name']].astype(str)

Finally chain replace() to your current method for converting string nan back to real NaN:

new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer',on=['first_name','last_name'])
         .query('_merge=="left_only"')
         .drop('_merge', axis=1)
         .replace('nan',float('NaN'),regex=True))

python - 如果在另一个数据框中的一组列之间存在匹配项，如何删除 Pandas 数据框中的行？

问题描述

解决方案

推荐阅读