python - Pandas fastest way to get duplicates in multiple dataframes
问题描述
Given these dataframes :
df1 = pd.DataFrame({'id' : ['A', 'B', 'C', 'D', 'E'],
'line' : ['1', '2', '3', '4', '5'],
'source file' : ['df1', 'df1', 'df1', 'df1', 'df1']})
df2 = pd.DataFrame({'id' : ['F', 'G', 'H', 'D', 'J'],
'line' : ['1', '2', '3', '4', '5'],
'source file' : ['df2', 'df2', 'df2', 'df2', 'df2']})
df1
Out[27]:
id line source file
0 A 1 df1
1 B 2 df1
2 C 3 df1
3 D 4 df1
4 E 5 df1
df2
Out[28]:
id line source file
0 F 1 df2
1 G 2 df2
2 H 3 df2
3 D 4 df2
4 J 5 df2
I want to find the duplicate ids and add in a new column the original file of the duplicate and the line.
So I concatenated the 2 dataframes and used the pd.Index.duplicated()
method :
df3 = pd.concat([df1,df2])
df3.reset_index(inplace=True)
df3 = df3.drop(columns=['index'])
df3.insert(3, 'comment', '')
df3
Out[32]:
id line source file comment
0 A 1 df1
1 B 2 df1
2 C 3 df1
3 D 4 df1
4 E 5 df1
5 F 1 df2
6 G 2 df2
7 H 3 df2
8 D 4 df2
9 J 5 df2
for index, value in enumerate(pd.Index(df3['id']).duplicated(keep=False)):
if value:
comment = ''
for item in df3.index[df3.id == df3.iloc[index]['id']].drop(index):
id = df3.iloc[index]['id']
comment += 'id already exists in {} file line {}'.format(
df3.iloc[item]['source file'],
df3.iloc[item]['line']
)
df3.iloc[index, df3.columns.get_loc('comment')] = comment
df3
Out[34]:
id line source file comment
0 A 1 df1
1 B 2 df1
2 C 3 df1
3 D 4 df1 id already exists in df2 file line 4
4 E 5 df1
5 F 1 df2
6 G 2 df2
7 H 3 df2
8 D 4 df2 id already exists in df1 file line 4
9 J 5 df2
I find that this code is not optimized, is there a faster way to achieve this result?
解决方案
- 找到采样线。你可以使用
groupby
和 cumcountfunction in
pandas`。例如:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id' : ['A', 'B', 'C', 'D', 'E'],
'line' : ['1', '2', '3', '4', '5'],
'source file' : ['df1', 'df1', 'df1', 'df1', 'df1']})
df2 = pd.DataFrame({'id' : ['F', 'G', 'H', 'D', 'J'],
'line' : ['1', '2', '3', '4', '5'],
'source file' : ['df2', 'df2', 'df2', 'df2', 'df2']})
alldata = pd.concat([df1, df2])
alldata['row_number'] = alldata.groupby(['id']).cumcount() + 1
alldata
- 然后找到
row_number
>= 2的行
推荐阅读
- c++ - std::to_chars() 最小浮点缓冲区大小
- javascript - 如何将数据从路由处理程序发送到我的 socket.io 函数?
- typescript - Typescript keyof generic AND 扩展了一个类型
- python - 正则表达式匹配浮点数和科学记数法
- javascript - 赛普拉斯输入为空时输入,但不清除时不清除
- java - 是否可以知道 @ParameterizedTests 的 @ValueSource 的大小?
- node.js - 代码:'InvalidAuthenticationToken',消息:'CompactToken 解析失败,错误代码:80049217' - Microsoft AD Graph API
- java - 如何使用 CsvMapper 将 CSV 映射到带有列表的对象?
- postgresql - 偶尔的性能问题
- google-bigquery - 它们是否支持 BigQuery 中单个 DDL 语句(CREATE、DROP、ALTER)的隐式事务?