首页 > 解决方案 > Pandas fastest way to get duplicates in multiple dataframes

问题描述

Given these dataframes :

df1 = pd.DataFrame({'id' : ['A', 'B', 'C', 'D', 'E'],
                   'line' : ['1', '2', '3', '4', '5'],
                   'source file' : ['df1', 'df1', 'df1', 'df1', 'df1']})

df2 = pd.DataFrame({'id' : ['F', 'G', 'H', 'D', 'J'],
                   'line' : ['1', '2', '3', '4', '5'],
                   'source file' : ['df2', 'df2', 'df2', 'df2', 'df2']})
df1
Out[27]: 
  id line source file
0  A    1         df1
1  B    2         df1
2  C    3         df1
3  D    4         df1
4  E    5         df1

df2
Out[28]: 
  id line source file
0  F    1         df2
1  G    2         df2
2  H    3         df2
3  D    4         df2
4  J    5         df2

I want to find the duplicate ids and add in a new column the original file of the duplicate and the line. So I concatenated the 2 dataframes and used the pd.Index.duplicated() method :

df3 = pd.concat([df1,df2])
df3.reset_index(inplace=True)
df3 = df3.drop(columns=['index'])
df3.insert(3, 'comment', '')

df3
Out[32]: 
  id line source file comment
0  A    1         df1        
1  B    2         df1        
2  C    3         df1        
3  D    4         df1        
4  E    5         df1        
5  F    1         df2        
6  G    2         df2        
7  H    3         df2        
8  D    4         df2        
9  J    5         df2   

for index, value in enumerate(pd.Index(df3['id']).duplicated(keep=False)):
            if value:
                comment = ''
                for item in df3.index[df3.id == df3.iloc[index]['id']].drop(index):
                    id = df3.iloc[index]['id']
                    comment += 'id already exists in {} file line {}'.format(
                            df3.iloc[item]['source file'],
                            df3.iloc[item]['line']
                        )
                df3.iloc[index, df3.columns.get_loc('comment')] = comment
df3
Out[34]: 
  id line source file                               comment
0  A    1         df1                                      
1  B    2         df1                                      
2  C    3         df1                                      
3  D    4         df1  id already exists in df2 file line 4
4  E    5         df1                                      
5  F    1         df2                                      
6  G    2         df2                                      
7  H    3         df2                                      
8  D    4         df2  id already exists in df1 file line 4
9  J    5         df2  

I find that this code is not optimized, is there a faster way to achieve this result?

标签: python

解决方案


  1. 找到采样线。你可以使用groupby和 cumcount function inpandas`。例如:

import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id' : ['A', 'B', 'C', 'D', 'E'],
                   'line' : ['1', '2', '3', '4', '5'],
                   'source file' : ['df1', 'df1', 'df1', 'df1', 'df1']})

df2 = pd.DataFrame({'id' : ['F', 'G', 'H', 'D', 'J'],
                   'line' : ['1', '2', '3', '4', '5'],
                   'source file' : ['df2', 'df2', 'df2', 'df2', 'df2']})
alldata = pd.concat([df1, df2])

alldata['row_number'] = alldata.groupby(['id']).cumcount() + 1
alldata

在此处输入图像描述

  1. 然后找到row_number>= 2的行

推荐阅读