首页 > 解决方案 > 使用字典将重复行映射到原始行 - Python 3.6

问题描述

我正在尝试在我的数据框中找到重复的行pandas。实际上,df.shape438796, 4531,但我将下面这个玩具示例用于MRE

|   id   | ft1 | ft2 | ft3 | ft4 | ft5 |  label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 |  1  |  1  |  43 |  1  |  1  |  High  |
| id_101 |  1  |  1  |  33 |  0  |  1  | Medium |
| id_102 |  1  |  1  |  12 |  1  |  1  |   Low  |
| id_103 |  1  |  1  |  46 |  1  |  0  |   Low  |
| id_104 |  1  |  1  |  10 |  1  |  1  |  High  |
| id_105 |  0  |  1  |  99 |  0  |  1  |   Low  |
| id_106 |  0  |  0  |  0  |  0  |  0  |  High  |
| id_107 |  1  |  1  |  6  |  0  |  1  |  High  |
| id_108 |  1  |  1  |  29 |  1  |  1  | Medium |
| id_109 |  1  |  0  |  27 |  0  |  0  | Medium |
| id_110 |  0  |  1  |  32 |  0  |  1  |  High  |

我想要完成的是观察特征的一个子集,如果有重复的行,保留第一行,然后表示哪id: label对是重复的。

我看过以下帖子:

我知道pandasduplicated()电话。所以我尝试实现它并且它有点工作:

import pandas as pd

# Read in example data
df = pd.read_clipboard()

# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']

# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]

# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()

# Loop through the duplicates and print out the values I want
for idx in dupes:
#    print(df[:idx])
    print(df.loc[[idx],['id', 'label']])

但是,我想要做的是对于特定行,通过将这些行保存为id: label组合来确定哪些行是它的副本。因此,虽然我能够为每个重复项提取idlabel,但我无法将其映射回它是重复项的原始行。

理想的数据集如下所示:

|   id   | ft1 | ft2 | ft3 | ft4 | ft5 |  label |                  duplicates                 |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 |  1  |  1  |  43 |  1  |  1  |  High  | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 |  1  |  1  |  33 |  0  |  1  | Medium |                {id_107: High}               |
| id_102 |  1  |  1  |  12 |  1  |  1  |   Low  |                                             |
| id_103 |  1  |  1  |  46 |  1  |  0  |   Low  |                                             |
| id_104 |  1  |  1  |  10 |  1  |  1  |  High  |                                             |
| id_105 |  0  |  1  |  99 |  0  |  1  |   Low  |                {id_110: High}               |
| id_106 |  0  |  0  |  0  |  0  |  0  |  High  |                                             |
| id_107 |  1  |  1  |  6  |  0  |  1  |  High  |                                             |
| id_108 |  1  |  1  |  29 |  1  |  1  | Medium |                                             |
| id_109 |  1  |  0  |  27 |  0  |  0  | Medium |                                             |
| id_110 |  0  |  1  |  32 |  0  |  1  |  High  |                                             |

如何获取重复值并将它们有效地映射回原始值(了解我的实际数据集的大小)?

标签: pythonpandasdataframeduplicates

解决方案


在列中使用字典确实很复杂,这是一种可能的解决方案:

# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']

# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]

#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
       .groupby(cols)['a']
       .agg(lambda x: dict(list(x))))

#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')

print (df)

        id  ft1  ft2  ft3  ft4  ft5   label  \
0   id_100    1    1   43    1    1    High   
1   id_101    1    1   33    0    1  Medium   
2   id_102    1    1   12    1    1     Low   
3   id_103    1    1   46    1    0     Low   
4   id_104    1    1   10    1    1    High   
5   id_105    0    1   99    0    1     Low   
6   id_106    0    0    0    0    0    High   
7   id_107    1    1    6    0    1    High   
8   id_108    1    1   29    1    1  Medium   
9   id_109    1    0   27    0    0  Medium   
10  id_110    0    1   32    0    1    High   

                                           duplicates  
0   {'id_102': 'Low', 'id_104': 'High', 'id_108': ...  
1                                  {'id_107': 'High'}  
2                                                      
3                                                      
4                                                      
5                                  {'id_110': 'High'}  
6                                                      
7                                                      
8                                                      
9                                                      
10                   

替代自定义函数,用于将没有第一个的所有重复项分配给每个组的新列的第一个值,最后更改掩码以替换空字符串:

# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']

m = ~df.duplicated(subset=cols)  & df.duplicated(subset=cols, keep=False)

def f(x):
    x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
    return x

df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')

print (df)
        id  ft1  ft2  ft3  ft4  ft5   label  \
0   id_100    1    1   43    1    1    High   
1   id_101    1    1   33    0    1  Medium   
2   id_102    1    1   12    1    1     Low   
3   id_103    1    1   46    1    0     Low   
4   id_104    1    1   10    1    1    High   
5   id_105    0    1   99    0    1     Low   
6   id_106    0    0    0    0    0    High   
7   id_107    1    1    6    0    1    High   
8   id_108    1    1   29    1    1  Medium   
9   id_109    1    0   27    0    0  Medium   
10  id_110    0    1   32    0    1    High   

                                           duplicated  
0   {'id_102': 'Low', 'id_104': 'High', 'id_108': ...  
1                                  {'id_107': 'High'}  
2                                                      
3                                                      
4                                                      
5                                  {'id_110': 'High'}  
6                                                      
7                                                      
8                                                      
9                                                      
10                                                     

推荐阅读