首页 > 解决方案 > 根据python中的2个列表删除列中的重复项

问题描述

我有一个数据框,例如:

Groups NAME          LETTER
G1     Canis_lupus   A
G1     Canis_lupus   B
G1     Canis_lupus   F
G1     Cattus_cattus C
G1     Cattus_cattus C
G2     Canis_lupus   C
G2     Zebra_fish    A
G2     Zebra_fish    D
G2     Zebra-fish    B
G2     Cattus_cattus D
G2     Cattus_cattus E

这个想法是我想在里面Groups只保留两个重复的NAME地方LETTERlist1=['A','B','C']list2=['D','E','F']

例如,当有重复的AandB时,我保持A字母顺序

在示例中,我应该得到:

 Groups NAME          LETTER
G1     Canis_lupus   A
G1     Canis_lupus   F
G1     Cattus_cattus C
G2     Canis_lupus   C
G2     Zebra_fish    A
G2     Zebra_fish    D
G2     Cattus_cattus D

这是数据框

{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G2', 9: 'G2', 10: 'G2'}, 'NAME': {0: 'Canis_lupus', 1: 'Canis_lupus', 2: 'Canis_lupus', 3: 'Cattus_cattus', 4: 'Cattus_cattus', 5: 'Canis_lupus', 6: 'Zebra_fish', 7: 'Zebra_fish', 8: 'Zebra-fish', 9: 'Cattus_cattus', 10: 'Cattus_cattus'}, 'LETTER': {0: 'A', 1: 'B', 2: 'F', 3: 'C', 4: 'C', 5: 'C', 6: 'A', 7: 'D', 8: 'B', 9: 'D', 10: 'E'}}

标签: python-3.xpandas

解决方案


如有必要,想法首先排序LETTER,然后按两个列表过滤Series.isin,删除重复项DataFrame.drop_duplicates,最后连接在一起concat

#sorting per groups
df = df.sort_values(['Groups','LETTER'])

#sorting by one column
#df = df.sort_values('LETTER')


list1=['A','B','C'] 
list2=['D','E','F']

df1 = df[df['LETTER'].isin(list1)].drop_duplicates(['Groups','NAME'])
df2 = df[df['LETTER'].isin(list2)].drop_duplicates(['Groups','NAME'])


df = pd.concat([df1, df2]).sort_index(ignore_index=True)
print (df)
  Groups           NAME LETTER
0     G1    Canis_lupus      A
1     G1    Canis_lupus      F
2     G1  Cattus_cattus      C
3     G2    Canis_lupus      C
4     G2     Zebra_fish      A
5     G2     Zebra_fish      D
6     G2  Cattus_cattus      D

使用合并字典将值映射到新列的想法,类似于另一个解决方案,如果不匹配两个lists by ,则仅删除行DataFrame.dropna,最后删除辅助列和排序:

d = {**dict.fromkeys(list1, 'a'),
     **dict.fromkeys(list2, 'b')}


df = (df.assign(new = df.LETTER.map(d))
        .dropna(subset=['new'])
        .drop_duplicates(subset=['Groups', 'NAME', 'new'])
        .sort_index(ignore_index=True)
        .drop('new', 1)
        )

print (df)
  Groups           NAME LETTER
0     G1    Canis_lupus      A
1     G1    Canis_lupus      F
2     G1  Cattus_cattus      C
3     G2    Canis_lupus      C
4     G2     Zebra_fish      A
5     G2     Zebra_fish      D
6     G2  Cattus_cattus      D

推荐阅读