首页 > 解决方案 > 如果列在列表中包含超过 x 个值,则删除组

问题描述

您好,我有一个元素列表,例如:

list_element=['Elephant','Monkey','Cow','Human','Bird','Snail','Snake','Donkey','Baboon','Orang-Outan']

和一个数据框

name  value
G1    Gr.1:4282399-4282564(+):Elephant
G1    SEQAHAHHE
G1    Zr.2:4282387-428245(-):Monkey
G1    GrA.2:42845-428289(+):Monkey
G1    QYEH897EH.3
G1    GrA2S2_ED:42845-4282789(+):Cow
G1    UDDKDDH6
G1    YDDIJBDIB778
G2    Gr.1:423663-4282542(-):Elephant
G2    Gr7E:423609-4282552(+):Elephant
G2    UEHHEE88E8E.2
G2    AP_UUD1_CU_OK-lQGGQ
G2    GrEH:423663-4282542(+):Baboon
G2    Gr7JE:42356-428257(+):Snail
G2    AP_UUD1_CU_OK-lQ8900
G2    ASGSG_E553:423663-4282542(-):Human
G3    GrA98_OK:42845-42867(+):Bird
G3    AGGAGA5567

我保留了G1因为我们一共有element <= 3(猴子、大象和牛)

我删除G2是因为我们总共有element > 3(大象、人类、蜗牛和狒狒)

我保留G3是因为总共有element <= 3(鸟)

正如你所看到的,我们在 value 中找到了包含 a'):'

和预期的输出将是:

name  value
G1    Gr.1:4282399-4282564(+):Elephant
G1    SEQAHAHHE
G1    Zr.2:4282387-428245(-):Monkey
G1    GrA.2:42845-428289(+):Monkey
G1    QYEH897EH.3
G1    GrA2S2_ED:42845-4282789(+):Cow
G1    UDDKDDH6
G1    YDDIJBDIB778
G3    GrA98_OK:42845-42867(+):Bird
G3    AGGAGA5567

谢谢你的帮助

标签: pythonpandas

解决方案


您可以使用.str.extract提取元素,然后groupby().nunique()计算唯一元素的数量:

s = (df['value'].str.extract('({})'.format('|'.join(list_element)) )[0]
    .groupby(df['name'])
    .transform('nunique') )

df[s<=3]

输出:

   name                             value
0    G1  Gr.1:4282399-4282564(+):Elephant
1    G1                         SEQAHAHHE
2    G1     Zr.2:4282387-428245(-):Monkey
3    G1      GrA.2:42845-428289(+):Monkey
4    G1                       QYEH897EH.3
5    G1    GrA2S2_ED:42845-4282789(+):Cow
6    G1                          UDDKDDH6
7    G1                      YDDIJBDIB778
16   G3      GrA98_OK:42845-42867(+):Bird
17   G3                        AGGAGA5567

推荐阅读