首页 > 解决方案 > 删除重复的 seq 名称 pandas

问题描述

我实际上有一个数据框,这里是一个例子:

cluster     seq_sp1      seq_sp2
1           seq20        seq56
1           seq56        seq20
2           seq3         seq5
3           seq9         seq5
3           seq7         seq4
3           seq4         seq7

我想删除重复的序列:这里的例子是重复的,seq20 seq56因为seq56 seq20seq7 seq4seq4 seq7

我想一个解决方案是首先对所有列进行排序,例如:

cluster     seq_sp1      seq_sp2
1           seq20        seq56
1           seq20        seq56
2           seq3         seq5
3           seq9         seq5
4           seq7         seq4
4           seq7         seq4

然后删除两个重复序列之一并获得:

   cluster     seq_sp1      seq_sp2
    1           seq20        seq56
    3           seq3         seq5
    4           seq9         seq5
    6           seq7         seq4

谢谢你的帮助 :)

你给我的脚本报告:

这是我的第一个数据的头(参见图片以查看重复组的颜色)

cluster_name    qseqid  sseqid  pident_x    pident_y    length  qstart  qend    sstart  send    qspec   sspec
13  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
14  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
16  cluster_016663  EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
17  cluster_016663  EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    0035    0042
19  cluster_016663  EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1 0.93    93.0    1179    1   1179    1   1175    0042    0035
20  cluster_016663  EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1 0.93    93.0    1179    1   1179    1   1175    0042    0035
22  cluster_016663  EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1 0.93    93.0    1179    1   1179    1   1175    0042    0035
23  cluster_016663  EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1 0.93    93.0    1179    1   1179    1   1175    0042    0035

这是我应该得到的结果:

    Unnamed: 0  cluster_name    qseqid  sseqid  pident_x    pident_y    length  qstart  qend    sstart  send    qspec   sspec
0   13  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    35  42
1   14  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    35  42
8   27  cluster_015764  EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0042_1 0.8059999999999999  82.3    1013    1   1013    1   1008    35  42
9   28  cluster_015764  EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0035_1 0.784   78.4    1013    1   1013    1   963 35  42
11  32  cluster_015764  EOG090X00LI_0042_0035_1 g1726.t1_0035_0042  0.67    58.5    1010    1   963 1   751 42  35

但我实际上得到:

Unnamed: 0  cluster_name    qseqid  sseqid  pident_x    pident_y    length  qstart  qend    sstart  send    qspec   sspec
0   13  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93    93.0    1179    1   1175    1   1179    35  42
1   14  cluster_016663  EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93    93.0    1179    1   1175    1   1179    35  42

我使用了这段代码:

df=pd.read_table("dataframe.txt",header=0,sep='\t')

df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid ','sseqid'])
df.to_csv("df_test",sep='\t')

图片

标签: pythonpandassortingduplicates

解决方案


我认为需要numpy.sort-drop_duplicates返回已排序的行:

df[['seq_sp1','seq_sp2']] = np.sort(df[['seq_sp1','seq_sp2']], axis=1)
df = df.drop_duplicates(subset=['seq_sp1','seq_sp2'])
print (df)
   cluster seq_sp1 seq_sp2
0        1   seq20   seq56
2        2    seq3    seq5
3        3    seq5    seq9
4        3    seq4    seq7

或者使用DataFrame.duplicated带有倒置掩码的掩码通过~nd 过滤boolean indexing- 输出中的原始未排序值:

mask = pd.DataFrame(np.sort(df[['seq_sp1','seq_sp2']], axis=1), index=df.index).duplicated()
df = df[~mask]

print (df)
   cluster seq_sp1 seq_sp2
0        1   seq20   seq56
2        2    seq3    seq5
3        3    seq9    seq5
4        3    seq7    seq4

编辑:

我用新数据测试它:

df = df[['qseqid','sseqid']]
print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1
19  EOG090X00GO_0042_0035_1  EOG090X00GO_0035_0035_1
20  EOG090X00GO_0042_0035_1  EOG090X00GO_0035_0042_1
22  EOG090X00GO_0042_0042_1  EOG090X00GO_0035_0035_1
23  EOG090X00GO_0042_0042_1  EOG090X00GO_0035_0042_1

df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid','sseqid'])

print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1

mask = pd.DataFrame(np.sort(df[['qseqid','sseqid']], axis=1), index=df.index).duplicated()
print (~mask)
13     True
14     True
16     True
17     True
19    False
20    False
22    False
23    False
dtype: bool

df = df[~mask]
print (df)
                     qseqid                   sseqid
13  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0035_1
14  EOG090X00GO_0035_0035_1  EOG090X00GO_0042_0042_1
16  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0035_1
17  EOG090X00GO_0035_0042_1  EOG090X00GO_0042_0042_1

推荐阅读