python - 删除重复的 seq 名称 pandas
问题描述
我实际上有一个数据框,这里是一个例子:
cluster seq_sp1 seq_sp2
1 seq20 seq56
1 seq56 seq20
2 seq3 seq5
3 seq9 seq5
3 seq7 seq4
3 seq4 seq7
我想删除重复的序列:这里的例子是重复的,seq20 seq56
因为seq56 seq20
和 seq7 seq4
seq4 seq7
我想一个解决方案是首先对所有列进行排序,例如:
cluster seq_sp1 seq_sp2
1 seq20 seq56
1 seq20 seq56
2 seq3 seq5
3 seq9 seq5
4 seq7 seq4
4 seq7 seq4
然后删除两个重复序列之一并获得:
cluster seq_sp1 seq_sp2
1 seq20 seq56
3 seq3 seq5
4 seq9 seq5
6 seq7 seq4
谢谢你的帮助 :)
你给我的脚本报告:
这是我的第一个数据的头(参见图片以查看重复组的颜色)
cluster_name qseqid sseqid pident_x pident_y length qstart qend sstart send qspec sspec
13 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
14 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
16 cluster_016663 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
17 cluster_016663 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 0035 0042
19 cluster_016663 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
20 cluster_016663 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
22 cluster_016663 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
23 cluster_016663 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1 0.93 93.0 1179 1 1179 1 1175 0042 0035
这是我应该得到的结果:
Unnamed: 0 cluster_name qseqid sseqid pident_x pident_y length qstart qend sstart send qspec sspec
0 13 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 35 42
1 14 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 35 42
8 27 cluster_015764 EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0042_1 0.8059999999999999 82.3 1013 1 1013 1 1008 35 42
9 28 cluster_015764 EOG090X00LI_0035_0035_1 EOG090X00LI_0042_0035_1 0.784 78.4 1013 1 1013 1 963 35 42
11 32 cluster_015764 EOG090X00LI_0042_0035_1 g1726.t1_0035_0042 0.67 58.5 1010 1 963 1 751 42 35
但我实际上得到:
Unnamed: 0 cluster_name qseqid sseqid pident_x pident_y length qstart qend sstart send qspec sspec
0 13 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1 0.93 93.0 1179 1 1175 1 1179 35 42
1 14 cluster_016663 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1 0.93 93.0 1179 1 1175 1 1179 35 42
我使用了这段代码:
df=pd.read_table("dataframe.txt",header=0,sep='\t')
df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid ','sseqid'])
df.to_csv("df_test",sep='\t')
解决方案
我认为需要numpy.sort
-drop_duplicates
返回已排序的行:
df[['seq_sp1','seq_sp2']] = np.sort(df[['seq_sp1','seq_sp2']], axis=1)
df = df.drop_duplicates(subset=['seq_sp1','seq_sp2'])
print (df)
cluster seq_sp1 seq_sp2
0 1 seq20 seq56
2 2 seq3 seq5
3 3 seq5 seq9
4 3 seq4 seq7
或者使用DataFrame.duplicated
带有倒置掩码的掩码通过~
nd 过滤boolean indexing
- 输出中的原始未排序值:
mask = pd.DataFrame(np.sort(df[['seq_sp1','seq_sp2']], axis=1), index=df.index).duplicated()
df = df[~mask]
print (df)
cluster seq_sp1 seq_sp2
0 1 seq20 seq56
2 2 seq3 seq5
3 3 seq9 seq5
4 3 seq7 seq4
编辑:
我用新数据测试它:
df = df[['qseqid','sseqid']]
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
19 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0035_1
20 EOG090X00GO_0042_0035_1 EOG090X00GO_0035_0042_1
22 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0035_1
23 EOG090X00GO_0042_0042_1 EOG090X00GO_0035_0042_1
df[['qseqid','sseqid']] = np.sort(df[['qseqid','sseqid']], axis=1)
df = df.drop_duplicates(subset=['qseqid','sseqid'])
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
mask = pd.DataFrame(np.sort(df[['qseqid','sseqid']], axis=1), index=df.index).duplicated()
print (~mask)
13 True
14 True
16 True
17 True
19 False
20 False
22 False
23 False
dtype: bool
df = df[~mask]
print (df)
qseqid sseqid
13 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0035_1
14 EOG090X00GO_0035_0035_1 EOG090X00GO_0042_0042_1
16 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0035_1
17 EOG090X00GO_0035_0042_1 EOG090X00GO_0042_0042_1
推荐阅读
- python - cvs 文件到 host.yaml 用于 Nornir
- json - GET 参数在 URL 中起作用,但在 Postman 中不作为正文
- c# - C#项目引用之谜
- c++ - 我的 C++ 函数给出了关于声明的异常错误
- android - Android 设备未收到通过 Firebase 云消息传递发送的 5% 的推送
- c - C/C++ VS 代码扩展引发构建错误:““C/C++”任务的任务提供程序意外提供了“shell”类型的任务。”
- ember.js - 如何在 Ember.JS 应用程序中加载 Require.JS 开发的模块?
- java - JAVA-JNA:我无法在整个回调函数中修改结构字段
- javascript - 如何通过单击 Web 浏览器中的按钮启动本地服务器
- python - 配置:错误:尝试使用 pyenv 安装 python 时,C 编译器无法创建可执行文件