首页 > 解决方案 > 尝试(但失败)将 2 个列表与蛋白质片段序列进行比较

问题描述

所以我写了这段脚本:

from Bio import SeqIO
from Bio import SeqUtils

protein = SeqIO.parse('short_protein.fasta', 'fasta')
id_protein = []
sequence_protein = 0   
weight_protein = 0   

for i in protein:
    sequence_protein = (f'{i.seq}')
    id_protein.append(f'{i.id}')
    weight = SeqUtils.molecular_weight(i.seq, seq_type="protein", circular=True)
    weight_protein += (round(weight, 1))

print(f'{str(id_protein)[2:-2]}  {weight_protein:6n}  {sequence_protein}')

protein_fragments = SeqIO.parse('short_protein_fragments.fasta', 'fasta')
id_fragments = []
sequence_fragments = []   
weight_fragments = []   

for i in protein_fragments:
    sequence_fragments.append(f'{i.seq}')
    id_fragments.append(f'{i.id}')
    weight = SeqUtils.molecular_weight(i.seq, seq_type="protein", circular=True)
    weight_fragments.append(round(weight, 1))
    
for item_a, item_b, item_c in zip(id_fragments, weight_fragments, sequence_fragments):
    print(f'{str(item_a)}  {item_b:6n}  {item_c}')

import itertools
from math import isclose

combinations = []
for i in range(len(weight_fragments)):
    for weight_subset in itertools.combinations(weight_fragments, i): 
        if isclose(sum(weight_subset), weight_protein): 
            combinations.append(weight_subset)
print(combinations)

data_fragments = dict(zip(weight_fragments, sequence_fragments))
print(data_fragments)

weight_combinations = [tuple(data_fragments[i] for i in c) for c in combinations]
print(weight_combinations)

import itertools
used_fragments = []
for el in sequence_fragments:
    if el in sequence_protein:
        used_fragments.append(el)

sequence_combinations = []
for i in range(0, len(used_fragments)+1):
    for seq_subset in itertools.permutations(used_fragments, i):
        if (''.join(seq_subset)) == (sequence_protein):
            sequence_combinations.append(seq_subset)
print(sequence_combinations)

if sorted(weight_combinations) == sorted(sequence_combinations): 
    print(f'The sequence "{sequence_protein}" with molecular weight {weight_protein}\ncan be covered by the fragments {str(weight_combinations)[1:-1]}\nwith molecular weights {str(combinations)[1:-1]}')
else:
    print(f'The computed weight combinations do not cover the protein sequence')

对于以下蛋白质序列和片段:

seq_compl  3788.4  IEEATHMTPCYELHGLRWVQIQDYAINVMQCL
seq_0000  3125.4  SKEPFKTRIDKKPCDHNTEPYMSGGNY
seq_0001  1963.4  KMITKARPGCMHQMGEY
seq_0002   397.5  AINV
seq_0003   484.5  QIQD
seq_0004  1036.3  YAINVMQCL
seq_0005  2267.6  IEEATHMTPCYELHGLRWV
seq_0006   475.6  MQCL
seq_0007    1724  HMTPCYELHGLRWV
seq_0008  2000.2  DHTAQPCRSWPMDYPLT
seq_0009   811.9  IEEATHM
seq_0010  1397.7  MVGKMDMLEQYA
seq_0011   681.8  GWPDII
seq_0012   647.7  QIQDY
seq_0013  2174.4  TPCYELHGLRWVQIQDYA
seq_0014    1794  HGLRWVQIQDYAINV
seq_0015  1040.3  KKKNARKW
seq_0016  1455.7  TPCYELHGLRWV

这给了我一个清单

序列组合:

[('IEEATHMTPCYELHGLRWV', 'QIQD', 'YAINVMQCL'),
 ('IEEATHMTPCYELHGLRWV', 'QIQDY', 'AINV', 'MQCL'),
 ('IEEATHM', 'TPCYELHGLRWV', 'QIQD', 'YAINVMQCL'),
 ('IEEATHM', 'TPCYELHGLRWV', 'QIQDY', 'AINV', 'MQCL')]

重量组合:

[('QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV'),
 ('AINV', 'IEEATHMTPCYELHGLRWV', 'MQCL', 'QIQDY'),
 ('QIQD', 'YAINVMQCL', 'IEEATHM', 'TPCYELHGLRWV'),
 ('AINV', 'MQCL', 'IEEATHM', 'QIQDY', 'TPCYELHGLRWV')]

它们基本上都包含相同的片段集,但顺序不同。一个是根据可能的权重组合计算的,另一个是针对与给定蛋白质序列相对应的可能序列排列计算的。我试图对两个列表进行排序,以便比较元素,但是两个列表似乎排序不同?

排序(sequence_combinations)

[('IEEATHM', 'TPCYELHGLRWV', 'QIQD', 'YAINVMQCL'),
 ('IEEATHM', 'TPCYELHGLRWV', 'QIQDY', 'AINV', 'MQCL'),
 ('IEEATHMTPCYELHGLRWV', 'QIQD', 'YAINVMQCL'),
 ('IEEATHMTPCYELHGLRWV', 'QIQDY', 'AINV', 'MQCL')]

排序(权重组合)

[('AINV', 'IEEATHMTPCYELHGLRWV', 'MQCL', 'QIQDY'),
 ('AINV', 'MQCL', 'IEEATHM', 'QIQDY', 'TPCYELHGLRWV'),
 ('QIQD', 'YAINVMQCL', 'IEEATHM', 'TPCYELHGLRWV'),
 ('QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV')]

有什么办法可以让我验证list_a == list_b中的片段组合吗?

标签: python

解决方案


推荐阅读