首页 > 解决方案 > Biopython:从修改后的 GenBank 记录中提取 CDS?

问题描述

我对 python 有一些基本的了解,并且一直在从 genbank 记录中提取编码序列。但是,我不确定如何处理编码序列已被修改的记录(例如,由于更正内部终止密码子)。这种序列的一个例子是这个 genbank 记录(或登录:XM_021385495.1,如果链接不起作用)。

在这个例子中,我可以翻译我可以访问的两个编码序列,但它们都有内部终止密码子 - 根据注释还有插入缺失!这是我访问 CDS 的方式: 1 - gb_record.seq 2 - cds.location.extract(gb_record) for where feature == "CDS"

但是,我需要已更正的序列。据我所知,我认为我需要在 CDS 功能中使用“transl_except”标签,但我不知道如何做到这一点。

我想知道是否有人能够提供一个例子或一些关于如何做到这一点的见解?

谢谢

标签: biopython

解决方案


我有一些用 python3 编写的演示代码,应该有助于解释这个 GenBank 记录。

import re

aa_convert_codon_di =  {
    'A':['[GRSK][CYSM].'],
    'B':['[ARWM][ARWM][CTYWKSM]', '[GRSK][ARWM][TCYWKSM]'],
    'C':['[TYWK][GRSK][TCYWKSM]'],
    'D':['[GRSK][ARWM][TCYWKSM]'],
    'E':['[GRSK][ARWM][AGRSKWM]'],
    'F':['[TYWK][TYWK][CTYWKSM]'],
    'G':['[GRSK][GRSK].'],
    'H':['[CYSM][ARWM][TCYWKSM]'],
    'I':['[ARWM][TYWK][^G]'],
    'J':['[ARWM][TYWK][^G]', '[CYSM][TYWK].', '[TYWK][TYWK][AGRSKWM]'],
    'K':['[ARWM][ARWM][AGRSKWM]'],
    'L':['[CYSM][TYWK].', '[TYWK][TYWK][AGRSKWM]'],
    'M':['[ARWM][TYWK][GRSK]'],
    'N':['[ARWM][ARWM][CTYWKSM]'],
    'O':['[TYWK][ARWM][GRSK]'],
    'P':['[CYSM][CYSM].'],
    'Q':['[CYSM][ARWM][AGRSKWM]'],
    'R':['[CYSM][GRSK].', '[ARWM][GRSK][GARSKWM]'],
    'S':['[TYWK][CYSM].', '[ARWM][GRSK][CTYWKSM]'],
    'T':['[ARWM][CYSM].'],
    'U':['[TYWK][GRSK][ARWM]'],
    'V':['[GRSK][TYWK].'],
    'W':['[TYWK][GRSK][GRSK]'],
    'X':['...'],
    'Y':['[TYWK][ARWM][CTYWKSM]'],
    'Z':['[CYSM][ARWM][AGRSKWM]','[GRSK][ARWM][AGRSKWM]'],
    '_':['[TYWK][ARWM][AGRSKWM]', '[TYWK][GRSK][ARWM]'],
    '*':['[TYWK][ARWM][AGRSKWM]', '[TYWK][GRSK][ARWM]'],
    'x':['[TYWK][ARWM][AGRSKWM]', '[TYWK][GRSK][ARWM]']}

dna_convert_aa_di = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'}


dna_str = "ATGACCGAGGTGCAAGACCTTGCACTTGGATTTGTTGAACCTCATGAGGTTCCCCTGGGCCCCTGGACATCGCCTTTTTCCAGCGTTCCACCAGAGACTTCACCCAACTGCTGTGACTTTTCAAACATCATTGAGAGCGGCTTGATACAGTTAGGCCACTCTCGCAGCTGTGAAGTTGTGAAGGCAAACTCCAGCGACCCATTCCTTCTTCCTTCAGAAAAGCAACTCGAGGAGCAGCGGGAGGAAACCCAGCTCTATCCTGCAGCGAGCGGGGCTGCGCAAGAGGCAGGTGCTGCTCTCACGGCCCGAAGGCAGCTCCGAGCTGCCGGGTGCGGTCACGTCAGCGGCCGAGCTGCCCGGCGGGGTGTGCATAAGAGCGAGCTATATGTGCTGCGTGTCATCACGGAGCCTTTCAAGTCCCTCCCTCCTTCTCCACTGCTGGGGCTGCAGTGGGCACCGGGCAGGAGGAGCGGCCGCAGCCCCGCGGGGGTGGGACGAGTCTCTGGGGGCTGCGCCACTTGGAAGATTTGCATTGGGTACATTGATAGCATTGTGATTGATGGCCTATTTAATACCATAATGTGTTCTTTAGATTTCTTTTTGGAGAACTCAGAAGAAAATTTGAAGCCAGCTCCACTTTTTCCAGCACAAATGACCCTTACTGGCACAGAAATTCATTTTAAACTTTCTCTAGATAAAGAGGCTGATGATGGCTTTTATGACCTTATGGATGAACTACTGGGTGATATTTTCCGAATGTCTGCCCAAGTGAAGAGACTAGAAGCCCACCTGGAATCAGAACATTAGGAGGACTATATGAACAGTGTGTTTGATCTGTCTGAACTCAGGCAGGAGAGTATGGAGAGAGTAATAAACGTCACCAACAAGGCCTTGAAGTACAGAAGATCTCATGATAGCTATGCTTATCTCTGACTAGAGGATCAGCTTGAGTTTATGAGGCAATTTCTTCCTTGTGCTCGTGGTTTAATGTCCACACAGATATCTCTTACTGGCATCCCACTACTAAACTGTGTAAAAAGCAGGCAAGAAAGAAACTAGTTTAAATAACTTCCTATTTATGAAAATCTCTGTGTTCAGATGAGTAAGTTTGAAGACCCAAGAATTTTTGAAAGCTGGTTTAAGGTGATTATGAAGCCTTTCAAAATGACACTTCTAAACATTACTAAGAAGTGGAGCTGGATGTTTAAGTAGTACACTATAGAAATAATAAGATTGAGTCTGAATGACTTCAAAGACTTTATAAAAGTGACAGATGCTGGACTTCAAAGAGGGAGGCATTATTGTGCACTGGCAGAAATCACCGGTCACCTCTTGGCTGTGAAAGAGAGGCAGACAGCTGCTGGTGAATCCTTTGAACCTTTAAAAGAANTTGTTGCATTGTTGGAAAGCTACAGACAGAAGATGCCAGATCAAGTTTGCATCCAGTGTCAAATCAGTTGTATCCTGGGAGCCTTTAAGGGTTATGTACTTCTGGTTGGAGTAGGTGGTAGTGATAAATGAAGCTTGTCAAGGCTGGCAGCATGCATCTCTTCCCTGGAGGTCTTTTAAATCATATGGAAGAAAGACCATGAGAGCAAGAACCTGAAGGTAGATGTTGCCAGTTTGTGCATCAAGACTGGTGCCAAGAACATGCCCACAGTGTTTTTGCTGACAGATGCCCAGGTTCCAGATGAACGCTTTCTTGTGCTGATTAATGACTTGTTGGCATCAAGAGATCTTCCTGATCTGTTCAGTGGTGAAGATGAGGAGGGCAAAGTTGCAGGAGTCAGAAAAGAAGTCNNCCTGGGCTTGATGGACACCACAGAAAGCTGCTGGAGGTGGTTCTTTGGTAGAGCGCAGCAGCTGTTAAAAGTGTATGGTGAAGTAGAGTCGAAATGTTGTGCACTGGTCCAGGCAAATACAAAATTAGCAACAGCTAAAGAGAATCTAGAAACAATCTTGAAAAAGCTTATTTCTGAAAATGTGCATTGGAGCCAATCTGTTGAAAACCTCAAAGCATAAAAGAAAACTGTACTCAAGGATGTTACATCAGCAGCAGCGTTTGCATCTTTCTTTGGAGCCTTCACAAAACCATATAGTCAAGAACAGATGGAACATTTCTGGATTCTTTCTCTAAAGTCACAGGAGTGTCCTGTTCCTGTGATAGAGGGGCCAGACTCTGCCATCCTGATGAATGATGCTCCAAGAGCAGCACAGAGTAACAAGAGTCTGCTTGCTGATAGGGTGTCAGCAGAAAATGCCACTGCTCTGACACACTGTGAGCAGGGCCCTCTGATGATAGATCCCCAGAAACAGGGAATTGAATGGACACAGAATAAATACAGAACTGACTTTAAAGTCATGCATCTAGGAGAGAATGGTTATGTGTGTACTATTGATACAGCTTTGGCTTGTGGAGAGATTATACTAATTGAAAACATGGCTGAATCTATCGATCTCTTACTTGATCCCCTAACTGGAAGACATACAGGTAAAAGGGGAAGGAATACTTGCGCAATCAGAATTTCTTGAAGACAAAAAAAAAAAAAGTGTGAATTCTACAGGAATTTCCATCTCATCCTTCACACTAAGCTGGCTAACCCTCCCTGCAAGCCAGAGCTTNAGGCTCAGACCACTCTCATTATTTTCACAGATACCAGGGGCAGGCTGGAAGAACAGCTGTTGGCTGAGGTGGTGAGTGCTGAAAGGCCTGACTTGGAAAACCATACGTCAGCACTGGCGAAACAGAAGAGTGTCTCTGAAATCAAGCCCAAGCAGCTTGAGGACAACATGCTGCTCAGTCTGTCAGCTGCCCAGAGCACTTTTGTAGGTGACAGTGAACTTGAAGAGAAATTCAAGTCAACTGCAGGAGAAATGATTGTCCGCCCACATGTTCACAGCTTCTTATTTTGGCAAAAAGCTTCCACTGTAGACTCTGGAAGATTTCATATCTCTTTAGGACAAGGGCAGGAGATGGTTGTGGAGNGACAACTTGAGAAGGCTGCCAAGCCTGGCCACTGGCTTCTTCTCCAAAATATTAATGTGGTAGCCAAGTGGCTAGGAACCTTGGAAAAACTCCTCGAGCAATAGAGTGAAGAAAGTCACTGGTATTTCCGTGTCTTCACTAGTGCTGAACCAGCTCCAGCCCCAGAAGAGCACATCATTCTTCAAGGAGTACTTGAAAACTGAATTAAAATTACCAGACTATCAATAACACTGCCAGTTGTTAAGTGGATAAATGTATTCCTTTTTTTCCTTTGGCAGGATACCCTTGAACTGTGTGGCAAAGAACAGGAATTTAAGAGCATTCTTTTCTCCCTTCGTTATTTTCACACCCGTGTTGCCAGCAGACTCATTTGGCCTTCCAGGCTGCAATTAAGATACCCATACAATACTAGAGATCTCACTGTTTGCATCAGTGTGCCCTGCAACTATTTAGACACTTACACAGAGGTCAGACGCAGTGGTCAGAAAAACAAGTCTATAAAATCAGCTGATTCCAACCCTTAG"
aa_str = "MTEVQDLALGFVEPHEVPLGPWTSPFSSVPPETSPNCCDFSNIIESGLIQLGHSRSCEVVKANSSDPFLLPSEKQLEEQREETQLYPAASGAAQEAGAALTARRQLRAAGCGHVSGRAARRGVHKSELYVLRVITEPFKSLPPSPLLGLQWAPGRRSGRSPAGVGRVSGGCATWKICIGYIDSIVIDGLFNTIMCSLDFFLENSEENLKPAPLFPAQMTLTGTEIHFKLSLDKEADDGFYDLMDELLGDIFRMSAQVKRLEAHLESEHXEDYMNSVFDLSELRQESMERVINVTNKALKYRRSHDSYAYLXLEDQLEFMRQFLPCARGLMSTQISLTGIPLLNCVKSRQERNXFKXLPIYENLCVQMSKFEDPRIFESWFKVIMKPFKMTLLNITKKWSWMFKXYTIEIIRLSLNDFKDFIKVTDAGLQRGRHYCALAEITGHLLAVKERQTAAGESFEPLKEXVALLESYRQKMPDQVCIQCQISCILGAFKGYVLLVGVGGSDKXSLSRLAACISSLEVFXIIWKKDHESKNLKVDVASLCIKTGAKNMPTVFLLTDAQVPDERFLVLINDLLASRDLPDLFSGEDEEGKVAGVRKEVXLGLMDTTESCWRWFFGRAQQLLKVYGEVESKCCALVQANTKLATAKENLETILKKLISENVHWSQSVENLKAXKKTVLKDVTSAAAFASFFGAFTKPYSQEQMEHFWILSLKSQECPVPVIEGPDSAILMNDAPRAAQSNKSLLADRVSAENATALTHCEQGPLMIDPQKQGIEWTQNKYRTDFKVMHLGENGYVCTIDTALACGEIILIENMAESIDLLLDPLTGRHTGKRGRNTCAIRISXRQKKKKCEFYRNFHLILHTKLANPPCKPELXAQTTLIIFTDTRGRLEEQLLAEVVSAERPDLENHTSALAKQKSVSEIKPKQLEDNMLLSLSAAQSTFVGDSELEEKFKSTAGEMIVRPHVHSFLFWQKASTVDSGRFHISLGQGQEMVVEXQLEKAAKPGHWLLLQNINVVAKWLGTLEKLLEQXSEESHWYFRVFTSAEPAPAPEEHIILQGVLENXIKITRLSITLPVVKWINVFLFFLWQDTLELCGKEQEFKSILFSLRYFHTRVASRLIWPSRLQLRYPYNTRDLTVCISVPCNYLDTYTEVRRSGQKNKSIKSADSN"

mod_dna_str = ""
mod_aa_str = aa_str[:]
start = 0

for index in range(start, len(dna_str), 3):
    codon = dna_str[index:index+3]
    if len(mod_aa_str) == 0:
        break
    if codon in dna_convert_aa_di and dna_convert_aa_di[codon] == mod_aa_str[0]:
        mod_aa_str = mod_aa_str[1:]
    else:
        codon_match = "|".join(aa_convert_codon_di[mod_aa_str[0]])
        if len(re.findall(codon_match, codon)) > 0:
            print(index, codon_match, codon)
            mod_aa_str = mod_aa_str[1:]

代码输出:

804 ... TAG
930 ... TGA
1056 ... TAG
1065 ... TAA
1209 ... TAG
1389 ... NTT
1518 ... TGA
1566 ... TAA
1800 ... NNC
2019 ... TAA
2529 ... TGA
2622 ... NAG
2985 ... NGA
3087 ... TAG
3186 ... TGA

从 CDS 的注释部分,我们有: 在 4 个密码子中插入 5 个碱基;删除了 2 个密码子中的 2 个碱基;在 11 个基因组终止密码子处取代了 11 个碱基”。

这与我们的输出有什么关系?阅读框永远不会改变,这表明给定的核苷酸序列中不存在 2 个缺失的碱基。5 个未知核苷酸 (N) 存在于 4 个密码子(未知氨基酸,X)中。该序列的作者已经考虑了插入缺失。存在 11 个过早终止密码子,它们被简单地翻译为未知氨基酸。“transl_except”标签匹配过早终止密码子的位置。这些位点的核苷酸没有改变。作者提供XP_021241170作为可能的更正翻译产品,但它仍然非常糟糕。


推荐阅读