首页 > 技术文章 > Biopython学习笔记

xiezh 2021-12-10 14:00 原文

一、Biopython 学习资料
Biopython教程和cookbook:http://biopython.org/DIST/docs/tutorial/Tutorial.html
Biopython教程(中文版cookbook):http://biopython-cn.readthedocs.io/zh_CN/latest/
更多文档:
http://biopython.org/wiki/Documentation
http://biopython.org/wiki/Category%3ACookbook
 
PyCogent:https://github.com/pycogent/pycogent/
 
二、序列输入和输出
1. 通过SeqIO读取序列文件
主要使用 Bio.SeqIO模块。
导入:
from Bio import SeqIO
 
Bio.SeqIO.read() 读取一个record,并返回record对象。如:
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:10638'])
 
Bio.SeqIO.parse() ,它用于读取序列文件生成 SeqRecord 对象,包含两个参数:
第一个参数是一个文件名或者一个句柄( handle )。句柄可以是打开的文件,命令行程序的输出,或者来自下载的数据(请见第 5.3 节)。
第二个参数是一个小写字母字符串,用于指定序列格式,支持的文件格式请见 http://biopython.org/wiki/SeqIO
一般支持的文件格式有(常见的):fasta、fastq、genbank、embl、swiss、clustal、phylip、sff等。
 
Bio.SeqIO.parse() 用于读取序列文件并返回 SeqRecord 对象,并通常用在循环中,例如:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(seq_record.description.replace(seq_record.id, ""))
    print(seq_record.seq)
    print(len(seq_record))    # 等同于 len(seq_record.seq)
 
此外,可通过 SeqIO.to_dict() 函数将SeqRecord对象转为dict对象,如:
>>> orchid_dict = SeqIO.to_dict(SeqIO.parse("NC_005816.gbf", "genbank"))
>>> orchid_dict
{'NC_005816.1': SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['BioProject:PRJNA224116', 'BioSample:SAMN02602970', 'Assembly:GCF_000007885.1'])}
 
Since this variable orchid_dict is an ordinary Python dictionary, we can look at all of the keys we have available:
>>> len(orchid_dict)
>>> list(orchid_dict.keys())
 
You can leave out the “list(...)“ bit if you are still using Python 2. Under Python 3 the dictionary methods like “.keys()“ and “.values()“ are iterators rather than lists.
 
We can access a single SeqRecord object via the keys and manipulate the object as normal:
>>> seq_record = orchid_dict["Z78475.1"]
>>> print(seq_record.description)
..................................
>>> print(repr(seq_record.seq))
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
 
 
2. 将序列写入文件
使用 Bio.SeqIO.write() 输出序列(写入文件)。该函数需要三个参数:某些 SeqRecord 对象(A list (or iterator) of SeqRecord objects, or (if using Biopython 1.54 or later) a single SeqRecord.),要写入的句柄或文件名,和序列格式。
例如:
from Bio import SeqIO
SeqIO.write(my_records, "my_example.faa", "fasta")
 
或者使用文件句柄(handle),例如:
from Bio import SeqIO
records = ...
with open("example.faa", "w") as handle:
SeqIO.write(records, handle, "fasta")
 
注意,如果你 Bio.SeqIO.write() 函数要写入的文件已经存在,旧文件将会被覆写,并且不会得到任何警告信息。
 
例子1:
records_new = SeqIO.parse(options.input, "fasta")
result = open(options.output, "w")
for rec in records_new:
    seq_id = rec.id
    alist = seq_id.split("|")
    if options.tag != "":
        alist[0] = alist[0] + "|" + options.tag
    seq_id = " ".join(alist)

    seq_desc = rec.description.replace(rec.id, "")
    seq_desc = re.sub(r">", "", seq_desc)
    seq_desc = seq_desc.strip()
    # 修改原先的ID、description
    rec.id = seq_id
    rec.description = seq_desc
    #print(seq_desc)
    #print(rec.id)

    seq_len = len(rec.seq)
    if seq_len >= options.length:
        SeqIO.write(rec, result, "fasta")  # 将record对象写入输出文件

result.close()
 
3. 将GenBank 转为 FASTA
方法1:
通过 Bio.SeqIO.parse() 和 Bio.SeqIO.write() 函数实现函数的转换。如:
from Bio import SeqIO
records = SeqIO.parse("ls_orchid.gbk", "genbank")
count = SeqIO.write(records, "my_example.fasta", "fasta")
print("Converted %i records" % count)
 
方法2:
使用 SeqIO.convert 函数进行格式转换。
>>> from Bio import SeqIO
>>> SeqIO.convert("NC_011833.gbk", "genbank", "NC_011833_cov.fna", "fasta")
 
方法3:
import Bio.SeqIO
gbk_file = "NC_011833.gbk"
fna_file = "NC_011833_cov.fna"
input = open(gbk_file, "r")
output = open(fna_file, "w")

for seq in SeqIO.parse(input, "genbank") :
    print("Convert %s from GenBank to FASTA format" % seq.id)
    output.write(">%s %s\n%s\n" % (
        seq.id, seq.description, seq.seq))

input.close()
output.close()
 
参考:http://biopython-cn.readthedocs.io/zh_CN/latest/cn/chr05.html#sec-seqio-conversion
https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/
 
可以从fasta中提取基因序列的4款软件 - 简书 (jianshu.com)
 
4. SeqRecord 对象内置的方法
Sequence record 或称之为 SeqRecord 类,该类在 Bio.SeqRecord 模块中有定义。
导入 SeqRecord 类:
from Bio.SeqRecord import SeqRecord
更多详情见:http://biopython.org/wiki/SeqRecord
 
SeqRecord 类包括下列属性:
.seq
– The sequence itself, typically a Seq object.
.id
– The primary ID used to identify the sequence – a string. In most cases this is something like an accession number.
.name
– A “common” name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analogous to the LOCUS id in a GenBank record.
.description
– A human readable description or expressive name for the sequence – a string.
.letter_annotations
– Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores (e.g. Section 20.1.6) or secondary structure information (e.g. from Stockholm/PFAM alignment files).
.annotations
– A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more “unstructured” information to the sequence.
.features
– A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence). The structure of sequence features is described below in Section 4.3.
.dbxrefs
- A list of database cross-references as strings.
 
SeqRecord 类包括下列方法:
.lower() # 将record序列全部变为小写字母
.upper() # 将record序列全部变为大写字母
.reverse_complement() # 对record对象进行反向互补,注意此时record的id、name、description等属性都会变化,如果不指定则为'<unknown>'。
.format() # 对record对象进行格式化,常用于文件格式的转换。
 
注意,SeqRecord 对象没有 .len 或者 .length 这样求序列长度的方法,想获得序列长度,可以:len(seq_record) 或者 len(seq_record.seq)
 
5. SeqRecord.Seq 对象内置的方法
.count()
- 对Seq对象中的字符计数,类似字符串。
.translate()
- 将核酸序列翻译为蛋白序列并返回,需提供密码子表,默认是1.例如:
seq = seq_record.seq
seq.translate(table=set_table)
 
# 注意,SeqRecord对象没有 translate() 方法,是Seq对象才有。
# 支持的编码表为:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
.complement()
- 获得核酸序列的互补链,例如:seq.complement()
.reverse_complement()
- 获得核酸序列的反向互补链,例如: seq.reverse_complement()
 
针对complement()、reverse_complement(),例子如下:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
 
As mentioned earlier, an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step:
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
 
针对translate(),例子如下:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
 
备注,Bio.Alphabet 在biopython 1.78以上版本已经删除。详情见:https://biopython.org/wiki/Alphabet
 
6. 统计序列中的GC含量
1)使用 Seq.count() 方法
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
 
2)使用 Bio.SeqUtils 模块
Bio.SeqUtils 模块已经 建立了好几个GC函数,类如:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
# 返回的是GC百分比
 
注意在使用 Bio.SeqUtils.GC() 函数时会自动处理序列和可代表G或者C的歧意核苷酸 字母S混合的情况。
 
Bio.SeqUtils 模块中,GC_skew(seq, window=100) Calculate GC skew (G-C)/(G+C) for multiple windows along the sequence.
 
更多详情用法见:http://biopython.org/DIST/docs/api/Bio.SeqUtils-module.html
 
7. 在N处切分序列
例如:
def splitScaff_N(scaff_file, out_file):
    output = open(out_file, "w")
    for seq_record in SeqIO.parse(scaff_file, "fasta"):
        num = 1
        seq_seq = str(seq_record.seq)
        compile_mo = re.compile(r"N+", re.I)
        substrs = compile_mo.split(seq_seq)
        seq_id = seq_record.id
        seq_desc = seq_record.description.replace(seq_id, "")
        seq_desc = seq_desc.strip()
        for item in substrs:
            new_id = seq_id + "_" + str(num)
            if seq_desc != "":
                output.write(">%s %s\n%s\n" % (new_id, seq_desc, item))
            else:
                output.write(">%s\n%s\n" % (new_id, item))
            num += 1
    output.close()
 
8. fastq 处理
(1)识别quality
参考:http://biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html
 
(2)切取部分序列(针对fastq)
如果对SeqRecord进行切片,letter_annotations会自动被切片,例如取最后10个基数:
>>> sub_record = record[-10:]
>>> print("%s %s" % (sub_record.id, sub_record.seq))
slxa_0001_1_0001_01 ACGTNNNNNN
>>> print(sub_record.letter_annotations["solexa_quality"])
[4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
 
如果只截取前50bp的fastq序列。
示例:
import sys, os, re
import shutil
import glob
import gzip
from Bio import SeqIO

def cut50fq(fq1):
    # yield
    with gzip.open(fq1, 'rt') as f1:
        for rec in SeqIO.parse(f1, 'fastq'):
            sub50 = rec[0:50]
            yield sub50

qq = sys.argv[1]
out_dir = sys.argv[2]
out_qq = os.path.join(out_dir, os.path.basename(qq))
with gzip.open(out_qq, 'wt') as fout2:
    for rec in cut50fq(qq):
        SeqIO.write(rec, fout2, 'fastq')
 
三、创建SeqRecord对象
参考:http://biopython.org/wiki/SeqRecord
简单地创建一个 SeqRecord 对象,例如:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                       IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print(record)
 
下面我们应用 SeqRecord 对象进行序列处理。
如何将基因核酸序列翻译为蛋白序列?
例子:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC

seqFile = os.path.abspath(sys.argv[1])
outSeq = os.path.abspath(sys.argv[2])
code_table = sys.argv[3]
if not code_table:
    code_table = 11

handle = open(outSeq, "w")
for seq_record in SeqIO.parse(seqFile, "fasta"):
    seq_id = seq_record.id
    seq_seq = seq_record.seq
    protein_seq = seq_seq.translate(table=code_table)
    protein_record = SeqRecord(Seq(str(protein_seq).rstrip("*"), IUPAC.protein), id=seq_id, description="")
    SeqIO.write(protein_record, handle, "fasta")
handle.close()
 
 
根据prodigal预测结果(gff文件)取序列。代码如下:
import sys, os, re
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

def Prodigal_Parser(fastaf, gfff, outPrefix, codon=11):
    fastaf = os.path.abspath(fastaf)
    gfff = os.path.abspath(gfff)
    
    # read fasta sequences
    seq_Dicts = {}
    for seq_record in SeqIO.parse(fastaf, 'fasta'):
        seq_id = seq_record.id
        #seq_desc = seq_record.description.replace(seq_id, "")
        #seq_desc = seq_desc.strip()
        seq_Dicts[seq_id] = seq_record

    # read the gff file of prodigal result
    number = 1
    nuc_out = open(outPrefix+"_gene.ffn", 'w')
    protein_out = open(outPrefix+'_protein.faa', 'w')
    with open(gfff, 'r') as fh:
        for line in fh:
            if re.search(r"^#|^\s*$", line):
                continue
            if re.search(r"CDS", line):
                fields = line.strip().split("\t")
                scaf_id = fields[0]
                start_site = int(fields[3]) - 1
                end_site = int(fields[4])
                substr = seq_Dicts[scaf_id].seq[start_site:end_site]
                subrecord = SeqRecord(substr, id=scaf_id+"_orf"+str(number), description='')
                if fields[6] == '-':
                    rc_seq = subrecord.seq.reverse_complement()
                    subrecord = SeqRecord(rc_seq, id=subrecord.id, description='')
                # 输出核酸序列
                SeqIO.write(subrecord, nuc_out, 'fasta')
                # protein
                protein = subrecord.seq.translate(table=codon)
                protein_record = SeqRecord(Seq(str(protein).rstrip("*"), IUPAC.protein), id=subrecord.id, description='')
                SeqIO.write(protein_record, protein_out, 'fasta')
                number += 1
    
    nuc_out.close()
    protein_out.close()

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python3 %s <genome> <prodigal_gff> <out_prefix> [codon_table]" % sys.argv[0], file=sys.stderr)
        sys.exit(1)

    fastaFile = sys.argv[1]
    gffFile = sys.argv[2]
    out_prefix = sys.argv[3]
    if len(sys.argv) >= 5:
        codon_table = sys.argv[4]
    else:
        codon_table = 11
    #Prodigal_Parser(fastaFile, gffFile, 'prodigal')
    Prodigal_Parser(fastaFile, gffFile, out_prefix, codon_table)
 
四、解析Genbank文件(gbk,gbff)
详情参考:https://biopython-cn.readthedocs.io/zh_CN/latest/cn/chr04.html
Dealing with GenBank files in Biopython (warwick.ac.uk)
 
4.1 简单说明
这里以 NC_005816.gbf 为例。
>>> from Bio import SeqIO
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> record = SeqIO.read("NC_005816.gbf", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['BioProject:PRJNA224116', 'BioSample:SAMN02602970', 'Assembly:GCF_000007885.1'])
>>> len(record) # 返回seq长度(bp)
9609
>>> len(record.features) # 返回features个数
27
 
说明:
genbank 的 SeqRecord 类非常简单,包括下列属性:
.seq
– 序列自身(即 Seq 对象)。
.id
– 序列主ID(-字符串类型)。通常类同于accession number。
.name
– 序列名/id (-字符串类型)。 可以是accession number, 也可是clone名(类似GenBank record中的LOCUS id)。
.description
– 序列描述(-字符串类型)。
备注,name 源于 LOCUS行, id 附加了版本后缀。description源于DEFINITION 行。
.letter_annotations
– 对照序列的每个字母逐字注释(per-letter-annotations),以信息名为键(keys),信息内容为值(value)所构成的字典。值与序列等长,用Python列表、元组或字符串表示。.letter_annotations可用于质量分数(如第 18.1.6 节) 或二级结构信息 (如 Stockholm/PFAM 比对文件)等数据的存储。
 
dbxrefs 列表中的数据来自 PROJECT 或DBLINK行:
>>> record.dbxrefs
['BioProject:PRJNA224116', 'BioSample:SAMN02602970', 'Assembly:GCF_000007885.1']
# 返回的是列表
 
多数注释信息储存在 annotations 字典中,例如:
>>> record.annotations
{'molecule_type': 'DNA', 'topology': 'circular', 'data_file_division': 'CON', 'date': '10-APR-2017', 'accessions': ['NC_005816'], 'sequence_version': 1, 'keywords': ['RefSeq'], 'source': 'Yersinia pestis biovar Microtus str. 91001', ... }
 
features 列表以 SeqFeature 对象(通过函数 record.features获得)的形式保存了features table中的所有entries(如genes和CDS等)。
Genbank中feature列表说明见:https://www.insdc.org/documents/feature-table
 
4.2 SeqFeature 对象
SeqFeature 对象包含的属性:
.type
– 用文字描述的feature类型 (如 ‘CDS’ 或 ‘gene’)。
.location
– SeqFeature 在序列中所处的位置。见第 4.3.2 节。 SeqFeature 设计了众多针对location对象的功能,包含一系列简写的属性。
.ref
– .location.ref 简写 –location对象相关的参考序列。通常为空(None)。
.ref_db
– .location.ref_db 简写 – 指定 .ref 相关数据库名称。通常为空(None)。
.strand
– .location.strand 简写 – 表示feature所处序列的strand。在双链核酸序列中,1表示正链, -1表示负链, 0 表示strand信息很重要但未知, None表示strand信息未知且不重要。蛋白和单链核酸序列为None。
.location.start 起始位置,为ExactPosition(准确位点),用一个数字表示。可通过int()强制性转为一个整数。(注意,这里是python计数,从0开始)
.location.end 起始位置。
.location.nofuzzy_start 和 .location.nofuzzy_end 为了兼容旧版Biopython,保留了整数形式的 nofuzzy_start and nofuzzy_end,其跟int(.location.start) 和 int(.location.end) 等同。
.qualifiers
– 存储feature附加信息(Python字典)。键(key)为值(value)所存信息的单字简要描述,值为实际信息。比如,键为 “evidence” ,而值为 “computational (non-experimental)”。 这只是为了提醒人们注意,该feature没有被实验所证实(湿实验)。Note:为与GenBank/EMBL文件中的feature tables对应,规定.qualifiers 中值为字符串数组(即使只有一个字符串)。
.sub_features
– 只有在描述复杂位置时才使用,如 GenBank/EMBL文件中的 ‘joins’ 位置。 已被 CompoundLocation 对象取代,因此略过不提。
 
>>> for seq_feature in record.features:
... print(seq_feature.qualifiers)
...
OrderedDict([('locus_tag', ['YP_RS22210']), ('old_locus_tag', ['pPCP01', 'YP_pPCP01'])])
...
# seq_feature为字典(dict)对象,每个feature中特征值为字典中元素。
 
示例:
>>> for seq_feature in record.features:
... if seq_feature.type == "CDS":
... print(seq_feature.type)
... print(seq_feature.location)
... print(seq_feature.ref)
... print(seq_feature.ref_db)
... print(seq_feature.strand)
...
CDS
[3102:3337](+)
None
None
1
CDS
[<3358:3442](+)
None
None
1
 
4.3 SeqFeature 对象的location的描述
location的模糊Positions用5个类分别描述:
ExactPosition
– 精确位点,用一个数字表示。从该对象的 position 属性可得知精确位点信息。
BeforePosition
– 位于某个特定位点前。如 `<13' , 在GenBank/EMBL中代表实际位点位于13之前。从该对象的 position 属性可得知上边界信息。
AfterPosition
– 与 BeforePosition 相反,如 `>13' , 在GenBank/EMBL中代表实际位点位于13以后。从该对象的 position 属性可获知下边界信息。
WithinPosition
– 介于两个特定位点之间,偶尔在GenBank/EMBL locations用到。如 ‘(1.5)’, GenBank/EMBL中代表实际位点位于1到5之间。该对象需要两个position属性表示,第一个 position 表示下边界(本例为1), extension 表示上边界与下边界的差值(本例为4)。因此在WithinPosition中, object.position 表示下边界, object.position + object.extension 表示上边界。
OneOfPosition
– 表示几个位点中的一个(GenBank/EMBL文件中偶尔能看到),比如在基因起始位点不明确或者有两个候选位点的时候可以使用,或者用于明确表示两个相关基因特征时使用。
UnknownPosition
– 代表未知位点。在GenBank/EMBL文件中没有使用,对应 UniProt中的 ‘?’ feature坐标。
 
在创建SeqFeature对象时,需要FeatureLocation 对象作为其location属性,例如创建一个FeatureLocation对象,则为:my_location = SeqFeature.FeatureLocation(start_pos, end_pos),那么创建SeqFeature对象则为:
example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
 
Location检测:
可用Python关键词 in 检验某个碱基或氨基酸残基的父坐标是否位于 feature/location中。
假定你想知道某个SNP位于哪个feature里,并知道该SNP的索引位置是4350(Python 计数)。一个简单的实现方案是用循环遍历所有features:
>>> for seq_feature in record.features:
... if my_snp in seq_feature:
... print(seq_feature.type, seq_feature.qualifiers.get('locus_tag'), seq_feature.qualifiers.get('db_xref'))
...
source None ['taxon:229193']
 
根据feature位置信息取子序列:
SeqFeature 或 location object对象并没有直接包含任何序列,只是可根据储存的location (见第 4.3.2 节),从父序列中取得。例如:某一短基因位于负链5:18位置(注意,这里是python计数,从0开始),由于GenBank/EMBL文件以1开始计数,那么在GenBank/EMBL文件中表示为 complement(6..18) :
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
# 构建一个SeqFeature对象
 
你可以用切片从父序列截取5:18,然后取反向互补序列。如果是Biopython 1.59或以后版本,可使用如下方法:
>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print feature_seq
AGCCTTTGCCGTC
 
不过在处理复合 features (joins)时,此法相当繁琐。此时可以使用 SeqFeature 对象的 extract 方法处理:
>>> feature_seq = example_feature.extract(example_parent) # example_parent是一个Seq对象,返回一个Seq对象。
>>> print feature_seq
AGCCTTTGCCGTC
 
SeqFeature 或 location对象的长度等同于所表示序列的长度。
>>> print example_feature.extract(example_parent)
AGCCTTTGCCGTC
>>> print len(example_feature.extract(example_parent))
13
>>> print len(example_feature)
13
>>> print len(example_feature.location)
13
 
4.4 CompoundLocation对象
CompoundLocation 对象用于描述复杂位置,如 GenBank/EMBL文件中的 ‘joins’ 位置。
详情参考:http://biopython.org/DIST/docs/api/Bio.SeqFeature.CompoundLocation-class.html
 
CompoundLocation 对象有以下属性:
.strand Overall strand of the compound location.
.start Start location - left most (minimum) value, regardless of strand.
.end End location - right most (maximum) value, regardless of strand.
.nofuzzy_start Start position (integer, approximated if fuzzy, read only) (OBSOLETE).
.nofuzzy_end End position (integer, approximated if fuzzy, read only) (OBSOLETE).
.ref Not present in CompoundLocation, dummy method for API compatibility.
.ref_db Not present in CompoundLocation, dummy method for API compatibility.
.operator 一般返回"join"
.parts 不同Location区域的信息
 
如何取CompoundLocation 对象中部分区域Location信息?例如:
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import FeatureLocation
>>> dna = Seq("nnnnnAGCATCCTGCTGTACnnnnnnnnGAGAMTGCCATGCCCCTGGAGTGAnnnnn")
>>> small = FeatureLocation(5, 20, strand=1)
>>> large = FeatureLocation(28, 52, strand=1)
>>> location = small + large
>>> print(small)
[5:20](+)
>>> print(large)
[28:52](+)
>>> print(location)
join{[5:20](+), [28:52](+)}
>>> for part in location.parts:
... print(len(part))
...
15
24
 
如何提取序列?
.extract(self, parent_sequence)
Extract the sequence from supplied parent sequence using the CompoundLocation object.
 
The parent_sequence can be a Seq like object or a string, and will generally return an object of the same type. The exception to this is a MutableSeq as the parent sequence will return a Seq object.
 
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_protein
>>> from Bio.SeqFeature import FeatureLocation, CompoundLocation
>>> seq = Seq("MKQHKAMIVALIVICITAVVAAL", generic_protein)
>>> fl1 = FeatureLocation(2, 8)
>>> fl2 = FeatureLocation(10, 15)
>>> fl3 = CompoundLocation([fl1,fl2])
>>> fl3.extract(seq)
Seq('QHKAMILIVIC', ProteinAlphabet())
 
 
4.5 参考文献(reference)
Biopython通过 Bio.SeqFeature.Reference 对象来储存相关的文献信息。
References属性储存了 期刊名(journal) 、 题名(title) 、 作者(authors) 等信息。此外还包括 medline_id 、 pubmed_id 以及 comment 。
文献对象都以列表储存在 SeqRecord 对象的 annotations 字典中。 字典的键为 “references”。
如:
>>> record = SeqIO.read("NC_005816.gbf", "genbank")
>>> record.annotations['references']
[Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...)]
>>> record.annotations['references'][0] # 取第一个reference信息
Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...)
>>> record.annotations['references'][0].title
'Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus'
>>> record.annotations['references'][0].authors
'Zhou,D., Tong,Z., Song,Y., Han,Y., Pei,D., Pang,X., Zhai,J., Li,M., Cui,B., Qi,Z., Jin,L., Dai,R., Du,Z., Wang,J., Guo,Z., Wang,J., Huang,P. and Yang,R.'
>>> record.annotations['references'][0].journal
'J. Bacteriol. 186 (15), 5147-5152 (2004)'
 
4.6 格式化方法
SeqRecord 类中的 format() 能将字符串转换成被 Bio.SeqIO 支持的格式,如FASTA:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
record = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
             +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
         +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
                      +"SSAC", generic_protein),
                   id="gi|14150838|gb|AAK54648.1|AF376133_1",
                   description="chalcone synthase [Cucumis sativus]")
print(record.format("fasta"))
 
format 方法接收单个必选参数,小写字母字符串是 Bio.SeqIO 模块支持的输出格式。但是不适用于包含多条序列的文件格式 (如多序列比对格式)。
 
4.7 输出genbank文件
首先构建SeqRecord对象,为SeqRecord对象添加SeqFeature属性,然后利用SeqIO.write()输出genbank文件。
例子1:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

out_file = "your_file.gbf"
seq = Seq("GATCGATCGATCGATCGATC", IUPAC.ambiguous_dna)
rec = SeqRecord(seq, id="ID1", name="ID1", description="test gbk seq ID1")
qualifiers = {"source": "prediction", "score": 10.0, "other": ["Some", "annotations"],
              "ID": "gene1"}
sub_qualifiers = {"source": "prediction"}
top_feature = SeqFeature(FeatureLocation(0, 20), type="gene", strand=1,
                         qualifiers=qualifiers)
sub_features = SeqFeature(FeatureLocation(0, 5), type="exon", strand=1,qualifiers=sub_qualifiers)
sub_features2 = SeqFeature(FeatureLocation(15, 20), type="exon", strand=1,qualifiers=sub_qualifiers)
rec.features = [top_feature, sub_features, sub_features2]
with open(out_file, "w") as out_handle:
    SeqIO.write(rec, out_handle, 'genbank')
 
更多详情见:http://biopython-cn.readthedocs.io/zh_CN/latest/cn/chr04.html
http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html
 
五、解析GFF文件
目前,Biopython没有整合解析GFF文件的功能,可以使用 bcbio-gff 和 gffutils 模块。见:
bcbio-gff :https://github.com/chapmanb/bcbb/tree/master/gff
gffutils https://github.com/daler/gffutils
 
参考:
http://biopython.org/wiki/GFF_Parsing
http://biopython.org/wiki/Gene_predictions_to_protein_sequences
 
1. 安装模块
安装 gffutils:
pip install gffutils
 
安装bcbio-gff:
pip3 install bcbio-gff
导入模块:
from BCBio.GFF import GFFExaminer
 
2. 将其他文件格式转为gff
例如,将genbank文件转为gff文件,
>>> from Bio import SeqIO
>>> from BCBio import GFF
>>> in_file = "Escherichia_coli_M1.gbf"
>>> out_file = "test.gff"
>>> in_handle = open(in_file)
>>> out_handle = open(out_file, "w")
>>> GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)
 
注意了,结果中会将genbank文件中的complete header info 取出作为一行,同时FEATURES中的source info 也被取出作为一行,然而我们往往要忽略这2个信息。所以,代码可以改为这样:
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "Escherichia_coli_M1.gbf"
    out_file = "test.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    records = []
    for record in SeqIO.parse(in_handle, "genbank"):
        # delete annotations
        record.annotations = {}
        # loop through features to find the source
        for i in range(0,len(record.features)):
            # if found, delete it and stop (only expect one source)
            if(record.features[i].type == "source"):
                record.features.pop(i)
                break
        records.append(record)

    GFF.write(records, out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()
 
参考了:https://stackoverflow.com/questions/20190209/incorrect-parse-with-bcbios-gff-parser
 
 
六、序列格式之间的转换
Dealing with GenBank files in Biopython (warwick.ac.uk)
 
根据ID或者根据序列去除重复项:http://www.mamicode.com/info-detail-1596125.html
 
 
七、解析比对结果文件
1、提取BSR(BLAST score ratios)值
使用模块:blast-score-ratio :https://pypi.python.org/pypi/blast-score-ratio/1.0.6
备注,
Requires NCBI BLAST+ command line >=2.2.29 application available here:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
BLAST binaries must be in your computer's path.
Also requires biopython and matplotlib packages.
 
示例:
import blast_score_ratio.bsr as bsr
bsr.test()
 

 

 

推荐阅读