首页 > 解决方案 > 使用 BioPython 修剪 fasta 文件

问题描述

我有一个包含多个序列的 fasta 文件。一些序列以'-'结尾,我想从最终序列中修剪它们。有没有一种干净的方法来修剪它们并使用 Biopython 编写一个没有破折号的新 fasta 文件?

我看到这篇文章How to remove all-N sequence entries from fasta file(s)并尝试修改一些代码,但它没有用......

包含如下序列的文件:

sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAGACAT ---------------

def dash_removal(file_in, file_out):
    records = SeqIO.parse(file_in, 'fasta')
    filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
    SeqIO.write(filtered, file_out, 'fasta')
    dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")

所有的序列最终都应该被修剪成这样:

sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAGA

任何帮助,将不胜感激!

标签: python-3.xtrimbiopythonfasta

解决方案


使用的所有选项sed都很棒,因为它们更快,但这是一种在BioPython.

这个想法是在每条记录rstripseq属性上使用。rstrip可以在序列上使用,就像在 Python 中的任何其他字符串上一样。

from Bio import SeqIO
import io

seq = """>sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCAT
GTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAA
TGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCA
CCAGGCCAGATGAGAGAA--------------------------------------------------------------"""

f = io.StringIO(seq) # replace it with f = open('my_fasta.fa', 'r')
clean_records = []
for record in SeqIO.parse(f, "fasta"):
    record.seq = record.seq.rstrip('-')
    clean_records.append(record)

with open('clean_fasta.fa', 'w') as f:
    SeqIO.write(clean_records, f, 'fasta')

推荐阅读