python - 如何在不使用 Biopython 的情况下从 FASTA 文件中获取此输出?
问题描述
我需要从 FASTA 文件中获取下面显示的输出,但不使用 BioPython。有人有想法吗?
这是使用 BioPython 的代码:
from Bio import SeqIO
records = SeqIO.parse("data/assembledSeqs.fa", "fasta")
for i, seq_record in enumerate(records):
print("Sequence %d:" % i)
print("Number of A's: %d" % seq_record.seq.count("A"))
print("Number of C's: %d" % seq_record.seq.count("C"))
print("Number of G's: %d" % seq_record.seq.count("G"))
print("Number of T's: %d" % seq_record.seq.count("T"))
print()
FASTA 文件如下所示:
>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTG
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGC
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86
CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
CAAACCACTTTCACCGCCACACGACC
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120
TGAACCTACGACTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCG
我需要获得以下输出:
Sequence 0:
Number of A's: 14
Number of C's: 17
Number of G's: 24
Number of T's: 24
Sequence 1:
Number of A's: 17
Number of C's: 30
Number of G's: 16
Number of T's: 17
Sequence 2:
Number of A's: 27
Number of C's: 31
Number of G's: 12
Number of T's: 16
Sequence 3:
Number of A's: 31
Number of C's: 41
Number of G's: 20
Number of T's: 28
我已经尝试过了,但我无法获得相同的输出。
def count_bases (fasta_file_name):
with open(fasta_file_name) as file_content:
for seqs in file_content:
if seqs.startswith('>'):
for i, seq in enumerate('>'):
print("Sequence %d:" % i)
else:
print("Number of A's: %d" % seqs.count("A"))
print("Number of C's: %d" % seqs.count("C"))
print("Number of G's: %d" % seqs.count("G"))
print("Number of T's: %d" % seqs.count("T"))
print()
return bases
result = count_bases('data/assembledSeqs.fa')
解决方案
这似乎有效。它使用正则表达式re
模块来识别标题行并获得以下核苷酸序列的总长度。此值用于确定要读取并连接在一起的后续数据行数。
因为它实际上解析了序列头中的信息,所以它应该能够处理任何长度的序列。
我不知道每行的官方碱基总数是多少,但在示例 fasta 文件中似乎是 60,因此已将其硬编码到代码中。
import re
pattern = r""">.+?length:\s(\d+)"""
regex = re.compile(pattern)
MAX_PER_LINE = 60
def count_bases (fasta_file_name):
bases = None # Undefined.
with open(fasta_file_name) as inp:
i = 0 # Sequence counter.
line = next(inp, None) # Read first line.
while line:
match = regex.search(line)
if match:
length = int(match.group(1))
nlines = (length + MAX_PER_LINE-1) // MAX_PER_LINE
# Read and concatenate data from the required number of lines.
seqs = ''.join(next(inp).rstrip() for _ in range(nlines))
print("Sequence %d:" % i)
print("Number of A's: %d" % seqs.count("A"))
print("Number of C's: %d" % seqs.count("C"))
print("Number of G's: %d" % seqs.count("G"))
print("Number of T's: %d" % seqs.count("T"))
print()
i += 1
line = next(inp, None) # Read next line.
return bases
fasta_file = 'assembledSeqs.fa'
result = count_bases(fasta_file)
这是一个变体,它使用 a 一次collections.Counter
计算所有碱基的数量。这可能比count
对每个单独使用序列方法更快,因此如果您有大量要处理的数据,这很有用。
def count_bases (fasta_file_name):
bases = None # Undefined.
with open(fasta_file_name) as inp:
i = 0 # Sequence counter.
line = next(inp, None) # Read first line.
while line:
match = regex.search(line)
if match:
length = int(match.group(1))
nlines = (length + MAX_PER_LINE-1) // MAX_PER_LINE
# Read and concatenate data from the required number of lines.
seqs = ''.join(next(inp).rstrip() for _ in range(nlines))
print("Sequence %d:" % i)
counter = Counter(seqs) # Count the number of each base.
for base in 'ACGT':
if base in counter:
print("Number of {}'s: {}".format(base, counter[base]))
print()
i += 1
line = next(inp, None) # Read next line.
return bases
推荐阅读
- schema.org - Schema.org 的“reviewBody”的完整评论或摘录
- ios - AppStore 导航栏 iOS 13
- javascript - 搜索动态生成的多级对象数组
- sql - 连接两个表,其中一个包含另一个
- sql - 函数返回的触发器如何影响 BEFORE 或 AFTER 语句?
- python - Vscode - Python 调试器:无法识别的参数
- c++ - 如何从 Linux 上的 picocom 等串口读取数据?
- python - how to add location constraints for an object in blender?
- android - 如何在 Android Studio 中显示稳定版本(项目结构)
- javascript - 单击其中的按钮后如何淡出div?