python - 如何在不使用 biopython 的情况下编写脚本来汇总多 fasta 文件中的信息?
问题描述
我有一个生物信息学类的任务,它要求一个 python 脚本对一个包含多个蛋白质序列的 FASTA 文件执行以下操作:
-打开用户输入指定的.fasta文件
-打印标题行(即以“>”开头的行)
- 打印前 10 个氨基酸并在同一行上,报告序列中的氨基酸数量
经过几个小时的尝试,我只设法为第一个序列打印了标题行和前 10 个氨基酸。我编写的 for 循环似乎并没有超出此范围(如果这是垃圾,请道歉,我是一个完整的初学者!)
input_file = open(input("\nPlease enter the name of a FASTA file (inlcuding the .fasta extension): "))
# Opens the file specified by the user
for line in input_file:
line = line.rstrip()
if line[0] == ">":
header = line
print("\nHeader:", header) # prints the header line
no_lines_searched = 0 # To stop loop after first line of sequence when printing first 10 AAs
for i in input_file:
if ">" not in i and no_lines_searched < 1:
print("First ten amino acid residues: ", i[0:10], "# Amino acids: ") # Prints the first 10 AAs
no_lines_searched = no_lines_searched+1 # To stop loop after first line of sequence
我试图变得聪明并设计第二个循环,使其返回序列的前 10 个氨基酸,然后停止,直到遇到另一个序列(用“>”表示)。
然后我打算使用占位符%s
以某种方式计算文件中每个序列的总序列长度,但似乎无法超越这一点!
我得到的输出如下:
Header: >sp|P03378|ENV_HV1A2 Envelope glycoprotein gp160 OS=Human immunodeficiency virus type 1 group M subtype B (isolate ARV2/SF2) GN=env PE=3 SV=1
First ten amino acid residues: MKVKGTRRNY # Amino acids:
非常感激任何的帮助!
解决方案
在今天大部分时间都花在这个问题上之后,我回答了我自己的问题。我以完全错误的方式去做这件事。这是我用来实现我的目标的代码:
#!/usr/bin/env python3
seq = "" # seq is a variable used to store each line of sequence data as the loop below iterates.
ten_AAs = '' # ten_AAs is a variable used to store the first ten amino acids of each sequence (it actually stores
# the first 10 AAs from each line of the sequence, but only the first 10 residues of each sequence are printed).
seq_count = 0 # seq_count is a variable used to track the total number of sequences in the file.
infile = input("\nPlease enter the name of a FASTA file (including the .fasta extension): ")
# Stores the file specified by the user in the variable, infile.
with open(infile, "r") as fasta: # Opens the above file such that we do not need to call filename.close() at the end.
print("\n")
for line in fasta: # Initiates a for loop to summarise the file: headers, first ten residues and sequence lengths.
line = line.rstrip() # Removes trailing newline characters.
if line[0] == ">": # Indicates the start of a new FASTA sequence and is a condition for printing summary data
if seq != "": # Second condition for printing summary data (see lines 28-30).
print("Header: {0}\nFirst ten amino acids: {1} \tSequence length: {2} residues\n"
.format(header, ten_AAs[0:10], len(seq))) # Placeholders for summary info.
seq_count += 1 # Increases seq_count by one for each header encountered.
header = line # Stores the header line for printing.
seq = "" # Resets seq for each iteration.
ten_AAs = '' # Resets ten_AAs for each iteration.
else:
seq += line[:] # Appends seq with the characters of the line if it is not a header (i.e. sequence data).
ten_AAs += line[0:10] # Appends ten_AAs with the first ten characters of the line if it is not a header.
# After reading all lines of one sequence, the loop encounters the next FASTA sequence, starting with a ">" character,
# but 'seq' is not blank. Therefore, both 'if' statements are now TRUE and all summary data for the first sequence
# (stored in 'header', 'ten_AAs' and 'seq') are printed as per lines 18-19.
print("Header: {0}\nFirst ten amino acids: {1} \tSequence length: {2} residues"
.format(header, ten_AAs[0:10], len(seq)))
# The final sequence is not followed by another accession number; therefore, the summary data for the final sequence
# must be printed manually, outside of the loop.
print("\nThis multi-FASTA file contains", str(seq_count), "sequences.")
# For completeness' sake, reports the number of sequences in the multi-FASTA file.
这是输出:
Please enter the name of a FASTA file (including the .fasta extension): multiprotein.fasta
Header: >1433G_HUMAN (P61981) 14-3-3 protein gamma (Protein kinase C inhibitor protein 1) (KCIP-1) [Homo sapiens]
First ten amino acids: VDREQLVQKA Sequence length: 246 residues
Header: >ATP8_RAT (P11608) ATP synthase protein 8 (EC 3.6.3.14) (ATPase subunit 8) (A6L) (Chargerin II) [Rattus norvegicus]
First ten amino acids: MPQLDTSTWF Sequence length: 67 residues
Header: >ALR_LISIN (Q92DC9) Alanine racemase (EC 5.1.1.1)
First ten amino acids: MVTGWHRPTW Sequence length: 368 residues
Header: >CDCA4_HUMAN (Q9BXL8) Cell division cycle-associated protein 4 (Hematopoietic progenitor protein) [Homo sapiens]
First ten amino acids: MFARGLKRKC Sequence length: 241 residues
This multi-FASTA file contains 4 sequences.
推荐阅读
- javascript - React-redux 不会在状态更改时重新渲染
- c# - 如何在 ASP.NET Core 视图中正确显示复选框表单控件?
- flutter - 如何使用 url_launcher 在 Flutter 应用程序中打开 URL?
- python - 正则表达式:删除点前长度为 1-3 的字母
- ios - 更新 ObservableObject 中的变量而不滞后
- c++ - 在 win32/winapi 中取消排序的 TreeView 列表
- xslt - 重新链接跳过元素的子元素的 parentID
- qt - cmake 上的 Windows 错误将 Opencv 与 Qt 链接
- python-3.x - Windows/Python 检查文件是否打开或正在使用
- teradata - 如何将表作为参数传递给存储过程