python - 在 python 中编辑文件并创建一个新文件
问题描述
我有一个大文本文件(“|”分隔),就像这个小例子:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|RNF14|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|HAS-0|HAS|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
第一行是 ID 行开头,"<"
第二行是属于上述 ID 行的字符序列。查看第 6 列有重复的名称,第 7 列是 ID 之后的行的长度(字符序列)。我想根据第 7 列选择每个 ID 行的重复,这意味着长度最长的 ID。小例子的预期输出是:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
因此,根据长度,每ID
行重复一次(查看),
我在 python 中尝试了以下代码,但它不起作用。你知道怎么解决吗?column 6
column 7.
from __future__ import print_function
import sys
def parse_fasta(data):
name, seq = None, []
for line in data:
line = line.rstrip()
if line.startswith('>'):
if name:
yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name:
yield (name, ''.join(seq))
isoforms = dict()
for defline, sequence in parse_fasta(sys.stdin):
geneid = '.'.join(defline[1:].split('.')[:-1])
if geneid in isoforms:
otherdefline, othersequence = isoforms[geneid]
if len(sequence) > len(othersequence):
isoforms[geneid] = (defline, sequence)
else:
isoforms[geneid] = (defline, sequence)
for defline, sequence in isoforms.values():
print(defline, sequence, sep='\n')
解决方案
我建议您为此使用Biopython,而不是构建自己的解析器。注意我还添加了一个完整性检查(在您的情况下,FASTA 标题行结尾474
实际上具有长度序列 just 388
):
from Bio import SeqIO
def yield_records():
seen = set()
for record in SeqIO.parse('in.fa', 'fasta'):
header_seq_len = int(record.description.split('|')[-1])
seq_len = len(record)
if header_seq_len != seq_len:
print('Warning: the seq length {} != that stated in the header {}'
.format(seq_len, header_seq_len))
if header_seq_len not in seen:
yield record
seen.add(header_seq_len)
SeqIO.write(yield_records(), 'out.fa', 'fasta')
推荐阅读
- java - 使用 Java Netbeans 和 MySQL 存储自动生成的随机数
- android - 无法在 Android 中隐藏折叠的工具栏视图
- angular - 如何将 i18n 数据作为参数传递给组件
- python - 当我点击一个反应时,我如何分配一个角色?它对我不起作用?
- java - 如何在 Kotlin 中制作静态方法
- dynamic-programming - 如何将具有给定流行度的 n 个元素分配到 k 个通道中(动态规划)
- php - 单击链接/按钮PHP时如何执行sql查询
- google-bigquery - 为什么 BigQuery AutoDetection 未检测到我的架构?
- normalizr - 如何在 normalizr 模式 processStrategy 中获取父键?它可用吗?
- file - 我在哪里可以找到自己使用命令制作的文件:sudo mv /var/www/myfile .,