首页 > 解决方案 > 在 Python 中连接两个 fasta 文件

问题描述

我有两个数据文件(FASTA),每个文件代表一个基因,序列由物种和本地识别。我想将这些文件连接成一个作为示例:

psbki.fas:
>E_oleracea_Docas_de_Belm 
AACCT
ycf1b.fas: 
>E_oleracea_Docas_de_B 
GGTTC

output:
>E_oleracea_Docas_de_Belm 
AACCTGGTTC

如果您查看两个文件中的物种名称,它们的编写存在一些语法问题,因此彼此不同。另外,我还有另一个问题:一些物种不在两个文件中。

为了解决这些问题,我编写了以下代码:

ids, sequences = parse_fasta(open('psbki.fas', 'r').read().split('\n'))
ids2, sequences2 = parse_fasta(open('ycf1b.fas', 'r').read().split('\n'))

for i, j, z, h in zip(ids, sequences, sequences2, ids2):
    if i != h:
        print(">"+i + "\n"+j)
    else:
        print(">"+i + "\n"+j+z)

前两个序列的输出正常。但对于其他序列,代码只打印一个文件中的文件,但它们在两个文件中。我的代码有什么问题?我是python的初学者

输出是:

>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAATTGGTATTATTCTCATTATCAGCATAAATTATCACACGTCTGGCTCTTCTTGAACGAATTTCAATATCTTCTATCGGTTTTTCCTCATTTTCTTCCTCCTGTTCTTCCAGAAGATTGGTCAATTTATATGACCATCGAGAAACCTTTTTACTGATTTCTTCTATTCCAATAGATTCATTTCTAGTTGTTTTATCATTTGGATCAATTGTCATTATATCGAATACAAATTTCAAAGATTTTGCTTGACTTTCTGAATCCATTTTTCTTTGTTCTGCCAATAAAGAACAGTTTTTCAAACAAAAATTGGGTGTGAATTCAAAAGAAAATGAAGTTAAGGAATTACCGATATAATTCAAAAATGATTTACCACCACCAAGTGAATTCTTTTGATGTTCAAATTCTCTGAAATTATTAGGAAGTAGCTCATGGATCTTATTTATCCAAAGACTTTTTATGGAATCCTCCATATAAGGGAAAAAATCATTTATGATTGTACGTAAATCAAAATCTTTTATTGCTCCACGGCATGGTCCGCTCAATAAAGGATCATATGTTTTGGTCAAGCATTTTTGTTTATTCTCATGATTGCAAAATCTAGTCTTTTTTTCGAGCATATCTAGAGCAAGAAATCCCTTTTCTTTTTTTTCTTTTTCTAGAGCTTTTATTCGACTTATTAATTCATTGCTCAAGTTGTATTTTTTTTGTTCATTGGTAAAAACCCAAAAATTATACAGGTCTCCATGGGATAATTTTTT-GTCGTGTACAAAAACATTTTTCGTTCTATCATTTCC
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAATTGGTATTATTCTCATTATCAGCATAAATTATCACACGTCTGGCTCTTCTTGAACGAATTTCAATATCTTCTATCGGTTTTTCCTCAATTTCTTCCTCCTGTTCTTCCAGAAGATTGGTCAATTTATATGACCATCGAGAAACCTTTTTACTGATTTCTTCTATTCCAATAGATTCATTTCTAGTTGTTTTATCATTTGGATCAATTGTCATTATATCGAATACAAATTTCAAAGATTTTGCTTGACTTTCTGAATCCATTTTTCTTTGTTCTGCCAATAAAGAACAGTTTTTCAAACAAAAATTGGGTGTGAATTCAAAAGAAAATGAAGTTAAGGAATTACCGATATAATTCAAAAATGATTTACCACCACCAAGTGAATTCTTTTGATGTTCAAATTCTCTGAAATTATTAGGAAGTAGCTCATGGATCTTATTTATCCAAAGACTTTTTATGGAATCCTCCATATAAGGGAAAAAATCATTTATGATTGTACGTAAATCAAAATCTTTTATTGCTCCACGGCATGGTCCGCTCAATAAAGGATCATATGTTTTGGTCAAGCATTTTTGTTTATTCTCATGATTGCAAAATCTAGTCTTTTTTTCGAGCATATCTAGAGCAAGAAATCCCTTTTCTTTTTTTTCTTTTTCTAGAGCTTTTATTCGACTTATTAATTCATTGCTCAAGTTGTATTTTTTTTGTTCATTGGTAAAAACCCAAAAATTATACAGGTCTCCATGGGATAATTTTTTTGTCGTGTACAAAAACATTTTTCGTTCTATCATTTCC
>E_edulis_F7
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
>E_edulis_R10
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_R11
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGKGTATGTGGTAAAGTAAAAAATAASTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_R12
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGWGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAA----ATCTTG
>E_edulis_IFES
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGARCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
>E_oleracea_Ilha_do_combu_1
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_5
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_10
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Mangal_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Mangal_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Docas_de_Belm
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Utinga
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_1
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_IFES
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG

换句话说,我想连接两个文件中的基因,而不是打印,也不连接只出现在一个文件中的物种。我不知道如何解决这个物种写得很少有错误的问题。

编辑 1:我已经更改了代码,使用 Levenshtein 比率来解决某些物种名称中的书写错误,但输出是相同的。

新代码是:

import Levenshtein as lev
Str1 = str(ids)
Str2 = str(ids2)
Ratio = lev.ratio(Str1.lower(),Str2.lower())

for i, j, z, h in zip(ids, sequences, sequences2, ids2):
    if lev.ratio(i,h) > 0.70 and i in h:
         print(">"+i + "\n"+j+z)
    else:
        print(">"+i + "\n"+j)

编辑 2

    Input File1: gene 1
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_F7
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG

Input File 2: gene 2
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_ed_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG

My desired output:
 >E_edulis_I1
    AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
    >E_edulis_I2
    AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG

PS 在第二个文件中,我有相同的物种 E_edulis_I2,但名称不完整 -> E_ed_I2。我希望脚本能够识别并将序列与第一个序列连接起来(文件 1 = E_edulis_I2)。另一个问题是 E_edulis_F7 物种只出现在文件 1 中,所以我不希望在我的输出中出现这种物种。

标签: pythonfor-loopif-statementconcatenationfasta

解决方案


这比我预期的要复杂一些。问题是,例如,比is"E_edulis_I1"更接近匹配。我认为最好的解决方案是比较两个文件之间的每一对名称,并确定它们是否相似到足以被称为匹配。然后,一旦你有了这组匹配的名称及其相似度,你就可以按照最相似到最不相似的顺序浏览它,并随时将结果添加到组合的 FASTA 文件中。因此,由于两个文件之间存在完全匹配,因此将首先将其放入组合的 FASTA 中。然后当我们找到匹配的名字and时,我们会看到我们已经使用了,所以这对不能匹配。"E_edulis_I2""E_ed_I""E_edulis_I1""E_edulis_I1""E_edulis_I2""E_edulis_I1"

这对我来说似乎仍然有点脆弱。您只需要注意您的相似度函数和相似度阈值是什么。您可能会添加的一件事是在匹配完成时打印出名称但相似度不为 1。这样,您可以快速扫描这些名称(希望没有太多)并确定是否有任何名称匹配到没有。

无论如何,这是代码。它至少适用于您提供的示例。

from pathlib import Path
from typing import Dict
import Levenshtein as lev

def fasta_to_dict(path: str) -> Dict[str, str]:
    """Read in a fasta file and return a dict mapping names to sequences."""
    fasta = Path(path).read_text().split("\n")
    return {fasta[i]: fasta[i + 1] for i in range(0, len(fasta) - 1, 2)}

# put in the paths to your fasta files here
fasta1 = fasta_to_dict("f1.fasta")
fasta2 = fasta_to_dict("f2.fasta")

def get_similarity(name1: str, name2: str) -> float:
    """Return a number describing how similar the names are."""
    # this could be any string comparison function you want
    return lev.ratio(name1, name2)

similarity_threshold = 0.7 # must be at least this similar to be called a match

def get_correct_name(name1: str, name2: str) -> str:
    """Determine which of 2 close-match names is the correct one."""
    # this can also be whatever function makes sense in your application
    # I'm just using the longer name as the "correct" one
    return name1 if len(name1) > len(name2) else  name2

possible_matches = []
for name1 in fasta1:
    for name2 in fasta2:
        similarity = get_similarity(name1, name2)
        if similarity > similarity_threshold:
            possible_matches.append((name1, name2, similarity))

# sort by most similar matches first
possible_matches = sorted(possible_matches, key=lambda match: -match[-1])

combined_fasta = {}
used_names1 = set()
used_names2 = set()
for name1, name2, _ in possible_matches:
    if name1 in used_names1 or name2 in used_names2:
        continue
    correct_name = get_correct_name(name1, name2)
    combined_fasta[correct_name] = fasta1[name1] + fasta2[name2]
    used_names1.add(name1)
    used_names2.add(name2)

# put the path to your output fasta file here
with open("combined.fasta", "w") as f:
    for name, seq in combined_fasta.items():
        f.write(name + "\n")
        f.write(seq + "\n")

推荐阅读