python - 在 Python 中连接两个 fasta 文件
问题描述
我有两个数据文件(FASTA),每个文件代表一个基因,序列由物种和本地识别。我想将这些文件连接成一个作为示例:
psbki.fas:
>E_oleracea_Docas_de_Belm
AACCT
ycf1b.fas:
>E_oleracea_Docas_de_B
GGTTC
output:
>E_oleracea_Docas_de_Belm
AACCTGGTTC
如果您查看两个文件中的物种名称,它们的编写存在一些语法问题,因此彼此不同。另外,我还有另一个问题:一些物种不在两个文件中。
为了解决这些问题,我编写了以下代码:
ids, sequences = parse_fasta(open('psbki.fas', 'r').read().split('\n'))
ids2, sequences2 = parse_fasta(open('ycf1b.fas', 'r').read().split('\n'))
for i, j, z, h in zip(ids, sequences, sequences2, ids2):
if i != h:
print(">"+i + "\n"+j)
else:
print(">"+i + "\n"+j+z)
前两个序列的输出正常。但对于其他序列,代码只打印一个文件中的文件,但它们在两个文件中。我的代码有什么问题?我是python的初学者
输出是:
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAATTGGTATTATTCTCATTATCAGCATAAATTATCACACGTCTGGCTCTTCTTGAACGAATTTCAATATCTTCTATCGGTTTTTCCTCATTTTCTTCCTCCTGTTCTTCCAGAAGATTGGTCAATTTATATGACCATCGAGAAACCTTTTTACTGATTTCTTCTATTCCAATAGATTCATTTCTAGTTGTTTTATCATTTGGATCAATTGTCATTATATCGAATACAAATTTCAAAGATTTTGCTTGACTTTCTGAATCCATTTTTCTTTGTTCTGCCAATAAAGAACAGTTTTTCAAACAAAAATTGGGTGTGAATTCAAAAGAAAATGAAGTTAAGGAATTACCGATATAATTCAAAAATGATTTACCACCACCAAGTGAATTCTTTTGATGTTCAAATTCTCTGAAATTATTAGGAAGTAGCTCATGGATCTTATTTATCCAAAGACTTTTTATGGAATCCTCCATATAAGGGAAAAAATCATTTATGATTGTACGTAAATCAAAATCTTTTATTGCTCCACGGCATGGTCCGCTCAATAAAGGATCATATGTTTTGGTCAAGCATTTTTGTTTATTCTCATGATTGCAAAATCTAGTCTTTTTTTCGAGCATATCTAGAGCAAGAAATCCCTTTTCTTTTTTTTCTTTTTCTAGAGCTTTTATTCGACTTATTAATTCATTGCTCAAGTTGTATTTTTTTTGTTCATTGGTAAAAACCCAAAAATTATACAGGTCTCCATGGGATAATTTTTT-GTCGTGTACAAAAACATTTTTCGTTCTATCATTTCC
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAATTGGTATTATTCTCATTATCAGCATAAATTATCACACGTCTGGCTCTTCTTGAACGAATTTCAATATCTTCTATCGGTTTTTCCTCAATTTCTTCCTCCTGTTCTTCCAGAAGATTGGTCAATTTATATGACCATCGAGAAACCTTTTTACTGATTTCTTCTATTCCAATAGATTCATTTCTAGTTGTTTTATCATTTGGATCAATTGTCATTATATCGAATACAAATTTCAAAGATTTTGCTTGACTTTCTGAATCCATTTTTCTTTGTTCTGCCAATAAAGAACAGTTTTTCAAACAAAAATTGGGTGTGAATTCAAAAGAAAATGAAGTTAAGGAATTACCGATATAATTCAAAAATGATTTACCACCACCAAGTGAATTCTTTTGATGTTCAAATTCTCTGAAATTATTAGGAAGTAGCTCATGGATCTTATTTATCCAAAGACTTTTTATGGAATCCTCCATATAAGGGAAAAAATCATTTATGATTGTACGTAAATCAAAATCTTTTATTGCTCCACGGCATGGTCCGCTCAATAAAGGATCATATGTTTTGGTCAAGCATTTTTGTTTATTCTCATGATTGCAAAATCTAGTCTTTTTTTCGAGCATATCTAGAGCAAGAAATCCCTTTTCTTTTTTTTCTTTTTCTAGAGCTTTTATTCGACTTATTAATTCATTGCTCAAGTTGTATTTTTTTTGTTCATTGGTAAAAACCCAAAAATTATACAGGTCTCCATGGGATAATTTTTTTGTCGTGTACAAAAACATTTTTCGTTCTATCATTTCC
>E_edulis_F7
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
>E_edulis_R10
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_R11
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGKGTATGTGGTAAAGTAAAAAATAASTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_R12
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGWGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAA----ATCTTG
>E_edulis_IFES
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGARCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
>E_oleracea_Ilha_do_combu_1
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_5
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Ilha_do_combu_10
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Mangal_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Mangal_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Docas_de_Belm
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Utinga
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_1
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_2
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_Canto_de_Roa_3
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
>E_oleracea_IFES
AAATCGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAGATCTTATCTTG
换句话说,我想连接两个文件中的基因,而不是打印,也不连接只出现在一个文件中的物种。我不知道如何解决这个物种写得很少有错误的问题。
编辑 1:我已经更改了代码,使用 Levenshtein 比率来解决某些物种名称中的书写错误,但输出是相同的。
新代码是:
import Levenshtein as lev
Str1 = str(ids)
Str2 = str(ids2)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
for i, j, z, h in zip(ids, sequences, sequences2, ids2):
if lev.ratio(i,h) > 0.70 and i in h:
print(">"+i + "\n"+j+z)
else:
print(">"+i + "\n"+j)
编辑 2
Input File1: gene 1
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_F7
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCTTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTT-GGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAA-G----ATCTTG
Input File 2: gene 2
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_ed_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
My desired output:
>E_edulis_I1
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
>E_edulis_I2
AAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTGAAATAGAAATTCTTGTATATTGAATAACCGCGGCGATGAATTTTGATCAACTTATTTCCTCGTTCTGACCTTACAGTGAGCAAAGACTTTATTAGGTTGCCTACAATACCTAATTATTCATATGACAAGAAATTTTTGATAACGAAGGAATCAAAATCTTATTCCAAAGAAATTCGTGAAAATGACTTTCTTTTCAAAAAACACTTCATTTTTTTTGGGGGTGTCATGTCAAAACAAAATAGTGTATGTGGTAAAGTAAAAAATAAGTAACCTATTCCCTTTTTCAAAAAAAAAAG----ATCTTG
PS 在第二个文件中,我有相同的物种 E_edulis_I2,但名称不完整 -> E_ed_I2。我希望脚本能够识别并将序列与第一个序列连接起来(文件 1 = E_edulis_I2)。另一个问题是 E_edulis_F7 物种只出现在文件 1 中,所以我不希望在我的输出中出现这种物种。
解决方案
这比我预期的要复杂一些。问题是,例如,比is"E_edulis_I1"
更接近匹配。我认为最好的解决方案是比较两个文件之间的每一对名称,并确定它们是否相似到足以被称为匹配。然后,一旦你有了这组匹配的名称及其相似度,你就可以按照最相似到最不相似的顺序浏览它,并随时将结果添加到组合的 FASTA 文件中。因此,由于两个文件之间存在完全匹配,因此将首先将其放入组合的 FASTA 中。然后当我们找到匹配的名字and时,我们会看到我们已经使用了,所以这对不能匹配。"E_edulis_I2"
"E_ed_I"
"E_edulis_I1"
"E_edulis_I1"
"E_edulis_I2"
"E_edulis_I1"
这对我来说似乎仍然有点脆弱。您只需要注意您的相似度函数和相似度阈值是什么。您可能会添加的一件事是在匹配完成时打印出名称但相似度不为 1。这样,您可以快速扫描这些名称(希望没有太多)并确定是否有任何名称匹配到没有。
无论如何,这是代码。它至少适用于您提供的示例。
from pathlib import Path
from typing import Dict
import Levenshtein as lev
def fasta_to_dict(path: str) -> Dict[str, str]:
"""Read in a fasta file and return a dict mapping names to sequences."""
fasta = Path(path).read_text().split("\n")
return {fasta[i]: fasta[i + 1] for i in range(0, len(fasta) - 1, 2)}
# put in the paths to your fasta files here
fasta1 = fasta_to_dict("f1.fasta")
fasta2 = fasta_to_dict("f2.fasta")
def get_similarity(name1: str, name2: str) -> float:
"""Return a number describing how similar the names are."""
# this could be any string comparison function you want
return lev.ratio(name1, name2)
similarity_threshold = 0.7 # must be at least this similar to be called a match
def get_correct_name(name1: str, name2: str) -> str:
"""Determine which of 2 close-match names is the correct one."""
# this can also be whatever function makes sense in your application
# I'm just using the longer name as the "correct" one
return name1 if len(name1) > len(name2) else name2
possible_matches = []
for name1 in fasta1:
for name2 in fasta2:
similarity = get_similarity(name1, name2)
if similarity > similarity_threshold:
possible_matches.append((name1, name2, similarity))
# sort by most similar matches first
possible_matches = sorted(possible_matches, key=lambda match: -match[-1])
combined_fasta = {}
used_names1 = set()
used_names2 = set()
for name1, name2, _ in possible_matches:
if name1 in used_names1 or name2 in used_names2:
continue
correct_name = get_correct_name(name1, name2)
combined_fasta[correct_name] = fasta1[name1] + fasta2[name2]
used_names1.add(name1)
used_names2.add(name2)
# put the path to your output fasta file here
with open("combined.fasta", "w") as f:
for name, seq in combined_fasta.items():
f.write(name + "\n")
f.write(seq + "\n")
推荐阅读
- r - 如何在 Rtsne 图中对我的样本进行颜色编码
- c# - 设置数据上下文时如何使用用户控件依赖属性?
- java - Java 8 Streams:为什么 mapToInt 需要 Integer::parseInt 作为参数?
- java - 此代码中的继承错误是什么?
- python - Python-Django [SSL: WRONG_VERSION_NUMBER] 错误
- ios - 删除 TableView 中的单个分隔符
- design-patterns - 微服务 CQRS 分离构建(写入)查询模型和读取模型
- javascript - 如何在javascript中展平数组?
- ruby - 如何始终将类方法附加到链式方法调用
- javascript - mocha 似乎无法识别文件系统