首页 > 解决方案 > 如何在序列中找到特定字母旁边的回文?

问题描述

我有一个脚本,它返回 DNA 序列中的回文子串。

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"
print(len(sequence))
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 6 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if pal:
        print(sequence[i : i+6])

它返回:

704
GATCTC
GAGCTC
GCAGGC
GTTCAC
GAGATC
TCAAGA
AAATTT
GACGTC
CAGTTG
TGGACA
AAGATT
CTTCAG
CCAAGG
CGACCG
TTGGAA
CTCGAG
TCTTGA
CTTGAG
TGAGCA
CGGGCG
ATAGAT
ACGGGT
TCCAGA
CTCGAG
TCGAGA
TGTTCA
GTTCAC
GGCACC
AGATCT
CACCTG
GCCTGC
GACTTC
CAGATG
AGAACT
TCAAGA
GAAGTC
TCAGGA
AGGACT
TCTGGA
TGTTCA
TTCAAA
TCAAGA
GAGGTC
AGGTCT
TAGATA
AGTTCT
AGTTCT

我想查找这些子字符串是否位于“[ATCG]CC”或“[ATCG]GG”旁边我想找到这些回文在序列中的位置(例如从第 i 个到 (i+5 )th 因为回文的长度为 6),然后检查第 (i+6) 到 (i+8) 个字母是 [ATCG]CC 还是 [ATCG]GG。你知道我怎么写这样的脚本吗?或者你有更好的逻辑?谢谢

标签: pythonsubstring

解决方案


只需添加一些额外的检查。

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"
print(len(sequence))
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
ans = []
for i in range(len(sequence) - 9 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if not pal:
        continue

    if (sequence[i+7] == sequence[i+8]) and (sequence[i+7] in ('C', 'G')):
        print(sequence[i : i+9])
        ans.append(sequence[i : i+9])
    else:
        print(sequence[i : i+6] + " (X)")
print("Count of answer: %d" % len(ans))

输出:

704
GATCTC (X)
GAGCTC (X)
GCAGGC (X)
GTTCAC (X)
GAGATC (X)
TCAAGA (X)
AAATTT (X)
GACGTC (X)
CAGTTG (X)
TGGACA (X)
AAGATT (X)
CTTCAG (X)
CCAAGG (X)
CGACCG (X)
TTGGAA (X)
CTCGAG (X)
TCTTGA (X)
CTTGAG (X)
TGAGCA (X)
CGGGCG (X)
ATAGAT (X)
ACGGGT (X)
TCCAGA (X)
CTCGAG (X)
TCGAGA (X)
TGTTCA (X)
GTTCAC (X)
GGCACC (X)
AGATCT (X)
CACCTG (X)
GCCTGCCCC
GACTTC (X)
CAGATG (X)
AGAACT (X)
TCAAGA (X)
GAAGTCAGG
TCAGGA (X)
AGGACT (X)
TCTGGA (X)
TGTTCA (X)
TTCAAA (X)
TCAAGA (X)
GAGGTC (X)
AGGTCT (X)
TAGATA (X)
AGTTCT (X)
AGTTCT (X)
Count of answer: 2

推荐阅读