首页 > 解决方案 > 在另一个文件的 kmers 中搜索一个文件的 kmers 并在 Python 中计算出现次数

问题描述

得到了这个函数,它会在 python 的四个 Bases 上生成所有可能的 kmers:

def generate_kmers(k):

    bases = ['A', 'C', 'T', 'G']  # in task (a) we only should wirte a function that generates k-mers of the four Bases
    kmer = [''.join(p) for p in itertools.product(bases, repeat=length_kmer)]
    # itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
    # all string combinations together over a length of k-mers
    return kmer

now what I want is, to look over a list of Sequences of a fastq file (eg ['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN']) and count the occurences of all my kmers of the function generate_kmer in my list of Sequences and to save it在字典里。(例如{AAAA: 2, AAAC: 1...})首先我尝试修改generate_kmer,以便它提供序列文件的所有k-mers,并遍历kmerSequences 和kmerBases,但这不起作用。

有人对我该怎么做有任何想法吗?

标签: pythonbiopythonfastq

解决方案


You could try this with count:

import itertools

def generate_kmers(k):

    bases = ['A', 'C', 'T', 'G']  # in task (a) we only should wirte a function that generates k-mers of the four Bases
    kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
    # itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
    # all string combinations together over a length of k-mers
    return kmer

seqs=['GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'CCTGTAGAGTCATAAAGACCTCTTGGGTCCATCCTAGAAATTTTTCAGCTGAGAATAACGGGTCTGTTTCAGTTATTGCTTCTACTATNNNNNNNNNNNNNNNNNNNNNNNNNNN']
k=4
mers4= generate_kmers(k)
dcts=[{kmer:seq.count(kmer) for kmer in mers4}for seq in seqs]
print(dcts)

Edit:

import itertools
import re
def generate_kmers(k):

    bases = ['A', 'C', 'T', 'G']  # in task (a) we only should wirte a function that generates k-mers of the four Bases
    kmer = [''.join(p) for p in itertools.product(bases, repeat=k)]
    # itertools.product returns a Cartesian product of input iterables, in our case it generates over bases and joined
    # all string combinations together over a length of k-mers
    return kmer

k=4
mers4= generate_kmers(k)

#given sequence
s='GTATACACTAGTCCAGGATGTGCTTCTTGTAGAAAAGTAAAACAATGGTTAAAAGATCACAATCTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

#function that returns the dictionary with ocurrences
def dct_count(seq):
    return {mer:len(re.findall(mer, s)) for mer in mers4}

dc=dct_count(s)
print(dc)

推荐阅读