首页 > 解决方案 > 构建 k-mers 集合的重叠图

问题描述

我正在学习基因组测序中的图形算法课程,目前的任务涉及从一堆 k-mers 构建重叠图。正式的问题如下:

输入: k-mers 模式的集合。
输出:邻接表形式的重叠图。

我有一个工作算法可以处理除自循环以外的大多数情况。问题是我不知道何时/为什么应该添加自循环。对此的任何帮助将不胜感激。以下是一些说明我的问题和我当前解决方案的示例。谢谢!

示例 1
输入:

ACT
CTT
TTT

输出:

TTT->TTT
ACT->CTT
CTT->TTT

示例 2
输入:

中交
输出:

中交->中交

示例 3
输入:

CT
TG
TG
TC
TT
TC
输出:

TC->CT
CT->TC,TG,TT
TT->TC,TG,TT

/*
 * Creates an overlap graph from a list of k-mers
 *
 * \param patterns      a vector of string k-mers
 *
 * \return a string containing an adjacency list representation of the
 * overlap graph as described in the problem specification
 * 
 * Output Format. The overlap graph Overlap(Patterns), in the form of an
 * adjacency list. Specifically, each line of the output represents all
 * edges (u,v) leaving a given node u in the format “u -> v”, where u and v
 * are both k-mers. If a given node u has multiple edges leaving it (e.g. v and w),
 * the destination nodes are comma-separated in any order. For example, if
 * there exist nodes ACG, CGT, and CGG, the resulting line of the adjacency
 * list would be: ACG -> CGT,CGG
 */


#include <algorithm>
#include <iostream>
#include <map>
#include <vector>
#include <set>
#include <string>


std::string overlap_graph(const std::vector<std::string>& patterns) {
    // Overlap Graph of the k-mers collection patterns, in which a vertex exists for
    // each unique k-mer, and an edge (u,v) exists if the last k-1 characters of u are
    // equal to the first k-1 characters of v.

    if (patterns.size() == 1) return patterns[0] + "->" + patterns[0];
    // Build adjancency list O(n^2)
    std::map <std::string, std::vector<std::string>> adj;
    for (const auto& kmer_u : patterns) {
        // Add kmer as key if it doesn't exist in the map and find its neighbours.
        if (adj.find(kmer_u) == adj.end()) {
            adj[kmer_u] = {};
            for (const auto& kmer_v : patterns) {
                if ((kmer_u != kmer_v) &&
                    (kmer_u.substr(1, kmer_u.length() - 1) == kmer_v.substr(0, kmer_v.length() - 1)) && 
                    (std::find(adj[kmer_u].begin(), adj[kmer_u].end(), kmer_v) == adj[kmer_u].end())) {
                    adj[kmer_u].push_back(kmer_v);
                }
            }
        } // Otherwise continue.
    }

    // Convert to string O(n).
    std::string ans;
    for (auto const& key_value : adj) {
        std::string kmer_u{ key_value.first };
        std::vector<std::string> kmers_v{ key_value.second };
        if (!kmers_v.empty()) {
            ans.append(kmer_u + "->" + kmers_v[0]);
            for (std::size_t kmer_v{ 1 }; kmer_v < kmers_v.size(); kmer_v++) {
                ans.append(',' + kmers_v[kmer_v]);
            }
            ans.push_back('\n');
        }
    }
    return ans;
}


void tests() {
    /* Sample Dataset */
    std::vector<std::string> patterns{ {"AAG"}, {"AGA"}, {"ATT"}, {"CTA"}, {"CTC"}, {"GAT"},
                                       {"TAC"}, {"TCT"}, {"TCT"}, {"TTC"} };
    std::string answer{ "TTC->TCT\nCTA->TAC\nAAG->AGA\nCTC->TCT\nTCT->CTA,CTC\nAGA->GAT\nGAT->ATT\nATT->TTC\n" };
    std::cout << answer << std::endl;
    // std::cout << (overlap_graph(patterns) == answer) << std::endl;
    std::cout << overlap_graph(patterns) << std::endl;
}


void solution() {
    std::vector<std::string> patterns;

    std::string pattern;
    while (std::cin >> pattern) {
        patterns.push_back(pattern);
    }
    std::cout << overlap_graph(patterns) << std::endl;
}


int main() {
    //tests();
    solution();
}

标签: c++bioinformaticsgraph-theorygraph-algorithmgenome

解决方案


推荐阅读