首页 > 解决方案 > 朴素字符串匹配算法的性能

问题描述

我需要对两个不匹配的字符串进行匹配。我已经尝试过一些方法,比如朴素算法和正则表达式,但是我的代码非常慢,并且没有通过作业评分器。

你能建议我在我的代码中做什么,这让它变得非常慢吗?或建议另一种更快的解决方案?

示例:节点为“GATGATGATGCT”,Neighbor 为“GATGATGCTCTC”,min_overlap_len 为 4,Overlap_neighbor_string 为“GATG”(因为overlap_len 为4)

预期的答案是最长重叠开始的节点的索引,并且可以将组合字符串扩展到邻居以找到一个完整的字符串。因此,在此示例中,答案应为 3 以创建组合字符串“GAT GATG ATGCT CTC”

朴素算法:

def find_overlap_naive(node, neighbor):
    min_overlap_len = 4
    overlap_string = neighbor[0:min_overlap_len]
    error_limit = 2
    
    for start in range(len(node)-min_overlap_len): 
        error_counter = 0
        start_counter = start
  
        for overlap_index in range(min_overlap_len): # For each character in node checking if min_overlap exists
            if error_counter > error_limit:
                break
            if node[start_counter] is not overlap_string[overlap_index]:
                error_counter += 1
            start_counter += 1
            continue
        
        if error_counter <= error_limit: # Chcking if remaining of the node is extended to neighbor
            remaining_node_len = len(node)-start_counter
            for y in range(remaining_node_len): # Checking if remaining of the node is similar to the neighbor
                if error_counter > error_limit:
                    break
                neighbor_index = min_overlap_len + y
    
                if node[start_counter] is not neighbor[neighbor_index]:
                    error_counter += 1 
                start_counter += 1
                continue             
        
        # If error>2 after the matching go to next char of node         
        if error_counter > error_limit:
            continue
    
        overlap_start = start
        overlap_length = len(node) - overlap_start
        return (overlap_start, overlap_length)
    
    return (-1, 0) # return (-1, 0) if no match found

PS:给出的示例是一个玩具示例,真实字符串(节点和邻居)的长度为 100 个字符,并且对于每个具有 1617 个邻居的节点,该函数将被调用 1618 次,因此代码变得非常慢。

标签: python-3.xalgorithmperformancebioinformatics

解决方案


推荐阅读