首页 > 解决方案 > Find the length of the longest common substring in 'n' binary strings

问题描述

I am given n strings (n>=2 and n<=4) and each one is constructed using 2 letters only: a and b. In this set of strings I have to find the length of the longest common substring that is present in all the strings. A solution is guaranteed to exist. Let's see an example:

n=4
abbabaaaaabb
aaaababab
bbbbaaaab
aaaaaaabaaab

The result is 5 (because the longest common substring is "aaaab").

I don't have to print (or even know) the substring, I just need to print its length.

It is also given that the result cannot be greater than 60, even though the length of each string can be as high as 13 000.

What I tried is this: I found the smallest length of any string of the given strings and then I compared it with 60 and I chose the smallest value between the two as starting point. Then I started taking sequences of the first string, and the length of each sequence of the first string is len, where len takes values from starting point to 1. At each iteration I take all possible sequences of the first string of length len and I use it as pattern. Using the KMP algorithm (thus, complexity of O(n+m)), I iterated through all the other strings (from 2 to n) and checked if pattern is found in string i. Whenever it isn't found, I break the iteration and try the next sequence available of length len or, if there isn't any, I decrease len and try all sequences that have as length the new, decreased value len. But if it matches, I stop the program and print the length len, since we started from the longest possible length, decreasing at each step, it is logical that the first match that we find represents the largest possible length. Here is the code (but it doesn't really matter since this method is not good enough; I know I shouldn't use using namespace std but it doesn't really affect this program so I just didn't bother):

#include <iostream>
#include <string>
#define nmax 50001
#define result_max 60

using namespace std;

int n,m,lps[nmax],starting_point,len;
string a[nmax],pattern,str;

void create_lps() {
    lps[0]=0;
    unsigned int len=0,i=1;
    while (i < pattern.length()) {
        if (pattern[i] == pattern[len]) {
            len++;
            lps[i] = len;
            i++;
        }
        else {
            if (len != 0) {
                len = lps[len-1];
            }
            else {
                lps[i] = 0;
                i++;
            }
        }
    }
}

bool kmp_MatchOrNot(int index) {
    unsigned int i=0,j=0;
    while (i < a[index].length()) {
        if (pattern[j] == a[index][i]) {
            j++;
            i++;
        }
        if (j == pattern.length()) {
            return true;
        }
        else if (i<a[index].length() && pattern[j]!=a[index][i]){
            if (j != 0) {
                j = lps[j-1];
            }
            else {
                i++;
            }
        }
    }
    return false;
}

int main()
{
    int i,left,n;
    unsigned int minim = nmax;
    bool solution;
    cin>>n;
    for (i=1;i<=n;i++) {
        cin>>a[i];
        if (a[i].length() < minim) {
            minim = a[i].length();
        }
    }

    if (minim < result_max) starting_point = minim;
    else starting_point = result_max;

    for (len=starting_point; len>=1; len--) {
        for (left=0; (unsigned)left<=a[1].length()-len; left++) {
            pattern = a[1].substr(left,len);
            solution = true;
            for (i=2;i<=n;i++) {
                if (pattern.length() > a[i].length()) {
                    solution = false;
                    break;
                }
                else {
                    create_lps();
                    if (kmp_MatchOrNot(i) == false) {
                        solution = false;
                        break;
                    }
                }
            }
            if (solution == true) {
                cout<<len;
                return 0;
            }
        }
    }
    return 0;
}

The thing is this: the program works correctly and it gives the right results, but when I sent the code on the website, it gave a 'Time limit exceeded' error, so I only got half the points.

This leads me to believe that, in order to solve the problem in a better time complexity, I have to take advantage of the fact that the letters of the string can only be a or b, since it looks like a pretty big thing that I didn't use, but I don't have any idea as to how exactly could I use this information. I would appreciate any help.

标签: c++stringalgorithmoptimizationlongest-substring

解决方案


The answer is to build the suffix trees of all of the strings individually, then intersect them. A suffix tree is like a trie that contains all suffixes of one string simultaneously.

Building a suffix tree for a fixed alphabet is O(n) with Ukkonen's algorithm. (If you don't like that explanation, you can use google to find others.) If you have m trees of size n, this is time O(nm).

Intersecting suffix trees is a question of traversing them in parallel, only going further when you can go further in all trees. If you have m trees of size n, this operation can be done in time no more than O(nm).

The overall time of this algorithm is time O(nm). Given that just reading the strings is of time O(nm), you can't do better than that.


Adding a small amount of detail, suppose that your suffix tree is written as one character per node. So each node is just a dictionary whose keys are characters and whose values are the rest of the tree. So to us your example, for the string ABABA the diagram at https://imgur.com/a/tnVlSI1 would turn into a data structure something like (see below) this one:

{
    'A': {
        'B': {
            '': None,
            'A': {
                'B': {
                    '': None
                }
            }
        }
    },
    'B': {
        '': None
        'A': {
            'B': {
                '': None
            }
        }
    }
}

And likewise BABA would turn into:

{
    'A': {
        '': None
        'B': {
            'A': {
                '': None
            }
        }
    },
    'B': {
        'A': {
            '': None,
            'B': {
                'A': {
                    '': None
                }
            }
        }
    }
}

With data structures that look like this, naive Python to compare them looks like:

def tree_intersection_depth (trees):
    best_depth = 0
    for (char, deeper) in trees[0].items():
        if deeper is None:
            continue
        failed = False

        deepers = [deeper]
        for tree in trees[1:]:
            if char in tree:
                deepers.append(tree[char])
            else:
                failed = True
                break

        if failed:
            continue

        depth = 1 + tree_intersection_depth(deepers)
        if best_depth < depth:
            best_depth = depth

    return best_depth

And you would call it like tree_intersection_depth([tree1, tree2, tree3, ...]).

With the above two trees it does indeed give 3 as the answer.

Now I actually cheated in writing out that data structure. What makes suffix trees efficient is that you DON'T actually have a data structure that looks like that. You have one that reuses all of the repeated structures. So the code to simulate setting up the data structures and calling it looks like this:

b_ = {'B': {'': None}}
ab_ = {'': None, 'A': b_}
bab_ = {'B': ab_}
abab = {'A': bab_, 'B': ab_}

a_ = {'A': {'': None}}
ba_ = {'': None, 'B': a_}
aba_ = {'A': ba_}
baba = {'B': aba_, 'A': ba_}

print(tree_intersection_depth([abab, baba]))

And now we can see that to get the promised performance, there is a missing step. The problem is that while the size of the tree is O(n), while searching it we will potentially visit O(n^2) substrings. In your case you don't have to worry about that, because the substrings are guaranteed to never go to depth more than 60. But in the fully general case you would need to add memoization so that when recursion results in comparing data structures that you have seen before, you immediately return the old answer, and not the new one. (In Python you would use the id() method to compare the address of the object to ones you've seen before. In C++, have a set of pointer tuples for the same purpose.)


推荐阅读