Wednesday, June 24, 2015

Road to Suffix Tree (05)

In the last part we talked about the $ O(m^2) $ algorithm for building a suffix tree. This time, we will talk about a $ O(m^3) $ time solution.

Why would one wanted to do that? This is because we will improve the $ O(m^3) $ algorithm all the way to linear time. We are following a similar exposition as in Algorithms on String, Tree and Sequences.

The algorithm is really simple, here it is, this is the only change from SuffixTree1.

void SuffixTree2Builder::BuildSuffixTree(string input, SuffixTree2* suffixTree)
{
    string s = "xabxac";
    for (unsigned int end = 1; end <= s.length(); end++)
    {
        for (unsigned int start = 0; start < end; start++)
        {
            suffixTree->Add(s.substr(start, end - start));
        }
    }
}

If we ignore the outer for loop, this is simply the $ O(m^2) $ algorithm we talked in the last post. The algorithm install the 'suffixes' in the compressed trie in the following order

x

xa
a

xab
ab
b

...

And so on - many of the 'suffix' are actually prefix of a suffix of the whole string, but that does not matter for a suffix tree because a suffix tree include all the prefixes anyway.

The blank lines are intentionally added between outer loop iterations. Following the book's convention, we call an outer loop iteration a phase, and an inner loop iteration an extension. So for a string with length $ m $, we have $ m $ phases, from phase 1 to phase $ m $. For phase $ k $, we have $ k $ extensions, from extension 1 to extension $ k $.

Unlike the book's convention, however, we will ALWAYS use 0 based index for string, this is so convenient, and we will ALWAYS use inclusive, exclusive index for string indexes, for example, [3, 7) for "Hello World" is "lo W", as indicated below:

0123456789A
Hello World

This algorithm is really stupid? Yes - it is, we will move forward in the next post.

No comments :

Post a Comment