Friday, June 26, 2015

In the last post, we finally come back to a working solution that can build a suffix tree. Remember the cost of insertion is mostly about finding where to insert, and after that is done, the insertion alone take only constant time, so it make sense to optimize that search.

To that end, we introduce suffix link. This is the most complicated concept in the Ukkonen's algorithm.

Suppose we have an internal node 'v' (i.e. not root, not leaf) representing 'aX', where X is a string, possibly empty, and 'a' is a single character, then the suffix link of 'v' is a pointer to the node 'w' representing X.

Three questions immediately arise.
1.) Why should 'w' exist? In general the string 'X' might not be represented by a node, and
2.) Suppose it does, how do one efficiently find it?

We will answer the first two questions at the same time by a lemma.

If a new internal node is introduced in extension j. It's suffix link will be found in the next extension j+1.

Proof:
Notice that a new internal is introduced in the tree only by the split rule. When inserting 'aXyY', we notice the tree already contains 'aXzZ' where X, Y, Z are strings, possibly empty, a, y and z are single character, and $y \ne z$.

In the next extension, we will be inserting XyY. By the insertion order, we can be sure that XzZ is already there, so the tree must split at the point X, and therefore the suffix link will always be there and it will always be found in the next extension!

With that in mind, we can build the suffix link relatively easily. That result in SuffixTree4.

We will answer the last question in the next post.