Wednesday, June 24, 2015

Road to suffix tree (06)

The algorithm in the previous post has a fatal flaw - what is that? It is wrong! Let's look at the generated 'compressed' trie now!

It is now building a trie - not a compressed trie. Can you see why? The compressed trie will always generate a node at the end of the suffix, so if 'a' is a string inserted in the compressed trie. 'a' will always lead to a node. If we wanted the compressed trie as in the baseline suffix tree algorithm. We need to make sure we do NOT create those immediate nodes with only one child. 

To do that, let examine more closely in the string insertion order. There is a property we can leverage in the insertion order here. Suppose, we are trying to insert "xac" to the tree. We can be sure that the string "xa" is already installed in the tree. 

So the insertion procedure can be split into two steps:
Finding where is "xa", and then install the "c" in the tree.

Depending on input, the step 1 search can end up somewhere in the tree. If we are lucky, it ends up in a node, but it could be in the middle of an edge as well, but conceptually it is just a location in the tree. It has three cases

Case 1) The location is a leaf node
Case 2) The location is not a leaf, and it extends to the next character we want.
Case 3) The location is not a leaf, and it does not extend to the next character we want.

Case 1 is simple - our current code in compressed trie create a new edge and new node in this case, we need to avoid that. Just add one more character on the edge label and we are done. This is what the book calls rule 1. 

Case 2 is also simple - do nothing at all! The tree already have the string we need, why bother doing anything at all? The book call this rule 3. (Notice we call this case 2, but book say rule 3)

Case 3 is the complicated beast. At this point we must branch out. The book call this rule 2. I personally like to put the simple case up front, this usually make the code easier to read.

The mismatch of the case number and the rule number is obviously annoying. In fact, I hate the rule numbering thing in the book. So I conceptually replaced those rule numbers by rule name, it made the exposition much more easy to follow:

Case 1, rule 1 is the leaf extension rule.
Case 2, rule 3 is the no-op rule.
Case 3, rule 2 is the split rule.

So hereafter I will always refer to the rules by name - never by numbers.

Too much concepts in this post, we will do the coding in the next one.

No comments :

Post a Comment