Monday, June 22, 2015

Road to suffix tree (01)

To explain what a suffix tree is, I need to explain what is a trie and how to compress it. So let's start there.

Consider an abstract dictionary like this:

class Dictionary<TKey, TValue>
{
  Add(TKey key, TValue value);
  TValue Get(TKey key)
}

Don't worry about the syntax for now, it is just pseudo code.

To realize a dictionary, there are many different ways. One can build a simple association list, or a binary search tree, or a hash table.

Now we consider a special case where the keys are strings. It is very common for us to use string as key. String has a sub-structure, it is a sequence of characters. Is it possible for us to use that sub-structure to build a better data structure?

Yes - and that is a trie. Consider these string value pairs and the associated trie will explain how a trie works.

(Hello, 5)
(Happy, 5)
(Halloween, 9)

The tree looks like

H->e->l->l->(o,5)
 ->a->p->p->(y, 5)
    ->l->l->o->w->e->e->(n,9)

With the tree, it is easy to imagine how would one implement the Add and Get method out of it. In code, the tree look like this

class TrieNode
{
  map<char, TrieNode*> children;
  TValue* value; // non-null if this is a leaf.
}

The data structure allows a worst case bound for Add and Get to have time proportional to the length of the key. This cannot be done with association list, binary search tree or hash table. That shows to superiority of this data structure compared to them.

See here for some sample code for work with a Trie.

Using the code, we can generate a visualization of the trie with our example string as follow:


This is probably the largest tree we will see, in the next post we will compress it.

No comments :

Post a Comment