Monday, June 22, 2015

Road to suffix tree (02)

In the last part, we mentioned about trie and its superiority above other data structures. There is some inefficiency associated with the data structure though - it is the memory consumption.

Notice that we have a node per character. That implies we need a pointer for each character in the tree, assuming a one byte character (or sometimes more), we need 4 bytes for the pointers in x86 and 8 bytes for pointers on amd64, this is a clear waste of space.

To save space (and for time complexity later on), we introduce compression. For example, the compressed trie of the same key-value pair will be illustrated as follow:

(Hello, 5)
(Happy, 5)
(Halloween, 9)

will be represented as a compressed trie like this

 ->a->(ppy, 3)
    ->(lloween, 9)

In general, each node could be a string instead of a single character. For each subtree, however, should always split with different characters, so the class definition look like this

class CompressedTrieNode
  map<char, CompressedTrieLink> children;

class CompressedTrieLink
  string linkLabel;
  CompressedTrieNode* child;

The introduction of CompressedTrieLink is just to make the intent clear. It is just pair<string, CompressedTrieNode*>. Note that the tree depth is reduced, and this can be used to our advantage later on in suffix trees.

See below for some sample code on how to work with a compressed trie

With the code, we can generate the visualization of the compressed trie as follow

Compared with the tree last time, it is significant memory saving.

No comments :

Post a Comment