Programming Praxis – Ternary Search Tries

Today’s Programming Praxis problem is about Ternary search tries, which are basically hashmaps of strings to values, but which can be faster in some cases. We have to implement the data structure and provide functions to find, insert, update and delete items and a function to convert the trie to a list. In order to match the provided solution, we will test our code by putting all the words in the bible, as well as their frequencies, into a ternary trie. Our target is 52 lines (the size of the provided solution). Let’s go.

Our imports:

import Data.Char
import qualified Data.List.Key as K
import Prelude hiding (lookup)

The data structure has two small differences from the one in the provided solution: I use a Maybe instead of the bool+value approach, since it’s more idiomatic Haskell, and the split is a list instead of a single type. The reason for this is that it makes the ternary trie far more generic: the provided solution works only for strings, because in the insert function it uses the first ASCII character  as a default. By making the split a list, I can use an empty list as the default, which means I can also define a trie with e.g. lists of Int as the key. This would not be possible with a single item, since there is no way to get a default value for an arbitrary type.
Another thing you might notice are the strictness annotations on the three branches of a node. If you omit these (as I initially did) and try to cram the bible in a trie you’re going to go through stack space like a hot knife through butter, which means the program will be terribly slow. Note that in order to see the benefit you will have to compile the program: running it in ghci will still produce a stack overflow.

data TernaryTrie k v = Empty | Node { val :: Maybe v,
    split :: [k], lb :: !(TernaryTrie k v),
    eb :: !(TernaryTrie k v), gb :: !(TernaryTrie k v) }

For lookup we have two terminating cases (an empty node is reached or the key is found). The rest is just simple recursion.

lookup :: Ord k => [k] -> TernaryTrie k v -> Maybe v
lookup _      Empty = Nothing
lookup []     t     = val t
lookup (x:xs) t     = case compare [x] $ split t of
                           GT -> lookup (x:xs) $ gb t
                           LT -> lookup (x:xs) $ lb t
                           EQ -> lookup xs     $ eb t

Insert, update and delete all follow the same pattern. To get rid of the boring ‘take the correct action for each branch’ code I factored that out into the modify function to better comply with the DRY principle.

insert :: Ord k => [k] -> v -> TernaryTrie k v -> TernaryTrie k v
insert k  v Empty = insert k v $
                    Node Nothing (take 1 k) Empty Empty Empty
insert [] v t     = t { val = Just v }
insert k  v t     = modify (flip insert v) k t

update :: Ord k => [k] -> v -> (v -> v) ->
          TernaryTrie k v -> TernaryTrie k v
update k  v _ Empty = insert k v Empty
update [] v p t     = t { val = Just . maybe v p $ val t }
update k  v p t     = modify (\x -> update x v p) k t

delete :: Ord k => [k] -> TernaryTrie k v -> TernaryTrie k v
delete _  Empty = Empty
delete [] t     = t { val = Nothing }
delete k  t     = modify delete k t

modify :: Ord k => ([k] -> TernaryTrie k v -> TernaryTrie k v) ->
                   [k] -> TernaryTrie k v -> TernaryTrie k v
modify f k t = case compare (take 1 k) (split t) of
                    LT -> t { lb = f (drop 0 k) $ lb t }
                    EQ -> t { eb = f (drop 1 k) $ eb t }
                    GT -> t { gb = f (drop 0 k) $ gb t }

And more recursion for the enlist method.

enlist :: TernaryTrie k v -> [([k], v)]
enlist = enlist' [] where
    enlist' _ Empty = []
    enlist' k t     =
        maybe [] (\v -> [(k, v)]) (val t) ++ enlist' k (lb t) ++
        enlist' (k ++ split t) (eb t) ++ enlist' k (gb t)

And we test it by finding the 25 most common words in the bible.

main :: IO ()
main = print . take 25 . reverse . K.sort snd . enlist .
       foldl (\t k -> update k 1 succ t) Empty .
       map (map toLower . filter isAlpha) . words =<<
       readFile "bible.txt"

And there we go. With 35 lines it’s not a huge reduction (although the provided solution grows by about 15 lines if you reformat it to the 65-ish character limit I use on this blog), but that was to be expected since it’s mostly basic recursion, which doesn’t easily lend itself to alternative solutions. Still, it’ll do.

About these ads

Tags: , , , , , , ,

5 Responses to “Programming Praxis – Ternary Search Tries”

  1. Christophe Poucet Says:

    I take it that you’ve made ‘split’ a list of keys since you plan to eventually encode more than one element?

  2. Remco Niemeijer Says:

    No, the list is always supposed to have one element. Perhaps it would have been slightly better to make it a Maybe instead.

  3. wren ng thornton Says:

    have you tried taking a look at bytestring-trie? It’s restricted to using ByteStrings for keys rather than being generic, but it has a number of nice additional features: node fusion for shared prefixes, extremely generalized versions of the primary functions,…

  4. Remco Niemeijer Says:

    … and it’s a binary instead of a ternary trie :)
    I didn’t look at it because I always try to solve these exercises myself, but if I ever need to use one in practice, then yeah, I’ll use the existing library.

  5. wren ng thornton Says:

    Well, the above version isn’t really all that ternary. It’s only “ternary” on account of not doing node fusion. If two keys are distinct then there exists some maximal equal prefix after which the suffix of one is initially greater/lesser than the suffix of the other. That binary split is the same for the above and the library.

    A real ternary trie would have instead that two distinct keys have some maximal equal prefix after which the suffix of one key is initially A/B/C than the suffix of the other. The trick is getting an element-wise function that partitions non-equal elements into three equal sized sets, rather than two. You can’t use equality as the third set, because equality is handled by the shared prefix and so is already accounted for.

    Trying to achieve a real ternary trie is desirable because many decision problems have e as the optimal branching factor. It’s easy to get any 2^n branching factor by just chomping off more elements at once. That’ll help to flatten the trie, but at the cost of additional width. Optimality is about balancing the cost of depth against the cost of width; node fusion reduces depth at no cost to width.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 35 other followers

%d bloggers like this: