Programming Praxis – A Statisticle Speling Korrecter

In today’s Programming Praxis exercise, we have to implement Peter Norvig’s spelling corrector. Let’s get started, shall we?

First, we’re going to need a bunch of imports.

import Data.Char
import Data.List
import qualified Data.List.Key as K
import qualified Data.Map as M
import qualified Data.Set as S
import qualified Data.Text as T

The first step is to create our dictionary of valid words. I’m using the Data.Text functions instead of the Prelude ones here for performance reasons, as the text file is fairly big at 6.2 MB.

nwords :: IO (M.Map String Int)
nwords = fmap (M.fromListWith (+) . map (flip (,) 1 . T.unpack) .
               T.splitBy (not . isLetter) . T.toLower . T.pack) $
         readFile "big.txt"

Creating all the words at edit distance 1 is fairly simple with some list comprehensions. The reason for using the Data.Set functions here is that nub from Data.List is horribly slow on large lists.

edits :: String -> [String]
edits word = S.elems . S.fromList $ dels ++ trans ++ repls ++ ins
    where s     = zip (inits word) (tails word)
          dels  = [a ++ b     | (a, _:b)   <- s]
          trans = [a ++ c:b:d | (a, b:c:d) <- s]
          repls = [a ++ c:b   | (a, _:b)   <- s, c <- ['a'..'z']]
          ins   = [a ++ c:b   | (a, b)     <- s, c <- ['a'..'z']]

The python version has separate functions for the words at edit distance 1 and 2, but thanks to lazy evaluation we can just write a one-size-fits all function.

known_edits :: M.Map String a -> String -> Int -> [String]
known_edits dict word n = filter (`M.member` dict) $
    iterate (edits =<<) [word] !! n

Correcting a word is then simply a matter of generating the alternatives and finding the most common one.

correct :: M.Map String Int -> String -> String
correct dict word = maybe word (K.maximum (`M.lookup` dict)) .
    find (not . null) $ map (known_edits dict word) [0..2]

Let’s see if everything went correctly:

main :: IO ()
main = do dict <- nwords
          print $ correct dict "speling"
          print $ correct dict "korrecter"

Yup. One spelling korrector… uhh… corrector, all done.

Tags: , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: