Today’s Programming Praxis problem is an interesting one: we have to write a program to solve monoalphebetic substitution ciphers.
The algorithm we’re going to use is by no means perfect, but for the example cryptogram, as well as a few other test cases, it generally gives a good enough approximation that only a little bit of human help is needed to get the full solution. It also has the benefit of being nice and short.
First, our imports:
import qualified Data.Map as M
We’re going to be using a dictionary-based approach. In order for that to work, we need a way to see if a given dictionary word can fit the pattern defined by the crypto word. We do this by looking at the first letter of the crypto and the dictionary word. If we already have a partial solution key, we check if the decoded letter is the same as that of the dictionary word. If we don’t, we add the relationship between the two letters to the solution key. Then we check the rest of the two words in the same way.
match :: M.Map Char String -> String -> String -> Bool
match _   = True
match k (c:cs) (p:ps) = M.findWithDefault [p] c k == [p] &&
match (M.insert c [p] k) cs ps
match _ _ _ = False
To apply the solution key to an encrypted string, we replace all the letters that exist in the key with their decoded equivalents, leaving the rest as is.
substitute :: M.Map Char String -> String -> String
substitute key = concatMap (\c -> M.findWithDefault [c] c key)
To get a series of potential keys, we look at every word of the encrypted text and see which dictionary words can still be substituted for them. If there are less than 100 options for a word, we generate all the keys that would result from those substitutions.
getKeys :: M.Map Char String -> [String] -> String -> [M.Map Char String]
getKeys k dict c = [M.fromList .
unionBy (\(a,x) (b,y) -> a == b || x == y) (M.assocs k) .
zip w $ map return p | w <- words c,
let ps = filter (match k w) dict, length ps < 100, p <- ps]
We score keys by the amount of letters of they decrypt; the more, the better.
score :: M.Map Char String -> String -> Int
score k = negate . length . filter isLower . substitute k
We iterate this process a couple of times by feeding the resulting keys into a new iteration of the algorithm. In order to prevent combinatorial explosion, we only keep the 10 best keys of each generation.
findBestKeys :: Int -> [String] -> String -> [M.Map Char String]
findBestKeys n dict c = iterate (take 10 . sortWith (`score` c) .
concatMap (\k -> getKeys k dict c)) [M.empty] !! n
‘Solving’ a crpytogram, i.e. showing the 10 best candidates and letting the user interpret them, means finding the best keys and showing the resulting substitutions. The first argument determines how many generations are performed, which determines the quality of the solution. For some cryptograms you may have to pay around with this a little. In my tests, something in the order of 3-7 usually works fairly well.
solve :: Int -> [String] -> String -> [String]
solve n dict c = map (`substitute` c) $ findBestKeys n dict $
filter (\x -> isAlpha x || isSpace x) c
Let’s see if our little algorithm works. For our dictionary we use english-words.10, a file that can be found in SCOWL, a Spell-Checker Oriented Word List (more info here). It is a list of common English words that offers a good balance between quality (it works better for my algorithm than the usr/dict/words file from Linux does) and speed (it’s just under 5000 words, which means that I get a solution in less than a second).
crypto = "P OYUUAOEXYW YM AEFGAD, FZGPEAG JAATUL, MYC ENA " ++
"AGFOPEXYW PWG AWSYLVAWE YM ENA DPHHL ZCYICPVVAC"
main :: IO ()
main = do dict <- fmap lines $ readFile "english-words.10"
mapM_ print $ solve 7 dict crypto
The resulting decrypted text is “a collection of etudew, updated JeeTls, for the education and enSosment of the waHHs programmer”, which looks strikingly similar to the old motto of Programming Praxis – “a collection of etudes, updated weekly, for the education and enjoyment of the savvy programmer”. As you can see, there are a few errors and it doesn’t get everything, but that is because these words are not in the dictionary file I used. Still, for an 18-line algorithm I’d say it works well enough. A trigram- or genetic algorithm-based solution might work better, but we’ll see that when the official solution is released.