Posts Tagged ‘regex’

Programming Praxis – Regular Expressions, Part 1

September 15, 2009

In today’s Programming Praxis problem our task is to write a parser for simple regular expressions. Since Haskell has a very good parser library called Parsec, we’re going to be using that. Let’s get started.

First, some imports:

import Control.Applicative ((<$>), (*>), (<*), (<*>))
import Data.Char
import Text.Parsec
import Text.Parsec.String

Next we define our data structure. There are seven constructs we have to implement, split into two groups based on whether or not they can be followed by a star or not.

data Elem = Lit Char | Esc Char | Any | Set Bool [Elem] deriving Show
data Chunk = Elem Elem | BoL | EoL | Star Elem deriving Show

The parser itself is not too difficult if you know how the operators from Control.Applicative work. <$> means apply the function on the left to the result of the parser on the right. <*, *> and <*> take only the result on the left, right and both sides respectively.

regex :: Parser [Chunk]
regex = (++) <$> bol <*> many chunk where
    bol = option [] (const [BoL] <$> char '^')
    chunk = choice [Star <$> try (element <* char '*'),
                    const EoL <$> try (char '$' <* eof),
                    Elem <$> element]
    element = choice [esc <$> try (char '\\' *> anyChar),
                      const Any <$> char '.',
                      Set False . expandSet <$> set "[^",
                      Set True . expandSet <$> set "[",
                      Lit <$> noneOf "]"]
    esc c = if elem c "nt" then Esc c else Lit c
    set s = try (string s *> many1 element <* char ']')
    expandSet (Lit a:Lit '-':Lit b:xs)
        | validRange a b = map Lit [a..b] ++ expandSet xs
    expandSet (x:xs) = x : expandSet xs
    expandSet _ = []
    validRange a b = b > a && ((isLower a && isLower b) ||
                               (isUpper a && isUpper b) ||
                               (isDigit a && isDigit b))

With the parser written, the function to parse a string is trivial:

parseRegex :: String -> Either ParseError [Chunk]
parseRegex = parse regex ""

Some tests to see if everything is working properly:

main :: IO ()
main = mapM_ print [parseRegex "[0-9][0-9]*",
                    parseRegex "^..*$",
                    parseRegex "hello",
                    parseRegex "^ *hello *$",
                    parseRegex "^[^x].*[0-9] *x$"]

Piece of cake. Next time we do the implementation.

Programming Praxis – Beautiful Code

September 11, 2009

Today’s Programming Praxis is about beautiful code. Specifically, it concerns a bit of C code that can match simple regular expressions. The code in question is widely considered as beautiful code. Personally I’d say the idea behind the code is good, but that the beauty of the code sample itself is being held back by a language that requires too much dealing with trivial stuff (e.g. having to manually increment pointers to move through a string), making the code needlessly long. Fortunately, our assignment is to implement the algorithm using the features and idioms of our own language, so let’s see what we can do with a slightly more modern language:

First, an import:

import Data.List

Since the algorithm itself isn’t all that difficult, we’ll focus on the features of Haskell that are used in this version. the top-level function shows pattern matching (replacing all those if statements), first-class functions (the argument for map) and partial application (match returns a function that takes a string and returns a bool).

match :: String -> String -> Bool
match ('^':r) = matchHere r
match r       = or . map (matchHere r) . tails

matchHere shows more pattern matching and adds lazy evaluation (if the check on the first character of the regex in the third line fails, the second condition is not checked).

matchHere :: String -> String -> Bool
matchHere (c:'*':r) xs  = matchStar c r xs
matchHere "$"       xs  = null xs
matchHere (r:rs) (x:xs) = (r == '.' || r == x) && matchHere rs xs
matchHere r      _      = null r

matchStar adds pattern guards to the mix.

matchStar :: Char -> String -> String -> Bool
matchStar _ r xs     | matchHere r xs = True
matchStar c r (x:xs) = (c == '.' || c == x) && matchStar c r xs
matchStar _ _ _      = False

Using the test suite from Programming Praxis (shortened here due to length) we can see our function works correctly:

main :: IO ()
main = do mapM_ print [
              match "a" "a",
              match "a" "b" == False,
              match "a*a*a" "aaa",
              match "a*a*a" "xxxxx" == False]

With less than half the code size of the original, and a more high-level approach, I prefer this version over the original, but I guess beauty is in the eye of the beholder.