Programming Praxis – Regular Expressions, Part 1

In today’s Programming Praxis problem our task is to write a parser for simple regular expressions. Since Haskell has a very good parser library called Parsec, we’re going to be using that. Let’s get started.

First, some imports:

import Control.Applicative ((<$>), (*>), (<*), (<*>))
import Data.Char
import Text.Parsec
import Text.Parsec.String

Next we define our data structure. There are seven constructs we have to implement, split into two groups based on whether or not they can be followed by a star or not.

data Elem = Lit Char | Esc Char | Any | Set Bool [Elem] deriving Show
data Chunk = Elem Elem | BoL | EoL | Star Elem deriving Show

The parser itself is not too difficult if you know how the operators from Control.Applicative work. <$> means apply the function on the left to the result of the parser on the right. <*, *> and <*> take only the result on the left, right and both sides respectively.

regex :: Parser [Chunk]
regex = (++) <$> bol <*> many chunk where
    bol = option [] (const [BoL] <$> char '^')
    chunk = choice [Star <$> try (element <* char '*'),
                    const EoL <$> try (char '$' <* eof),
                    Elem <$> element]
    element = choice [esc <$> try (char '\\' *> anyChar),
                      const Any <$> char '.',
                      Set False . expandSet <$> set "[^",
                      Set True . expandSet <$> set "[",
                      Lit <$> noneOf "]"]
    esc c = if elem c "nt" then Esc c else Lit c
    set s = try (string s *> many1 element <* char ']')
    expandSet (Lit a:Lit '-':Lit b:xs)
        | validRange a b = map Lit [a..b] ++ expandSet xs
    expandSet (x:xs) = x : expandSet xs
    expandSet _ = []
    validRange a b = b > a && ((isLower a && isLower b) ||
                               (isUpper a && isUpper b) ||
                               (isDigit a && isDigit b))

With the parser written, the function to parse a string is trivial:

parseRegex :: String -> Either ParseError [Chunk]
parseRegex = parse regex ""

Some tests to see if everything is working properly:

main :: IO ()
main = mapM_ print [parseRegex "[0-9][0-9]*",
                    parseRegex "^..*$",
                    parseRegex "hello",
                    parseRegex "^ *hello *$",
                    parseRegex "^[^x].*[0-9] *x$"]

Piece of cake. Next time we do the implementation.

About these ads

Tags: , , , , , , , , , , ,

4 Responses to “Programming Praxis – Regular Expressions, Part 1”

  1. programmingpraxis Says:

    For those of us not familiar with parsec, please explain your parser more slowly.

    Questions:

    1) The exercise is careful to require that the characters on the two ends of a range in a character class are either both lower-case, or both upper-case, or both digits, and the left character is less than the right character. Is yours so careful? For instance, how would [A-z] expand? According to the exercise, it should expand to the three-character class #\A, #\-, #\z.

    2) Have you implemented escape characters?

    3) Please explain the type declaration Set Bool [Char].

    4) Do ^ and $ lose their meta-ness when not at the beginning or end of the regular expression?

    5) What are the functions option and choice?

    Thanks for helping me clear up my confusion.

    Phil

  2. Remco Niemeijer Says:

    Hey Phil,

    Looks like I overlooked the text between the two bulleted lists, and since none of the cases mentioned in point 1 and 2 featured in the unit tests, I never noticed the problem. I’ve updated the code to handle those cases.

    As for the remaining questions:

    3) It’s now been changed to Set Bool [Elem] to fix a bug and handle escaped characters. The boolean determines whether it is a positive set (true) or a negative set (false). The [Elem] is the contents of the set.

    4) If you try parseRegex “^^$$”, you’ll get Right [BoL,Elem (Lit '^'),Elem (Lit '$'),EoL]. So yeah, they lose their meta-ness. I assume this is what’s supposed to happen.

    5) option takes an argument of type a and a parser that returns an a. If the parser fails, it returns the first argument, otherwise the result of the parser. choice takes a list of parsers and tries each one until it finds one that succeeds, after which it returns that parser’s output. If no parsers succeed, choice fails.

  3. Programming Praxis – Regular Expressions, Part 2 « Bonsai Code Says:

    [...] Bonsai Code Art is the elimination of the unnecessary « Programming Praxis – Regular Expressions, Part 1 [...]

  4. RobD Says:

    Hey very nice blog!!….I’m an instant fan, I have bookmarked you and I’ll be checking back on a regular….See ya :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 35 other followers

%d bloggers like this: