Archive for October 19th, 2010

Programming Praxis – Text File Databases: Part 1

October 19, 2010

In today’s Programming Praxis exercise our goal is to read data from four different types of text file databases. Let’s get started, shall we?

Some imports (the last one is only there to make the type signatures easier to read):

import Control.Applicative ((<*), (<*>), (*>), (<$>))
import Text.Parsec
import Text.Parsec.String

Whenever I need to read any kind of text-based data format, the Parsec library is my go-to tool. First, let’s define what constitutes the end of a line, since we need it in all four types.

eol :: Parser ()
eol = (char '\n' *> optional (char '\r')) <|>
      (char '\r' *> optional (char '\n')) <|> eof

The first type to handle are fixed-length records. All we do is create a parser for each field and concatenate their results. There is currently no special consideration for the header, as I can’t tell from the exercise text what we need to do with it and unfortunately there are no test cases for me to see the expected behaviour.

fixedLength :: [Int] -> Parser [String]
fixedLength fields = foldr (\n p -> (:) <$> count n anyChar <*> p)
                           (return []) fields <* eol

The parser for character-delimited records is fairly self-evident: records consist of fields and stop at the end of a line, fields consist of characters and stop at delimiters or the end of a line. The separator is itself a parser, so there’s plenty of flexibility.

charDelim :: Parser a -> Parser [String]
charDelim sep = manyTill field eol where
    field = manyTill anyChar ((sep *> return ()) <|> lookAhead eol)

Comma separated files aren’t much more difficult. Fields are separated by commas and are either plain text or quoted values.

csv :: Parser [String]
csv = sepBy field (char ',') <* eol where
    field = quoted <|> many (noneOf ",\n\r")
    quoted = between (char '"') (char '"') $
             many (try (char '"' <* char '"') <|> noneOf "\"")

For name-value records, just create a tuple of the name and the value, keep doing so until you find an empty line.

nameValue :: Parser a -> Parser [(String, String)]
nameValue sep = manyTill field eol where
    field = (,) <$> manyTill anyChar sep <*> manyTill anyChar eol

The four parsers above only parse a single record.  To read a file, we just keep reading records until we hit the end of the file.

readDB :: Parser a -> FilePath -> IO (Either ParseError [a])
readDB record = fmap (parse (manyTill record eof) "") . readFile

The lines below show some example usages:

main :: IO ()
main = do print =<< readDB (fixedLength [5,3,4]) "db_fl.txt"
          print =<< readDB (charDelim $ char '|') "db_cd.txt"
          print =<< readDB csv "db_csv.txt"
          print =<< readDB (nameValue $ char ':') "db_nv.txt"

Judging from my own limited test cases, everything seems to be working, and the code is significantly more compact than the provided solution. Yet another example of why I’m a fan of Parsec.