Programming Praxis – Text File Databases: Part 1

In today’s Programming Praxis exercise our goal is to read data from four different types of text file databases. Let’s get started, shall we?

Some imports (the last one is only there to make the type signatures easier to read):

import Control.Applicative ((<*), (<*>), (*>), (<$>))
import Text.Parsec
import Text.Parsec.String

Whenever I need to read any kind of text-based data format, the Parsec library is my go-to tool. First, let’s define what constitutes the end of a line, since we need it in all four types.

eol :: Parser ()
eol = (char '\n' *> optional (char '\r')) <|>
      (char '\r' *> optional (char '\n')) <|> eof

The first type to handle are fixed-length records. All we do is create a parser for each field and concatenate their results. There is currently no special consideration for the header, as I can’t tell from the exercise text what we need to do with it and unfortunately there are no test cases for me to see the expected behaviour.

fixedLength :: [Int] -> Parser [String]
fixedLength fields = foldr (\n p -> (:) <$> count n anyChar <*> p)
                           (return []) fields <* eol

The parser for character-delimited records is fairly self-evident: records consist of fields and stop at the end of a line, fields consist of characters and stop at delimiters or the end of a line. The separator is itself a parser, so there’s plenty of flexibility.

charDelim :: Parser a -> Parser [String]
charDelim sep = manyTill field eol where
    field = manyTill anyChar ((sep *> return ()) <|> lookAhead eol)

Comma separated files aren’t much more difficult. Fields are separated by commas and are either plain text or quoted values.

csv :: Parser [String]
csv = sepBy field (char ',') <* eol where
    field = quoted <|> many (noneOf ",\n\r")
    quoted = between (char '"') (char '"') $
             many (try (char '"' <* char '"') <|> noneOf "\"")

For name-value records, just create a tuple of the name and the value, keep doing so until you find an empty line.

nameValue :: Parser a -> Parser [(String, String)]
nameValue sep = manyTill field eol where
    field = (,) <$> manyTill anyChar sep <*> manyTill anyChar eol

The four parsers above only parse a single record.  To read a file, we just keep reading records until we hit the end of the file.

readDB :: Parser a -> FilePath -> IO (Either ParseError [a])
readDB record = fmap (parse (manyTill record eof) "") . readFile

The lines below show some example usages:

main :: IO ()
main = do print =<< readDB (fixedLength [5,3,4]) "db_fl.txt"
          print =<< readDB (charDelim $ char '|') "db_cd.txt"
          print =<< readDB csv "db_csv.txt"
          print =<< readDB (nameValue $ char ':') "db_nv.txt"

Judging from my own limited test cases, everything seems to be working, and the code is significantly more compact than the provided solution. Yet another example of why I’m a fan of Parsec.

Tags: , , , , , , , ,

2 Responses to “Programming Praxis – Text File Databases: Part 1”

  1. programmingpraxis Says:

    1) Regarding Parsec. I wish Scheme had a standard parser library like that.

    2) Regarding fixed-length records don’t have a defined specification. That was on purpose. You can define the positions as list-of-starts-and-ends or list-of-starts-and-lengths or list-of-starts-and-list-of-ends or list-of-starts-and-list-of-lengths or some other way, whatever is convenient for you. Mine is list-of-starts-and-ends, but sometimes I wish it was list-of-starts-and-lengths.

    3) Regarding test cases. There will be a large list of test cases in Part 2, in the paper attached to that exercise. Also complete documentation. I probably should have attached it to Part 1, also. And there’s another exercise coming up next Tuesday that will use the input library in a significant way. Ultimately, of course, there is no one “right” set of test cases, because everyone has different needs for their readers; I just proposed a fairly standard set of library functions that I have found useful.

    4) Regarding CSV parsing. I’m not sure your comma-separated values parser does the right thing all the time. How do you handle commas embedded in quoted strings? Embedded double-quotes? Embedded newlines? CSV parsing is hard, not to mention under-specified. Look at my earlier exercise for some guidance.

    5) Regarding code length. I purposely repeated some code in all the readers, so that each one stands on its own — you can just cut-and-paste one of the readers into your program and it will work fine. The other approach, which you have taken, is to write the code once and make the user import all the needed pieces. Both are right, they are just different.

    6) I am always interested to look at your Haskell solutions because they use so many library functions. Today you used manyTill, anyChar, noneOf and lookAhead; those are all part of Parsec, correct?


  2. Remco Niemeijer Says:

    As for csv parsing, my code already handles all three cases you mention (it wouldn’t be much of a csv parser if it didn’t, now would it? :)) For an explanation why, let’s look at the “quoted” parser in a little more detail, since that’s where all three cases occur. between simply applies its first, third and second parser in order, returning only the result of the third one, so in this case it ignores the surrounding quotes. many keeps applying its argument until it fails, returning a list of all results. try means that no input is consumed unless its argument succeeds. <* applies both of its arguments, keeping only the result of the first one. <|> applies its second argument only when the first one fails. noneOf parses any character except those in its argument. With that out of the way, let’s see how that relates to all three cases. Once we’re in a quoted string, commas are parsed by noneOf “\””. They’re added to the content just the same as letters etc., since they don’t get treated as separators again until the field parser has completed, returning control to the sepBy at the top. Double quotes are handled by the parser before the <|>. Whenever a quote character is encountered, if it is followed by a second one, the parser succeeds and returns the result of the left parser, i.e. a single quote character, which is added to the field content. If it’s not followed by a second quote, the parser fails. Normally, the first quote would be consumed, but try means it gets put back in the queue. Because of that, noneOf “\”” will fail, many will complete, and the quote is consumed by the second argument of between, completing the field. Embedded newlines work the same as commas; when they’re encountered by noneOf “\”” they’re not given any special treatment.

    As for the four library functions you mention, they are indeed all part of Parsec. if you’re interested, they are described in the following pages:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: