Posts Tagged ‘text’

Programming Praxis – Text File Databases: Part 1

October 19, 2010

In today’s Programming Praxis exercise our goal is to read data from four different types of text file databases. Let’s get started, shall we?

Some imports (the last one is only there to make the type signatures easier to read):

import Control.Applicative ((<*), (<*>), (*>), (<$>))
import Text.Parsec
import Text.Parsec.String

Whenever I need to read any kind of text-based data format, the Parsec library is my go-to tool. First, let’s define what constitutes the end of a line, since we need it in all four types.

eol :: Parser ()
eol = (char '\n' *> optional (char '\r')) <|>
      (char '\r' *> optional (char '\n')) <|> eof

The first type to handle are fixed-length records. All we do is create a parser for each field and concatenate their results. There is currently no special consideration for the header, as I can’t tell from the exercise text what we need to do with it and unfortunately there are no test cases for me to see the expected behaviour.

fixedLength :: [Int] -> Parser [String]
fixedLength fields = foldr (\n p -> (:) <$> count n anyChar <*> p)
                           (return []) fields <* eol

The parser for character-delimited records is fairly self-evident: records consist of fields and stop at the end of a line, fields consist of characters and stop at delimiters or the end of a line. The separator is itself a parser, so there’s plenty of flexibility.

charDelim :: Parser a -> Parser [String]
charDelim sep = manyTill field eol where
    field = manyTill anyChar ((sep *> return ()) <|> lookAhead eol)

Comma separated files aren’t much more difficult. Fields are separated by commas and are either plain text or quoted values.

csv :: Parser [String]
csv = sepBy field (char ',') <* eol where
    field = quoted <|> many (noneOf ",\n\r")
    quoted = between (char '"') (char '"') $
             many (try (char '"' <* char '"') <|> noneOf "\"")

For name-value records, just create a tuple of the name and the value, keep doing so until you find an empty line.

nameValue :: Parser a -> Parser [(String, String)]
nameValue sep = manyTill field eol where
    field = (,) <$> manyTill anyChar sep <*> manyTill anyChar eol

The four parsers above only parse a single record.  To read a file, we just keep reading records until we hit the end of the file.

readDB :: Parser a -> FilePath -> IO (Either ParseError [a])
readDB record = fmap (parse (manyTill record eof) "") . readFile

The lines below show some example usages:

main :: IO ()
main = do print =<< readDB (fixedLength [5,3,4]) "db_fl.txt"
          print =<< readDB (charDelim $ char '|') "db_cd.txt"
          print =<< readDB csv "db_csv.txt"
          print =<< readDB (nameValue $ char ':') "db_nv.txt"

Judging from my own limited test cases, everything seems to be working, and the code is significantly more compact than the provided solution. Yet another example of why I’m a fan of Parsec.

Programming Praxis – Chronological Listing Of Exercises

July 2, 2010

In today’s Programming Praxis exercise our goal is to replicate a script Phil wrote to generate chronological and reverse chronological lists of all of his posts. He did it in 24 lines of AWK, so let’s see how Haskell measures up.

Some imports:

import Data.List
import Data.List.Split
import Text.Printf
import Text.Regex.Posix

We need a function to display the name of a month.

toMonth :: Int -> String
toMonth m = chunk 3 "JanFebMarAprMayJunJulAugSepOctNovDec" !! (m - 1)

Generating the html for a post is roughly the same as in his version, save for the fact that I removed a bit of duplication in the links.

item :: [[String]] -> String
item xs = printf
    "<tr><td>%s</td><td>%02s %s %s</td><td>%s: %s</td>\
    \<td>%s%s<a href=\"http://programmingpraxis.codepad.org/%s\">\
    \codepad</a></td></tr>"
    (g "number") (g "pubday") (toMonth . read $ g "pubmon") (g "pubyear")
    (link "" (g "title")) (g "blurb") (link "" "exercise")
    (link ("/" ++ g "soln") "solution") (g "codepad")
    where g x = maybe "" last $ find ((== x) . head) xs
          link :: String -> String -> String
          link = printf "<a href=\"/%s/%02s/%02s/%s%s/\">%s</a>"
                 (g "pubyear") (g "pubmon") (g "pubday") (g "file")

Generating a list of items is pretty self-explanatory: separate the blocks, filter out the posts, parse the properties and generate the necessary html.

items :: String -> [String]
items = map (item . map (splitOn "\t") . lines) .
        filter (=~ "^number\t[1-9][0-9]*$") . splitOn "\n\n"

All that’s left to do is sort the items as required and put them in a table. Like the original implementation, this version requires that the file containing the list is sorted chronologically.

listing :: ([String] -> [String]) -> String -> String
listing f xs = "<table cellpadding=\"10\">" ++
               concat (f $ items xs) ++ "</table>"

Let’s see if everything works:

main :: IO ()
main = do x <- readFile "praxis.info"
          putStrLn $ listing id x
          putStrLn $ listing reverse x

Yup. And at 15 lines, I think I’ll continue to use Haskell for my text munging needs.