Monadic Parsing

Pedro Vasconcelos, DCC/FCUP

March 2021

Bibliography

“Functional Pearls: Monadic Parsing in Haskell”, Graham Hutton and Erik Meijer, Journal of Functional Programming.
Chapter 8 of “Programming in Haskell” (Graham Hutton)
Sample code: https://github.com/pbv/tapf-parsing

What is a parser?

A parser is program that converts text into some internal representation
- e.g. an abstract syntax tree (AST)
Can be implemented manually (e.g. top-down recursive descent)
Or using some automatic tool (e.g. bottom-up parser generators like yacc,bison or happy)

This lecture: writing top-down recursive descent parsers using monadic parsing combinators.

A type for parsers

A parser consumes an input string and yields pairs of result and remaining input.

-- parser producing values of type `a'
newtype Parser a = Parser (String -> [(a, String)])

Read this out loud as a rhyme:

A parser for things
Is a function from strings
To lists of pairs
Of things and strings

Running a parser

-- run a parser on some input 
runParser :: Parser a -> String -> [(a, String)]
runParser (Parser f) = f

Example

A parser that accepts a single next character.

next :: Parser Char
next = Parser (\cs -> case cs of
                       [] -> []
                       (c:cs') -> [(c, cs')])

Make it into a monad

instance Monad Parser where
   return a = Parser (\cs -> [(a, cs)])
   p >>= k  = Parser (\cs -> 
                  concat [runParser (k a) cs'
                         | (a,cs')<-runParser p cs])

return yields a value without consuming input
>>= applies a continuation to the result of the first parser
Combination of state (input stream) and choice (alternative parsings)

Failure and Choice

-- | a parser that always fails
empty :: Parser a
empty = Parser (\_ -> [])

-- | combine results of two parsers 
(+++) :: Parser a -> Parser a -> Parser a
p +++ q = Parser (\cs -> runParser p cs ++
                         runParser q cs)

Deterministic choice

The combinator +++ returns all results of both parsers; we often care only about the first result.

Let us define an infix operator for deterministic choice.

(<|>) :: Parser a -> Parser a -> Parser a
p <|> q = Parser (\cs -> case runParser p cs of
                           [] -> runParser q cs
                           (x:_) -> [x])

Both +++ and <|> are associative with neutral element empty.

Some simple combinators

-- | parse a char satisfying a predicate
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = do c <- next
               if p c then return c else empty 

-- | parse a specific char
char :: Char -> Parser Char
char c = satisfy (==c)

-- | parse a specific string
string :: String -> Parser String
string "" = return ""
string (c:cs) 
    = do char c; string cs; return (c:cs)

Combinators for repetition

-- | repeat a parser zero or more times
many :: Parser a -> Parser [a]
many p = many1 p <|> return []

-- | repeat a parser one or more times
many1 :: Parser a -> Parser [a]
many1 p = do a<-p; as<-many p; return (a:as)

Remember that <|> is left-biased; cue example…

Spaces and Tokens

-- | parse many whitespace charaters
spaces :: Parser String
spaces = many (satisfy isSpace)

-- | parse a token using a parser `p`
-- discarding any initial spaces
token :: Parser a -> Parser a
token p = spaces >> p

-- | parse a string token
symbol :: String -> Parser String
symbol s = token (string s)

Example 1: parsing CSVs

Text files for simple tabular data
Each item is either a string or an integer
Each row is a list of items separated by commas (,)
Rows are separated by \n

Example:

"Name", "Sex", "Age", "Height(in)", "Weight(lbs)"
"Alex",  "M",   41,   74,           170
"Bert",  "M",   42,   68,           166
"Carl",  "M",   32,   70,           155

Parsing integers and strings

integer :: Parser Integer
integer = token $ do
    s <- many1 (satisfy isDigit)
    return (read s) 

stringLit :: Parser String
stringLit = token $ do
  char '\"'
  s <- many (satisfy (/='\"'))
  char '\"'
  return s

NB: some cases missing (negative numbers, escape chars)

Parsing an item

Use a sum type to tag different results:

type Item = Either Integer String 

item :: Parser Item
item = do n <- integer; return (Left n)
       <|> 
       do s <- stringLit; return (Right s)

Parsing an item (2)

Alternativa: we could define our own data type.

data Item = Number Integer | Text String

item :: Parser Item
item = do n <- integer; return (Number n)
       <|> 
       do s <- stringLit; return (Text s)

Parsing a row

Need to “factor out” the first item because of the separators (commas).

row :: Parser [Item]
row = do v<-item; vs<-rest; return (v:vs)
      <|> 
      return []
  where
    rest = many (comma >> item) <|> return []

comma :: Parser Char
comma = token (char ',')

Avoiding repetition

The row parser is a bit repetitive
But we cannot simplify it using many because commas are separators not terminators
Let us define a new combinator for this purpose

Combinators for separators

-- | repeated applications of `p'
-- separated by applications of `sep'
sepBy :: Parser a -> Parser b -> Parser [a]
p `sepBy` sep = (p `sepBy1` sep) <|> return []

sepBy1 :: Parser a -> Parser b -> Parser [a]
p `sepBy1` sep = do a <- p
                    as <- many (sep >> p)
                    return (a:as)

Row parser revisited

The row parser is much simpler using sepBy:

row :: Parser [Item]
row = item `sepBy` comma

Parsing many rows

We can define the parser for rows using the previous parser and many1:

rows :: Parser [[Item]]
rows = many1 (do r<-row; newline; return r)

newline = char '\n'

We can even inline the definition of row directly:

rows :: Parser [[Item]]
rows = many1 (do r <- item`sepBy`comma
                 newline
                 return r)

Example 2: arithmetic expressions

A parser for arithmetic expressions with integers, +, -, *, / and parentheses
Result is the expression value (no need to build a syntax tree)

Example:

> runParser expr "1 + 2*(3+1)"
[(9, "")]
> runParser expr "12 * 3 + 1   "
[(37,"    ")]

Grammar

\[ \begin{array}{rcl} \text{expr} &::=& \text{expr} ~ \text{addop}~ \text{term} \quad|\quad \text{term} \\ \text{term} &::=& \text{term}~ \text{mulop}~ \text{factor} \quad|\quad \text{factor} \\ \text{factor} &::=& \text{integer} \quad|\quad \mathtt{(}~\text{expr}~\mathtt{)} \\ \text{addop} &::=& \mathtt{+} \quad|\quad \mathtt{-} \\ \text{mulop} &::=& \mathtt{*} \quad|\quad \mathtt{/}\\ \text{integer} &::=& \cdots \end{array} \]

Base expressions

factor :: Parser Integer
factor = integer <|> parens expr
   -- `expr' will be defined later

integer :: Parser Integer  -- as before

-- add parenthesis around a parser
parens :: Parser a -> Parser a
parens p = do
   symbol "("
   a <- p
   symbol ")"
   return a

Terms

term :: Parser Integer
term = do t<-factor; cont t
  where
  cont t1 = do op <- mulop
               t2 <- factor
               cont (t1 `op` t2)
            <|> return t1

mulop :: Parser (Integer -> Integer -> Integer)
mulop = do symbol "*"; return (*)
        <|> 
        do symbol "/"; return div

Complete expressions

expr :: Parser Integer
expr = do t<-term; cont t
  where
  cont t1 = do op <- addop
               t2 <- term
               cont (t1 `op` t2)
            <|> return t1

addop :: Parser (Integer -> Integer -> Integer)
addop = do symbol "+"; return (+)
        <|>
        do symbol "-"; return (-)

Avoiding repetition

expr and term are similar
- combine sub-expressions using operators
- expr allows + and -
- term allows * and /
But we can’t use sepBy because we need to combine acording to the operator
Let use define combinators for this: chainl and chainl1

More combinators (2)

-- | chain applications of a parser 
-- using some  operator associated to the left
chainl :: Parser a 
       -> Parser (a -> a -> a) -> a -> Parser a
chainl p op a = (p `chainl1` op) <|> return a

chainl1 :: Parser a 
        -> Parser (a -> a -> a) -> Parser a
chainl1 p op = do a <- p; cont a 
   where cont a = do f <- op
                     b <- p
                     cont (f a b) 
                  <|> return a

Expression parser using chaining

expr  = term `chainl1` addop

term  = factor `chainl1` mulop

addop = do symbol "+"; return (+)
        <|> 
        do symbol "-"; return (-) 

mulop = do symbol "*"; return (*) 
        <|> 
        do symbol "/"; return div 

factor =  integer <|> parens expr

Library

Parsec is an “industrial strength” monadic parsing library:

https://hackage.haskell.org/package/parsec

well documented
fast
generates good error messages
tokenizers, operator precedences, associativities, etc.
many pre-defined languages (can be “tweaked”)
parametrized by the input token type
can be “stacked” onto other monads (a monad transformer)