A Monadic Pratt Parser

Introduction

I recently read Beautiful Code which contains articles from several well known programmers about the code they consider beautiful.  In Top Down Operator Precedence, Douglas Crockford discusses the Pratt Parser (named after Vaughan Pratt who wrote the paper on it), also known as the Top Down Operator Precedence parser.  What makes this parsing technique interesting is that the syntax is defined in terms of tokens instead of grammar rules, an interesting way to define a language parser. As an example, Crockford defines the multiplication operator in terms of * token:

symbol("*", 60).led = function (left) {
    this.first = left;
    this.second = expression(60);
    this.arity = "binary";
    return this;
};

 

A Pratt parser works by reading through a stream of tokens and finding a symbol defined for the current token.  The symbol definition above for the * token defines how to build an abstract syntax tree (AST) node for that part of the token stream. The rest of the parser is defined similarly with symbols and functions for each significant piece of syntax in your language. I could go on about just how a Pratt parser works but someone beat me to it.

After reading about Pratt parsing I attempted my own implementation. I use F# which is a wonderful functional language supporting imperative style programming with mutable state. However, I feel a bit dirty using mutable state in a functional programming language (blame the hidden Haskell programmer in me) so I added the restriction that I will create my Pratt Parser in a pure functional manner.

Monads + Pratt?

Being purely functional meant fully embracing immutability. I manage the parser state by making use of the state monad which I have discussed previously which makes use of the computation expression feature in F#. This creates very clean looking code since the state is being passed along in the background which makes it very easy to string functions together.

To create the monadic Pratt parser I define a few key types which are used throughout the code: Symbol<’ex> and InputState<’ex>.

// Parser states consists of 'a which is the return type of any given parser combinator
// and 'ex which is the user defined type of the AST they are building
type Parser<'a, 'ex> = StateMonad<InputState<'ex>,'a>

// A symbol is the core component of the Pratt Parser. You define how the parsing works
// by defining symbols for operators/tokens
and Symbol<'ex> =
    {
        name : string;
        led : Token -> 'ex -> Parser<'ex,'ex>;
        nud : Token -> int -> Parser<'ex,'ex>;
        lbp : int;
    }

// The state type for the parser. Maintains the list of input tokens, the current token,
// the current symbol and the user defined mapping of tokens to symbols
and InputState<'ex> =
    {
        tokens : seq<Token>;
        token : Token option;
        symbol : Symbol<'ex> option;
        symbolMap : Map<string, Symbol<'ex>>
    }

 

The InputState<’ex> type defines all the state that is passed between the functions of the parser. It contains:

  • Tokens: The sequence of tokens received from the lexer
  • SymbolMap: The user defined mapping from a token’s name or value to a parser symbol.
  • Token:  The current token that the parser is on
  • Symbol: The corresponding symbol for the current token

The Symbol<’ex> type contain the user defined logic for how to parse the pieces of the language.It contains:

  • Name: The symbol name matches either a token’s name or value
  • Lbp: The left binding power of this symbol. Determines if it will grab the previous expression.
  • Nud: The function called when this symbol is executed with no arguments
  • Led: The function called when this symbol is executed with the previous expression’s result.

To parse a program you define several symbols and add them to InputState’s initial SymbolMap. The nud and led functions in these symbols will build up pieces of an AST. For example, given an AST defined by the Expression type:

type Expression =
    | Call of Expression * Expression list
    | Cond of Expression * Expression * Expression
    | Let of Expression * Expression
    | Fun of Expression * Expression list * Expression
    | Id of String
    | Int of int
    | Float of float
    | Bool of bool

 

Here is a symbol which describes how to build an identifier node using the Id data constructor:

// Error function for symbols that don't define led
let errorLed token left =
    parser {
        return! error token "Symbol does not take a left argument"
    }

// The nud function for id which returns an Expression type using the Id data constructor
let idNud token rbp =
    parser {
      return Id token.text
    };

// A symbol for identifiers. Defines a nud function but not a
// led function since identifiers don't consume neighboring tokens
let idSymbol =
    {
        name = "ID";
        lbp = 0;
        nud = idNud;
        led = errorLed;
    }

 

Most symbols will follow a similar pattern so I created several helper methods to create symbols and build up the symbol map. I won’t go into each here (you can read the source code) but they let you build the symbol map in a declarative way. For example, this is the definition of a symbol map for a simple language:

let languageSymbolMap = Map.empty |>  

                        // Add empty symbols
                        // Will error if called
                        addEmptySymbol ")" |>
                        addEmptySymbol "THEN" |>
                        addEmptySymbol "ELSE" |>

                        // Simple literal values
                        addLiteralSymbol "INT" Int int |>
                        addLiteralSymbol "FLOAT" Float double |>
                        addLiteralSymbol "BOOL" Bool bool  |>

                        // Binary right associative symbols
                        addInfixR "=" 10 neLed |>
                        addInfixR "!=" 10 neLed |>
                        addInfixR "&&" 30 andLed |>
                        addInfixR "||" 30 orLed |>

                        // Binary left associative symbols
                        addInfixL ">" 40 gtLed |>
                        addInfixL ">=" 40 gteLed |>
                        addInfixL "<" 40 ltLed |>
                        addInfixL "<=" 40 lteLed |>
                        addInfixL "+"  50 plusLed |>
                        addInfixL "-" 50 minusLed |>
                        addInfixL "*" 60 timesLed |>
                        addInfixL "/" 60 divideLed |>
                        addInfixL "^" 70 powerLed|>

                        // Custom symbols
                        addSymbol eqSymbol |>
                        addSymbol idSymbol |>
                        addSymbol leftParenSymbol |>
                        addSymbol letSymbol |>
                        addSymbol funSymbol |>
                        addSymbol ifSymbol |>
                        addSymbol endSymbol

 

Expression Parsing

The heart of the Pratt parser is the expression parsing code which calls the led and nud methods defined on the symbols.  My implementation follows the method in Crockford’s article very closely except that I use recursion instead of a loop and I extend the definition of the nud function to accept a right binding power.

let rec leftBoundExpression rbp token symbol left =
    parser {
        if (rbp < symbol.lbp) then
            do! advance()
            let! newLeft = symbol.led token left
            let! newToken, newSymbol = current()
            return! leftBoundExpression rbp newToken newSymbol newLeft
        else
            return left
    }

let expression rbp =
    parser {
        let! token, symbol = current()
        do!  advance()
        let! left = symbol.nud token rbp
        let! nextToken, nextSymbol = current()
        return! leftBoundExpression rbp nextToken nextSymbol left
    }

 

Adding the right binding power (rbp) parameter to the nud function makes it possible to parse a call expression like the ones found in Haskell or F#

someFunction arg1 arg2 arg2

To parse this expression the nud method for an someFunction must match the multiple expressions that come after it. This creates an issue since an identifier can be matched in two different ways: as a function call with arguments or as a stand alone identifier. We need to ensure the nud function for arg1 doesn’t try to grab arg2. Passing the right binding power to the nud method solves this since the nud for someFunction can indicate that it wants the next expressions with a high binding power. Below is an updated version of the identifier method’s nud function demonstrating this:

let idNud token rbp =
    parser {
        if rbp < 90 then
            let! args = manyExpressions 90
            if args.Length > 0
            then return Call (token, args)
            else return Id token.text
        else
            return Id token.text
    };

 

Parser Combinators

In the code above you may have noticed the call to manyExpressions. This function will match as many expressions as it can and then returns them in a list.

let rec manyExpressions rbp =
    parser {
        let! exp = expression rbp
        let! rest =  (manyExpressions rbp)
        return exp :: rest
    } <|> parser.Return []

 

What is interesting about this function is its base case. It will keep recursing trying to match expressions until it finds a symbol that does not define the nud method which causes the parser to return an error. This is where the <|> parser.Return [] expression takes over.  The <|> symbol is a function defined on the state monad which combines two monads/parsers into one. This kind of function is called a parser combinator. The first parser given to it will execute and if it eventually errors the parsing will backtrack until if finds an alternative (the right argument to <|>) or unwinds the stack completely and returns the error. In the code above the second parser is defined by  parser.Return []  which returns an empty list when the first parser errors.

Those familiar with the Haskell parser combinator library called Parsec may recognize the <|> combinator. In Parsec it is a predictive combinator and will only try the second alternative if the first didn’t consume input. This differs from my implementation which will try the second alternative even if the the first expression consumed hundreds of tokens before it errored.

Parser combinators offer great power since they can be used to effectively provide infinite look-ahead. They make it easy to parse ambiguous language constructs. I only use them in a limited way but there is much room to expand them.

Conclusion

I wasn’t sure at first what a monadic Pratt Parser would look and feel like. However, I am very pleased with how clear and readable it is. The use of computation expressions in F# enabled a simple and concise syntax. The melding of Pratt parsing with parser combinators creates really expressive and powerful parsers. I am interested in exploring this further to see what issues and pitfalls this kind of parser design may have.

All the code for my monadic Pratt parser can be found on this github page. This repository also contains an implementation of a simple functional language which demonstrates using the parser. The key files to look at are: