A Monadic Pratt Parser

Introduction

I recently read Beautiful Code which contains articles from several well known programmers about the code they consider beautiful.  In Top Down Operator Precedence, Douglas Crockford discusses the Pratt Parser (named after Vaughan Pratt who wrote the paper on it), also known as the Top Down Operator Precedence parser.  What makes this parsing technique interesting is that the syntax is defined in terms of tokens instead of grammar rules, an interesting way to define a language parser. As an example, Crockford defines the multiplication operator in terms of * token:

symbol("*", 60).led = function (left) {
    this.first = left;
    this.second = expression(60);
    this.arity = "binary";
    return this;
};

 

A Pratt parser works by reading through a stream of tokens and finding a symbol defined for the current token.  The symbol definition above for the * token defines how to build an abstract syntax tree (AST) node for that part of the token stream. The rest of the parser is defined similarly with symbols and functions for each significant piece of syntax in your language. I could go on about just how a Pratt parser works but someone beat me to it.

After reading about Pratt parsing I attempted my own implementation. I use F# which is a wonderful functional language supporting imperative style programming with mutable state. However, I feel a bit dirty using mutable state in a functional programming language (blame the hidden Haskell programmer in me) so I added the restriction that I will create my Pratt Parser in a pure functional manner.

Monads + Pratt?

Being purely functional meant fully embracing immutability. I manage the parser state by making use of the state monad which I have discussed previously which makes use of the computation expression feature in F#. This creates very clean looking code since the state is being passed along in the background which makes it very easy to string functions together.

To create the monadic Pratt parser I define a few key types which are used throughout the code: Symbol<’ex> and InputState<’ex>.

// Parser states consists of 'a which is the return type of any given parser combinator
// and 'ex which is the user defined type of the AST they are building
type Parser<'a, 'ex> = StateMonad<InputState<'ex>,'a>

// A symbol is the core component of the Pratt Parser. You define how the parsing works
// by defining symbols for operators/tokens
and Symbol<'ex> =
    {
        name : string;
        led : Token -> 'ex -> Parser<'ex,'ex>;
        nud : Token -> int -> Parser<'ex,'ex>;
        lbp : int;
    }

// The state type for the parser. Maintains the list of input tokens, the current token,
// the current symbol and the user defined mapping of tokens to symbols
and InputState<'ex> =
    {
        tokens : seq<Token>;
        token : Token option;
        symbol : Symbol<'ex> option;
        symbolMap : Map<string, Symbol<'ex>>
    }

 

The InputState<’ex> type defines all the state that is passed between the functions of the parser. It contains:

  • Tokens: The sequence of tokens received from the lexer
  • SymbolMap: The user defined mapping from a token’s name or value to a parser symbol.
  • Token:  The current token that the parser is on
  • Symbol: The corresponding symbol for the current token

The Symbol<’ex> type contain the user defined logic for how to parse the pieces of the language.It contains:

  • Name: The symbol name matches either a token’s name or value
  • Lbp: The left binding power of this symbol. Determines if it will grab the previous expression.
  • Nud: The function called when this symbol is executed with no arguments
  • Led: The function called when this symbol is executed with the previous expression’s result.

To parse a program you define several symbols and add them to InputState’s initial SymbolMap. The nud and led functions in these symbols will build up pieces of an AST. For example, given an AST defined by the Expression type:

type Expression =
    | Call of Expression * Expression list
    | Cond of Expression * Expression * Expression
    | Let of Expression * Expression
    | Fun of Expression * Expression list * Expression
    | Id of String
    | Int of int
    | Float of float
    | Bool of bool

 

Here is a symbol which describes how to build an identifier node using the Id data constructor:

// Error function for symbols that don't define led
let errorLed token left =
    parser {
        return! error token "Symbol does not take a left argument"
    }

// The nud function for id which returns an Expression type using the Id data constructor
let idNud token rbp =
    parser {
      return Id token.text
    };

// A symbol for identifiers. Defines a nud function but not a
// led function since identifiers don't consume neighboring tokens
let idSymbol =
    {
        name = "ID";
        lbp = 0;
        nud = idNud;
        led = errorLed;
    }

 

Most symbols will follow a similar pattern so I created several helper methods to create symbols and build up the symbol map. I won’t go into each here (you can read the source code) but they let you build the symbol map in a declarative way. For example, this is the definition of a symbol map for a simple language:

let languageSymbolMap = Map.empty |>  

                        // Add empty symbols
                        // Will error if called
                        addEmptySymbol ")" |>
                        addEmptySymbol "THEN" |>
                        addEmptySymbol "ELSE" |>

                        // Simple literal values
                        addLiteralSymbol "INT" Int int |>
                        addLiteralSymbol "FLOAT" Float double |>
                        addLiteralSymbol "BOOL" Bool bool  |>

                        // Binary right associative symbols
                        addInfixR "=" 10 neLed |>
                        addInfixR "!=" 10 neLed |>
                        addInfixR "&&" 30 andLed |>
                        addInfixR "||" 30 orLed |>

                        // Binary left associative symbols
                        addInfixL ">" 40 gtLed |>
                        addInfixL ">=" 40 gteLed |>
                        addInfixL "<" 40 ltLed |>
                        addInfixL "<=" 40 lteLed |>
                        addInfixL "+"  50 plusLed |>
                        addInfixL "-" 50 minusLed |>
                        addInfixL "*" 60 timesLed |>
                        addInfixL "/" 60 divideLed |>
                        addInfixL "^" 70 powerLed|>

                        // Custom symbols
                        addSymbol eqSymbol |>
                        addSymbol idSymbol |>
                        addSymbol leftParenSymbol |>
                        addSymbol letSymbol |>
                        addSymbol funSymbol |>
                        addSymbol ifSymbol |>
                        addSymbol endSymbol

 

Expression Parsing

The heart of the Pratt parser is the expression parsing code which calls the led and nud methods defined on the symbols.  My implementation follows the method in Crockford’s article very closely except that I use recursion instead of a loop and I extend the definition of the nud function to accept a right binding power.

let rec leftBoundExpression rbp token symbol left =
    parser {
        if (rbp < symbol.lbp) then
            do! advance()
            let! newLeft = symbol.led token left
            let! newToken, newSymbol = current()
            return! leftBoundExpression rbp newToken newSymbol newLeft
        else
            return left
    }

let expression rbp =
    parser {
        let! token, symbol = current()
        do!  advance()
        let! left = symbol.nud token rbp
        let! nextToken, nextSymbol = current()
        return! leftBoundExpression rbp nextToken nextSymbol left
    }

 

Adding the right binding power (rbp) parameter to the nud function makes it possible to parse a call expression like the ones found in Haskell or F#

someFunction arg1 arg2 arg2

To parse this expression the nud method for an someFunction must match the multiple expressions that come after it. This creates an issue since an identifier can be matched in two different ways: as a function call with arguments or as a stand alone identifier. We need to ensure the nud function for arg1 doesn’t try to grab arg2. Passing the right binding power to the nud method solves this since the nud for someFunction can indicate that it wants the next expressions with a high binding power. Below is an updated version of the identifier method’s nud function demonstrating this:

let idNud token rbp =
    parser {
        if rbp < 90 then
            let! args = manyExpressions 90
            if args.Length > 0
            then return Call (token, args)
            else return Id token.text
        else
            return Id token.text
    };

 

Parser Combinators

In the code above you may have noticed the call to manyExpressions. This function will match as many expressions as it can and then returns them in a list.

let rec manyExpressions rbp =
    parser {
        let! exp = expression rbp
        let! rest =  (manyExpressions rbp)
        return exp :: rest
    } <|> parser.Return []

 

What is interesting about this function is its base case. It will keep recursing trying to match expressions until it finds a symbol that does not define the nud method which causes the parser to return an error. This is where the <|> parser.Return [] expression takes over.  The <|> symbol is a function defined on the state monad which combines two monads/parsers into one. This kind of function is called a parser combinator. The first parser given to it will execute and if it eventually errors the parsing will backtrack until if finds an alternative (the right argument to <|>) or unwinds the stack completely and returns the error. In the code above the second parser is defined by  parser.Return []  which returns an empty list when the first parser errors.

Those familiar with the Haskell parser combinator library called Parsec may recognize the <|> combinator. In Parsec it is a predictive combinator and will only try the second alternative if the first didn’t consume input. This differs from my implementation which will try the second alternative even if the the first expression consumed hundreds of tokens before it errored.

Parser combinators offer great power since they can be used to effectively provide infinite look-ahead. They make it easy to parse ambiguous language constructs. I only use them in a limited way but there is much room to expand them.

Conclusion

I wasn’t sure at first what a monadic Pratt Parser would look and feel like. However, I am very pleased with how clear and readable it is. The use of computation expressions in F# enabled a simple and concise syntax. The melding of Pratt parsing with parser combinators creates really expressive and powerful parsers. I am interested in exploring this further to see what issues and pitfalls this kind of parser design may have.

All the code for my monadic Pratt parser can be found on this github page. This repository also contains an implementation of a simple functional language which demonstrates using the parser. The key files to look at are:

  • http://blog.barrkel.com/ Barry Kelly

    I think you’re slightly at risk of mistaking Crockford’s implementation of Pratt parsing for the core insight of Pratt parsing. I say this because of what you say at the start: “a Pratt parser works by reading through a stream of tokens and finding a symbol defined for the current token” – but this would deny that a token can occur multiple times in the grammar with different meanings (this happens all the time in practice). Pratt parsing doesn’t forego grammar rules; but it does require that the grammar rules be factored in a certain way – a way that happens to be a very good match for arithmetic operators.

    Encoding arithmetic precedence using the normal top-down technique of LL recursive descent is problematic in terms of performance, because a whole stack of functions needs to be recursed through, from the top-level Expression to the lowest-level Factor, in case the following operator binds more tightly.

    Pratt parsing instead encodes the maximum precedence of the operator as an argument to the recursion, enabling a flattening of the function stack. It’s essentially a way of removing redundant recursion in the anti-pattern that occurs in straight-forward recursive descent encoding of LL grammars. It works excellently in combination with recursive descent. Using Pratt parsing exclusively is likely to be problematic, though, because it lacks context for tokens that may have different precedences at different places in the grammar.

    As to whether Pratt parsing associates actions with rules or operators, that’s an implementation detail, as at the algorithm level most parser techniques are merely recognizers. If the grammar isn’t already suitable, it needs to be transformed to be amenable to Pratt parsing; at which point, whether you consider the action to be associated with the rule or the token you determine whether to parse or not based on precedence, is a matter of point of view.

    • https://www.google.com/accounts/o8/id?id=AItOawlrLeowWytSCscAcNv3ky4tdtP7AcgDAC8 Matthew

      Thanks for the comment Barry. You make several good points.

      I think in Pratt’s paper he argues that it is more natural to associate semantics with tokens over grammar rules. While this doesn’t preclude multiple tokens with different meaning it does dissuade it. From his paper:

      When a given class of phrases is characterized unambiguously by the presence of a particular token, the effect is the same, but this is not always the case in a BNF-style semantic specification, and I conjecture that the difficulty of learning and using a given language specified with a BNF grammar increases in proportion to the number of rules not identifiable by a single token.

      His argument is that structuring a parser this way is more natural since the language creator already has semantics in mind and just wants to apply syntax to it. Again from Pratt’s paper:

      The programmer has in mind a set of semantic objects. His natural inclination is to talk about them by assigning them names, or tokens. He then makes up programs using these tokens, together with other tokens useful for program control, and some purely syntactic tokens. (No clear-cut boundary separates these classes.) This suggests that it is more natural to associate semantics with tokens than with classes of phrases.

      The point you make about it being an implementation detail whether it associates with rule or operator is true. Given an existing grammar you can refactor it to be token based but this could lead to languages where one token has several different meanings. While this is fine it complicates implementation and leads to a more confusing language. Pratt’s point is that if you structure it from the point of view of semantics mapped to tokens you can avoid this confusing and make a clearer language. Once again from his paper:

      …phrase-structure rules interact more strongly than individual tokens because rules can share non-terminals whereas tokens have nothing to share. So our assignment of semantics to tokens has a much better chance of being modular than an assignment to rules. Thus one can tailor the language to one’s needs by selecting from a library, or writing, the semantics of just those objects that one needs for the task in hand, without having to worry about preordained interactions between two semantic objects at the syntactic level.

  • Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1012