Plingdollar

10 Sep 2020

Error Recovery in FParsec

An interesting discussion on error recovery with the nom parser combinator in Rust has been sitting on my “To Read” list for a few months. After re-reading it a few times I have been inspired to try out the approach in that article using the FParsec library.

Just as in that example we will attempt to parse the language defined by the following Parser Expression Grammar (PEG):

expr    ← sum
sum     ← product (('+' / '-') product)*
product ← value (('*' / '/') value)*
value   ← [0-9]+ / '(' expr ')'

The grammar is simple enough, and translating it into a parser combinator is fairly straightforward. Each non-terminal, on the left of the , becomes a parser function. Each production on the right is mapped into the appropriate FParsec primitives:

let expr, exprRef = createParserForwardedToRef()

let value =
    (pint64  |>> Ast.Value ) <|>
    between (pchar '(') (pchar ')') expr

let product =
    let op = pchar '*' <|> pchar '/'
    (value .>>. (many (op >>. value)))
    |>> binNode Product

let sum =
    let op = pchar '+' <|> pchar '-'
    (product .>>. (many (op >>. product)))
    |>> binNode Sum

exprRef := sum

So far so good. Parsing valid inputs produces a ParseResult.Ok containing our syntax tree and all is well.

'123' ~> Value 123L
'1*2' ~> Product (Value 1L,Value 2L)
'(1+2)*3' ~> Product (Sum (Value 1L,Value 2L),Value 3L)

The problem arises when the parser fails.

Error in Ln: 1 Col: 2
(*1)+2)
 ^
Expecting: integer number (64-bit, signed) or '('

FParsec has done an excellent job of making an error message for us. This is a good example of how a parser combinator library can make knocking together a quick parser a breeze. There are a few issues however:

  1. It would be super nice if we always got some form of parse tree from our parser.
  2. Our example (*1)+2) has more than one syntax error in it.

Parsers are generally used in two places. Either as part of a batch program performing a one-off compilation, or by a language server or IDE performing interactive syntax analysis. In either case, we want to provide as much information back to the programmer as possible. If we can get our hands on some form of syntax tree we can still provide a gracefully degraded IDE experience. Recognising all the syntax errors we can at once is helpful to the programmer who is in a tight edit -> compile loop.

The plan to solve this is simple.

The parser should always succeed.

Instead of the current output of the parser, which is just a SyntaxNode, we instead would like it to return a pair of SyntaxNode * Diagnostic list where Diagnostic is some structure containing a Position and an error message.

To achieve this we will take advantage of FParsec’s UserState to build our diagnostic list, and we will create some new parser combinators to backtrack on failure and emit diagnostics where needed.

First lets create a State type to be used by our parser:

type Diagnostic = Diagnostic of Position * string

type State =
    { mutable Diagnostics: Diagnostic list }

Simple enough, we have a mutable list of diagnostics. Let’s add a member function to encapsulate emitting a diagnostic:

    member s.EmitDiagnostic pos err =
        let diag = Diagnostic(pos, err)
        s.Diagnostics <- diag::s.Diagnostics

Nice. Now let’s create some parser combinators that use them. First we introduce the expect combinator. It takes a parser, and an error message. If the parser succeeds we wrap its return in a Some. If it fails we emit a diagnostic and return None.

let expect (p: Parser<'a, State>) e =
    let raiseErr (stream: CharStream<State>) =
        stream.UserState.EmitDiagnostic stream.Position e
        Reply(None)
    attempt p |>> Some <|> raiseErr

Our next is expectSyn. This is a thin wrapper over expect which maps failure to a new SyntaxNode.Error kind.

let expectSyn (p: Parser<SyntaxNode, State>) err =
    expect p err |>> Option.defaultValue SyntaxNode.Error

With this in place we can now start to re-write our parser to emit diagnostics and continue in the case of some errors. The general approach here is to wrap an expect over any right hand parts of a production that should not fail.

let sum =
    let op = pchar '+' <|> pchar '-'
    let err = "expected expression after operator"
    (product .>>.
        (many (op >>. expectSyn product err)))
    |>> (binNode Sum)

In our sum parser we don’t wrap the op, because we expect that to fail when there isn’t a + or -. If there is an operator however we don’t expect the right hand side to fail, so we use expectSyn to ensure it doesn’t. A similar transform is needed for the product parser. With this in place our parser succeeds even on invalid inputs:

'1+' ~>
  ([Sum (Value 1L,Error)],
   [Diagnostic (
     ("test", Ln: 1, Col: 3),
     "expected expression after operator")])

Here the right-hand side of our Sum expression has been stubbed out with an Error. We have a diagnostic too, with a location pointing to where the problem lies.

Buoyed on by this success, let’s add some expects to the value parser:

let value =
    let opn = pchar '('
    let close = expect (pchar ')') "missing closing ')'"
    let body =
      expectSyn expr "expected expression after ("

    (pint64  |>> Value ) <|> between opn close body

We use a plain expect for the closing ), and expectSyn for the body of the parenthesised expression. Once we have seen an opening ( neither of these are optional.

Now we can recognise errors in even more places:

'(1' ~>
  ([Value 1L],
   [Diagnostic (
     ("test", Ln: 1, Col: 3),
     "missing closing ')'")])
'()' ~>
  ([Error],
   [Diagnostic (
     ("test", Ln: 1, Col: 2),
     "expected expression after (")])
'(' ~>
  ([Error],
   [Diagnostic (
     ("test", Ln: 1, Col: 2),
      "missing closing ')'");
    Diagnostic (
     ("test", Ln: 1, Col: 2),
     "expected expression after (")])

Nice! We get two diagnostics for the last one: one for the missing expression body, and one for the closing ).

The last place we need to handle failure is at the root level of the parser. If the parser encounters something that doesn’t look much like an expression at all we need a way to record that and skip over it. To do this we introduce a new error parser. Its job is to skip over a single character, and record a diagnostic:

let error (s: CharStream<State>) =
    match s.ReadCharOrNewline() with
    | EOS -> Reply(ReplyStatus.Error,
                   expected "valid expression character")
    | ch ->
        sprintf "Unexpected character %c" ch
        |> s.UserState.EmitDiagnostic s.Position
        Reply(Error)

This is a little more complex because we need to handle the case that there are no characters left to skip over. In that case the parser should return ReplyStatus.Error so that FParsec can move on and match the end of file. In other cases we create a diagnostic message and return our SyntaxNode.Error value.

We add this in to our root parser as the last value in the choice. Because FParsec considers each choice in turn until one succeeds our error parser will only be called upon if the real parsers weren’t able to make any progress.

let parser = many (expr <|> error) .>> eof

With all that in place let’s try parsing our original malformed expression again:

'(*1)+2)' ~>
  ([Product (Error,Value 1L); Error; Value 2L; Error],
   [Diagnostic (("test", Ln: 1, Col: 8),
                "Unexpected character )");
    Diagnostic (("test", Ln: 1, Col: 5),
                "Unexpected character )");
    Diagnostic (("test", Ln: 1, Col: 2),
                "missing closing ')'");
    Diagnostic (("test", Ln: 1, Col: 2),
                "expected expression after (")])

Much better. We get some kind of parse tree at least, and the diagnostics point to the problems with the missing expression at 1:2. Where we fall down however is with the diagnostics for the missing opening (. This is as far as we come with this example though. Introducing these recovery expressions is as much an art as it is a science.

The full source code for these parsers is available on GitHub.