Plingdollar

Red Green Syntax Trees - an Overview

It’s no secret that I’m a big fan of a good parser. One of the key parsing techniques that I have tried to focus on in the past is “infallible parsing”. The idea that for any input the parser should always generate some output. This is important to try and give as accurate a set of errors as possible when malformed or partial code is being analysed. An idea that builds on this is lossless syntax trees. In this post I’ll discuss a high level overview of a kind of lossless tree, discuss where it came from, and how it can help when parsing and compiling.

The idea of red / green syntax trees initially came from the Roslyn C# compiler. It was adopted by Swift in their libsyntax implementation. The Rust Analyzer project built further upon this with the Rowan crate. Rowan extends the model of red-green syntax trees with the idea of dynamically typed nodes.

The core of the idea is to represent parsed syntax as two distinct trees. The base tree is the green tree. This represents pieces of concrete syntax, such as the character 1, the operator +, or the expression 1 + 1. The green tree contains no position information. Each node in the tree is immutable meaning that portions of the tree with the same shape can share nodes.

For example given the 1 + 1 expression we could represent it as the following green tree:

           Node BIN_EXPR
                 |
     +-----------+-----------+
     |           |           |
 Token '1'   Token '+'   Token '1'

But we could equally share the token’s for 1 and use the following tree:

           Node BIN_EXPR
                 |
     +-----------+-----------+
     |           |           |
     |       Token '+'       |
     |                       |
     +------ Token '1' ------+

To the outside observer there should be no difference. This sharing of sub-trees is more useful where larger parts of the tree can be shared, such as repeated uses of the same path or calls to the same method.

An simplified definition of a tree might look like the following in F#:

type GreenToken =
	{ Kind: SyntaxKind
	  Text: string }
and GreenNode =
	{ Kind: SyntaxKind
	  Width: int
	  Children: Choice<GreenNode, GreenToken> list }

In the green tree each node has no specific location, only a width. The width of a token is the width of the text it contains. The with of nodes is cached on the node. It is the sum of the widths of its children.

On its own the green tree represents abstract syntactic elements. To provide a richer view over the top the red tree is the a layer over the top similar to the following:

type SyntaxToken =
	{ Offset: int
	  Parent: SyntaxNode option
	  Green: GreenToken }
and SyntaxNode =
	{ Offset: int
	  Parent: SyntaxNode option
	  Green: GreenNode }

Each node is extended with two things: a reference to the parent node, and an absolute offset of the start of this node form the beginning of the source text. The offset paired with the widths from the green nodes provides a source range for each element.

Nodes in the red tree all have a reference to a parent node which can be used to traverse the tree and find siblings. A red node fabricates it’s children lazily when a traversal descends into the node. This means we only pay the cost for the portion of the red tree we actually use. Once we no longer reference a portion of the red tree it is free to be garbage collected.

I’m currently experimenting with these ideas for a re-write of the Feersum scheme Compiler’s parser. See the Firethorn project for the red / green tree implementation.

For more details on this approach see:

Another notable mention, although logically distinct, is the Oil shell’s lossless syntax tree.

Comments