24 Mar 2015
Tokenising in C++ with Ragel
Ragel is an alternative to Lex/Flex for creating lectures from a set of regular expressions. It is far more powerful however. Where Lex allows you to chain together a set of regular expressions and run some actions written in C or C++ in response to a match Ragel allows you to generate an arbitrary state machine by combining regular expressions and then run actions at any point in the matching of those expressions. It also supports a wide variety of ‘host’ languages, including C#, Ruby and Go.
We will attempt to recognise a simple set of tokens consisting of variables, numbers and the plus sign.
Var ::= [a-z][a-z0-9]* Num ::= [0-9]+ Plus ::= '+'
To recognise it we define the following Ragel machine. It uses the shorthand syntax for defining a tokeniser machine. This allows us to define a set of regular expressions actions that will be run when the expression is matched similar to the functionality of lex.
Notice that we explicitly have to match whitespace and ignore it. We will define the macro
CAPTURE_TOKEN to expand to the C++ code we require to return a token of that type.
We will represent tokens as a simple C++ struct:
We now just need to define a simple C++ class to encapsulate the lexer’s state. We define the member variables
eof which Ragel needs to keep track of the state of the buffer and initialise them in the constructor. If the data was being read in chunks from a buffered source you would need to keep on updating these pointers until you reached the end of the input storem.
Since we are using a tokeniser machine we need to set up the
te variables which Ragel will set to point to the matched token in the buffer before calling each action.
The final set of Ragel variables (
stack) define the internal workings of the state machine.
The one last thing we need to define is the macro which captures the token. This macro basically writes directly into the
token variable defined as a local in the
And there you go. All done. We can now pass a simple buffer to our lexer and keep on calling
next() until we run out of tokens. This lecture will return the tokens one at a time as it reads them. You can have Ragel tokenise the whole input if you want by removing the
fbreak calls from the actions and looping until you receive either
The full code for this article is in this Gist.