Plingdollar

24 Mar 2015

Tokenising in C++ with Ragel

Ragel is an alternative to Lex/Flex for creating lectures from a set of regular expressions. It is far more powerful however. Where Lex allows you to chain together a set of regular expressions and run some actions written in C or C++ in response to a match Ragel allows you to generate an arbitrary state machine by combining regular expressions and then run actions at any point in the matching of those expressions. It also supports a wide variety of ‘host’ languages, including C#, Ruby and Go.

We will attempt to recognise a simple set of tokens consisting of variables, numbers and the plus sign.

Var ::= [a-z][a-z0-9]*
Num ::= [0-9]+
Plus ::= '+'

To recognise it we define the following Ragel machine. It uses the shorthand syntax for defining a tokeniser machine. This allows us to define a set of regular expressions actions that will be run when the expression is matched similar to the functionality of lex.

%%{
  machine ExampleLexer;

  main := |*

    digit+ => {
      CAPTURE_TOKEN(Num);
      fbreak;
    };

    alpha (alpha | digit) * => {
      CAPTURE_TOKEN(Var);
      fbreak;
    };
    
    '+' => {
        CAPTURE_TOKEN(Plus);
        fbreak;
    };

    space+;
  *|;
}%%

Notice that we explicitly have to match whitespace and ignore it. We will define the macro CAPTURE_TOKEN to expand to the C++ code we require to return a token of that type.

We will represent tokens as a simple C++ struct:

struct Token
{
    enum TokenType {
        Var,
        Num,
        Plus,
        End,
        None
    };

    TokenType type;
    std::string value;
};

We now just need to define a simple C++ class to encapsulate the lexer’s state. We define the member variables p, pe and eof which Ragel needs to keep track of the state of the buffer and initialise them in the constructor. If the data was being read in chunks from a buffered source you would need to keep on updating these pointers until you reached the end of the input storem.

Since we are using a tokeniser machine we need to set up the ts and te variables which Ragel will set to point to the matched token in the buffer before calling each action.

The final set of Ragel variables (act, cs, top, stack) define the internal workings of the state machine.

class Lexer
{
public:
    Lexer(std::string& input)
      : buffer(input) {

        %%write init;

        // set up the buffer here
        p = buffer.c_str();
        pe = p + buffer.size();
        eof = pe;
    }

    Token* next() {
        auto token = new Token();
    
        token->type = Token::None;
    
        do {
    
            if (cs >= ExampleLexer_first_final) {
                token->type = Token::End;
            }
    
            %%write exec;
            
            if (cs == ExampleLexer_error) {
                token->type = Token::None
                return token;
            }
            
        } while (token->type == Token::None);
    
        return token;
    }

private:
    // buffer state
    const char* p, * pe, * eof;
    // current token
    const char* ts, * te;
    // machine state
    int act, cs, top, stack[1];

    std::string buffer;
};

The one last thing we need to define is the macro which captures the token. This macro basically writes directly into the token variable defined as a local in the next() method.

#define CAPTURE_TOKEN(t) \
  token->value = std::string(ts, te-ts); \
  token->type = Token::t

And there you go. All done. We can now pass a simple buffer to our lexer and keep on calling next() until we run out of tokens. This lecture will return the tokens one at a time as it reads them. You can have Ragel tokenise the whole input if you want by removing the fbreak calls from the actions and looping until you receive either None or End.

The full code for this article is in this Gist.