15 Apr 2015
A Rusty Guide to Tokenising by Hand
I recently wrote about creating a tokeniser in C++ using Ragel. In that post I created a lexer which recognised a set of three tokens. In this post I’ll discuss creating a lexer by hand in Rust to recognise the same language.
The job of a tokeniser is to break down a piece of text into a set of tokens, or lexemes. A token consists of two parts: an identifier which defines which regular expression that the string matched, and an optional value. If the token represents a number then the value might be the integer representation of that number.
We defined the language with the following three definitions:
Var ::= [a-z][a-z0-9]* Num ::= [0-9]+ Plus ::= '+'
There are four main stages to creating a tokeniser from this set of regular expressions:
- Create a nondeterministic finite-state automata (NFA) which accepts each of the regular expressions.
- Join these automata together to create a single NFA which accepts all of the regular expressions.
- Convert the NFA to a deterministic finite-state automata (DFA).
- Write a program which implements the DFA.
We will start with the simplest regular expression
'+'. The NFA that accepts this has two states. We move from the initial state to the final state when we see
The machine that accepts
Num tokens also has two states. The first transition from the start state ensures that at least one digit is matched. We then allow any number of additional numbers with a looping transition.
For the final FSA which accepts
Vars has two main parts. The first is the two-node machine which accepts
[a-z]. This is then joined with a ε transition to a one-state machine which accepts the Kleene star of
Once we have the NFAs for each of the individual tokens we can combine them into a single machine. We do this by adding a new start state and then adding a ε transition from this new state to the start state of each of the individual machines.
We have also added in a machine which accepts any number of whitespace characters. This will allow us to separate each token by optional whitespace.
NFA → DFA Conversion
To convert the NFA to a DFA we will create a DFA in which each state represents a set of NFA states.
To calculate these sets of states we compute the ε-closure of each transition in the NFA. We start with the ε-closure of the start state and repeatedly choose a transition we haven’t followed and then add in a new transition going to the ε-closure of that transition.
|DFA State||NFA States||Transitions||Final|
Now all that’s left to do is to write a program that implements the state machine we have constructed. We’ll start by defining the three structures that we will use.
The first is the
Tok structure. This represents each token we generate. A nice feature of Rust is that we can store the value directly in the enum along with the identifier of the token. This is really just syntactic sugar for a tagged union in c-land. There is one major advantage though. You can’t read the wrong type of data out due to the type safety enforced by the Rust language.
Next up is the state enumeration. It’s a pretty simple one. One thing to note is that we make sure it implements
PartialEq so we can compare values of it later on.
The final structure is the tokeniser itself. This stores the buffer of characters to be tokenised, represented as a
String, and the start of the token currently being tokenised. Once we finish consuming a token we advance this index so that the next token in the stream can be recognised.
The final state machine then loops through the characters in the buffer, starting from where it left off. If the character refers to a valid transition in the state machine it moves to that state and keeps on looking for more characters. Once we can no longer move to another state we return a token based on the last accepting state that we entered.
Seen as in this state machine we can’t move to any non-accepting states after entering an accepting state we make the simplifying assumption that the last accepting state was just the last state. If the last state wasn’t an accepting state then we return
None. If the last state was an accepting state we return a
Some(Tok) with the right data read from the slice of the buffer which we just recognised.
For more code take a look at the full Microcrate.