Lexical Grammar
The lexical grammar is defined using the Token derive macro on an arbitrary enum type, which represents the type of the token.
Each enum variant represents a token variant. To specify the scanning rule for
an individual variant, you annotate that variant with the #[rule(...)]
macro
attribute and provide a regular expression to match the corresponding token.
The macro uses these expressions to build an optimized finite-state automaton, from which it generates the scanning program..
From the JSON example:
use lady_deirdre::lexis::Token;
#[derive(Token, Clone, Copy, PartialEq, Eq)]
#[define(DEC = ['0'..'9'])]
#[define(HEX = DEC | ['A'..'F'])]
#[define(POSITIVE = ['1'..'9'] DEC*)]
#[define(ESCAPE = '\\' (
| ['"', '\\', '/', 'b', 'f', 'n', 'r', 't']
| ('u' HEX HEX HEX HEX)
))]
#[lookback(2)]
#[repr(u8)]
pub enum JsonToken {
EOI = 0,
Mismatch = 1,
#[rule("true")]
True,
// ...
#[rule('}')]
BraceClose,
// ...
#[rule('"' (ESCAPE | ^['"', '\\'])* '"')]
String,
#[rule('-'? ('0' | POSITIVE) ('.' DEC+)? (['e', 'E'] ['-', '+']? DEC+)?)]
Number,
#[rule([' ', '\t', '\n', '\x0c', '\r']+)]
Whitespace,
}
The type must be Copy, Eq, and #[repr(u8)]
enum type with variants without a
body.
The macro will implement the Token trait on the applicable object, providing not only the scan function itself but also additional metadata about the lexical grammar.
Special Tokens
The EOI
and Mismatch
variants must be present in the enum1.
The EOI ("end-of-input") variant denotes the end of the source code. While the scanner does not scan this token explicitly, it uses this variant in certain API functions that return this token instance to indicate the end of the input stream.
The Mismatch tokens are utilized by the scanning algorithm to represent source code fragments that do not adhere to any lexical grammar rules. In essence, this serves as a fallback token, indicating that the source code contains unrecognized elements ("abracadabra").
In Lady Deirdre, scanning is, in principle, an infallible process. If the scanning algorithm encounters a portion of the source code it cannot identify, it emits the Mismatch token into the output stream. Depending on the parser, this token may be recognized as a syntax parsing error.
They are determined by discriminant rather than their names.
Regular Expressions
Inside the #[rule(...)]
macro attribute, you specify the regular expression
that matches this token kind.
The expression language consists of a set of operators commonly found in typical
regular expression languages. These include character range match
operators (['a'..'z', '0'..'9']
), repetition operators (+
, *
, ?
), and
character classes ($upper
, $alpha
, etc).
The macro supports predefined expressions (via #[define(Name = Expr)]
as
shown in the example above) that could be inlined as-is an any other expression
by name, without recursion.
Grammar Ambiguity
Every token scanning expression must match at least one character, as Lady Deirdre does not allow empty tokens.
Additionally, the rules must be mutually exclusive. For instance, if you have a
scanning rule for identifiers ['a'..'z']+
and a dedicated keyword rule
"package"
, both could potentially match the same text fragment "package".
To resolve this, you should prioritize the keyword variant over the identifier
by annotating it with a higher priority number #[priority(1)]
(the default
priority being zero). Prioritizing the identifier instead would render the
keyword rule inapplicable, which is also considered an error.
The macro checks for these issues during compile time and yields corresponding error messages when ambiguity errors are detected.
Lastly, in the above example, there is a #[lookback(2)]
macro attribute that
specifies how many characters the rescanning algorithm should step back to
restart the incremental rescanning process. By default, this value is 1,
indicating that the rescanner must start the process from at least one character
back before the edited fragment. In the case of JSON, we need to ensure that
there are at least two characters available so that the rescanner can continue
rescanning of incomplete floating-point number literals ending with the dot
character.
Debugging
You can debug the regular expressions by surrounding them with the dump(...)
operator: '"' dump((ESCAPE | ^['"', '\'])*) '"'
. This will prompt the macro to
print the contents within the "dump" argument, displaying the state machine
transitions of this specific expression as interpreted by the macro. This
feature can be particularly useful when crafting complex lexical rules.
Additionally, you can annotate the enum type with the #[dump]
macro attribute.
This will instruct the macro to print its generated output to the terminal,
allowing you to inspect the generated code. This is similar to using the
macro-expand tool, but the output will be properly formatted for readability.
Guidelines
It is advisable to keep the lexical grammar as simple and granular as possible, leaving the finer details to the syntax parsing stage.
In particular, I do not recommend scanning entire code comments and string
literals during lexical analysis. While in the example provided above, for the
sake of simplicity and demonstration, we scan string literals, in actual
applications, it would be preferable to scan just a "
character as a single
token and define syntax parsing rules for strings in the syntax parser.
Informally speaking, you should only scan text portions between which you would navigate when using the "ctrl ←" and "ctrl →" keyboard shortcuts in a code editor.