Introduction

Lady Deirdre Logo

Lady Deirdre is a framework that helps you develop front-end code analysis tools, such as code editor language extensions, programming language compilers and interpreters, and even new code editors.

This guide will explain the main concepts of the API and walk you through the steps of developing an analysis tool.

The book assumes that you already have some experience with the Rust programming language and that you understand the core concepts of classical compilers: lexical scanners, syntax parsers, regular expressions, context-free grammars, etc.

If you have prior experience with code editor plugin development and the LSP protocol in particular, it will certainly help you understand the material but is not strictly required. The book will provide you with a brief overview of the core concepts behind these tools.

This work is proprietary software with source-available code.

To copy, use, distribute, and contribute to this work, you must agree to the terms and conditions of the General License Agreement.

Copyright (c) 2024 Ilya Lakhin (Илья Александрович Лахин). All rights reserved.

Overview

The Domain

The program you are developing could function both as a programming language compiler and as a code editor extension simultaneously. Lady Deirdre does not strongly distinguish between these domains, so we will collectively refer to the developing program as the Compiler.

The input data for the Compiler is a compilation project: a set of semantically interconnected source code text files. We refer to each individual file within a project as a compilation unit. A compilation unit could be a real file stored on disk or have a more abstract source (e.g., transferred to the Compiler by the code editor through the LSP communication channel).

The primary purpose of the front-end part of the Compiler is to determine if the compilation project is well-formed (i.e., if there are syntax or semantic errors in the compilation units) and to infer the semantic connections between the source code objects (e.g., all call sites of a function in the source code).

The compilation project is subject to frequent changes, as the end user may modify the source code of the units with every keystroke. The Compiler should keep its internal representation in sync with these changes in real time.

Moreover, the compilation project is often not well-formed. While the end user is writing the source code, it is usually in an incomplete state, with syntax and semantic errors. Therefore, the Compiler should be resilient to these errors, able to continuously synchronize the program's abstract representation with the current state of the source code without halting at the first encountered error.

The Compiler's best effort is to infer as much metadata as possible from the current state of the source code to assist the end user in the code editor: highlighting references between identifiers, providing code completion suggestions, and enabling semantically meaningful navigation between text symbols.

The Core Concepts

Lady Deirdre separates the processes of lexical scanning, syntax parsing, and semantic analysis.

Lexical and syntax analysis are performed on each compilation unit eagerly using an incremental reparsing approach. With every end-user keystroke, the framework patches the token stream and syntax tree relative to the changes.

As a result, incremental reparsing is usually a fast process even if the unit's source code is large. This reparsing process does not alter the outer parts of the syntax tree outside the typically small reparsing area, which is important for the semantic analysis that relies on the states of the syntax tree.

Semantic analysis, in contrast, is a lazy, demand-driven process for the entire compilation project. Lady Deirdre infers individual features of the semantic model only when you explicitly request these features.

The model is described in terms of user-defined computable functions that compute specific node's semantic metadata based on the compilation units' syntax tree states and other computable function values. Together, these computable functions form the Semantic Graph.

Lady Deirdre computes and incrementally updates this graph partially and automatically when you request a feature, taking into account the changes in the compilation units.

Compilation Units

The Document object represents the source code text, token stream, and syntax tree of an individual compilation unit.

Through the Document object, you can write to arbitrary fragments of the source code and read its data at any time.

let mut doc = Document::<JsonNode>::new_mutable(r#"{ "foo": 123 }"#);

 // Absolute zero-based index.
doc.write(3..6, "bar");

 // Line-column one-based index.
doc.write(Position::new(1, 4)..Position::new(1, 7), "baz");

assert_eq!(doc.substring(2..12), r#""baz": 123"#);

// Returns the root node of the syntax tree.
let _ = doc.root();

// Returns an iterator over the syntax errors.
let _ = doc.errors();

// Depth-first forth and back traverse of the syntax tree and its tokens.
doc.traverse_tree(&mut my_visitor);

// Reads tokens within the token stream.
let _ = doc.chunks(..);

The Document comes in two flavors: mutable and immutable. The mutable Document supports incremental reparsing (as shown in the example above), while the immutable Document does not support incremental reparsing but performs faster when you load the source code text once.

There are no other API differences between these two document types, so you can switch between the modes seamlessly. For example, if you want to switch off the incremental compilation mode of your program, the program would function as a pure one-pass compiler.

The Document is parameterized with the type that describes the lexical scanner and syntax parser of the language, specifying the individual token type and the syntax tree's node type.

Lexis

First, you need to specify the type of the lexis. Typically, you can do this using the derive macro on your enum type, where the enum variants denote individual token types. The token's lexical scanning rules are described in terms of regular expressions.

From the JSON example:

#[derive(Token)]
pub enum JsonToken {
    #[rule("true")]
    True,

    #[rule('{')]
    BraceOpen,

    #[rule('-'? ('0' | POSITIVE) ('.' DEC+)? (['e', 'E'] ['-', '+']? DEC+)?)]
    Number,
    
    //...
}

The macro will generate a highly optimized lexical scanner based on the provided regex rules.

In Lady Deirdre, the lexical scanning process is infallible. If there are source code fragments that do not match the specified rules, these fragments will be recognized as fallback "mismatch" tokens, which will generate syntax errors during the syntax parsing stage.

Syntax

The syntax grammar is described similarly using enum types and the derive macro.

The node's parsing rules are described in terms of LL(1) grammars, but you can also implement your own custom parsers for individual node types, allowing for custom parse logic with unlimited recursion, including possibly left recursion.

Within the macro's parsing rules, you can capture the results of descending rule applications and reference these results in the enum variant fields.

This system of references forms the node-to-child relationships between the syntax tree nodes, which is useful for depth-first tree traversal. Additionally, the parser establishes ascending node-to-parent relationships, allowing traversal from nodes to the tree root. Lady Deirdre's incremental reparser ensures both kinds of references are kept up to date.

From the JSON example:

#[derive(Node)]
#[token(JsonToken)] // Specifies a type of the Token.
pub enum JsonNode {
    #[rule($BracketOpen (items: ANY)*{$Comma} $BracketClose)]
    Array {
        #[parent] // Node-to-Parent relation.
        parent: NodeRef,
        #[child]// Node-to-Child relation.
        items: Vec<NodeRef>,
    },
    
    //...
}

Most of the language syntax constructs, which can be easily expressed in terms of LL(1) grammar rules, will be described this way. However, some complex parsing rules, such as infix expressions, will be implemented manually using hand-written recursive-descent parsers.

The syntax trees created by Lady Deirdre are, informally speaking, abstract syntax trees where all trivial elements such as whitespaces and comments are intentionally omitted. However, it is worth noting that you can also build a full ParseTree based on the same grammar, which has a different structure useful for implementing code formatters.

The parser generated by the macro is an error-resistant parser capable of recovering from syntax errors in the end user's code. It recovers from syntax errors using standard "panic mode" algorithm and based on internal heuristics statically inferred from the grammar. You have the option to explicitly configure the recovery rules for the entire grammar and for individual parsing rules for fine-tuning.

Ownership

In Lady Deirdre, the source code tokens and syntax tree nodes are owned by the Document.

The NodeRef and TokenRef objects are globally unique (composite) numerical indices that point to a specific node or token inside the Document. They are unique in the sense that whenever the incremental reparser removes a node or token, the corresponding index object becomes obsolete forever, and the newly created node and token instance always receives a unique index object that will never clash with the index objects created previously by any Document.

Lady Deirdre uses NodeRefs, in particular, to establish parent-child relations between syntax tree nodes.

This approach is convenient in that NodeRefs/TokenRefs, being just numerical indices, are cheap and easy to Copy and are memory-allocation independent. You can easily spread them across the program to address specific objects within a particular document.

But the downside is that to dereference the addressed instance, you always have to have access to the corresponding Document at the dereferencing point.

let doc: Document<JsonNode>;
let token_ref: NodeRef;

let Some(token) = token_ref.deref(&doc) else {
    panic!("TokenRef obsolete.");
}

Traversing

You can traverse the syntax tree either manually by dereferencing the NodeRef index object and inspecting the enum variant fields, or generically, using the NodeRef's grammar-independent functions. These functions include getting the node's parent, children, or siblings, and addressing their children by string or numerical keys.

let doc: Document<JsonNode>;
let node_ref: NodeRef;

let foo_ref: NodeRef = node_ref
    .parent(&doc)
    .last_child(&doc)
    .get_child(&doc, 3)
    .prev_sibling(&doc)
    .get_child(&doc, "foo");

You can also traverse the entire syntax tree or a branch of the tree generically using a visitor.

let doc: Document<JsonNode>;
let branch_ref: NodeRef;

doc.traverse_subtree(&branch_ref, &mut MyVisitor);

struct MyVisitor;

impl Visitor for MyVisitor {
    fn visit_token(&mut self, _token_ref: &TokenRef) {}
    fn enter_node(&mut self, _node_ref: &NodeRef) -> bool { true }
    fn leave_node(&mut self, _node_ref: &NodeRef) {}
}

Semantics

The semantic graph of the compilation project consists of attributes.

An attribute represents a value of arbitrary user-defined type, along with the function that computes this value based on the values of other attributes read within the function.

The value of the attribute can be of any type that implements Clone and Eq.

#[derive(Clone, PartialEq, Eq)]
struct MyAttrValue {
    //...
}

impl Computable for MyAttrValue {
    type Node = MyNode;

    fn compute<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Self> {
        // Computes and returns attribute's value using the `context` object.
    }
}

The purpose of an attribute is to infer meaningful information related to a particular node of the syntax tree. For instance, one attribute of the variable declaration node could infer the type of that variable, while another attribute might infer all variable uses across the code.

Attribute instances are owned by the syntax tree nodes. When defining the node, attributes can be placed inside a special enum variant's #[semantics] field:

#[derive(Node)]
struct MyNode {
    #[rule(<parse rule>)]
    SomeNodeVariant {
        //...
        #[semantics]
        semantics: Semantics<Attr<MyAttrValue>>,
    }
}

If a node has more than one attribute, you should define a dedicated struct where you would put these attributes. Then, you would use this struct as a parameter of the Semantics<...> object.

Lady Deirdre computes attribute values only when you query them explicitly.

let analysis_task; // You gets this object from the Analyzer (see next sections).
let node_ref: NodeRef;

let Some(MyNode::SomeNodeVariant { semantics, ... }) = node_ref.deref() else {...}

let (_, attribute_value) = semantics
    .get().unwrap()
    .my_attr.snapshot(&analysis_task).unwrap();

You can find a complete setup example of the syntax tree with semantics in the Chain Analysis example.

Concurrent Analysis

Lady Deirdre is capable of computing the semantic graph of your compilation project in a multi-threaded environment, handling concurrent requests to the semantic attributes. Specifically, the language server can manage parallel requests from the language client (code editor) in dedicated worker threads.

The Analyzer object manages a collection of documents of the compilation project and the semantic graph of the project.

It's assumed that this object will serve as the central synchronization point of your compiler. You can interact with the Analyzer from multiple threads if you're developing a multi-threaded compiler.

At any point in time, you can either edit the source code or query the semantic graph. Due to this reason, the Analyzer doesn't allow direct access to its inner content. Instead, it provides functions that grant access to specific operations.

These functions return RAII-guard-like objects called "tasks" through which necessary operations can be performed.

From the Chain Analysis example:

let analyzer = Analyzer::<ChainNode>::new(AnalyzerConfig::default());

let doc_id;

{
    // A handle object through which we can signalize the task's worker
    // to cancel it's job. 
    let handle = TriggerHandle::new();

    // Requests Mutation task through which we can add new
    // or edit existing documents.
    let mut task = analyzer.mutate(&handle, 1).unwrap();

    // Returns a unique identifier of the new document
    doc_id = task.add_mutable_doc(INPUT);
}

{
    let handle = TriggerHandle::new();
    
    // Requests semantic-analysis task.
    let task = analyzer.analyze(&handle, 1).unwrap();

    // Here we can fetch the document by `doc_id`, traverse its nodes,
    // and query their semantics using the `task` object.
}

The Analyzer's task system supports task priorities and a graceful shutdown mechanism. The inner task manager of the Analyzer can signal the task's worker thread to temporary interrupt its job based on the currently requested task priorities.

It's important to note that the Analyzer itself is not a thread manager and does not spawn any threads. Thread job management is not a goal of Lady Deirdre.

You can also use the Analyzer from a single main thread only. For example, you can build your compiler to the wasm target and use the Analyzer's tasks sequentially.

Lexis

Lexical scanning is the initial stage of source code text analysis.

During this process, the scanner iterates through the characters of a Unicode string, establishing token boundaries and associating each scanned fragment, delimited by these boundaries, with a corresponding token instance.

The lexical scanner is a simple program that implements finite-state automata, always looking at most one character ahead. Consequently, the scanner can be restarted at any character of the text, which is particularly beneficial for incremental rescanning. For instance, when an end user modifies a specific portion of the source code text, the scanner restarts before the altered fragment, eventually converging to the state of the tail of the text.

The resulting token stream serves as input for the syntax parser.

Lexical Grammar

The lexical grammar is defined using the Token derive macro on an arbitrary enum type, which represents the type of the token.

Each enum variant represents a token variant. To specify the scanning rule for an individual variant, you annotate that variant with the #[rule(...)] macro attribute and provide a regular expression to match the corresponding token.

The macro uses these expressions to build an optimized finite-state automaton, from which it generates the scanning program..

From the JSON example:

use lady_deirdre::lexis::Token;

#[derive(Token, Clone, Copy, PartialEq, Eq)]
#[define(DEC = ['0'..'9'])]
#[define(HEX = DEC | ['A'..'F'])]
#[define(POSITIVE = ['1'..'9'] DEC*)]
#[define(ESCAPE = '\\' (
    | ['"', '\\', '/', 'b', 'f', 'n', 'r', 't']
    | ('u' HEX HEX HEX HEX)
))]
#[lookback(2)]
#[repr(u8)]
pub enum JsonToken {
    EOI = 0,

    Mismatch = 1,

    #[rule("true")]
    True,
    
    // ...

    #[rule('}')]
    BraceClose,
    
    // ...

    #[rule('"' (ESCAPE | ^['"', '\\'])* '"')]
    String,

    #[rule('-'? ('0' | POSITIVE) ('.' DEC+)? (['e', 'E'] ['-', '+']? DEC+)?)]
    Number,

    #[rule([' ', '\t', '\n', '\x0c', '\r']+)]
    Whitespace,
}

The type must be Copy, Eq, and #[repr(u8)] enum type with variants without a body.

The macro will implement the Token trait on the applicable object, providing not only the scan function itself but also additional metadata about the lexical grammar.

Special Tokens

The EOI and Mismatch variants must be present in the enum1.

The EOI ("end-of-input") variant denotes the end of the source code. While the scanner does not scan this token explicitly, it uses this variant in certain API functions that return this token instance to indicate the end of the input stream.

The Mismatch tokens are utilized by the scanning algorithm to represent source code fragments that do not adhere to any lexical grammar rules. In essence, this serves as a fallback token, indicating that the source code contains unrecognized elements ("abracadabra").

In Lady Deirdre, scanning is, in principle, an infallible process. If the scanning algorithm encounters a portion of the source code it cannot identify, it emits the Mismatch token into the output stream. Depending on the parser, this token may be recognized as a syntax parsing error.

1

They are determined by discriminant rather than their names.

Regular Expressions

Inside the #[rule(...)] macro attribute, you specify the regular expression that matches this token kind.

The expression language consists of a set of operators commonly found in typical regular expression languages. These include character range match operators (['a'..'z', '0'..'9']), repetition operators (+, *, ?), and character classes ($upper, $alpha, etc).

The macro supports predefined expressions (via #[define(Name = Expr)] as shown in the example above) that could be inlined as-is an any other expression by name, without recursion.

Grammar Ambiguity

Every token scanning expression must match at least one character, as Lady Deirdre does not allow empty tokens.

Additionally, the rules must be mutually exclusive. For instance, if you have a scanning rule for identifiers ['a'..'z']+ and a dedicated keyword rule "package", both could potentially match the same text fragment "package".

To resolve this, you should prioritize the keyword variant over the identifier by annotating it with a higher priority number #[priority(1)] (the default priority being zero). Prioritizing the identifier instead would render the keyword rule inapplicable, which is also considered an error.

The macro checks for these issues during compile time and yields corresponding error messages when ambiguity errors are detected.

Lastly, in the above example, there is a #[lookback(2)] macro attribute that specifies how many characters the rescanning algorithm should step back to restart the incremental rescanning process. By default, this value is 1, indicating that the rescanner must start the process from at least one character back before the edited fragment. In the case of JSON, we need to ensure that there are at least two characters available so that the rescanner can continue rescanning of incomplete floating-point number literals ending with the dot character.

Debugging

You can debug the regular expressions by surrounding them with the dump(...) operator: '"' dump((ESCAPE | ^['"', '\'])*) '"'. This will prompt the macro to print the contents within the "dump" argument, displaying the state machine transitions of this specific expression as interpreted by the macro. This feature can be particularly useful when crafting complex lexical rules.

Additionally, you can annotate the enum type with the #[dump] macro attribute. This will instruct the macro to print its generated output to the terminal, allowing you to inspect the generated code. This is similar to using the macro-expand tool, but the output will be properly formatted for readability.

Guidelines

It is advisable to keep the lexical grammar as simple and granular as possible, leaving the finer details to the syntax parsing stage.

In particular, I do not recommend scanning entire code comments and string literals during lexical analysis. While in the example provided above, for the sake of simplicity and demonstration, we scan string literals, in actual applications, it would be preferable to scan just a " character as a single token and define syntax parsing rules for strings in the syntax parser.

Informally speaking, you should only scan text portions between which you would navigate when using the "ctrl ←" and "ctrl →" keyboard shortcuts in a code editor.

Token Buffer

The simplest data structure for storing the token stream is the TokenBuffer.

It holds both the source code text and the token stream but has limited incremental rescanning capabilities, only allowing appending to the end of the source code. Token buffers are useful for loading large files from disk or network incrementally, particularly when data arrives in parts.

You are encouraged to use token buffer if you don't need general incremental rescanning capabilities and if you want to store only the source code with tokens, or if you plan to initially load the source code and later reupload it to a general-purpose compilation unit storage like Document.

Token buffer offers the fastest scanning implementation among other Lady Deirdre compilation unit storages, providing high performance when iterating through token chunks and source code substrings.

Also, this object is useful for examining the results of the lexical scanner output.

use lady_deirdre::lexis::{TokenBuffer, SourceCode};

let mut buffer = TokenBuffer::<JsonToken>::from("[1, 2, 3");

assert_eq!(buffer.substring(..), "[1, 2, 3");

buffer.append(", 4, 5, 6]");

assert_eq!(buffer.substring(..), "[1, 2, 3, 4, 5, 6]");

// Prints all tokens in the token buffer to the terminal.
for chunk in buffer.chunks(..) {
    println!("{:?}", chunk.token);
}

Source Code

Lady Deirdre distinguishes between the scanner's implementation, carried out by the Token type, and the source code manager, responsible for actual code storage and lexical scanning through interaction with the scanner's implementation.

Each source code manager implements a SourceCode trait, providing a minimal set of features enabling inspection of its content, including the source code text and its lexical structure.

The crate offers specialized source code managers with distinct API options and performance characteristics. For instance, while the TokenBuffer lacks general incremental rescanning capabilities, it offers slightly better performance than the mutable Document.

Code Inspection

Text Addressing

In Lady Deirdre, the minimal unit for indexing the source code text is the Unicode character.

A Site is a numeric type (an alias of usize) representing the absolute Unicode character index in a string.

SiteSpan is an alias type of Range<Site> that denotes a fragment (or span) of the source code.

Most API functions within the crate conveniently accept impl ToSite or impl ToSpan objects when users need to address specific source code characters or spans of characters. These traits facilitate automatic conversion between different representations of source code indices.

One example of a source code index type is the Position object, which references code in terms of the line and column within the line1. It implements the ToSite trait.

For any type implementing the ToSite trait, ToSpan is automatically implemented for all standard Rust range types with bound of this type. For instance, both Site and Position types implement the ToSite trait, making 10..20, 10..=20, and 10.., and Position::new(10, 40)..Position::new(14, 2) valid span types.

However, a particular span instance could be invalid; for instance, 20..10 is invalid because its lower bound is greater than its upper bound.

Certain API functions in the crate (e.g., Document::write) require that the specified span must be valid; otherwise, the function would panic. This behavior aligns with Rust's behavior when indexing arrays with invalid ranges.

You can check the validity of a range upfront using the ToSpan::is_valid_span function.

The RangeFull .. object always represents the entire content and is always valid.

use lady_deirdre::lexis::{Position, ToSpan, TokenBuffer};

let mut buf = TokenBuffer::<JsonToken>::from("foo\nbar\nbaz");

assert!((2..7).is_valid_span(&buf));
assert!((2..).is_valid_span(&buf));
assert!((..).is_valid_span(&buf));
assert!(!(7..2).is_valid_span(&buf));
assert!((Position::new(1, 2)..Position::new(3, 1)).is_valid_span(&buf));
1

Please note that the line and column numbers in the Position object are one-based: 1 denotes the first line, 2 denotes the second line, and so forth.

Text Inspection

The following SourceCode' s functions enable you to query various metadata of the compilation unit's text.

use lady_deirdre::lexis::{SourceCode, TokenBuffer};

let mut buf = TokenBuffer::<JsonToken>::from("foo, bar, baz");

// The `substring` function returns a `Cow<str>` representing the substring
// within the specified span.
// The underlying implementation attempts to return a borrowed string whenever
// possible.
assert_eq!(buf.substring(2..7), "o, ba");

// Iterates through the Unicode characters in the span.
for ch in buf.chars(2..7) {
    println!("{ch}");
}

// A total number of Unicode characters.
assert_eq!(buf.length(), 13);

// Returns true if the code is empty (contains no text or tokens).
assert!(!buf.is_empty());

// A total number of lines (delimited by `\n`).
assert_eq!(buf.lines().lines_count(), 1);

From buf.lines(), you receive a LineIndex object that provides additional functions for querying metadata about the source code lines. For example, you can fetch the length of a particular line using this object.

Tokens Iteration

The SourceCode::cursor and its simplified version chunks allow you to iterate through the tokens of the source code.

Both functions accept a span of the source code text and yield tokens that "touch" the specified span. Touching means that the token's string is fully covered by, intersects with, or at least contacts the span within its bounds.

For example, if the text "FooBarBazWaz" consists of the tokens "Foo", "Bar", "Baz", and "Waz", the span 3..7 would contact the "Foo" token (3 is the end of the token's span), fully cover the "Bar" token, and intersect with the "Baz" token (by the "B" character). However, the "Waz" token is outside of this span and will not be yielded.

In other words, these functions attempt to yield the widest set of tokens that are in any way related to the specified span.

The chunks function simply returns a standard iterator over the token metadata. Each Chunk object contains the token instance, a reference to its string, the absolute Site of the beginning of the token, and the substring length in Unicode characters2.

use lady_deirdre::lexis::{Chunk, SourceCode, TokenBuffer};

let buf = TokenBuffer::<JsonToken>::from("123 true null");

for Chunk {
    token,
    string,
    site,
    length,
} in buf.chunks(..)
{
    println!("---");
    println!("Token: {token:?}");
    println!("String: {string:?}");
    println!("Site: {site}");
    println!("Length: {length}");
}

The cursor function returns a more complex TokenCursor object that implements a cursor-like API with built-in lookahead capabilities and manual control over the iteration process. This object is particularly useful for syntax parsing and will be discussed in more detail in the subsequent chapters of this guide.

To give you a brief overview of the token cursor, the above code could be rewritten with the token cursor as follows:

use lady_deirdre::lexis::{SourceCode, TokenBuffer, TokenCursor};

let buf = TokenBuffer::<JsonToken>::from("123 true null");

let mut cursor = buf.cursor(..);

loop {
    // 0 means zero lookahead -- we are looking at the point of where the cursor
    // is currently pointing.
    let token = cursor.token(0);

    // If the cursor reached the end of input, we are breaking the loop.
    if token == JsonToken::EOI {
        break;
    }

    println!("---");
    println!("Token: {:?}", cursor.token(0));
    println!("String: {:?}", cursor.string(0));
    println!("Site: {:?}", cursor.site(0));
    println!("Length: {:?}", cursor.length(0));

    // Manually moves token cursor to the next token.
    cursor.advance();
}
2

Note that the Chunk object represents a valid span and implements the ToSpan trait.

Token References

Lexical structures (tokens) are owned by the source code managers, such as TokenBuffers and Documents, which implement the SourceCode trait through which you can access the tokens.

The TokenRef is a convenient interface containing a composite numeric index that uniquely addresses a token in the source code. As a numeric index, it is a Copy and lifetime-independent object that you can freely use throughout your program. However, TokenRef could potentially represent an invalid or obsolete pointer. Therefore, most TokenRef functions require passing a reference to the source code manager and may return None if the corresponding token does not exist in the manager.

For example, the TokenRef::deref function "dereferences" the token and returns Some if the token exists in the specified compilation unit, or None otherwise.

use lady_deirdre::lexis::{SourceCode, TokenBuffer, TokenCursor, TokenRef};

let mut buf_1 = TokenBuffer::<JsonToken>::from("123 true null");
let buf_2 = TokenBuffer::<JsonToken>::new();

// Get the reference to the first token in the TokenBuffer.
let first_token: TokenRef = buf_1.cursor(..).token_ref(0);

// Gets an instance of the Token instance.
assert_eq!(first_token.deref(&buf_1), Some(JsonToken::Number));

// Gets a string fragment covered by this token.
assert_eq!(first_token.string(&buf_1), Some("123"));

// Checks validity of the TokenRef for specified compilation unit.
assert!(first_token.is_valid_ref(&buf_1));

// However, this token reference is not valid for buf_2.
assert_eq!(first_token.deref(&buf_2), None);

// Removing all tokens from the TokenBuffer.
buf_1.clear();

// As such, the reference is no longer valid for the buf_1 as well.
assert_eq!(first_token.deref(&buf_1), None);

The source of TokenRef objects could be token cursors (as in the example above), but typically, you will obtain them by inspecting nodes of the syntax trees.

TokenRef Lifetime

The TokenRef reference is unique in the following ways:

  1. It uniquely addresses a particular compilation unit.
  2. It uniquely addresses a particular token within this unit.

If the incremental source code manager (such as Document) rescans the source code fragment to which the token belongs, its TokenRef reference would effectively become obsolete. Every new token in the Document would receive a new unique instance of the TokenRef object.

The TokenRef::is_valid_ref function tests the validity of the reference for a specified compilation unit.

Nil TokenRef

TokenRefs have one special "nil" value. Nil token references are special references that intentionally do not address any tokens within any compilation unit.

These TokenRefs are created with the TokenRef::nil function and can be tested using the is_nil function. The is_nil function returns true only for token references created this way; otherwise, it returns false, even if the TokenRef is obsolete.

Nil TokenRefs are useful to mimic the Option::None discriminant, so you don't have to wrap a TokenRef type in an Option.

The crate API never wraps TokenRef in an Option<TokenRef> type. Instead, if the value cannot be computed, the API uses a nil TokenRef.

Site References

The Site index, which represents the absolute offset of a Unicode character in the source code text, cannot reliably address a token's absolute offset after source code edits. This is because the token could be shifted left or right, or it could disappear during incremental rescanning, depending on the bounds of the edit.

In contrast, TokenRef::site returns the absolute offset of the beginning of the token's string fragment at the time of the call. In other words, this function returns an updated absolute offset of the token after an edit operation, provided the incremental rescanner did not remove the token during rescanning.

This allows for addressing a token's character bounds relative to changes in the source code.

The SiteRef helper object (backed by the TokenRef under the hood) addresses token bounds. Specifically, this object addresses either the beginning of the token or the end of the source code.

ToSite implements the ToSite trait, so it can be used as a valid bound of a range span.

use lady_deirdre::{
    lexis::{SiteRef, SourceCode, TokenCursor},
    syntax::VoidSyntax,
    units::Document,
};

let mut doc = Document::<VoidSyntax<JsonToken>>::new_mutable("foo [bar] baz");

let brackets_start: SiteRef = doc.cursor(..).site_ref(2);
let brackets_end: SiteRef = doc.cursor(..).site_ref(5);

assert_eq!(doc.substring(brackets_start..brackets_end), "[bar]");

// Rewriting "bar" to "12345".
doc.write(5..8, "12345");

assert_eq!(doc.substring(brackets_start..brackets_end), "[12345]");

Similar to TokenRef, the SiteRef interface has a special nil value and the is_nil test function.

Syntax

The next stage of source code analysis after lexical scanning is syntax parsing.

The input for syntax parser is stream of tokens produced during the lexical scanning or incremental rescanning stage.

Syntax parsing, and especially incremental reparsing, is a more complex process that includes error recovery and other features. However, you will find many similarities between the lexical scanning and syntax parsing APIs of Lady Deirdre.

The output of the syntax parser is a syntax tree, which is used in the semantic analysis stage.

Syntax Grammar

You define the syntax grammar using the Node derive macro on an arbitrary enum type that serves as the type for the syntax tree nodes.

Unlike the token enum, the node enum variants are required to have bodies with fields. These fields allow the parser to store parent-child relationships between nodes.

The node's parser is described in terms of LL(1) grammars using the #[rule(...)] macro attributes, which denote individual variant grammar rules.

Within these regex-like parse expressions, you can refer to other variants, establishing recursive descent parsing between the parse procedures.

Additionally, you can name any subexpression with the field: prefix inside the expression. This syntax enforces the generated parser to capture the result of the subexpression matching (whether it be a token or a syntax tree node) and place the matching result into the variant's field with the same name.

This process is called capturing, and it allows the parser to establish the node-to-children descending relationships between nodes.

The opposite ascending node-to-parent relationships are established automatically if you declare a variant field with the #[parent] macro attribute.

From the JSON example:


#[derive(Node)]
#[token(JsonToken)]
#[trivia($Whitespace)]
#[define(ANY = Object | Array | True | False | String | Number | Null)]
#[recovery(
    $BraceClose,
    $BracketClose,
    [$BraceOpen..$BraceClose],
    [$BracketOpen..$BracketClose],
)]
pub enum JsonNode {
    #[root]
    #[rule(object: Object)]
    Root {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        object: NodeRef,
    },

    #[rule(start: $BraceOpen (entries: Entry)*{$Comma} end: $BraceClose)]
    Object {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        start: TokenRef,
        #[child]
        entries: Vec<NodeRef>,
        #[child]
        end: TokenRef,
    },

    #[rule(key: String $Colon value: ANY)]
    Entry {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        key: NodeRef,
        #[child]
        value: NodeRef,
    },
    
    // ...

    #[rule(value: $String)]
    #[secondary]
    String {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        value: TokenRef,
    },
}

The Node macro generates an optimized and error-resistant syntax parser based on the provided grammar rules. The macro allows you to replace individual node-generated parsers with hand-written parsers, where you can implement custom recursive-descent logic with potentially unlimited lookahead and left recursion. Hand-written parsers will be discussed in more detail in the next chapters of this guide.

Macro API

In this chapter, I will intentionally omit some details, referring you to the macro documentation for a more verbose description of the available features, and to the JSON example as an example of a node implementation that utilizes most of the macro's capabilities.

Some general points to note about the macro API are:

  1. The #[token(JsonToken)] macro attribute specifies the type of the token. This attribute is required and denotes the alphabet of the parsing input.
  2. The #[trivia($Whitespace)] macro attribute describes elements that you want to omit automatically in the variant parsers between matching tokens. Trivia is a normal parsing expression that will be repeated zero or more times between each token. Typically, this expression enumerates whitespace tokens and refers to the comment variants of the grammar. The trivia expression can be overridden for each parsable variant (e.g., comments and string parsers might have different trivia expressions).
  3. There is exactly one enum variant annotated with the #[root] macro attribute. This variant denotes the syntax tree root and serves as the entry point of the grammar.
  4. A variant field annotated with the #[node] attribute is a reference to the current node1.
  5. A variant field with the #[parent] attribute is a reference to the parent node of the current node. This field establishes a node-to-parent relation and will be automatically updated by the incremental reparser.
  6. A variant field with the #[child] attribute establishes a node-to-child relation. The name of the field must match one of the capturing operator keys, and the type must correspond to the capturing type (node or token) and the capturing repetition.
1

NodeRef references are similar to TokenRef composite-index references, as they point to particular syntax tree instances of the compilation unit. We will discuss them in more detail in the next chapters as well.

Incremental Reparsing

The parser generated by the macro will be suitable for incremental reparsing.

By default, all node variants are subject to eager caching during incremental reparsing. These variant nodes are called Primary.

If you annotate a variant with the #[secondary] macro attribute, you inform the macro that this node is Secondary, and it should not be cached.

Rule Expressions

The expression syntax of the #[rule(...)] macro attribute is similar to the regular expression syntax of the Token macro, except that inside the parse expression, we match tokens (prefixed with a dollar sign: $Colon) and nodes (without a dollar sign: Object) instead of Unicode characters.

Since the LL(1) parser is a recursive-descent parser that looks at most one token ahead to make a decision in the choice operator (A | B), you should consider the leftmost set2 of the descending rules.

For example, the expression A | B would be ambiguous if both A and B variant rules could start matching with the same token. Similarly, the expression A | $Foo would be ambiguous if A could start with the $Foo token.

All variant rules except the #[root] variant and the trivia expressions must parse at least one token. The Root variant is allowed to parse potentially empty token streams.

The macro will check these and other requirements and yield descriptive error messages if one of the requirements is violated.

Similar to the Token's regexes, you can use the dump(...) operator for debugging purposes, which prints the state-machine transitions, captures, and the leftmost set of the surrounding parse expression.

2

The set of tokens from which the parse rule starts matching directly or indirectly by descending into other rules is called the "leftmost set".

Capturing

The expression operator start: $BraceOpen means that the parser matches the token "BraceOpen" and puts its TokenRef reference into the "start" field of the variant's body.

The operator can capture either a node or a token. If there is something else on the right-hand side, the capture operator will be spread to the inner operands. For example, foo: (Bar | Baz*) means the same as (foo: Bar) | (foo: Baz)*.

The type of the corresponding field depends on what the operator captures (node or token) and how many times. If the operator could be applied no more than once, the field type would be NodeRef or TokenRef, respectively. If the operator could be applied more than once, the type would be a Vec of NodeRef or TokenRef.

Examples:

  • In foo: Bar, the "foo" field would have the type NodeRef.
  • In foo: $Bar, the "foo" field would have the type TokenRef.
  • In foo: Bar & foo: Baz, the "foo" field would have the type Vec.
  • In foo: Bar*, the "foo" field would also have the type Vec.
  • In foo: $Bar?, the "foo" field would have the type TokenRef because "Bar" can be matched no more than one time. If the parser never matches "Bar", the "foo" field receives the value TokenRef::nil.

Guidelines

  1. Keep the syntax grammar simple.

    The purpose of the syntax tree is to express the general nesting structure of the source code to assist the further semantic analysis stage. If your language grammar contains rules that require higher-level lookaheads or context-dependent parsing, it might be better to parse a more simplified subset of this syntax at the syntax parse stage, leaving the rest of the analysis to the semantic stage.

  2. Always capture the start and end bounds of the node.

    If your parse rules would capture the start and end tokens either directly or indirectly by capturing the starting and ending nodes of this node, these captures would help Lady Deirdre properly understand the starting and ending sites of the node. This is especially important for functions such as the NodeRef::span function.

    In particular, for this reason, in the JSON object and array rules, we capture start and end tokens even though they are meaningless in terms of syntax tree traversing.

  3. Annotate the captured fields with the #[child] attribute.

    By annotating the captured field with this attribute, you make it clear that the corresponding field should be treated as a node's child, even if this child is a token.

    For instance, the traverse_tree function relies on this metadata when performing the syntax tree depth-first traversal.

  4. Keep the #[child] fields in order.

    For the same reasons, the captured fields should appear in the variant's body in the same order as they appear in the parse expressions. For example, if you have a #[rule(foo: $Foo & bar: Bar* & baz: $Baz)] rule, the variant fields should come in this order: "foo", "bar", and "baz".

  5. Don't capture semantically meaningless inner tokens.

    Capturing a comma token in a comma-separated list, for example, is likely unnecessary because you wouldn't rely on the list separators when analyzing the list.

    Note that for source code formatting purposes, you would use a dedicated API where all tokens are intentionally included in the parse tree regardless of their omission in the syntax tree.

  6. Prefer wrapping semantically meaningful tokens into dedicated nodes.

    When you encounter an intermediate token in the rule's expression that potentially addresses semantic metadata (e.g., a variable identifier in let x), you always have a choice: either capture it as is or introduce a dedicated node (e.g., "Identifier") that captures the token separately, and then capture the introduced node in the rule.

    For semantic analysis purposes, it would be more convenient to always work with node captures, so you should prefer wrapping.

    For example, in the JSON object entry rule, we capture the entry's key as a node-wrapper (String node) rather than as a token for this reason.

  7. Make the leaf nodes the Secondary nodes.

    By default, all nodes of the syntax tree are subject to eager caching. In practice, the incremental reparser probably performs better if you limit the cache to the structurally complex nodes only and annotate the rest of the node variants with the #[secondary] attribute.

    In the JSON example syntax, Root, Object, Entry, and Array are the primary nodes and could be cached during incremental reparsing. However, the leaf nodes such as String, Number, and others are secondary nodes. The incremental reparser will prefer to reparse their values during reparsing, saving cache memory.

Error Recovering

Once the node variant parser takes control flow, it has to parse the input stream regardless of its content. It must consume at least one token from this stream (if the stream is non-empty, and unless the node variant is a root node) and must return a fully initialized instance of the corresponding node variant.

In other words, variant parsers behave eagerly in an attempt to parse the input, regardless of the context from which they were called.

Given that the input stream is potentially an arbitrary sequence of tokens, the parser must do its best to recognize the rule on this stream and is subject to heuristic error recovery.

The generated parser performs error recovery whenever it encounters a token that is not expected in the current parse state.

For instance, if we are parsing a set of identifiers separated by commas and the end user forgets to put a comma between two identifiers, the parser might decide to continue parsing from the next identifier, yielding a parse error at the position where the comma was absent (a so-called "insert recovery").

The generated parser performs such recoveries based on preliminary compile-time static analysis of the rule expression. However, it usually prefers a "panic recovery" strategy by default, which is the most common error recovery approach in LL grammars.

Panic Recovery

In the panic recovery mode, if the parser encounters a token that is not expected in the current parse position, it eagerly consumes this token and the following tokens until it finds the next token from which it can resume the normal parsing process from its current parse state.

This approach generally works well in many cases, except that the parser might consume too many tokens before finding something meaningful to resume. Sometimes it is better to halt the parsing process earlier and return control flow to the ascending parser. For example, if we are parsing Rust's let x syntax and encounter another let token before reading the variable identifier, it would be better to halt the parsing of the current let-statement, assuming that the user started another statement in the ascending context.

In the macro, you can specify a set of panic-recovery halting tokens using the #[recovery(...)] macro attribute.

In the JSON example, we specify the following recovery configuration:

#[recovery(
    $BraceClose,
    $BracketClose,
    [$BraceOpen..$BraceClose],
    [$BracketOpen..$BracketClose],
)]
pub enum JsonNode {
    // ...
}

This configuration will be applied to all parsing rules, but you can override it for specific rules using the same macro attribute.

In the example above, we configure two halting tokens: "BraceClose" and "BracketClose". Additionally, we set up so-called recovery groups ([$BraceOpen..$BraceClose]). The group consists of two tokens: the open token and the close token of the group. Whenever the recoverer encounters an open token of the group followed consistently by the close token somewhere else, it omits the entire sequence of tokens surrounded by the open and close tokens, regardless of whether the surrounded content contains halting tokens. In other words, the recoverer considers a system of nested groups as a whole to be skipped during recovery.

In more realistic grammar than JSON, such as Rust syntax, you would probably use semicolons and the statement starting tokens ("let", "use", etc.) as common halting tokens, and the open-close braces as groups.

Mismatched Captures

If during error recovery the recoverer fails to recognize a token or a node that is a target for capturing, the parser sets enum variant fields to reasonable defaults:

  • TokenRef or NodeRef fields will be set to a nil value.
  • Vectors will be left empty or partially completed (if the parser managed to successfully pass some of the repetition iterations).

Debugging

When designing a syntax parser, it can be useful to perform quick and straightforward observations of the parser's step-by-step actions.

The built-in Node::debug function accepts a string of the source code text and prints to the terminal the hierarchical structure that shows how the parser descends into the node variant parsing procedures and what tokens these procedures consumed. Additionally, it will include the points where the parser detected syntax errors.

For example, the following code:

use lady_deirdre::syntax::Node;
    
JsonNode::debug(r#"{
    "foo": true,
    "bar": [123 "baz"]
}"# );

will print something like this:

 Root {
     Object {
         $BraceOpen
         $Whitespace
         Entry {
             String {
                 $String
             } String
             $Colon
             $Whitespace
             True {
                 $True
             } True
         } Entry
         $Comma
         $Whitespace
         Entry {
             String {
                 $String
             } String
             $Colon
             $Whitespace
             Array {
                 $BracketOpen
                 Number {
                     $Number
                 } Number
                 $Whitespace
                 --- error ---
                 String {
                     $String
                 } String
                 $BracketClose
             } Array
         } Entry
         $Whitespace
         $BraceClose
     } Object
 } Root

Errors Printing

Note that in the above example, the parser encountered a syntax error when parsing the JSON array (missing a comma between 123 and "baz").

You can generically iterate and print syntax errors using the SyntaxError::display function.

use lady_deirdre::{syntax::SyntaxTree, units::Document};

// Parsing the syntax and lexis of the source code into the immutable Document.
let doc = Document::<JsonNode>::new_immutable(r#"{
    "foo": true,
    "bar": [123 "baz"]
}"#);

for error in doc.errors() {
    println!("{:#}", error.display(&doc));
}

This code will print annotated snippets of the source code, pointing to the fragments where the errors occur, along with the default generated error messages.

   ╭──╢ Unit(1) ╟──────────────────────────────────────────────────────────────╮
 1 │ {                                                                         │
 2 │     "foo": true,                                                          │
 3 │     "bar": [123 "baz"]                                                    │
   │                ╰╴ missing ',' in Array                                    │
 4 │ }                                                                         │
   ├───────────────────────────────────────────────────────────────────────────┤
   │ Array syntax error.                                                       │
   ╰───────────────────────────────────────────────────────────────────────────╯

Syntax Tree Printing

Finally, using the CompilationUnit::display1 method, you can print the output syntax tree to the terminal.

use lady_deirdre::{
    syntax::SyntaxTree,
    units::{CompilationUnit, Document},
};

let doc = Document::<JsonNode>::new_immutable(r#"{
    "foo": true,
    "bar": [123 "baz"]
}"#);

println!("{:#}", doc.display(&doc.root_node_ref()));

Outputs:

Root(entry: 0) {
    object: Object(entry: 1) {
        start: $BraceOpen(chunk_entry: 0) {
            string: "{",
            length: 1,
            site_span: 0..1,
            position_span: 1:1 (1 char),
        },
        entries: [
            Entry(entry: 2) {
                key: String(entry: 3) {
                    value: $String(chunk_entry: 2) {
                        string: "\"foo\"",
                        length: 5,
                        site_span: 6..11,
                        position_span: 2:5 (5 chars),
                    },
                },
                value: True(entry: 4) {
                    token: $True(chunk_entry: 5) {
                        string: "true",
                        length: 4,
                        site_span: 13..17,
                        position_span: 2:12 (4 chars),
                    },
                },
            },
            Entry(entry: 5) {
                key: String(entry: 6) {
                    value: $String(chunk_entry: 8) {
                        string: "\"bar\"",
                        length: 5,
                        site_span: 23..28,
                        position_span: 3:5 (5 chars),
                    },
                },
                value: Array(entry: 7) {
                    start: $BracketOpen(chunk_entry: 11) {
                        string: "[",
                        length: 1,
                        site_span: 30..31,
                        position_span: 3:12 (1 char),
                    },
                    items: [
                        Number(entry: 8) {
                            value: $Number(chunk_entry: 12) {
                                string: "123",
                                length: 3,
                                site_span: 31..34,
                                position_span: 3:13 (3 chars),
                            },
                        },
                        String(entry: 9) {
                            value: $String(chunk_entry: 14) {
                                string: "\"baz\"",
                                length: 5,
                                site_span: 35..40,
                                position_span: 3:17 (5 chars),
                            },
                        },
                    ],
                    end: $BracketClose(chunk_entry: 15) {
                        string: "]",
                        length: 1,
                        site_span: 40..41,
                        position_span: 3:22 (1 char),
                    },
                },
            },
        ],
        end: $BraceClose(chunk_entry: 17) {
            string: "}",
            length: 1,
            site_span: 42..43,
            position_span: 4:1 (1 char),
        },
    },
}
1

Keep in mind that this function accepts either a TokenRef or a NodeRef. Supplying a NodeRef of a syntax tree branch allows you to print only the subtree of this branch, while providing a TokenRef enables you to print detailed metadata about the referred token.

Syntax Tree

The syntax API shares many similarities with the lexis API architecture:

  1. The syntax grammar, implemented by the Node type, is distinct from the syntax tree manager responsible for actual parsing and storage of the syntax tree.
  2. The syntax tree manager implements the SyntaxTree trait, providing access to the parsed syntax tree through its functions.
  3. There are several syntax manager implementations with distinct sets of features and performance characteristics.
  4. Individual nodes within the syntax tree are addressed using the NodeRef referential object, which points to concrete node instances owned by the syntax tree manager.

The simplest implementation of the syntax tree manager is the ImmutableSyntaxTree, which performs one-time parsing without incremental reparsing capabilities but has the fastest computation performance.

This object accepts a token cursor providing access to the input stream. For example, you can obtain this cursor from the TokenBuffer.

use lady_deirdre::{
    lexis::{SourceCode, TokenBuffer},
    syntax::{ImmutableSyntaxTree, SyntaxTree},
};

let tokens = TokenBuffer::<JsonToken>::from(r#"{
    "foo": true,
    "bar": [123, null]
}"#);

// Parsing the entire set of tokens in the token buffer.
let tree = ImmutableSyntaxTree::<JsonNode>::parse(tokens.cursor(..));

// Ensuring that the ImmutableSyntaxTree successfully parsed
// the input stream without syntax errors.
assert!(tree.errors().next().is_none());

The above code is verbose because it requires manual setup of the TokenBuffer and its token cursor.

More commonly, we can utilize the immutable Document, which is backed by the TokenBuffer and ImmutableSyntaxTree under the hood. Through this object, we can directly scan and parse the source code text. This object implements both the SourceCode and SyntaxTree traits, allowing us to access the lexical structure of the compilation unit as well.

use lady_deirdre::{syntax::SyntaxTree, units::Document};

let doc = Document::<JsonNode>::new_immutable(r#"{
   "foo": true,
   "bar": [123, null]
}"#);

assert!(doc.errors().next().is_none());

Node References

Instances of nodes in the syntax tree are owned by the syntax tree manager (e.g., by the Document or ImmutableSyntaxTree).

Similar to the TokenRef reference used to access individual tokens in the source code, the NodeRef referential object is used to obtain access to instances of syntax tree nodes.

NodeRefs are cheap to copy and are lifetime-independent objects representing globally unique composite numeric indices. However, their functions require references to the syntax tree managers in order to dereference corresponding nodes owned by the manager. It's important to note that a NodeRef could potentially represent an invalid reference if the node was removed.

use lady_deirdre::{
    syntax::{NodeRef, PolyRef, SyntaxTree},
    units::Document,
};

let doc = Document::<JsonNode>::new_immutable(r#"{
   "foo": true,
   "bar": [123, null]
}"#);

// Returns a referential object that points to the root of the syntax tree.
let root_ref: NodeRef = doc.root_node_ref();

// Documents always have a root.
assert!(root_ref.is_valid_ref(&doc));
assert!(!root_ref.is_nil());

let Some(JsonNode::Root {object,..}) = root_ref.deref(&doc) else {
    // Validity checked above.
    unreachable!();
};

// Nil NodeRefs are intentionally invalid references within any compilation unit.
assert!(!NodeRef::nil().is_valid_ref(&doc));
assert!(NodeRef::nil().is_nil());
assert!(NodeRef::nil().deref(&doc).is_none());

Polymorphic References

Since both NodeRef and TokenRef can serve as types for the children of syntax tree nodes, they both implement a generic trait PolyRef that provides common functions for both.

For example, PolyRef::span returns the site span of the referred object's bounds.

The PolyRef trait is an object-safe trait, useful for handling tree children without breaking the call chain. For instance, if you are confident that a particular instance of a PolyRef type is a NodeRef, you can use the PolyRef::as_node_ref function to cast the instance to a NodeRef; otherwise, it returns a nil NodeRef without causing a panic if the instance is not a NodeRef.

use lady_deirdre::{
    lexis::{Position, ToSpan},
    syntax::{PolyRef, SyntaxTree},
    units::Document,
};

let doc = Document::<JsonNode>::new_immutable(r#"{
   "foo": true,
   "bar": [123, null]
}"#);

let root_span = doc
    .root_node_ref()
    .as_node_ref() // We are confident that `root_node_ref` returns a NodeRef.
    .span(&doc)
    .unwrap() // We are confident that the root NodeRef is a valid reference.
    .to_position_span(&doc)
    .unwrap(); // Site span can be casted to a Position span.

assert_eq!(root_span, Position::new(1, 1)..Position::new(4, 2));

Finally, Lady Deirdre provides an owned version of the PolyRef trait, known as PolyVariant. PolyVariant is a simple enum with NodeRef and TokenRef variants. You can convert either of these referential objects into a PolyVariant using the PolyRef::as_variant function whenever you need a generic owned referential object for the compilation unit's content.

Note that PolyVariant itself also implements the PolyRef trait.

Tree Inspection

Query Nodes Manually

When you have a NodeRef reference, you can inspect the structure of the tree locally around this node by directly dereferencing node instances.

use lady_deirdre::{syntax::SyntaxTree, units::Document};

let doc = Document::<JsonNode>::new_immutable(r#"{
   "foo": true,
   "bar": [123, null]
}"#);

let root_ref = doc.root_node_ref();

let Some(JsonNode::Root { object, .. }) = root_ref.deref(&doc) else {
    panic!();
};

let Some(JsonNode::Object { entries, .. }) = object.deref(&doc) else {
    panic!();
};

let Some(JsonNode::Entry { value, .. }) = entries[1].deref(&doc) else {
    panic!();
};

let Some(JsonNode::Array { items, .. }) = value.deref(&doc) else {
    panic!();
};

let Some(JsonNode::Number { value, .. }) = items[0].deref(&doc) else {
    panic!();
};

let Some(string) = value.string(&doc) else {
    panic!();
};

assert_eq!(string, "123");

Alternatively, the above code could be rewritten in a more compact way using the NodeRef's inspection functions without breaking the call chain.

use lady_deirdre::{syntax::SyntaxTree, units::Document};

let doc = Document::<JsonNode>::new_immutable(r#"{
   "foo": true,
   "bar": [123, null]
}"#);

let string = doc
    .root_node_ref()
    .get_child(&doc, "object")
    .get_child(&doc, "entries") // returns the first entry
    .next_sibling(&doc) // gets the second entry
    .get_child(&doc, "value")
    .get_child(&doc, "items") // returns the first item
    .get_token(&doc, "value")
    .string(&doc)
    .unwrap();

assert_eq!(string, "123");

Each of these functions is infallible; they will return a nil NodeRef if they cannot fulfill the request. Therefore, we should be confident about the node configuration we are trying to query.

Depth-First Traversing

You can perform a depth-first traversal of the entire syntax tree or a specific branch using the SyntaxTree::traverse_tree and SyntaxTree::traverse_subtree functions, respectively.

Both functions require a visitor object to be passed as an argument. This object should implement a Visitor trait, which includes functions that will be triggered when the traversal procedure visits a node or token in the tree, according to the node-child relationships between the nodes.

use lady_deirdre::{
    lexis::TokenRef,
    syntax::{PolyRef, SyntaxTree, Visitor},
    units::Document,
};

let doc = Document::<JsonNode>::new_immutable( r#"{
    "foo": true,
    "bar": [123, null]
}"#);

doc.traverse_tree(&mut MyVisitor(&doc));

struct MyVisitor<'a>(&'a Document<JsonNode>);

impl<'a> Visitor for MyVisitor<'a> {
    fn visit_token(&mut self, token_ref: &TokenRef) {
        println!("Token\n{}", token_ref.display(self.0));
    }
    
    fn enter_node(&mut self, node_ref: &NodeRef) -> bool {
        println!("Enter\n{}", node_ref.display(self.0));
    
        // Tells the traverser to continue descending into this node's branch.
        true
    }
    
    fn leave_node(&mut self, node_ref: &NodeRef) {
        println!("Leave\n{}", node_ref.display(self.0));
    }
}

The visitor is a stateful object that you can mutate during tree traversal. You can use this mechanism to collect common metadata from the syntax tree.

The enter_node function returns a boolean value that controls whether to further descend into the entered node branch.

The leave_node function effectively visits the tree in reverse order.

Hand-Written Parsers

The following chapters cover more advanced topics, providing an in-depth exploration of the anatomy of Lady Deirdre's parsers. They will guide you on how to override parsers generated by the Node macro with manually implemented parse functions.

One common case where you might want to implement the parse procedure manually is infix expression parsing. Infix expressions usually require left recursion, which cannot be directly expressed in terms of LL(1) grammars.

These chapters will guide you through the Expr Parser example. This example demonstrates how to parse boolean expressions (e.g., (true | false) & true) using the Pratt algorithm.

Overriding a Parser

To recap, the Node derive macro automatically implements parse procedures for each enum variant annotated with the #[rule(...)] macro attribute. Inside the rule, you write a regex-like parse expression in terms of the LL(1) grammars used by the macro to generate the parse procedure. This determines the leftmost set of tokens from which the procedure starts parsing. The leftmost set is used when you descend into this variant in another variant's rule.

There is a possibility to replace the generated parse procedure with a manually written Rust function using the #[parser(...)] macro attribute.

This attribute accepts a Rust expression that must return an instance of the enum that represents the parsing result product. As an input, you would use the session variable, which is a mutable reference to the SyntaxSession that represents the current state of the parsing environment.

Usually, inside this expression, you would call your parsing function passing the session variable as an argument.

From the Expr Parser example:


#[derive(Node)]
#[token(BoolToken)]
#[trivia($Whitespace)]
pub enum BoolNode {
    #[root]
    #[rule(expr: Expr)]
    Root {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        expr: NodeRef,
    },

    #[rule($ParenOpen | $True | $False)] // Leftmost set.
    #[denote(EXPR)]
    #[describe("expression", "<expr>")]
    #[parser(parse_expr(session))] // Overridden parser.
    Expr {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        content: NodeRef,
    },
    
    //...
    
    #[denote(AND)]
    #[describe("operator", "<and op>")]
    And {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        left: NodeRef,
        #[child]
        right: NodeRef,
    },
    
    //...
}

Leftmost Set is Required

Note that even though we are overriding the parse procedure for the BoolNode::Expr enum variant via the #[parser(parse_expr(session))] macro attribute, we still have to specify the #[rule($ParenOpen | $True | $False)] attribute too.

The macro requires this attribute because it needs to know the leftmost set of the parser. Specifically, when we refer to the Expr variant inside the Root' s #[rule(expr: Expr)] parse expression, the macro knows that the Expr parser would start parsing from the "ParenOpen", "True", or "False" tokens as described in its rule.

Certainly, you don't need to reimplement the entire grammar of the overridden parse function inside the #[rule(...)] attribute (the macro will ignore it anyway). Instead, it would be enough just to enumerate the leftmost tokens via the | choice operator.

Variants Denotation

Another thing to notice in this snippet is that the BoolNode::And variant does not have a rule attribute, but instead, it has a pair of #[denote(AND)] and #[describe("operator", "<and op>")] macro attributes.

We don't specify the "rule" attribute here because we are going to parse this variant manually inside the "parse_expr" function too.

The denote attribute informs the macro that this variant is subject to parsing (even if it does not have an explicitly expressed grammar rule) and therefore is a legitimate part of the syntax tree.

The macro allows us to specify the #[child], #[parent], and other similar fields in the denoted variants, assuming that their values will be assigned manually. But more importantly, the macro reserves a parse rule number for the denoted variant that we will use inside the manually written parser to address this variant. The number can be accessed through the type's constant with the name that we specify inside the attribute (BoolNode::AND in this case).

If the variant is denoted but does not have a rule, the macro additionally requires specifying the describe attribute, which provides the end-user facing description of this syntax tree node variant. The first parameter is a string that describes the general class of this node variant ("operator"), and the second one is a more specific description of this particular variant ("<and op>"). Lady Deirdre will use this metadata to format error messages for the syntax errors.

Finally, the variants with the rule attribute are assumed to be denoted implicitly. We don't need to denote them manually, but as a rule of thumb, it is recommended denoting and describing all enum variants regardless.

Syntax Session

Inside the hand-written parse function, you will use the session variable provided by the macro-generated code when it invokes this function.

The variable is of type SyntaxSession, which provides an interface to read input tokens and manage the output syntax tree.

The final goal of the parse function is to read some tokens from the syntax session to recognize the target grammar rule and initialize and return an instance of the syntax tree node as a result of the parsing procedure.

Input Tokens Stream

The SyntaxSession trait is at first place a supertrait of TokenCursor, representing an input stream for the parser.

From this interface, you can read tokens' metadata ahead of the current stream position. For example, the session.token(0) function returns the first token that has not been consumed yet, session.token(1) reads the next token, and so on. Other similar lookahead functions allow you to observe more metadata about the tokens1. However, typically, the syntax parser should only rely on the token instances when making a decision to change its own inner parse state2.

None of these lookahead functions move the input stream forward. Once your parse algorithm has observed a few tokens ahead, analyzed them, and made a decision to actually "consume" these tokens, the algorithm calls the TokenCursor::advance function, which consumes one token and moves the stream position to the next token, or TokenCursor::skip, which allows you to consume several tokens.

For instance, in the Expr Parser example, we are parsing a sequence of whitespaces iteratively by reading the tokens one by one:

fn skip_trivia<'a>(session: &mut impl SyntaxSession<'a, Node = BoolNode>) {
    loop {
        // Looking ahead at the next token.
        let token = session.token(0);

        // If the token is not a whitespace token, finish the parse procedure.
        if token != BoolToken::Whitespace {
            break;
        }

        // Otherwise, if the token is a whitespace, consume it, and resume
        // the loop from the next token. 
        session.advance();
    }
}

Note that the above function, as a helper function in the overall procedure, could potentially parse zero tokens. However, the general algorithm is required to consume at least one token from the non-empty input stream.

1

For instance, session.string(0) would return a substring of the source code text covered by the first token.

2

Furthermore, ideally, the parser should look ahead at no more than a single token ahead of the stream position (session.token(0)). The fewer tokens you look ahead, the better incremental reparsing performance you would gain. However, this is not a strict requirement. Looking at a few tokens ahead is generally acceptable.

Error Recovering

If the parsing algorithm hasn't finished yet but encounters an unexpected token in the middle of the parsing procedure — a token that normally shouldn't exist in the input stream based on the current parse state — this is a syntax error.

Conventionally, in a hand-written parser, you should follow the same error recovery approaches common for parsers generated by the macro. More likely, you would use the panic recovery procedure.

To avoid manually reimplementing the panic recovery algorithm and to be consistent with the auto-generated parsers, Lady Deirdre exposes the Recovery configurable object that implements this algorithm, which is also used inside the macro-generated code.

The Recovery object has the same configuration options that you would use inside the #[recovery(...)] macro attribute: the Recovery::unexpected function adds a halting token, and the Recovery::group function adds a group of tokens that should be treated as a whole.

It is assumed that this object will be constructed upfront in the const context and stored in a static for fast reuse.

Depending on the parsing procedure complexity, you may want to prepare several Recovery objects for various types of syntax errors. For instance, in the Expr Parser example, there are three prepared Recovery objects: one to recover from syntax errors in the operators, one for operands, and one for errors inside the parentheses.

static OPERAND_RECOVERY: Recovery =
    // Creates an unlimited recovery configuration without groups
    // and halting tokens.
    Recovery::unlimited() 
        // Adds a group of parenthesis tokens.
        // The sequence of tokens like `(...)` will be consumed as a whole
        // during the panic recovery.
        .group(BoolToken::ParenOpen as u8, BoolToken::ParenClose as u8)
        // If the recoverer encounters a `)` token somewhere outside of any
        // group, it halts the recovery procedure (the halting token will
        // not be consumed).
        .unexpected(BoolToken::ParenClose as u8);

To apply the panic recovery procedure, you call the Recovery::recover function, passing it the session variable and the set of tokens the recoverer should look for. The function will consume as many tokens as needed according to the configured rules and will return an object describing whether the procedure managed to find the required token or failed to do so due to a specific reason (e.g., a halting token has been reached).

Regardless of the recovery result, you should report the error using the SyntaxSession::failure function.

// A set of tokens that we expect as the leftmost token of an operand.
static OPERAND_TOKENS: TokenSet = TokenSet::inclusive(&[
    BoolToken::True as u8,
    BoolToken::False as u8,
    BoolToken::ParenOpen as u8,
]);

// ...

fn parse_operand<'a>(
    session: &mut impl SyntaxSession<'a, Node = BoolNode>,
    context: NodeRule,
) -> NodeRef {
    loop {
        let token = session.token(0);

        match token {
            // Handling expected tokens.
            BoolToken::True => return parse_true_operand(session),
            BoolToken::False => return parse_false_operand(session),
            BoolToken::ParenOpen => return parse_group(session),

            // Otherwise, try to recover using the panic recovery algorithm.
            _ => {
                // A SiteRef of where the unexpected token was encountered.
                let start_site_ref = session.site_ref(0);

                // Runs the recovery procedure. This function possibly consumes
                // some tokens from the input stream (using `session`).
                let result = OPERAND_RECOVERY.recover(session, &OPERAND_TOKENS);

                // A SiteRef of where the recoverer finishes.
                let end_site_ref = session.site_ref(0);

                // Regardless of the recovery result, the syntax error has
                // to be reported.
                session.failure(SyntaxError {
                    span: start_site_ref..end_site_ref,
                    context,
                    recovery: result,
                    expected_tokens: &OPERAND_TOKENS,
                    expected_nodes: &EMPTY_NODE_SET,
                });

                // If the recoverer failed to recover, finish the parse loop;
                // otherwise, resume parsing from the recovered token stream
                // position.
                if !result.recovered() {
                    return NodeRef::nil();
                }
            }
        }
    }
}

Rules Descending

Whenever your parser needs to descend into other rules, you basically have two options:

  1. Call the SyntaxSession::descend function, which gives control flow back to the parsing environment.
  2. Create and parse the node manually using a pair of functions: SyntaxSession::enter and SyntaxSession::leave.

The result of the descend function would be similar to if the parsing environment parsed the requested node: it will advance the token cursor of the session to as many tokens as needed to cover the parsing rule, it will add a branch of nodes to the syntax tree as a result of parsing, and it will return you a NodeRef reference of the top node of the branch. Basically, the descend function performs a normal parsing procedure, except that in practice, during incremental reparsing, this function could potentially utilize the parsing environment's inner cache to bypass real parsing steps.

You should prefer to use the descend function on the primary nodes whenever possible.

In the Expr Parser example, we are using this method to descend into the subexpression when parsing the expression group surrounded by the (...) parentheses.

fn parse_group<'a>(
   session: &mut impl SyntaxSession<'a, Node = BoolNode>,
) -> NodeRef {
    // Consumes the opening "(" token.
    session.advance();

    // Skips whitespaces in between.
    skip_trivia(session);

    // Parses the inner expression.
    let inner = session.descend(BoolNode::EXPR);

    // In the rest of the code, we are parsing the closing ")" token and return
    // the `inner` NodeRef to the parsed subexpression.
}

Calling the descend function requires you to follow the same requirements as if you were descending into the rule from the Node macro expression:

  1. Left recursion is forbidden. You should not call this function at the beginning of the parse procedure if descending into this rule could directly or indirectly lead to recursive calling of the current parsing procedure. Such a call is likely to result in infinite recursion. However, descending into the same rule in the middle of the parsing is perfectly fine. In particular, the parse_group function recursively descends into the same parse_expr procedure because we forcefully consume the ( token before the call.
  2. The variant you descend to must have a parser. The variant should have a #[rule(...)].

The second method allows you to parse the subnode manually and is generally not restricted to the above limitations.

Calling the enter function starts node parsing. Calling the leave function finishes the subparser and returns the syntax session to parsing of the parent node. The enter and leave functions must be properly balanced: entering into the subparse context must always be enclosed by leaving the context.

In the leave function, you specify the instance of the node that is the product of the subparser. This function returns a NodeRef of the product deployed to the syntax tree (similarly to the descend function).

fn parse_true_operand<'a>(
   session: &mut impl SyntaxSession<'a, Node = BoolNode>,
) -> NodeRef {
    // Starts "true" node parsing.
    session.enter(BoolNode::TRUE);

    // Consumes the "true" token.
    session.advance();

    // A NodeRef of the syntax tree node currently being parsed.
    let node = session.node_ref();
    
    // A NodeRef of the parent node that we parsed before entering into
    // the "true" node subparser.
    let parent = session.parent_ref();

    // Finishes "true" node subparser, and returns its NodeRef.
    return session.leave(BoolNode::True { node, parent });
}

Note that both descend and enter functions require the rule number as an argument. Having this number, the parsing environment reveals nesting between the parsing procedures, which is specifically important for building the parser tree.

These numbers are the constants that were specified in the #[denote(TRUE)] macro attributes when we set up the derive macro.

Left Recursion

Lady Deirdre follows an approach to handling left recursion by lifting syntax tree nodes to their siblings, such that the node becomes a child of its former sibling.

When parsing code such as true & false, first, you parse the true operand. If the input stream finishes at this step, you return this node as the result product of parsing. Otherwise, when the parser encounters an & token, it starts a new binary operator parser immediately lifting the previously created operand to the current operator's context, then parses the second operand and finishes the operator parser. You can repeat this procedure iteratively to create a left-rotated binary tree.

The SyntaxSession::lift function "transplants" the syntax tree branch created just before we enter the new node subparser to the context of this subparser. In particular, this function automatically changes the parent NodeRef of the former sibling to the node that we start parsing.

From the operator parser of the Expr Parser example:

BoolToken::And => {
    if binding >= 2 {
        return accumulator;
    }

    // The `accumulator` is the result product of the overall parse procedure.
    let left = accumulator;

    // Entering into the `&` operator subparser.
    let node = session.enter(BoolNode::AND);
 
    // The accumulated product could be Nil due to syntax errors.
    // In this case, we should not and cannot lift it.
    if !left.is_nil() {
        // Makes the former accumulated node the child (the left-hand operand)
        // of the currently parsed operator.
        session.lift(&left);
    }

    let parent = session.parent_ref();

    // Consumes the `&` token.
    session.advance();
    skip_trivia(session);

    // Parses the right-hand side of the operator.
    let right = parse_operator(session, BoolNode::AND, 2);

    // Finishes operator subparser, and sets the result to the `accumulator`
    // for reuse on the next loop step.
    accumulator = session.leave(BoolNode::And {
        node,
        parent,
        left,
        right,
    });
}

Nodes Relations

In contrast to the macro-generated parsers, in the hand-written parser, you have to instantiate and properly initialize the instance of the node manually: when you return the final result from the overall parse procedure, and when you finish the inner subparsers via the leave function.

To set up the node-to-parent relation, you can use the SyntaxSession::parent_ref function that returns a NodeRef reference to the parent node in the syntax tree of the currently parsed node.

The SyntaxSession::node_ref returns a NodeRef reference of the currently parsed node that will be deployed into the syntax tree when the parser finishes parsing (or subparsing) process.

To set up the child NodeRefs, you can use the result of the descend and leave functions.

Whenever the parse procedure encounters syntax errors that cannot be recovered, the parser function should set the child references to the most reasonable defaults following the same approach as in the macro-generated parsers.

Pratt's Algorithm

In this chapter, I will explain how the algorithm implemented in the hand-written parser in the Expr Parser example works in general. You may find this approach useful for programming languages with infix expressions (math expressions with binary operators).

To recall, the example parses expressions of simple boolean logic: true and false are the atomic operands of the expression, _ & _ and _ | _ are conjunction and disjunction binary operators respectively, where the conjunction has a priority over disjunction (true & false | false & true means (true & false) | (false & true)). Finally, the language has a parenthesis grouping operator (e.g., (true | false)).

In theory, we could describe such a language in terms of the ordinary LL(1) grammar, for example, by parsing lists of operands separated by the operator tokens and disregarding the operators' precedence, assuming that the operator precedence will be established in the semantic analysis stage manually and based on the lists' content. However, such an approach is generally acceptable, but it is usually more convenient to work with an already prepared binary tree that properly reflects operands nesting.

Parsing binary trees with left and right recursion is generally impossible in LL-parsers because these parsers' grammar cannot express left recursion. However, inside the hand-written recursive descending parser, we can bypass this limitation.

The approach used behind the example utilizes Pratt's Parsing Algorithm.

The idea is that we associate each operator with a numeric priority, usually called a binding power: 0 for unbound precedence, 1 for the | operator, and 2 for the & operator1. There are two mutually recursive functions: the parse_operator function that parses a sequence of operators from the token stream in accordance with the specified binding power, and the parse_operand function that parses the atomic operand ("true" or "false"), or a parenthesis operator (which we treat as an operand too).

The parsing procedure starts by entering into the parse_operator function with zero binding power (which means that the function should attempt to parse all input tokens).

First, this function parses an operand by calling the parse_operand function and stores the result in the accumulator variable. The first operand that we parsed is going to be the left-hand operand.

Then the function enters a loop where it parses the next incoming pairs of operator tokens and the right-hand operands and reduces them to the left-rotated binary tree using the accumulator:

  1. Whenever the loop encounters the next operator token, it checks if this operator has the equal or higher binding power than the current one. If not, it breaks the loop.
  2. Otherwise, the loop consumes the token and parses the right-hand side operand by recursively calling the parse_operator function with the binding power of this operator.
  3. Finally, the loop folds the current accumulator as the left-hand side of the operation and the result of the right-hand side parsing product into a binary node representing this operation and stores it in the accumulator again, resuming the loop.
  4. The loop finishes when it encounters the end of the token stream input or the ) token that denotes that the function reached the end of the expression inside the (...) grouping expression.
fn parse_operator<'a>(
    session: &mut impl SyntaxSession<'a, Node = BoolNode>,
    context: NodeRule,
    binding: u8, // Current Binding Power
) -> NodeRef {
    let mut accumulator = parse_operand(session, context);

    loop {
        // Skipping the whitespaces between operands and operators.
        skip_trivia(session);

        // Looking ahead at the next token.
        let token = session.token(0);

        match token {
            // `&` operator encountered.
            BoolToken::And => {
                // Check the current binding power with the operator's binding
                // power.
                if binding >= 2 {
                    return accumulator;
                }
                
                // Folds the current accumulator as the left-hand operand and
                // the next right-hand operand into a single binary node.

                let left = accumulator;

                let node = session.enter(BoolNode::AND);

                if !left.is_nil() {
                    session.lift(&left);
                }

                let parent = session.parent_ref();

                session.advance(); // Consumes the operator token.
                skip_trivia(session);

                // Parses the right-hand side with the operator's binding power (2).
                let right = parse_operator(session, BoolNode::AND, 2);

                // Finishes folding and stores the result in the accumulator.
                accumulator = session.leave(BoolNode::And {
                    node,
                    parent,
                    left,
                    right,
                });
            }

            BoolToken::Or => {
                if binding >= 1 {
                    return accumulator;
                }

                // The same procedure, but uses the binding power 1 when parsing
                // the right-hand side.

                // ...
            }

            // The end of the input has been reached.
            // Breaking the loop and returning the accumulated result.
            BoolToken::ParenClose | BoolToken::EOI => return accumulator,

            _ => {
                // Syntax error handler
            }
        }
    }
}

The parse_operand function, in turn, parses just a single operand ("true" or "false") or a parenthesis operator, which is treated as an operand too.

fn parse_operand<'a>(
    session: &mut impl SyntaxSession<'a, Node = BoolNode>,
    context: NodeRule,
) -> NodeRef {
    loop {
        let token = session.token(0);

        match token {
            BoolToken::True => return parse_true_operand(session),
            BoolToken::False => return parse_false_operand(session),
 
            // Recursively descends into the `parse_operator` function again
            // with binding power 0 when parsing the inner expression
            // inside `(...)`.
            BoolToken::ParenOpen => return parse_group(session),

            _ => {
                // Syntax error handler.
            }
        }
    }
}

The above algorithm effectively constructs a left-rotated binary tree. However, the algorithm could be easily extended to cover more cases:

  • If you assign even binding powers to the operators (& power is 20, | power is 10), you can easily turn any operator into the right-recursive by passing the one binding power less to the right-hand side parsers (e.g., parse_operator(session, BoolNode::AND, 19) turns the conjunction operator into the right recursive operator).

  • The unary operators without the right-hand side could be parsed the same way, except that in the parse_operator function, you don't need to parse the right-hand side.

  • The unary operators without the left-hand side could be parsed in the parse_operand function that would recursively call the parse_operator function with the corresponding operator's binding power to parse the right-hand side.

1

Some operators obviously could share the same binding power. For example, the "+" and "-" operators in arithmetic expressions would have the same priority, and therefore the same binding power.

Documents

A Document is the primary object designed to store the source code text of a compilation unit, along with its lexical and syntax structures. It ensures all three components remain synchronized.

This object has two constructors: Document::new_mutable and Document::new_immutable.

Both constructors take the source code text as the initial input for the Document. The first constructor creates a Document that supports write operations, allowing for the editing of arbitrary source code spans within the document's text. The second constructor creates a Document that does not support write operations but is slightly faster during the document's creation.

To edit a mutable Document, you use the Document::write function. Thisfunction takes an arbitrary span of the source code text that you wish to rewrite and the text you want to insert in place of the specified span. It rescans the tokens of the affected source code fragment (localized to the span) and incrementally reparses the part of the syntax tree related to these changes. The mutable Document is specifically designed to be efficient for write operations. Incremental updates typically take milliseconds, even for large compilation units, making it feasible to write into the mutable Document with each end-user keystroke.

use lady_deirdre::{lexis::SourceCode, units::Document};

let mut doc = Document::<JsonNode>::new_mutable(r#"{ "foo": 123 }"#);

doc.write(9..12, "456");

assert_eq!(doc.substring(..), r#"{ "foo": 456 }"#);

If the compiler serves the dual purpose of being a programming language compiler that compiles the entire codebase at once, and a language server that continuously analyzes a dynamically evolving compilation project, you can optimize the program's performance by switching between immutable and mutable documents depending on the current mode of the program.

Loading By Parts

When the content of a file is being transferred in parts, for example, through a network or by loading the file from disk in chunks, you can create a TokenBuffer and continuously append these chunks into the buffer.

Once the file loading is complete, you can use this token buffer as input for the Document constructor.

use lady_deirdre::{
    lexis::{SourceCode, TokenBuffer},
    units::Document,
};

// Ideally, you can use `TokenBuffer::with_capacity()` if the final length of
// the file is known upfront.
let mut buf = TokenBuffer::new();

buf.append(r#"{ "foo": "#);
buf.append(r#"123 }"#);

let doc = Document::<JsonNode>::new_immutable(buf);

assert_eq!(doc.substring(..), r#"{ "foo": 123 }"#);

This approach is likely more efficient than writing the chunks to the end of a mutable Document. TokenBuffer is more efficient for lexical scanning when new fragments are being appended, and this method postpones the syntax parsing of the not-yet-completed source code text.

Syntax-less Documents

Sometimes, you may want to use the Document to store just the source code text and the lexical analysis of the file, bypassing the syntax analysis stage. For example, a mutable Document can be used as a simple storage of strings with random read/write access.

In this case, you can use the VoidSyntax helper object to enforce the Document to bypass syntax analysis.

use lady_deirdre::{lexis::SourceCode, syntax::VoidSyntax, units::Document};

let mut doc = Document::<VoidSyntax<JsonToken>>::new_mutable(r#"{ "foo": 123 }"#);

doc.write(9..12, "456");

assert_eq!(doc.substring(..), r#"{ "foo": 456 }"#);

The above document has full capabilities of the SourceCode trait, but the SyntaxTree implementation represents a dummy syntax tree with just a single root node that covers empty text.

Documents Identification

Each instance of the Document (and similar source code storage objects such as the TokenBuffer) has a globally unique identifier within the current process.

The Document::id function returns an object of type Id. This object is Copy, Eq, and Hash, ensuring that two distinct instances of documents have distinct identifiers.

Related objects of a Document, such as NodeRef, TokenRef, and others, store the identifier of the document to which they belong.

For example, from a NodeRef referential object, you can determine the identifier of the document to which the referred node belongs. This is particularly useful when working with multi-document compilers.

use lady_deirdre::{
    arena::Identifiable,
    syntax::{NodeRef, SyntaxTree},
    units::Document,
};

let doc = Document::<JsonNode>::new_immutable(r#"{ "foo": 123 }"#);

let root: NodeRef = doc.root_node_ref();

assert_eq!(doc.id(), root.id());

Documents Naming

You can assign a possibly non-unique string name to the document to simplify document identification during debugging. For instance, you can use a file name as a document's name.

The Id::set_name and Id::name functions set and retrieve the current document name, respectively. Additionally, the crate API uses the document's name in various debugging functions. For example, the Display implementation of the Document object prints the source code text to the terminal with the document name as the snippet's caption.

use lady_deirdre::{arena::Identifiable, units::Document};

let doc = Document::<JsonNode>::new_immutable(r#"{ "foo": 123 }"#);

// By default, the document has an empty name.
assert_eq!(doc.id().name(), "");

doc.id().set_name("Foo Doc");

assert_eq!(doc.id().name(), "Foo Doc");

Semantics

Semantic analysis is the final stage of the compilation project processing.

The Semantic Model is a set of user-defined data objects that collectively form an abstraction over the syntax trees of the compilation project.

These data objects are constructed by associated user-defined computable functions. Together, the model's data object and its associated function are called an attribute.

Attributes are objects owned by the syntax tree nodes. By traversing the syntax tree and querying their attribute values (the data objects of the semantic model), you discover the semantics of the compilation project.

Some attributes are the inputs of the semantic model; they perform a direct initial mapping of the syntax and lexical structures of the compilation units to a subset of the semantic model. Other attributes infer derived information from other attribute values.

Dependencies between attributes form a semantic graph. This graph is a lazy-evolving, demand-driven structure and is subject to incremental recomputations. Subsets of the graph are computed or recomputed whenever you query attributes from these subsets. The rest of the graph remains in an uninitialized or outdated state.

Lady Deirdre's semantic analysis framework helps you organize these data structures and compute the semantic graph efficiently, possibly from multiple concurrent threads, while keeping it in sync with changes in the source code of the compilation project.

Chain Analysis Example

The subchapters of this book section refer to the Chain Analysis example, which illustrates the basic concepts of the framework.

The example program attempts to analyze a simple programming language consisting of nested code blocks, where each block contains variable assignment expressions and sub-blocks.

{
    x = 100;

    {
        y = x;

        {
            z = y;
            w = 200;
            u = w;
        }
    }
}

On the left-hand side of the assignment is the name of the variable (a "Key") being introduced. On the right-hand side is either a numeric value or a reference to a variable introduced previously. Variables can shadow each other.

The compiler computes the numeric values of the variables based on the system of references between them. For instance, the variable z in the above snippet has a value of 100 because it refers to the variable y, which in turn refers to the variable x, having a numeric value of 100.

The non-incremental approach to this problem is straightforward. We can create a hashmap ("namespace") with the keys representing variable names and the values representing the currently inferred numbers for these variables. Then, we perform a depth-first traversal through the entire graph. Whenever we encounter an assignment expression, we insert an entry into the map with the key from the left-hand side of the expression and a value that is either a number from the right-hand side or, if the right-hand side is a reference, we retrieve the numeric value associated with that reference from the same map. After processing each assignment expression, we associate the key of that expression with the inferred number.

The above approach is inefficient in practice for two reasons:

  1. In a code editor, the end-user typically views just one screen of the source code text at a time. Therefore, computing the entire tree is unnecessary most of the time.
  2. Rerunning this procedure on every end-user's keystroke is necessary to keep the assignment expressions in sync with the changes.

To make this procedure more incremental and lazily computable, instead of computing the entire procedure at once, we would split it into independent sub-procedures localized for each code block.

For each code block, we will create its own namespace hashmap and traverse the block's statements similarly to the previous approach:

  • Whenever we encounter an assignment expression, we will try to resolve it based on the current hashmap state as before. However, if the assignment refers to a variable that does not exist in the map, we assume that the referenced variable is external. In this case, we will use a string with this reference name in the map's entry as a marker that this variable is external.
  • If we encounter a nested block, we will not descend into this block. Instead, we will associate this block with the current copy of the hashmap.
{
    x = 100; // x = 100

    // namespace copy: x = 100
    {
        y = x; // y = "x"

        // namespace copy: y = "x"
        {
            z = y; // x = "y"
            w = 200; // w = 200
            u = w; // u = 200
        }
    }
}

Note that the above block procedures are independent from each other. Each block's procedure can be run in any order, and the running could be postponed until needed.

To query a particular variable's resolution, first, we run the block procedure into which it is nested. Then, we look at the local variable resolution: if the variable was already resolved to a numeric value by its block procedure (such as x, w, or u variables), we are done.

Otherwise, we run the procedure of the parent block and look at the copy of the namespace that the parent's procedure assigns to our block. If the namespace contains a numeric value for the referred token, we are done. Otherwise, we repeat this iteration with the grandparent block, and so on, until we climb up to the ancestor where the number is found.

This incremental approach offers two advantages:

  1. Whenever we need to run the block-resolution procedure, we can cache its results as well as all intermediate resolutions. This means that the next time we resolve this or another variable that directly or indirectly depends on this block's local resolutions, we can retrieve their values from the cache.
  2. If the end-user types something in the block, we can erase only this block's caches. As a result, the previously computed resolutions can still utilize the caches that were not erased by these changes.

This example illustrates the core concept of the incremental semantic analysis framework. The source codes of the compilation units should be split into independent scopes, so that the local semantic information can be inferred from the states of the scopes. Then, higher-level procedures will infer higher-level semantics from the local semantics of the scopes by seamlessly connecting their bounds. Finally, all of these procedures would cache their results for reuse, and these caches are subject to invalidation depending on the changes in the source code.

Partition into Scopes

To achieve efficiency in incremental semantic analysis, the language should allow for partitioning of the source code of compilation units into scopes. Within these scopes, local base semantic facts can be inferred relatively independently from other scopes.

For instance, in the Java programming language, each compilation unit (file) introduces a Java class, which can be segmented into several abstract semantic layers:

  • The class type declaration layer.
  • The declaration layer for class members (fields and methods) within the class.
  • The layer for implementing methods (method bodies).

Initially, we can consider each of these layers as independent from each other: each method's body code constitutes an independent scope, each class member signature forms an independent scope, and finally, the class type declaration stands as a scope initially separate from its members and method implementations.


class MyClass<T> {
    private T fieldFoo = 5;
    
    public void methodBar(int x) {
        //..
    }
    
    public void methodBaz(T y) {
        //..
    }
}

From each of these scopes, we infer as much localized information as needed, which we can later utilize to draw more comprehensive conclusions.

In the example above, from the signature of methodBaz, we can deduce that it possesses a parameter of type T. However, solely from this declaration, we cannot pinpoint where exactly this type has been declared. Conversely, from the signature of MyClass, we gather that the class has a generic type T that can be utilized within its members. Yet, we cannot determine solely from the type signature declaration which class members specifically employ this type. By linking these two independent pieces of information together, we can conclude that the parameter x of methodBaz has a generic type declared in the class signature.

In terms of Lady Deirdre's semantic analysis framework, the localized facts we infer from the scopes constitute the inputs of the language's semantic model.

The semantic graph attributes, which map from the scope nodes to the semantic model objects, serve as the entry points of the model (the input attributes). Other attributes deduce more generalized facts based on the state of the model.

The granularity of the attributes within the semantic graph and the separation of scopes in the programming language syntax are core features that render incremental semantic analysis efficient.

Grammar Setup

The central component of your compiler is the Analyzer object. This object is responsible to manage the set of documents within the compilation project and their semantic graph. Further details regarding the Analyzer's API will be discussed in subsequent chapters. For now, our focus in this chapter will be on configuring the programming language grammar.

The first generic parameter N of the Analyzer represents the type of the language grammar. Essentially, this parameter denotes the type of the syntax tree node. However, to fully describe the grammar, including semantics, you need to extend this enum type with additional metadata:

  1. Annotate enum variants that serve as the top nodes of the scopes with the #[scope] macro attribute.
  2. Add a semantics field to each parsable (and denoted) enum variant, annotated with #[semantics].
  3. Optionally, you can specify the syntax tree classifier using the #[classifier] macro attribute.

From the Chain Analysis example:

#[derive(Node)]
#[token(ChainToken)]
#[trivia($Whitespace)]
#[classifier(ChainNodeClassifier)] // Nodes classifier (this attribute is optional).
pub enum ChainNode {
    #[root]
    #[rule(block: Block)]
    Root {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        block: NodeRef,
        
        // Fields annotated with this macro attribute must be present in each
        // variant body, and they must be of type `Semantics`.
        #[semantics] 
        semantics: Semantics<VoidFeature<ChainNode>>,
    },

    #[rule($BraceOpen statements: (Block | Assignment)* $BraceClose)]
    #[scope] // This node is the top node of the scope.
    Block {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        statements: Vec<NodeRef>,
        #[semantics]
        semantics: Semantics<BlockSemantics>,
    },

    #[rule(key: Key $Assign value: (Ref | Num) $Semicolon)]
    Assignment {
        #[node]
        node: NodeRef,
        #[parent]
        parent: NodeRef,
        #[child]
        key: NodeRef,
        #[child]
        value: NodeRef,
        #[semantics]
        semantics: Semantics<VoidFeature<ChainNode>>,
    },
    
    // ...
}

Semantics Field

Each variant in the Node enum must contain a semantics field annotated with the #[semantics] attribute and of type Semantics.

This field will be automatically initialized1 and managed by the macro-generated code.

Through this field, you can access semantic graph attributes that describe the semantics specific to each node.

The Semantic object is parameterized by a user-defined type, typically a struct type, enumerating all semantic attributes logically associated with the node. In the example above, the Semantics of the ChainNode::Block variant is parameterized by the BlockSemantics type.

If a node variant doesn't have any attributes, you can parameterize its Semantics object with the VoidFeature type, as seen in the Root and Assignment node variants.

1

To initialize this field manually in the hand-written parser, use the Semantics::new function, passing the current NodeRef obtained from the SyntaxSession::node_ref function.

Feature Objects

The type you use as a parameter of the Semantics object is called a feature.

Typically, the semantic feature is a user-defined struct type derived from the Feature trait using the Feature derive macro. This structure consists of fields that are either attributes or other feature objects.

#[derive(Feature)]
#[node(ChainNode)] // Required by the macro trait.
pub struct BlockSemantics {
    #[scoped]
    pub analysis: Attr<BlockAnalysis>,
    pub assignments: Attr<Shared<BlockAssignmentMap>>,
    pub blocks: Attr<Shared<BlockNamespaceMap>>,
    pub namespace: Attr<Shared<BlockNamespace>>,
}

In the above code, all fields are semantic attributes (Attr types), but you are free to use other features as field types whenever you want to create more complex nested structures. You can also reuse the same feature type and attribute types in multiple places, as long as the feature or attribute logically belongs to different syntax tree nodes. The Analyzer will treat them as independent instances.

Additionally, in the above code, we annotated the analysis field as #[scoped]. This annotation informs the Analyzer that this specific attribute (or feature) is an entry point of the semantic model, performing the initial inspection and mapping of the syntax tree's scoped branch to the semantic model's initial objects.

Features with scoped attributes should be used as semantic objects of scoped nodes (BlockSemantics is the semantics of the ChainNode::Block, which is a #[scope]).

Attributes

We will discuss attributes in more detail in the next chapters, but to give you a brief overview, the generic parameter of Attr specifies the type of the attribute value. This value is part of the semantic model and can be any user-defined type (e.g., a struct or an enum) equipped with a function that computes this value based on the syntax tree values and other attribute values.

#[derive(Default, Clone, PartialEq, Eq)]
pub struct BlockAnalysis {
    pub assignments: Shared<BlockAssignmentMap>,
    pub blocks: Shared<BlockNamespaceMap>,
}

impl Computable for BlockAnalysis {
    type Node = ChainNode;

    fn compute<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Self> {
        // Computing the BlockAnalysis instances based on the inputs provided
        // by the `context`.
    }
}

The general requirements imposed on this type are that it should implement the Clone, Eq, and Computable traits.

Semantic Graph

A semantic attribute (Attr object) consists of a cache for a value of an arbitrary user-defined type and a function that computes this value when invoked by the Analyzer's inner algorithm.

Inside the Computable function that computes the value, you can access other attribute values, the syntax and lexical content of the compilation units, and other Analyzer-related elements from the context argument of the AttrContext type. This argument is the source of the inputs to the function, allowing you to infer and return the resulting attribute value.

The implementation of the function should be deterministic and generally free of side effects. Typically, it should compute the output value solely based on the inputs.

Attribute Value

What you compute inside the function depends on your semantics design. It could be the type of a variable introduced in the source code, or the occurrences of a particular identifier throughout the source code. Essentially, it encompasses anything needed to express the programming language's semantic rules and to enhance the language server, thereby assisting the end user in the code editor1.

Typically, an attribute computes a value that logically belongs to the syntax tree node on which it is instantiated. From the context argument, you can access the NodeRef that points to the node owning this attribute. Using this NodeRef reference, you can determine the document (by the document's id) that contains this node and read the corresponding node instance.

#[derive(Default, Clone, PartialEq, Eq)]
pub struct BlockAnalysis {
    pub assignments: Shared<BlockAssignmentMap>,
    pub blocks: Shared<BlockNamespaceMap>,
}

impl Computable for BlockAnalysis {
    type Node = ChainNode;

    fn compute<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Self> {
        // A NodeRef that points to the syntax tree node owning this
        // attribute (assumed to be a `ChainNode::Block` enum variant in this case).
        let block_ref = context.node_ref();
        
        // Requests the Document object that stores the corresponding
        // compilation unit and its syntax tree in particular.
        let doc_read = context.read_doc(block_ref.id).unwrap_abnormal()?;
        let doc: &Document<ChainNode> = doc_read.deref();

        // Dereferences the instance of the syntax tree node to iterate through
        // the block statements.
        let Some(ChainNode::Block { statements, .. }) = block_ref.deref(doc) else {
            // If we encounter that the syntax tree is broken for any reason,
            // we return the (possibly unfinished) state of the computable
            // value regardless.
            //
            // The computable functions strive to infer as much metadata
            // as possible without panicking.
            return Ok(Self::default());
        };
        
        // Traversing through the block statements.
        for st_ref in statements {
            match st_ref.deref(doc) {
                // ...
            }
        }
        
        // ...
    }
}

Similarly to the syntax analysis stage, semantic analysis should be resilient to errors. If the computable function cannot fully infer the target value, it attempts to compute as much metadata as possible or fallback to reasonable defaults without causing a panic. For this reason, most semantic model objects in the Chain Analysis example implement the Default trait.

For instance, in Rust source code, when introducing a variable with let x;, the variable's type depends on the initialization expression. In the type-inference attribute's computable function, we attempt to infer the Rust type of the variable based on known initialization points. If we cannot fully infer the type, we may infer it to a reasonable possibility or possibilities.

1

Attributes are general-purpose; you can store any arbitrary data inside them, not necessarily related to language semantics only. For example, when implementing a programming language compiler, you can store middle-end or even back-end artifacts of the compiler in some attributes. In this sense, Lady Deirdre's semantic analysis framework could serve as an entry point to the middle- or back-end compiler, even though these compilation stages are not the direct purpose of Lady Deirdre.

The Graph

Inside the computable function of the attribute, you can read other attribute values. This mechanism allows you to infer more specific semantic facts from more general facts.

For instance, in the Chain Analysis example, the LocalResolution attribute infers let-statement references within the local block in which it was declared based on all local assignments (BlockAssignmentMap attribute) within this block.

#[derive(Default, Clone, PartialEq, Eq)]
pub enum LocalResolution {
    #[default]
    Broken,
    Resolved(usize),
    External(String),
}

impl SharedComputable for LocalResolution {
    type Node = ChainNode;

    fn compute_shared<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Shared<Self>> {
        // The NodeRef reference points to the key `ChainNode::Key` enum variant.
        let key_ref = context.node_ref();

        // The Document that owns this node's syntax tree.
        let doc_read = context.read_doc(key_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        // Dereferencing the "Key" node to access its semantics.
        let Some(ChainNode::Key { semantics, .. }) = key_ref.deref(doc) else {
            return Ok(Shared::default());
        };

        // Fetches the local `ChainNode::Block`'s NodeRef within the scope of
        // which the Key node resides.
        let block_ref = semantics
            .scope_attr()
            .unwrap_abnormal()?
            .read(context)?
            .scope_ref;

        // Dereferencing this block node instance.
        let Some(ChainNode::Block { semantics, .. }) = block_ref.deref(doc) else {
            return Ok(Shared::default());
        };

        // Accessing the block's semantics.
        let block_semantics = semantics.get().unwrap_abnormal()?;

        // Reading the `BlockAssignmentMap` attribute of the Block.
        let assignments = block_semantics
            .assignments
            .read(context)
            .unwrap_abnormal()?;

        // Looking up for an entry inside this map that belongs to the key.
        let Some(resolution) = assignments.as_ref().map.get(key_ref) else {
            return Ok(Shared::default());
        };

        //  Cloning the value from the entry, which will be the value of
        // the computable attribute.
        Ok(resolution.clone())
    }
}

In this snippet, particularly on the line block_semantics.assignments.read(context), we are reading the value of another attribute. The Attr::read function takes the current context reference and returns a RAII read-guard of the attribute's value.

When we read an attribute inside another attribute, we're indicating to the context that the value of the reader depends on the value of what's being read.

The act of reading establishes dependency relations between the attributes, so that the cached value of the reader is subject to recomputations whenever any of its dependencies change.

The system of attributes and their dependencies forms a Semantic Graph.

You don't specify this graph upfront; Lady Deirdre reveals the structure of the graph at runtime when it calls the computable function, which tells the Analyzer how one specific attribute depends on another.

This graph is dynamically evolving and potentially subject to reconfiguration as the computable function passes through different control flow paths. However, Lady Deirdre imposes one important limitation: the graph should not have cycles. In other words, a computable function of an attribute cannot read the value of an attribute that directly or indirectly reads its own value.

In the example above, the LocalResolution attribute depends on the BlockAssignmentMap attribute, which in turn depends on the BlockAnalysis attribute, an entry-point attribute that does not read any other attributes. Thus, this dependency chain is straightforward and does not have any cycles by design.

Avoiding cyclic dependencies between attributes is a rule that you should manually implement when designing the programming language semantics. Lady Deirdre provides some tools to detect possible errors in the graph design, which we will discuss in the next chapters.

Incremental Computations

Lady Deirdre does not compute attribute values instantly. Instead, the Analyzer computes them or retrieves them from the cache whenever you explicitly read them. Therefore, most of the time, some attribute values exist in a not-yet-computed or outdated state.

However, the Analyzer is responsible for keeping the values of the graph up to date whenever you observe corresponding attribute values.

The process of synchronizing the semantic graph is called validation. Conversely, marking an attribute as subject for recomputation in the graph is called invalidation.

The inner algorithm of the Analyzer is inspired by an approach similar to the Rust compiler's query framework, also known as Salsa. The algorithm attempts to avoid unnecessary recomputations of semantic graph attributes whenever possible. However, in general, it relies on the attribute's value equality (the Eq trait implementation on the attribute value) to determine whether the value of an attribute that depends on this one should be recomputed.

The validation procedure works as follows:

  1. Whenever the end user edits the source code of the compilation unit, the Analyzer incrementally reparses the corresponding Document of this unit.

  2. Then it detects all syntax tree nodes that have been altered during reparsing (nodes that have been updated, deleted, or created).

  3. Next, the Analyzer collects all top scope nodes (the nodes annotated with the #[scope] macro attribute).

  4. Then, the Analyzer marks the #[scoped] attributes of the scope nodes as invalid (subject to recomputation). At this point, the algorithm completes the "eager" stage of the validation. It does not make any further updates to the semantic graph values. This stage usually completes quickly.

  5. When you request a particular attribute value (e.g., by traversing the syntax tree and fetching an attribute value from the node's semantics), the algorithm checks whether the direct or indirect dependencies of the requested attribute are invalid (or not yet computed). In such cases, the algorithm calls the computable functions on the invalid attributes, updating their caches and propagating the changes down to the requested attribute.

    This process may finish earlier if the algorithm determines that the recomputation process converges to the previously stored caches (based on equality between the cached values and the new results of the computable function).

  6. Finally, the Analyzer returns an up-to-date clone of the attribute's value from its cache (hence, the value type should implement the Clone trait).

Input Attributes

An important aspect of this algorithm is that the Analyzer automatically invalidates only the #[scoped] attributes of the #[scope] syntax tree nodes whenever the end user changes the content of the syntax tree within the scope.

Therefore, typically only these attributes should perform the initial mapping of the scoped syntax tree structure to the initial semantic model objects. Informally, you can think of these attributes as the input attributes of the semantic graph.

Any other attributes should not directly rely on the current configuration of the compilation unit state, such as the structure of children of nodes or the strings covered by the scanned tokens. This metadata could change over time, and, in general, will not be detected by the validator when it validates the caches of these attributes. If this metadata is crucial to the attribute's computable function implementation, it should be reflected in the initial semantic model objects by the input attributes.

In the Chain Analysis example, only the BlockAnalysis attribute (which is a #[scoped] attribute of the #[scope] node syntax tree node) iterates through the block's inner let-statements and the inner blocks and collects them into HashMaps usable for further analysis. Moreover, this attribute does not inspect the inner structure of its nested blocks too, because the sub-block's inner syntax structure is outside of the current block scope.

Other attributes directly (e.g., BlockAssignmentMap) or indirectly (e.g., LocalResolution and GlobalResolution) read the BlockAnalysis's HashMaps, but they do not perform deeper inspection of the node's syntax tree structure inside their computable functions.

Side Effects

Typically, implementations of attribute's Computable functions should be free of side effects: their results should not rely on the external environment state, and non-input attributes should be independent from changes in the syntax and lexical structure of the compilation units.

If the implementation has side effects that cannot be avoided, you have two ways to overcome the limitations of the validation procedure:

  1. You can invalidate any attribute manually using the Attr::invalidate function if you are aware that the external environment state has changed.

  2. Inside the computable function implementation, you can use the Context::subscribe function to subscribe this attribute to the Analyzer-wide event that could be triggered independently for bulk invalidation of the semantic graph attributes subscribed to a specific event. The event object that you would pass to this function is an arbitrary user-defined value of a numeric type1.

Both methods should be used conservatively as they could potentially impact the incremental capabilities of the framework.

However, one scenario where you might find these mechanisms useful is when your compiler manages several Analyzers of distinct programming languages that logically build up a single compilation project. Within this setup, changes in the state of one Analyzer could be propagated to some attributes of another Analyzer's setup.

1

There are a couple of built-in events as well, such as the DOC_UPDATED_EVENT, which denotes document-wide edits within the specified document regardless of the scopes. However, the majority of the value range is available for user-defined events.

Scope Access

For any syntax tree node with semantics, you can obtain a NodeRef reference to the top node of the scope in which this node is nested.

The Semantics::scope_attr function returns a special built-in attribute that contains a NodeRef of the top node within the node's scope. The Analyzer is responsible for maintaining the accuracy of this attribute's value, and you can utilize it within any computable functions.

For instance, in the Chain Analysis example, the LocalResolution function accesses the scope block of the ChainNode::Key node by utilizing this attribute.

impl SharedComputable for LocalResolution {
    type Node = ChainNode;

    fn compute_shared<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Shared<Self>> {
        let key_ref = context.node_ref();
        let doc_read = context.read_doc(key_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        let Some(ChainNode::Key { semantics, .. }) = key_ref.deref(doc) else {
            return Ok(Shared::default());
        };

        let block_ref = semantics // The semantics of the Key node.
            .scope_attr() // The scope attribute of the `Key` node.
            .unwrap_abnormal()?
            .read(context)? // Reading this attribute.
            .scope_ref; // The NodeRef of the `Block` into which this `Key` node is nested.

        let Some(ChainNode::Block { semantics, .. }) = block_ref.deref(doc) else {
            return Ok(Shared::default());
        };
        
        // ...
    }
}

Note that the top nodes themselves are considered to be nested within their parent scopes. The ChainNode::Block node, which serves as a top node of a scope, is nested within its parent, Block1. By iteratively climbing up, you will eventually reach the root of the syntax tree.

The GlobalResolution attribute leverages this feature to calculate the ultimate resolution of the ChainNode::Key value by ascending through the hierarchy of nested blocks.

impl Computable for GlobalResolution {
    type Node = ChainNode;

    fn compute<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Self> {
        let key_ref = context.node_ref();
        let doc_read = context.read_doc(key_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        let Some(ChainNode::Key { semantics, .. }) = key_ref.deref(doc) else {
            return Ok(Self::default());
        };

        let key_semantics = semantics.get().unwrap_abnormal()?;

        let local_resolution = key_semantics
            .local_resolution
            .read(context)
            .unwrap_abnormal()?;

        // Checks if the `Key` has already been resolved locally.

        let mut ref_name = match local_resolution.as_ref() {
            LocalResolution::Broken => return Ok(Self::Broken),
            LocalResolution::Resolved(num) => return Ok(Self::Resolved(*num)),
            LocalResolution::External(name) => String::from(name),
        };
        
        // Otherwise, it climbs up through the system of nested blocks.

        // Fetches the NodeRef of the `Key`'s block node in a similar manner to
        // the `LocalResolution` computable function.
        let mut block_ref = semantics
            .scope_attr()
            .unwrap_abnormal()?
            .read(context)
            .unwrap_abnormal()?
            .scope_ref;

        loop {
            // Checks if the current block has the resolution we are seeking.
        
            let Some(ChainNode::Block { semantics, .. }) = block_ref.deref(doc) else {
                return Ok(Self::default());
            };

            let block_semantics = semantics.get().unwrap_abnormal()?;

            let block_namespace = block_semantics.namespace.read(context).unwrap_abnormal()?;

            match block_namespace.as_ref().namespace.get(&ref_name) {
                Some(LocalResolution::Broken) => return Ok(Self::Broken),
                Some(LocalResolution::Resolved(num)) => return Ok(Self::Resolved(*num)),
                Some(LocalResolution::External(name)) => ref_name = String::from(name),
                None => (),
            }

            // Otherwise, sets the `block_ref` to the parent block of
            // the current block to continue the climbing-up iteration.

            block_ref = semantics
                .scope_attr()
                .unwrap_abnormal()?
                .read(context)
                .unwrap_abnormal()?
                .scope_ref;
        }
    }
}
1

Or within the root node of the syntax tree. The root node is treated as the default scope for the entire syntax tree.

Granularity

As a general rule, it's preferable to maintain the semantic graph in a divided manner, with small attributes that are easy to clone and compare for equality. This ensures that each attribute value remains logically isolated from others.

With a granular semantic graph, the validation procedure is likely to complete the propagation of incremental changes throughout the dependent attributes of the graph more efficiently. This is achieved by comparing newly computed attribute values with previous caches and stopping the propagation process if they are found to be equal.

In the Chain Analysis example, the BlockAnalysis input attribute initially collects all assignment statements and inner blocks into two dedicated maps: assignments and blocks.

#[derive(Default, Clone, PartialEq, Eq)]
pub struct BlockAnalysis {
    pub assignments: Shared<BlockAssignmentMap>,
    pub blocks: Shared<BlockNamespaceMap>,
}

Later on, these maps are utilized in the LocalResolution and GlobalResolution attributes. In theory, we could directly read the BlockAnalysis attribute from these computable functions. However, in practice, when the end user modifies the content of a block, it's likely that one of the BlockAnalysis maps may remain unchanged. Therefore, depending solely on changes in the overall BlockAnalysis attribute to read just one of the two maps is probably unnecessary1.

For these reasons, we spread both maps into the intermediate BlockAssignmentMap and BlockNamespaceMap attributes by cloning the hash maps into them. Subsequently, we read these maps in the final attributes through these intermediaries independently.

If the BlockAnalysis attribute becomes invalid, both BlockAssignmentMap and BlockNamespaceMap will be recomputed when the validation procedure refreshes the semantic graph. However, it's possible that some of the LocalResolution and GlobalResolution attributes will remain unaffected if the validator detects that the intermediate attribute values haven't changed. As a result, the entire validation procedure would proceed faster by skipping some of the heavy computation steps.

1

This example might seem artificial, but in real-world applications, it's probable that the input attributes for scope analysis would comprise many more potentially independent fields.

Shared Object

The Shared helper object is Lady Deirdre's reference-counting thread-safe container, akin to Rust's standard Arc, with two notable distinctions:

  1. Shared, unlike Arc, lacks a Weak counterpart. If a weak counterpart isn't required, Shared's computation and memory performance are slightly better than Arc's.
  2. The Shared::get_mut function accepts &mut self. This makes it more convenient to use when constructing Shared in place.

Shared was initially designed for Lady Deirdre's semantic analysis framework. However, you are free to utilize it anywhere you don't require Arc's weak counterpart as well.

use lady_deirdre::sync::Shared;

let mut shared_a = Shared::new(100);

// You can mutate the inner data in place when Shared does not have any clones yet.
*shared_a.get_mut().unwrap() += 20;

// However, to read the inner data, you need to explicitly call `.as_ref()`.
assert_eq!(*shared_a.as_ref(), 120);

// Does not clone the inner allocated data; it only creates another smart
// pointer to this allocation, similar to `Arc::clone`.
let shared_b = shared_a.clone();

assert_eq!(*shared_b.as_ref(), 120);

Shared Computable

The SharedComputable is a specialized helper trait that automatically implements the Computable trait on the type Shared<T> if SharedComputable is implemented on T.

The SharedComputable::compute_shared function is a mandatory computation function through which you return Shared<T> instead of T.

This trait is especially handy for propagating the Shared value through intermediate attributes. For instance, the BlockAssignmentMap simply clones a shared map from the BlockAnalysis (which is cheap, as it merely creates a new smart pointer to the same allocation).

#[derive(Default, Clone, PartialEq, Eq)]
pub struct BlockAssignmentMap {
    pub map: HashMap<NodeRef, Shared<LocalResolution>>,
}

impl SharedComputable for BlockAssignmentMap {
    type Node = ChainNode;

    fn compute_shared<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Shared<Self>> {
        let block_ref = context.node_ref();
        let doc_read = context.read_doc(block_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        let Some(ChainNode::Block { semantics, .. }) = block_ref.deref(doc) else {
            return Ok(Shared::default());
        };

        let block_semantics = semantics.get().unwrap_abnormal()?;

        // Cloning the `BlockAnalysis::assignments` field as the result value of
        // this computable function.
        Ok(block_semantics.analysis.read(context)?.assignments.clone())
    }
}

The Analyzer

To recap, the Analyzer serves as the central object of the compiler, managing the compilation project's set of documents and the semantic graph.

The state of this object is presumed to be shared among multiple threads1. Specifically, numerous threads can edit various documents concurrently without blocking (provided they edit distinct documents). Additionally, multiple threads can query the semantic graph concurrently and often without blocking (especially when the threads query independent attributes). However, it's not possible to edit the documents and query their attributes simultaneously. When querying an attribute, the graph undergoes incremental recomputations that require synchronization of its state with changes in documents. Therefore, the content of the documents should remain fixed at the synchronization point.

For this reason, the API restricts access to the Analyzer's state: at any given time, you either mutate the state of the Analyzer (e.g., apply edits to the documents) or analyze the current state (e.g., query attribute values).

The Analyzer grants access to specific operations with its data through a system of task objects. You can think of a "task" as an RAII guard, through which you gain access to specific operations on the Analyzer's data2.

The Analyzer offers three types of task objects:

  • The AnalysisTask: This task allows you to query semantic graph attributes. You can have as many simultaneous task objects of this type as you need.
  • The MutationTask: With this task, you can create, edit, or remove documents, and you can trigger analyzer-wide events. Similar to AnalysisTask, you can have multiple simultaneous task objects of this type.
  • The ExclusiveTask: This task enables you to sequentially perform analysis and mutation operations within a single thread. However, you cannot have more than one task of this type simultaneously.

You obtain the task objects by requesting them from the Analyzer. For instance, the Analyzer::analyze function returns an AnalysisTask instance.

Each of these request functions could block the current thread if the Analyzer cannot grant requested access instantly. For instance, if two threads request analysis tasks, both of them will obtain access. However, if one thread requests an analysis task and another thread requests a mutation task, one of the threads will be blocked until the other releases the task object.

Informally, you can view the task system as a "RwLock" with complex access-granting rules, and the task objects as "RAII guards".

1

Even though, it's perfectly acceptable to use it from a single thread in a single-threaded process too.

2

Don't be confused by the term "task". A Task Object simply grants access to specific operations. While it's assumed that the task object would be associated with a thread worker in the end application architecture, Lady Deirdre doesn't manage threads itself, nor does it spawn any threads specifically. Managing threads isn't the focus of the crate; you have the freedom to organize the multithreaded (or single-threaded) architecture of your program however you see fit.

Mutation Task

The mutation task is utilized for creating, editing, or removing documents, as well as triggering analyzer-wide events.

let analyzer = Analyzer::<ChainNode>::new(AnalyzerConfig::default());

// A handle through which the task's thread could be gracefully interrupted.
// This interruption can be triggered either manually or by the Analyzer's inner
// task manager.
let handle = TriggerHandle::new();

// Requests the MutationTask.
let mut task = analyzer.mutate(&handle, 1).unwrap();

// Creates a new mutable document inside the Analyzer with the initial source
// code "test".
// The function returns an identifier for the created document.
let doc_id = task.add_mutable_doc("{ x: 10; }");

// Edits the document by its ID.
// This function may block if the document is currently being edited in another
// thread within another mutation task.
task.write_to_doc(doc_id, .., "{ y: 10; }").unwrap();

// Invalidates all attributes that have been subscribed to the event.
task.trigger_event(doc_id, 1234);

// Removes the document.
task.remove_doc(doc_id).unwrap();

// Ensures that the document no longer exists in the Analyzer.
assert!(!task.contains_doc(doc_id));

In the above code, the add_mutable_doc function resembles Document::new_mutable, and the write_to_doc function resembles Document::write, except that the Document instance is managed by the Analyzer.

Analysis Task

With the analysis task, you can read attributes of the semantic graph, but you cannot edit existing documents.

// Requests the AnalysisTask.
let task = analyzer.analyze(&handle, 1).unwrap();

// Gets read-only access to the document by its id.
let doc_read = task.read_doc(doc_id).unwrap();
let doc = doc_read.deref();

// Searching for a `ChainNode::Key` node within the syntax tree.

let Some(ChainNode::Root { block, .. }) = doc.root_node_ref().deref(doc) else {
    panic!();
};

let Some(ChainNode::Block { statements, .. }) = block.deref(doc) else {
    panic!();
};

let Some(ChainNode::Assignment { key, .. }) = statements[0].deref(doc) else {
    panic!();
};

let Some(ChainNode::Key { semantics, .. }) = key.deref(doc) else {
    panic!();
};

let (attribute_version, resolution) = semantics
    .get()
    .unwrap()
    // The attribute of the node's semantic feature.
    .global_resolution
    // Returns a clone of the attribute's current value.
    .snapshot(&task)
    .unwrap();

assert_eq!(resolution, GlobalResolution::Resolved(100));

Note the snapshot function in the above code that we're calling on the global_resolution attribute of the node's semantics.

This function executes the validation procedure and returns a pair of objects: the Analyzer's inner version at which the value of the attribute was updated, and a copy of the attribute's value.

The version number represents the inner version of the semantic graph state. The Analyzer increments its version number each time it updates the values within the semantic graph. This number always increases and never decreases.

The snapshot function returns the version at which the cache was updated. This number is useful for quickly checking if the attribute has a new value by comparing it with the version number received from this function previously.

The second object of the pair is a copy of the attribute's value. Unlike the Attr::read function used within computable functions, which returns a reference to the value, the snapshot function used externally copies the value (by cloning it).

For this reason, it's recommended to make the attribute's value type cheap to copy if the attribute is intended to be observed from outside of computable functions. Otherwise, you can wrap the value type into Shared.

Exclusive Task

You obtain the exclusive task using the Analyzer::exclusive function.

The Analyzer grants only one instance of this type of task at a time, but this task object provides both the analysis task and mutation task APIs.

The exclusive task is useful for both single-threaded and multi-threaded compilers.

In some language servers and code editors, a technique used to implement code-completion suggestions involves probing the source code by inserting a special secret word at the position of the end user cursor. This allows traversal of the tree to find the syntax tree node containing this word, thus identifying the part of the syntax the user was attempting to complete. Finally, the source code is restored by removing the inserted probing word.

All three steps — writing the probe word, analyzing the probed text, and removing the probe word — should be done as a single transaction to ensure atomicity. The exclusive task provides this atomicity, preventing other threads from reading or changing the probed text in between.

Documents Reading

From any kind of task, you can read the content of the document (both lexical and syntactical). The read_doc function returns a DocumentReadGuard RAII guard, through which you access the Document object immutably. While this guard is held, attempts to mutate this specific document (edit or remove) will be blocked. However, semantic analysis (e.g., querying attributes) is not affected because analysis does not require mutation of compilation units.

Shared Analyzer

As the Analyzer is going to be a central object of the compiler, it's recommended to either place it in a Shared or a Lazy static for easy access from multiple threads. This is somewhat analogous to placing a Mutex or RwLock with the program-wide state into an Arc to share it across threads.

All methods of the Analyzer's API are &self functions.

use lady_deirdre::{
    analysis::{Analyzer, AnalyzerConfig, TriggerHandle},
    sync::Lazy,
};

// This static will be initialized once you dereference it.
static MY_COMPILER: Lazy<Analyzer<ChainNode>> =
    Lazy::new(|| Analyzer::new(AnalyzerConfig::default()));

let handle = TriggerHandle::new();

let task = MY_COMPILER.mutate(&handle, 1).unwrap();

Single Document Compiler

Sometimes, the compiler you're developing is intended to compile a programming language without modules. For instance, vanilla JavaScript doesn't have modules; the entire JavaScript compilation project consists of just one file (one document).

In this case, you can configure the Analyzer when you instantiate it to manage no more than a single document.

use lady_deirdre::analysis::{Analyzer, AnalyzerConfig};

let mut config = AnalyzerConfig::default();

config.single_document = true;

let analyzer = Analyzer::<ChainNode>::new(config);

With this configuration option, you are turning off some inner memory and performance overhead that the Analyzer consumes to handle more than one document.

However, note that the single document Analyzer is still capable of managing more than one document, but it is likely that multi-document management would be less efficient. Therefore, you can use this configuration option to design the compiler to usually manage a single compilation unit but not strictly limit it to just one unit.

Custom Hasher

The semantic analysis framework under the hood utilizes hash maps and hash sets to store various kinds of inner metadata. By default, these maps and sets use Rust's standard RandomState hasher, which prioritizes stability for specific kinds of cryptography attacks relevant for network services. However, it is slower than other alternatives without such guarantees.

Since compilers and language servers intended to run solely on local machines usually don't require this level of security, the performance of the Analyzer could be improved by replacing the standard hasher with a faster compatible alternative from the Rust ecosystem, such as aHash.

To replace the hashing algorithm, you need to explicitly specify the third generic parameter of the Analyzer with the hashing algorithm type of your choice.

Tasks Management

Under the hood, the Analyzer maintains a queue of tasks, both activated and pending.

To clarify, at any given point in time, the Analyzer activates only one type of simultaneous task objects (and no more than one exclusive task).

Whenever you request a task object of a particular type, the task manager attempts to activate it immediately according to the current tasks queue. If activation is not possible, the Analyzer blocks the current thread, enqueues the request, and unblocks the requester thread once the inactive request in the queue reaches activation (once all top active task objects in the queue that block this request will be released by the concurrent threads).

Graceful Shutdown

The job that your program's thread performs with the task object is subject to graceful shutdown. For this reason, each task request function of the Analyzer requires specifying the task handle through which the job could be signaled to shut down.

let handle = TriggerHandle::new();

let task = analyzer.analyze(&handle, 1).unwrap();

assert!(!handle.is_triggered());
assert!(!task.handle().is_triggered());

// Signals the job for interruption.
handle.trigger();

assert!(handle.is_triggered());
assert!(task.handle().is_triggered());

The TriggerHandle is the default implementation of the handle1. This object is thread-safe and cheap to clone. Once the handle is triggered (via the trigger function), all copies of the instance become triggered, which serves as a marker for the job thread to gracefully finish its job.

You can create the handle from outside of the worker thread where you are scheduling the worker's job and pass a copy of the handle to the worker thread. The worker thread will use it to request the task object from the Analyzer.

The worker thread should periodically check the handle's triggering state. Once the handle is triggered and if the worker hasn't completed its job yet, the worker thread should perform the minimal amount of work required to interrupt its job normally and drop the task object as soon as possible, releasing the acquired access grant back to the Analyzer.

The Analyzer could also trigger the handle. For example, if the task manager realizes that some thread has requested a task with higher priority and this kind of access cannot be granted instantly because there are lesser prioritized but still active task objects in the queue, the manager would trigger the handles of these active tasks.

1

Note that Lady Deirdre allows you to create your own task handle types with more complex triggering logic by implementing the TaskHandle trait on the user-defined type. In this case, you would use this type as the second generic parameter of the Analyzer object.

Tasks Interruption

The Analyzer itself examines the handle during the semantic graph validation between the attribute validation bounds. If the validation procedure determines that the handle was triggered in the middle of the analysis, the validator will leave the semantic graph in a not-yet-completed state (without breaking its integrity), and it will start returning the Interrupted error from all corresponding functions.

For example, the Attr::read function used inside the computable functions and the Attr::snapshot function used to get a copy of the attribute value from outside both would start returning the Interrupted error.

Receiving this error signals that the task handle was triggered, indicating that you are no longer able to query the semantic graph using this task object, and that you should gracefully finish the worker's job by dropping the task object as soon as possible.

When you receive this error inside the computable function, you should return this error as well from the function.

#[derive(Default, Clone, PartialEq, Eq)]
pub struct BlockAssignmentMap {
    pub map: HashMap<NodeRef, Shared<LocalResolution>>,
}

impl SharedComputable for BlockAssignmentMap {
    type Node = ChainNode;

    fn compute_shared<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Shared<Self>> {
        log_attr::<Self, H, S>(context)?;

        let block_ref = context.node_ref();
        let doc_read = context.read_doc(block_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        let Some(ChainNode::Block { semantics, .. }) = block_ref.deref(doc) else {
            return Ok(Shared::default());
        };

        let block_semantics = semantics.get().unwrap_abnormal()?;

        // The `?` mark would return the Interrupted error from this function if
        // the `read` function was interrupted.
        Ok(block_semantics.analysis.read(context)?.assignments.clone())
    }
}

Note that the validator checks interruption events only between computable function calls. In principle, it is not capable of checking this event during function execution. To make the trigger handle state examination more granular, you can manually check its state in long-running computable functions.

For instance, in the BlockAnalysis attribute of the Chain Analysis example, we are checking the interruption state during the iteration through the assignment statements of the block.

impl Computable for BlockAnalysis {
    type Node = ChainNode;

    fn compute<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Self> {
        let block_ref = context.node_ref();
        let doc_read = context.read_doc(block_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        let mut result = Self::default();

        let Some(ChainNode::Block { statements, .. }) = block_ref.deref(doc) else {
            return Ok(result);
        };

        let mut block_namespace = BlockNamespace::default();

        for st_ref in statements {
            // Returns an Interrupted error if the task handle had been triggered.
            context.proceed()?;

            // ...
        }

        Ok(result)
    }
}

Tasks Priority

The second argument of the task request functions (e.g., analyze, mutate, etc.) is a numeric type denoting the task's priority.

The task manager attempts to prioritize tasks with a higher priority number over tasks with a lower priority number when enqueueing the task object into the task queue.

Bulk Interruption

You can specify the minimum tasks priority level allowed in the Analyzer by calling the Analyzer::set_access_level function and specifying the priority threshold.

Calling this function will interrupt all currently active tasks with a priority lower than the threshold.

Non-active pending tasks with a priority lower than the threshold will be removed from the task queue, and the corresponding requester threads will be unblocked, immediately receiving Interrupted errors.

All future task requests with a lower priority than the current threshold will also receive Interrupted errors.

This function is particularly useful for shutting down the entire compiler gracefully. By specifying the maximum threshold value, you can enforce all tasks of all kinds to shut down gracefully, preventing access from being granted to any new incoming task requests.

Language Server Design

The rationale behind the Analyzer's complex data access API is that it is specifically designed for use in language server programs that handle language client (code editor) requests concurrently.

The client of the Language Server Protocol notifies the server about various events happening on the client side, allowing the server to handle these requests concurrently in dedicated working threads.

For example, when the client notifies the server that the end user has opened a file in the editor by sending the source code text, you can acquire a mutation task and create a Document to represent this task.

When the end-user edits the file, the client usually sends a notification to the server containing an edited fragment span and the text the user is typing. In this case, you would acquire a mutation task and apply the edit to the corresponding document.

When the user scrolls the document window, clicks or moves the cursor over symbols in the source code, or requests code completion suggestions, the client sends multiple requests to the server asking for various semantic facts about the source code spans that the user is currently observing. The server can use analysis tasks to query the Analyzer's document semantics and respond to these requests accordingly.

The observation requests from the client can be canceled if the client decides that a request is no longer relevant. In this case, once the server receives the cancellation notification, it can signal the corresponding working thread to interrupt its job by triggering the task handle used by the working thread.

Client-side requests can obviously conflict with each other. For example, an incoming document edit notification would conflict with in-progress semantic analysis requests.

These conflicts can be resolved through the Analyzer's task priority system. Analysis tasks used to handle client analysis requests should typically have lower priorities than mutation tasks handling document edits, as immediate synchronization of changes in the source code file on the client side with the server-side state is more important than analysis jobs.

Since the analysis job is subject to frequent interruptions by client-side cancellation notifications and mutation jobs, the typical analysis job workflow involves a loop with the following steps:

  1. At the beginning of the loop, check if the client-side request has been canceled. If it has, break the loop and respond to the client accordingly.
  2. Otherwise, acquire an analysis task from the Analyzer and execute the actual analysis procedure based on the client request inputs.
  3. If the analysis succeeds, return the response to the client with the analysis results and finish the loop.
  4. If the analysis job is interrupted because another thread with a higher priority attempts to acquire a conflicting (mutation) task object, the analysis worker should drop its analysis task object to allow the other thread to fulfill its request1. Then, restart the loop from step one to eventually complete the client-side analysis request.

An important feature of the above procedure is that even if we drop the analysis task in the middle of its execution, the Analyzer may still manage to complete part of the semantic graph validations. When the analysis procedure resumes, it is likely to execute much faster, continuing validation from the point where it was interrupted. This approach is particularly relevant for computation-heavy analysis procedures on highly granular semantic models.

1

At this step, you can even park or sleep the current thread for a short amount of time to ensure that the other thread acquires the requested task without race conditions.

Configuration Issues

Many functions in the semantic analysis framework API can return an AnalysisError, representing either a normal result (e.g., an Interrupted error) or an abnormal error indicating a configuration or usage issue with the framework.

For example, the write_to_doc function of the mutation task can return a MissingDocument error if you specify a document ID that does not exist in the Analyzer (e.g., if the document was previously removed from the Analyzer).

The API documentation for framework functions typically describes the types of errors that a function can return. Depending on the situation, you may handle certain errors manually. However, as a fallback, it is recommended to return normal errors from functions that return AnalysisResult and to panic immediately if an abnormal error occurs. This convention helps identify configuration or usage issues more quickly.

Canonical compilers written with Lady Deirdre should be designed to be infallible. If you receive an abnormal error from a framework function, it likely indicates a bug in your program's code that needs to be fixed.

In particular, the computable functions of the Chain Analysis example use the unwrap_abnormal helper function to filter out normal errors from abnormal ones, panicking if an abnormal error is encountered.

impl SharedComputable for BlockNamespaceMap {
    type Node = ChainNode;

    fn compute_shared<H: TaskHandle, S: SyncBuildHasher>(
        context: &mut AttrContext<Self::Node, H, S>,
    ) -> AnalysisResult<Shared<Self>> {
        log_attr::<Self, H, S>(context)?;

        let block_ref = context.node_ref();
        let doc_read = context.read_doc(block_ref.id).unwrap_abnormal()?;
        let doc = doc_read.deref();

        let Some(ChainNode::Block { semantics, .. }) = block_ref.deref(doc) else {
            return Ok(Shared::default());
        };

        // The `Semantics::get` function returns a reference to the node's
        // semantics. However, in theory, it could also return
        // an `UninitSemantics` error if the semantics of the node were not
        // properly initialized for some obscure reason.
        //
        // In such a case, the `unwrap_abnormal` function will panic accordingly.
        let block_semantics = semantics.get().unwrap_abnormal()?;

        Ok(block_semantics
            .analysis
            .read(context)
            // If the `Attr::read` function returns an Interrupted error, this
            // error will be propagated from this computable function as well.
            //
            // However, if the function returns any other type of error,
            // considered as abnormal, the `unwrap_abnormal` function will
            // also panic accordingly.
            .unwrap_abnormal()?
            .blocks
            .clone())
    }
}

Additionally, it is recommended to log every computable function at the beginning of its implementation. This practice aids in swiftly identifying various issues in the attributes logic by examining the log trace.

In the provided snippet, the log_attr function generates a debug message for the logger regarding the computable function about to be executed, along with the syntax tree node snippet on which this attribute is being computed. This function is implemented within the Chain Analysis example's codebase. Lady Deirdre does not include built-in functionality for logging computable functions, as it does not have a built-in logger and assumes that logging infrastructure is implementation-specific.

Cycles Detection

The absence of cycles in the semantic graph is a framework requirement that users need to implement manually.

Graph cycles share similarities with unbounded infinite recursion accidentally introduced into the source code. Lady Deirdre cannot proactively check the graph structure because it evolves at runtime based on custom logic within computable functions.

There are two primary methods for detecting accidentally introduced cycles. Firstly, logging computable functions helps establish how attributes refer to each other during execution via the log trace.

Secondly, a hard timeout limits computable function execution. Typically, if the semantic model of the language is granular, computable functions complete execution within a few milliseconds, even on low-end CPUs. By default, the Analyzer sets the timeout limit to a few seconds1. If a computable function exceeds this limit, it indicates a potential issue in the semantics design, and the corresponding analysis function ( e.g., Attr::snapshot) yields a Timeout error.

This mechanism is also useful for detecting cycles. When the semantic graph validator encounters a cyclic reference between attributes, it deadlocks the validation procedure. However, due to the timeout limit, the validator eventually unblocks with a Timeout error.

During debugging (when the debug_assertions feature flag is enabled), the Timeout error is considered abnormal. Debug builds aim to detect attributes with long execution times and cycles in the semantic graph as early as possible for debugging purposes. However, in production builds, Timeout errors are treated as normal errors, assumed to be caused by edge cases in the project's source code compilation, and thus handled without panic2.

1

You can configure this limit via the AnalyzerConfig object, which you pass to the Analyzer's constructor.

2

The user of the code editor's extension would prefer the extension to gracefully ignore specific edge cases that the plugin is unable to handle, rather than causing the entire plugin to terminate abruptly.

Code Diagnostics

While most end attributes of the semantic graph aim to infer specific semantic facts about particular syntax tree nodes, code diagnostics (semantic errors and warnings) are intended to be collected from the entire syntax tree.

To tackle this issue and improve the incremental nature of code diagnostics, you can gather local diagnostic messages within scopes by iterating through scope nodes and their attributes, potentially containing diagnostic issues. These issues can then be collected into the hash set of the scope's diagnostics attribute.

Subsequently, in the root node's global diagnostics attribute, you can iterate through all local diagnostic attributes of scopes and aggregate their values into a single set, wrapping it into a Shared structure for efficient cloning. Furthermore, you can enhance the final diagnostics set with syntax errors from the normal compilation unit by directly reading them from the document1.

The resulting global diagnostics attribute would indirectly depend on the majority of the semantic graph. Despite potential optimizations by the validator due to granularity, querying this attribute could still be computationally intensive in edge cases. To mitigate this, the language server could periodically examine this attribute with a low-priority analysis task.

Moreover, when utilizing the Attr::snapshot function to retrieve a copy of the current diagnostics sets, you can leverage the version number of the attribute value to determine whether this set needs to be republished to the client.

1

The Document::errors function would provide you with an iterator over all syntax errors within the compilation unit.

Tree Index

The semantic analysis framework of Lady Deirdre is capable of maintaining user-defined context-unaware indexes of the document's syntax tree nodes.

Examples of these indices include all methods within the document, all code blocks, identifiers partitioned by name, and so forth.

Indices serve various purposes. For instance, in the previous chapter, we discussed document-wide diagnostics, where the root's diagnostic attribute collects local diagnostics from all scope nodes within the document. To support this attribute, you can define an index of all scopes within the document. The attribute will then read this class of nodes inside the computable function to inspect all their local diagnostic attribute values.

Another example involves highlighting all identifiers within the code related to a single variable. Typically, within the attributes framework, it's easier to establish the variable usage node's definition node than the opposite relations. When the end user clicks on the variable usage symbol in the code editor, the language client requests from the language server all highlighted spans related to the symbol on which the user clicks.

Here's how you can fulfill a request for highlighting all identifiers within the code that relate to a single variable:

  1. Traverse the syntax tree to determine the node on which the end user clicks1.
  2. Query this node's attribute to retrieve its variable definition node. At this point, we discover two spans: the variable usage span where the user's cursor is and the variable definition span. However, we don't yet know about other variable usage spans within the code that would be useful for the editor's user.
  3. Query the index of all variable usages within the code with the specific name. Some of these will be the identifiers we are looking for, while others may be false candidates located outside the variable definition scope.
  4. Since false candidates are likely to be a relatively small subset, owing to the practice of programmers using distinct variable names, filter out these false instances in the set. This can be achieved by querying their definition attributes again to determine if they have the same definition node as the one discovered in step 2.
1

You can utilize depth-first traversal using the Document::traverse_tree function. By skipping the descent into child nodes with spans that don't cover the targeted site, the traversal complexity averages to O(ln(N)), where N is the number of nodes in the tree. In other words, traversing will typically be quite fast.

Index Setup

To enable the index, you need to specify the nodes classifier using the #[classifier(...)] macro attribute.

#[derive(Node)]
#[classifier(ChainNodeClassifier)]
pub enum ChainNode {
   // ...
}

The parameter of this macro attribute is an arbitrary type that implements the Classifier trait. It denotes classes of nodes, essentially serving as indices, and the function that partitions requested nodes between these classes.

In the Chain Analysis example, we define just one class for all ChainNode::Key nodes within the syntax tree.

#[derive(Clone, PartialEq, Eq, Hash)]
pub enum ChainNodeClass {
    AllKeys, // All keys
}

pub struct ChainNodeClassifier;

impl Classifier for ChainNodeClassifier {
    // Must match the type of the syntax tree node.
    type Node = ChainNode;
    
    // Could be any user-defined type eligible for the HashSet. 
    type Class = ChainNodeClass;

    // Given the Document and the NodeRef that points to the node inside this
    // document, this function should compute a set of classes to which this
    // node belongs, possibly returning an empty set.
    fn classify<S: SyncBuildHasher>(
        doc: &Document<Self::Node>,
        node_ref: &NodeRef,
    ) -> HashSet<Self::Class, S> {
        let mut result = HashSet::with_hasher(S::default());

        let Some(node) = node_ref.deref(doc) else {
            return result;
        };

        match node {
            ChainNode::Key { .. } => {
                let _ = result.insert(ChainNodeClass::AllKeys);
            }

            _ => (),
        }

        result
    }
}

Inside the classification function, it's recommended (and necessary) to dereference the specified node to determine its classes. You can examine its lexical structure but should avoid inspecting the node's parent and child node structures, as the classification should be context-unaware. Classes of the nodes are simple lexical classes.

In the above code, we classify the node by its enum discriminant only. In more complex setups, you can use the TokenRef references of the node, as these references are part of the node's lexical structure. For example, we could partition the Keys by their token strings as well, using the strings as part of the class.

Index Maintenance

The Analyzer automatically maintains the index. During the initial parsing of the newly created document, the Analyzer creates a document index by calling the above function on each syntax tree node, associating each class with the set of nodes of this class.

When you edit the document (using the write_to_doc function), the Analyzer incrementally updates this partition based on the changes in the structure of the syntax tree.

Index Access

You can query a set of nodes of the document that belong to a specified class both inside the computable functions of the attributes using the read_class function of the context variable, and outside using the snapshot_class function of the task object.

Both functions return a Shared set of the NodeRefs that point to the nodes in the document's syntax tree belonging to the class.

When you query the index from inside of the computable function, the attribute subscribes to changes in this class. Whenever the returning set changes, this attribute will be invalidated. Therefore, you can traverse and dereference the nodes from the returning set, and you can read their semantics too inside the computable function of any kind of attribute. However, in general, you should avoid inspecting these node structures more deeply.

Code Formatters

Many programming language compilers and language support extensions of code editors typically include code formatting programs. These programs take the source code text as input and bring it to a canonical format according to the code formatting rules.

Code formatting presents a challenging problem that must consider both canonical formatting rules and the original intentions of the code author, such as preserving empty lines between code fragments and retaining code comments on their respective lines.

Lady Deirdre offers two tools to aid in implementing the code formatter:

  1. The ParseTree builder constructs a concrete parsing tree. Unlike abstract syntax trees, it intentionally preserves all original source code whitespaces and comments.
  2. The PrettyPrinter tool automatically decides on splitting the text into lines and determines line indentation to ensure that the final lines are aligned according to the predefined maximum line length.

The parse tree builder utilizes the same syntax parser defined within the Node derive macro. However, unlike the Document or ImmutableSyntaxTree parsers, this parser preserves all tokens and trivia nodes, such as comments, in the concrete parse tree. While the structure of the output concrete parse tree resembles the syntax tree in terms of node structural nesting, the parse tree nodes include all possibly omitted children of the syntax tree grammar, regardless of capturing rules.

Nodes of the parse tree comprehensively cover the original source code without gaps or overlaps. Consequently, the original source code text can be fully reconstructed by traversing the tree. To simplify tree traversing, parse tree nodes are owned by their parents.

The source code text is expected to be provided to the parse tree builder, and the concrete parse tree is then used as input for the formatting tool. During tree traversal, your program interprets the lexis and tree nesting of the source code in accordance with your programming language formatting rules, considering the concrete tree configuration. It then feeds the source code tokens into the syntax-unaware Pretty Printer. The printer generates the final output string.

The Json Formatter example demonstrates the basic usage of both tools.

Pretty Printer

The pretty printer object operates in a syntax-unaware manner, meaning it doesn't consider the meaning of the original grammar tokens and nodes. Instead, it deals with its own system of abstract tokens that shape the output in terms of string words, blanks between words, and word groups only.

The objective of the printer is to interpret each incoming blank token as either a whitespace or a line break, depending on the current line length and the preconfigured maximum line length. If the printer determines to break the line at the location of the blank token, it will subsequently indent or dedent the following lines in accordance with the nesting of the groups.

Printing Words

The PrettyPrinter::word function inputs a string word token into the printer, which will be printed on the current line of the output, thus increasing the line length.

You invoke this function whenever encountering a parse tree token with content, such as a keyword, identifier, or anything else besides whitespace or a line break.

Additionally, you can call this function with a whitespace string to instruct the printer to preserve the whitespace on the line regardless of the current line length. This is useful, for instance, for comments' inner content or source code string literals. However, it's not recommended to use this function to forcibly break lines. In such cases, you should use the hardbreak function.

Blank Tokens

To separate the words in the output of the printer, you utilize one of the pretty printer's "blank" token functions:

  • PrettyPrinter::blank is the default blank token, interpreted either as a single whitespace or a line break. You call this function when the next word should be separated from the previous one, possibly with a line break, depending on the printing algorithm's decision.

  • PrettyPrinter::hardbreak is a blank token that enforces the printer to always interpret it as a line break. You call this function, for instance, when printing the inner content of multi-line comments, as the structure of the comment's text typically needs to be preserved.

  • PrettyPrinter::softbreak is similar to the blank function, but if the printer's algorithm decides to preserve the next token on the line, it does not insert whitespace. This function is useful, for example, to delimit the . dot tokens in a call-chain (foo.bar). In such cases, you can insert a softbreak token before the dot but omit the delimiter after the dot word

Word Groups

During the formatting process, when the current content exceeds the maximum line length, the algorithm attempts to break the content into lines by interpreting blank tokens as line breaks.

At this point, the algorithm can either interpret all blank tokens as line breaks, resulting in consistent line splitting, or it can selectively interpret some blank tokens, maximizing the utilization of line space.

Consistent line splitting is preferable for source code blocks, such as JSON objects.

{
    "foo": 123,
    "bar": 456,
    "baz": 789
}

For enumerations of simple items (e.g., JSON arrays), inconsistent breaking is more suitable.

[123, 456, 789, 1011, 1213, 1516, 1718, 1920, 2122,
    2324, 2526, 2728]

Content breaking occurs within word groups. To initiate a new group, you use either PrettyPrinter::cbox or PrettyPrinter::ibox. The former begins a consistent word group, while the latter starts an inconsistent group.

Each group must be closed by calling PrettyPrinter::end.

Both cbox and ibox functions accept indentation level shifting for the group, represented by a signed integer. Positive values increase the inner content's indentation, negative values decrease it (with zero having no effect). When the printer breaks the content inside the group, each new line is indented with whitespace according to the current indentation level.

Overriding Indentations

You can manually adjust line indentation by calling the PrettyPrinter::indent function immediately after submitting any of the blank tokens. If the algorithm interprets the submitted token as a line break, the next line, as well as all subsequent lines, will be shifted accordingly.

Keeping Content In Line

In general, the algorithm aims to break lines as early as possible so that parental word groups are split by lines, while leaf groups remain in line.

{
    "foo": [123, 456, 789, 1011, 1213, 1516, 1718, 1920, 2122]
}

This approach is generally suitable for most practical use cases. However, there are situations where it's preferable to keep the parental content aligned in line and splitting of the nested groups instead.

In such cases, you can utilize the PrettyPrinter::neverbreak function, which instructs the printer to reset the current line length counter to zero. Consequently, the algorithm assumes that the previously submitted text fits on the line, and begins splitting from the subsequent nested submissions.

{ "foo": [123, 456, 789, 1011, 1213, 1516, 1718,
    1920, 2122] }

Trailing Commas

Depending on the language grammar, some languages allow leaving a trailing comma at the end of lists (e.g., Rust and JavaScript, but not JSON). This ensures better readability when the list is split into multiple lines, as the last item receives a trailing comma, but the comma is omitted if the content remains in a single line.

This formatting rule depends on whether the algorithm decides to insert a line break at the blank token. To address this, you can use the PrettyPrinter::pre_break and PrettyPrinter::pre_space functions to configure the preceding blank token.

Both functions must be called immediately after submitting the blank token. The first function, pre_break, specifies a word that will be inserted before the line break (at the end of the previous line), while the second function, pre_space, inserts the specified word otherwise (the word will appear before the whitespace when paired with the blank function).

When your program formats a comma-separated list, you can insert regular , commas after each intermediary item and a normal blank token after each comma. At the end of the list, after the last submitted item, you can submit a softbreak token and configure it with pre_break(','), ensuring that if this trailing blank token receives a line break, the last line of the list will be appended with a comma.

Snippets

When a compilation project has errors or warnings, it is usually more beneficial for the end user to print source code snippets in the terminal, annotating the fragments where the issues occur.

While there are several similar tools in the Rust ecosystem that you can use with this crate, Lady Deirdre provides its own solution as well.

The Snippet is a configurable builder object that prints the source code text of a compilation unit, or a part of it, with emphasized fragments annotated with custom messages.

   ╭──╢ Unit(1) ╟──────────────────────────────────────────────────────────────╮
 1 │ {                                                                         │
 2 │     "foo": true,                                                          │
 3 │     "bar": [123 "baz"]                                                    │
   │                ╰╴ missing ',' in Array                                    │
 4 │ }                                                                         │
   ├───────────────────────────────────────────────────────────────────────────┤
   │ Array syntax error.                                                       │
   ╰───────────────────────────────────────────────────────────────────────────╯

You create the builder in the Display or Debug context, providing the Document (or any similar object with lexis, such as TokenBuffer) that needs to be printed, and annotate arbitrary code spans with string messages.

Once building is finished, the Snippet prints the annotated snippet into the Formatter's output.

The Json Highlight example demonstrates how to set up this builder on a custom object that wraps a compilation unit document.

pub struct JsonSnippet<'a> {
    pub doc: &'a Document<JsonNode>,
    pub annotation: Vec<(PositionSpan, AnnotationPriority, &'static str)>,
}

impl<'a> Display for JsonSnippet<'a> {
    fn fmt(&self, formatter: &mut Formatter<'_>) -> std::fmt::Result {
        // You create the Snippet builder by calling the snippet function on
        // the Formatter. The specified parameter is a reference to the document
        // that contains the source code text.
        let mut snippet = formatter.snippet(self.doc);

        snippet
            // Configure the Snippet's header and footer text.
            // If omitted, the header and footer decorations will also be
            // omitted accordingly.
            .set_caption("Header text")
            .set_summary("Footer text.")
            // Specifies the highlighter that instructs the Snippet on how to
            // stylize individual tokens of the printed text.
            
            // This configuration is optional. When omitted, the Snippet will
            // print all tokens uniformly using the default color scheme.
            .set_highlighter(JsonHighlighter);

        for (span, priority, message) in &self.annotation {
            // Adds an annotated span to the builder.
            //
            // The Snippet will print the specified fragment with an inverted
            // foreground (using the foreground color specified by
            // the annotation priority).
            //
            // If a message is present (the message string is not empty),
            // the Snippet will print this message near the spanned fragment.
            //
            // If the Snippet has specified annotations, it will print only
            // the source code lines that contain annotated fragments
            // (regardless of whether they have a message), plus some lines that
            // surround these lines before and after.
            //
            // If the Snippet does not have any annotations, it will print
            // the entire source code text.
            snippet.annotate(span, *priority, *message);
        }

        // Finishes the builder and prints the snippet to the Formatter's output.
        snippet.finish()
    }
}

The Snippet has several drawing configuration options that you can specify using the Snippet::set_config function. Here are a few:

  • You can show or hide line numbers, header and footer, and the outer frame.
  • You can enforce the Snippet to use ASCII-only drawing.
  • You can disable all terminal styles so that the Snippet will be monochrome.

By default (if you don't provide the drawing config manually), the builder draws the snippet with all drawing options turned off if the format is not alternated (format!("{}")). Otherwise, all drawing options are enabled (format!("{:#}")).

In the example above, we specify the JSON syntax highlighter using the Snippet::set_highlighter function.

The highlighter is a stateful object that implements the Highlighter trait and instructs the Snippet on how to stylize the source code tokens. The Snippet builder calls Highlighter::token_style for each token in the source code sequentially, and the function returns the Style of the token.

pub struct JsonHighlighter;

impl Highlighter<JsonToken> for JsonHighlighter {
    // The `dim` argument is set to true if this token is meant to have lesser
    // contrast than usual.
    //
    // The Snippet prefers to print the tokens outside of the annotated
    // fragments with lesser contrast to focus the user's attention on
    // the annotated spans.
    //
    // If the function returns None, it means that the token will be printed
    // without additional styles.
    fn token_style(&mut self, dim: bool, token: JsonToken) -> Option<Style> {
        match token {
            JsonToken::True | JsonToken::False | JsonToken::Null => Some(match dim {
                false => Style::new().blue(),
                true => Style::new().bright_blue(),
            }),

            JsonToken::String => Some(match dim {
                false => Style::new().green(),
                true => Style::new().bright_green(),
            }),

            JsonToken::BraceOpen | JsonToken::BraceClose => Some(Style::new().bold()),

            _ => None,
        }
    }
}

Since the highlighter is a stateful object, it can rely on previous tokens to make a decision about the next token style. For example, if the highlighter discovers that the token is part of a comment or a string literal context, it can stylize this token accordingly.