Code Inspection

Text Addressing

In Lady Deirdre, the minimal unit for indexing the source code text is the Unicode character.

A Site is a numeric type (an alias of usize) representing the absolute Unicode character index in a string.

SiteSpan is an alias type of Range<Site> that denotes a fragment (or span) of the source code.

Most API functions within the crate conveniently accept impl ToSite or impl ToSpan objects when users need to address specific source code characters or spans of characters. These traits facilitate automatic conversion between different representations of source code indices.

One example of a source code index type is the Position object, which references code in terms of the line and column within the line1. It implements the ToSite trait.

For any type implementing the ToSite trait, ToSpan is automatically implemented for all standard Rust range types with bound of this type. For instance, both Site and Position types implement the ToSite trait, making 10..20, 10..=20, and 10.., and Position::new(10, 40)..Position::new(14, 2) valid span types.

However, a particular span instance could be invalid; for instance, 20..10 is invalid because its lower bound is greater than its upper bound.

Certain API functions in the crate (e.g., Document::write) require that the specified span must be valid; otherwise, the function would panic. This behavior aligns with Rust's behavior when indexing arrays with invalid ranges.

You can check the validity of a range upfront using the ToSpan::is_valid_span function.

The RangeFull .. object always represents the entire content and is always valid.

use lady_deirdre::lexis::{Position, ToSpan, TokenBuffer};

let mut buf = TokenBuffer::<JsonToken>::from("foo\nbar\nbaz");

assert!((2..7).is_valid_span(&buf));
assert!((2..).is_valid_span(&buf));
assert!((..).is_valid_span(&buf));
assert!(!(7..2).is_valid_span(&buf));
assert!((Position::new(1, 2)..Position::new(3, 1)).is_valid_span(&buf));
1

Please note that the line and column numbers in the Position object are one-based: 1 denotes the first line, 2 denotes the second line, and so forth.

Text Inspection

The following SourceCode' s functions enable you to query various metadata of the compilation unit's text.

use lady_deirdre::lexis::{SourceCode, TokenBuffer};

let mut buf = TokenBuffer::<JsonToken>::from("foo, bar, baz");

// The `substring` function returns a `Cow<str>` representing the substring
// within the specified span.
// The underlying implementation attempts to return a borrowed string whenever
// possible.
assert_eq!(buf.substring(2..7), "o, ba");

// Iterates through the Unicode characters in the span.
for ch in buf.chars(2..7) {
    println!("{ch}");
}

// A total number of Unicode characters.
assert_eq!(buf.length(), 13);

// Returns true if the code is empty (contains no text or tokens).
assert!(!buf.is_empty());

// A total number of lines (delimited by `\n`).
assert_eq!(buf.lines().lines_count(), 1);

From buf.lines(), you receive a LineIndex object that provides additional functions for querying metadata about the source code lines. For example, you can fetch the length of a particular line using this object.

Tokens Iteration

The SourceCode::cursor and its simplified version chunks allow you to iterate through the tokens of the source code.

Both functions accept a span of the source code text and yield tokens that "touch" the specified span. Touching means that the token's string is fully covered by, intersects with, or at least contacts the span within its bounds.

For example, if the text "FooBarBazWaz" consists of the tokens "Foo", "Bar", "Baz", and "Waz", the span 3..7 would contact the "Foo" token (3 is the end of the token's span), fully cover the "Bar" token, and intersect with the "Baz" token (by the "B" character). However, the "Waz" token is outside of this span and will not be yielded.

In other words, these functions attempt to yield the widest set of tokens that are in any way related to the specified span.

The chunks function simply returns a standard iterator over the token metadata. Each Chunk object contains the token instance, a reference to its string, the absolute Site of the beginning of the token, and the substring length in Unicode characters2.

use lady_deirdre::lexis::{Chunk, SourceCode, TokenBuffer};

let buf = TokenBuffer::<JsonToken>::from("123 true null");

for Chunk {
    token,
    string,
    site,
    length,
} in buf.chunks(..)
{
    println!("---");
    println!("Token: {token:?}");
    println!("String: {string:?}");
    println!("Site: {site}");
    println!("Length: {length}");
}

The cursor function returns a more complex TokenCursor object that implements a cursor-like API with built-in lookahead capabilities and manual control over the iteration process. This object is particularly useful for syntax parsing and will be discussed in more detail in the subsequent chapters of this guide.

To give you a brief overview of the token cursor, the above code could be rewritten with the token cursor as follows:

use lady_deirdre::lexis::{SourceCode, TokenBuffer, TokenCursor};

let buf = TokenBuffer::<JsonToken>::from("123 true null");

let mut cursor = buf.cursor(..);

loop {
    // 0 means zero lookahead -- we are looking at the point of where the cursor
    // is currently pointing.
    let token = cursor.token(0);

    // If the cursor reached the end of input, we are breaking the loop.
    if token == JsonToken::EOI {
        break;
    }

    println!("---");
    println!("Token: {:?}", cursor.token(0));
    println!("String: {:?}", cursor.string(0));
    println!("Site: {:?}", cursor.site(0));
    println!("Length: {:?}", cursor.length(0));

    // Manually moves token cursor to the next token.
    cursor.advance();
}
2

Note that the Chunk object represents a valid span and implements the ToSpan trait.