refactor(parser): lexer replace Chars with Source (#2288)

This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.

## What it does

`Source` holds the source text, and allows:

* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.

`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.

## Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.

`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.

## Notes on implementation

`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.

I've split it into 2 commits.

* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.

### `SourcePosition`

There is one annoyance with the API which I haven't been able solve:

`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.

The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.

This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?

## Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for
`pdf.mjs`.

While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.

What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this
before?

One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!

I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
This commit is contained in:
overlookmotel 2024-02-05 13:51:46 +00:00 committed by GitHub
parent d9bea520ea
commit cdef41d552
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 525 additions and 109 deletions

View file

@ -10,6 +10,7 @@ use crate::{
Context, Parser,
};
#[derive(Clone, Copy)]
pub struct ParserCheckpoint<'a> {
lexer: LexerCheckpoint<'a>,
cur_token: Token,
@ -254,7 +255,9 @@ impl<'a> Parser<'a> {
let ParserCheckpoint { lexer, cur_token, prev_span_end, errors_pos: errors_lens } =
checkpoint;
self.lexer.rewind(lexer);
// SAFETY: Parser only ever creates a single `Lexer`,
// therefore all checkpoints must be created from it.
unsafe { self.lexer.rewind(lexer) };
self.token = cur_token;
self.prev_token_end = prev_span_end;
self.errors.truncate(errors_lens);

View file

@ -6,9 +6,8 @@ use crate::diagnostics;
///
/// SAFETY:
/// * Lexer must not be at end of file.
/// * `byte` must be next byte of source code, corresponding to current position
/// of `lexer.current.chars`.
/// * Only `BYTE_HANDLERS` for ASCII characters may use the `ascii_byte_handler!` macro.
/// * `byte` must be next byte of source code, corresponding to current position of `lexer.source`.
/// * Only `BYTE_HANDLERS` for ASCII characters may use the `ascii_byte_handler!()` macro.
pub(super) unsafe fn handle_byte(byte: u8, lexer: &mut Lexer) -> Kind {
BYTE_HANDLERS[byte as usize](lexer)
}
@ -82,7 +81,7 @@ macro_rules! byte_handler {
///
/// These assertions produce no runtime code, but hint to the compiler that it can assume that
/// next char is ASCII, and it uses that information to optimize the rest of the handler.
/// e.g. `lexer.current.chars.next()` becomes just a single assembler instruction.
/// e.g. `lexer.consume_char()` becomes just a single assembler instruction.
/// Without the assertions, the compiler is unable to deduce the next char is ASCII, due to
/// the indirection of the `BYTE_HANDLERS` jump table.
///
@ -108,8 +107,8 @@ macro_rules! byte_handler {
/// unsafe {
/// use assert_unchecked::assert_unchecked;
/// let s = lexer.current.chars.as_str();
/// assert_unchecked!(!s.is_empty());
/// assert_unchecked!(s.as_bytes()[0] < 128);
/// assert_unchecked!(!lexer.source.is_eof());
/// assert_unchecked!(lexer.source.peek_byte_unchecked() < 128);
/// }
/// {
/// lexer.consume_char();
@ -125,9 +124,8 @@ macro_rules! ascii_byte_handler {
// SAFETY: This macro is only used for ASCII characters
unsafe {
use assert_unchecked::assert_unchecked;
let s = $lex.current.chars.as_str();
assert_unchecked!(!s.is_empty());
assert_unchecked!(s.as_bytes()[0] < 128);
assert_unchecked!(!$lex.source.is_eof());
assert_unchecked!($lex.source.peek_byte_unchecked() < 128);
}
$body
});
@ -150,14 +148,14 @@ ascii_byte_handler!(SPS(lexer) {
// <VT> <FF> Irregular Whitespace
ascii_byte_handler!(ISP(lexer) {
lexer.consume_char();
lexer.trivia_builder.add_irregular_whitespace(lexer.current.token.start, lexer.offset());
lexer.trivia_builder.add_irregular_whitespace(lexer.token.start, lexer.offset());
Kind::Skip
});
// '\r' '\n'
ascii_byte_handler!(LIN(lexer) {
lexer.consume_char();
lexer.current.token.is_on_new_line = true;
lexer.token.is_on_new_line = true;
Kind::Skip
});
@ -190,7 +188,7 @@ ascii_byte_handler!(HAS(lexer) {
lexer.consume_char();
// HashbangComment ::
// `#!` SingleLineCommentChars?
if lexer.current.token.start == 0 && lexer.next_eq('!') {
if lexer.token.start == 0 && lexer.next_eq('!') {
lexer.read_hashbang_comment()
} else {
lexer.private_identifier()

View file

@ -7,10 +7,10 @@ impl<'a> Lexer<'a> {
/// Section 12.4 Single Line Comment
#[allow(clippy::cast_possible_truncation)]
pub(super) fn skip_single_line_comment(&mut self) -> Kind {
let start = self.current.token.start;
let start = self.token.start;
while let Some(c) = self.next_char() {
if is_line_terminator(c) {
self.current.token.is_on_new_line = true;
self.token.is_on_new_line = true;
self.trivia_builder
.add_single_line_comment(start, self.offset() - c.len_utf8() as u32);
return Kind::Skip;
@ -25,11 +25,11 @@ impl<'a> Lexer<'a> {
pub(super) fn skip_multi_line_comment(&mut self) -> Kind {
while let Some(c) = self.next_char() {
if c == '*' && self.next_eq('/') {
self.trivia_builder.add_multi_line_comment(self.current.token.start, self.offset());
self.trivia_builder.add_multi_line_comment(self.token.start, self.offset());
return Kind::Skip;
}
if is_line_terminator(c) {
self.current.token.is_on_new_line = true;
self.token.is_on_new_line = true;
}
}
self.error(diagnostics::UnterminatedMultiLineComment(self.unterminated_range()));
@ -43,7 +43,7 @@ impl<'a> Lexer<'a> {
break;
}
}
self.current.token.is_on_new_line = true;
self.token.is_on_new_line = true;
Kind::HashbangComment
}
}

View file

@ -37,7 +37,7 @@ impl<'a> Lexer<'a> {
}
pub(crate) fn next_jsx_child(&mut self) -> Token {
self.current.token.start = self.offset();
self.token.start = self.offset();
let kind = self.read_jsx_child();
self.finish_next(kind)
}

View file

@ -1,3 +1,5 @@
#![allow(clippy::unnecessary_safety_comment)]
//! An Ecma-262 Lexer / Tokenizer
//! Prior Arts:
//! * [jsparagus](https://github.com/mozilla-spidermonkey/jsparagus/blob/master/crates/parser/src)
@ -14,6 +16,7 @@ mod number;
mod numeric;
mod punctuation;
mod regex;
mod source;
mod string;
mod string_builder;
mod template;
@ -23,25 +26,30 @@ mod typescript;
mod unicode;
use rustc_hash::FxHashMap;
use std::{collections::VecDeque, str::Chars};
use std::collections::VecDeque;
use oxc_allocator::Allocator;
use oxc_ast::ast::RegExpFlags;
use oxc_diagnostics::Error;
use oxc_span::{SourceType, Span};
use self::{byte_handlers::handle_byte, string_builder::AutoCow, trivia_builder::TriviaBuilder};
use self::{
byte_handlers::handle_byte,
source::{Source, SourcePosition},
string_builder::AutoCow,
trivia_builder::TriviaBuilder,
};
pub use self::{
kind::Kind,
number::{parse_big_int, parse_float, parse_int},
token::Token,
};
use crate::{diagnostics, MAX_LEN};
use crate::diagnostics;
#[derive(Debug, Clone)]
#[derive(Debug, Clone, Copy)]
pub struct LexerCheckpoint<'a> {
/// Remaining chars to be tokenized
chars: Chars<'a>,
/// Current position in source
position: SourcePosition<'a>,
token: Token,
@ -55,18 +63,25 @@ pub enum LexerContext {
JsxAttributeValue,
}
#[derive(Debug, Clone, Copy)]
struct Lookahead<'a> {
position: SourcePosition<'a>,
token: Token,
}
pub struct Lexer<'a> {
allocator: &'a Allocator,
source: &'a str,
// Wrapper around source text. Must not be changed after initialization.
source: Source<'a>,
source_type: SourceType,
current: LexerCheckpoint<'a>,
token: Token,
pub(crate) errors: Vec<Error>,
lookahead: VecDeque<LexerCheckpoint<'a>>,
lookahead: VecDeque<Lookahead<'a>>,
context: LexerContext,
@ -82,21 +97,16 @@ pub struct Lexer<'a> {
#[allow(clippy::unused_self)]
impl<'a> Lexer<'a> {
pub fn new(allocator: &'a Allocator, mut source: &'a str, source_type: SourceType) -> Self {
// If source exceeds size limit, substitute a short source which will fail to parse.
// `Parser::parse` will convert error to `diagnostics::OverlongSource`.
if source.len() > MAX_LEN {
source = "\0";
}
pub fn new(allocator: &'a Allocator, source_text: &'a str, source_type: SourceType) -> Self {
let source = Source::new(source_text);
// The first token is at the start of file, so is allows on a new line
let token = Token::new_on_new_line();
let current = LexerCheckpoint { chars: source.chars(), token, errors_pos: 0 };
Self {
allocator,
source,
source_type,
current,
token,
errors: vec![],
lookahead: VecDeque::with_capacity(4), // 4 is the maximum lookahead for TypeScript
context: LexerContext::Regular,
@ -108,23 +118,30 @@ impl<'a> Lexer<'a> {
/// Remaining string from `Chars`
pub fn remaining(&self) -> &'a str {
self.current.chars.as_str()
self.source.remaining()
}
/// Creates a checkpoint storing the current lexer state.
/// Use `rewind` to restore the lexer to the state stored in the checkpoint.
pub fn checkpoint(&self) -> LexerCheckpoint<'a> {
LexerCheckpoint {
chars: self.current.chars.clone(),
token: self.current.token,
position: self.source.position(),
token: self.token,
errors_pos: self.errors.len(),
}
}
/// Rewinds the lexer to the same state as when the passed in `checkpoint` was created.
pub fn rewind(&mut self, checkpoint: LexerCheckpoint<'a>) {
///
/// # SAFETY
/// `checkpoint` must have been created from this `Lexer`.
#[allow(clippy::missing_safety_doc)] // Clippy is wrong!
pub unsafe fn rewind(&mut self, checkpoint: LexerCheckpoint<'a>) {
self.errors.truncate(checkpoint.errors_pos);
self.current = checkpoint;
// SAFETY: Caller guarantees `checkpoint` was created from this `Lexer`,
// and therefore `checkpoint.position` was created from `self.source`.
self.source.set_position(checkpoint.position);
self.token = checkpoint.token;
self.lookahead.clear();
}
@ -137,28 +154,30 @@ impl<'a> Lexer<'a> {
return self.lookahead[n - 1].token;
}
let checkpoint = self.checkpoint();
let position = self.source.position();
if let Some(checkpoint) = self.lookahead.back() {
self.current = checkpoint.clone();
if let Some(lookahead) = self.lookahead.back() {
// SAFETY: `self.lookahead` only contains lookaheads created by this `Lexer`.
// `self.source` never changes, so `lookahead.position` must have been created
// from `self.source`.
unsafe { self.source.set_position(lookahead.position) };
}
// reset the current token for `read_next_token`,
// otherwise it will contain the token from
// `self.current = checkpoint`
self.current.token = Token::default();
for _i in self.lookahead.len()..n {
let kind = self.read_next_token();
let peeked = self.finish_next(kind);
self.lookahead.push_back(LexerCheckpoint {
chars: self.current.chars.clone(),
token: peeked,
errors_pos: self.errors.len(),
});
self.lookahead.push_back(Lookahead { position: self.source.position(), token: peeked });
}
self.current = checkpoint;
// Call to `finish_next` in loop above leaves `self.token = Token::default()`.
// Only circumstance in which `self.token` wouldn't have been default at start of this
// function is if we were at very start of file, before any tokens have been read, when
// `token.is_on_new_line` is `true`. But `lookahead` isn't called before the first token is
// read, so that's not possible. So no need to restore `self.token` here.
// It's already in same state as it was at start of this function.
// SAFETY: `position` was created above from `self.source`. `self.source` never changes.
unsafe { self.source.set_position(position) };
self.lookahead[n - 1].token
}
@ -170,21 +189,23 @@ impl<'a> Lexer<'a> {
/// Main entry point
pub fn next_token(&mut self) -> Token {
if let Some(checkpoint) = self.lookahead.pop_front() {
self.current.chars = checkpoint.chars;
self.current.errors_pos = checkpoint.errors_pos;
return checkpoint.token;
if let Some(lookahead) = self.lookahead.pop_front() {
// SAFETY: `self.lookahead` only contains lookaheads created by this `Lexer`.
// `self.source` never changes, so `lookahead.position` must have been created
// from `self.source`.
unsafe { self.source.set_position(lookahead.position) };
return lookahead.token;
}
let kind = self.read_next_token();
self.finish_next(kind)
}
fn finish_next(&mut self, kind: Kind) -> Token {
self.current.token.kind = kind;
self.current.token.end = self.offset();
debug_assert!(self.current.token.start <= self.current.token.end);
let token = self.current.token;
self.current.token = Token::default();
self.token.kind = kind;
self.token.end = self.offset();
debug_assert!(self.token.start <= self.token.end);
let token = self.token;
self.token = Token::default();
token
}
@ -197,45 +218,36 @@ impl<'a> Lexer<'a> {
#[inline]
#[allow(clippy::cast_possible_truncation)]
fn offset(&self) -> u32 {
// Offset = current position of `chars` relative to start of `source`.
// Previously was `self.source.len() - self.current.chars.as_str().len()`,
// but that was slower because `std::str::Chars` internally is a current pointer + end pointer,
// whereas `&str` internally is a start pointer and len.
// So comparing `len()` of the two requires an extra memory read, and addition operation.
// https://godbolt.org/z/v46MWddTM
// This function is on hot path, so saving even a single instruction makes a measurable difference.
(self.current.chars.as_str().as_ptr() as usize - self.source.as_ptr() as usize) as u32
self.source.offset()
}
/// Get the current unterminated token range
fn unterminated_range(&self) -> Span {
Span::new(self.current.token.start, self.offset())
Span::new(self.token.start, self.offset())
}
/// Consume the current char if not at EOF
#[inline]
fn next_char(&mut self) -> Option<char> {
self.current.chars.next()
self.source.next_char()
}
/// Consume the current char
#[inline]
fn consume_char(&mut self) -> char {
self.current.chars.next().unwrap()
self.source.next_char().unwrap()
}
/// Peek the next char without advancing the position
#[inline]
fn peek(&self) -> Option<char> {
self.current.chars.clone().next()
self.source.peek_char()
}
/// Peek the next next char without advancing the position
#[inline]
fn peek2(&self) -> Option<char> {
let mut chars = self.current.chars.clone();
chars.next();
chars.next()
self.source.peek_char2()
}
/// Peek the next character, and advance the current position if it matches
@ -243,7 +255,7 @@ impl<'a> Lexer<'a> {
fn next_eq(&mut self, c: char) -> bool {
let matched = self.peek() == Some(c);
if matched {
self.current.chars.next();
self.source.next_char().unwrap();
}
matched
}
@ -267,16 +279,15 @@ impl<'a> Lexer<'a> {
fn read_next_token(&mut self) -> Kind {
loop {
let offset = self.offset();
self.current.token.start = offset;
self.token.start = offset;
let remaining = self.current.chars.as_str();
if remaining.is_empty() {
let byte = if let Some(byte) = self.source.peek_byte() {
byte
} else {
return Kind::Eof;
}
};
let byte = remaining.as_bytes()[0];
// SAFETY: Check for `remaining.is_empty()` ensures not at end of file,
// and `byte` is the byte at current position of `self.current.chars`.
// SAFETY: `byte` is byte value at current position in source
let kind = unsafe { handle_byte(byte, self) };
if kind != Kind::Skip {
return kind;

View file

@ -40,10 +40,7 @@ impl<'a> Lexer<'a> {
pub(super) fn read_minus(&mut self) -> Option<Kind> {
if self.next_eq('-') {
// SingleLineHTMLCloseComment `-->` in script mode
if self.current.token.is_on_new_line
&& self.source_type.is_script()
&& self.next_eq('>')
{
if self.token.is_on_new_line && self.source_type.is_script() && self.next_eq('>') {
None
} else {
Some(Kind::Minus2)

View file

@ -11,7 +11,7 @@ impl<'a> Lexer<'a> {
/// Which means the parser needs to re-tokenize on `PrimaryExpression`,
/// `RegularExpressionLiteral` only appear on the right hand side of `PrimaryExpression`
pub(crate) fn next_regex(&mut self, kind: Kind) -> (Token, u32, RegExpFlags) {
self.current.token.start = self.offset()
self.token.start = self.offset()
- match kind {
Kind::Slash => 1,
Kind::SlashEq => 2,

View file

@ -0,0 +1,410 @@
#![allow(clippy::unnecessary_safety_comment)]
use crate::MAX_LEN;
use std::{marker::PhantomData, slice, str};
/// `Source` holds the source text for the lexer, and provides APIs to read it.
///
/// It provides a cursor which allows consuming source text either as `char`s, or as bytes.
/// It replaces `std::str::Chars` iterator which performed the same function previously,
/// but was less flexible as only allowed consuming source char by char.
///
/// Consuming source text byte-by-byte is often more performant than char-by-char.
///
/// `Source` provides:
///
/// * Safe API for consuming source char-by-char (`Source::next_char`, `Source::peek_char`).
/// * Safe API for peeking next source byte (`Source::peek_byte`).
/// * Unsafe API for consuming source byte-by-byte (`Source::next_byte`).
/// * Mostly-safe API for rewinding to a previous position in source
/// (`Source::position`, `Source::set_position`).
///
/// # Composition of `Source`
///
/// * `start` is pointer to start of source text.
/// * `end` is pointer to end of source text.
/// * `ptr` is cursor for current position in source text.
///
/// # Invariants of `Source`
///
/// 1. `start` <= `end`
/// 2. The region of memory bounded between `start` and `end` must be initialized,
/// a single allocation, and contain the bytes of a valid UTF-8 string.
/// 3. `ptr` must always be >= `start` and <= `end`.
/// i.e. cursor always within bounds of source text `&str`, or 1 byte after last byte
/// of source text (positioned on EOF).
/// 4. `ptr` must always point to a UTF-8 character boundary, or EOF.
/// i.e. pointing to *1st* byte of a UTF-8 character.
///
/// These invariants are the same as `std::str::Chars`, except `Source` allows temporarily
/// breaking invariant (4) to step through source text byte-by-byte.
///
/// Invariants (1), (2) and (3) must be upheld at all times.
/// Invariant (4) can be temporarily broken, as long as caller ensures it's satisfied again.
///
/// Invariants (1) and (2) are enforced by initializing `start` and `end` from a valid `&str`,
/// and they are never modified after initialization.
///
/// Safe methods of `Source` enforce invariant (3) i.e. they do not allow reading past EOF.
/// Unsafe methods e.g. `Source::next_byte_unchecked` and `Source::peek_byte_unchecked`
/// require caller to uphold this invariant.
///
/// Invariant (4) is the most difficult to satisfy.
/// `Source::next_char` relies on source text being valid UTF-8 to provide a safe API which
/// upholds this invariant.
/// `Source::next_byte` requires very careful use as it may violate invariant (4).
/// That is fine temporarily, but caller *must* ensure the safety conditions of `Source::next_byte`
/// are satisfied, to restore this invariant before passing control back to other code.
/// It will often be preferable to instead use `Source::peek_byte`, followed by `Source::next_char`,
/// which are safe methods, and compiler will often reduce to equally efficient code.
#[derive(Clone)]
pub(super) struct Source<'a> {
/// Pointer to start of source string. Never altered after initialization.
start: *const u8,
/// Pointer to end of source string. Never altered after initialization.
end: *const u8,
/// Pointer to current position in source string
ptr: *const u8,
/// Marker for immutable borrow of source string
_marker: PhantomData<&'a str>,
}
impl<'a> Source<'a> {
/// Create `Source` from `&str`.
pub(super) fn new(mut source_text: &'a str) -> Self {
// If source text exceeds size limit, substitute a short source text which will fail to parse.
// `Parser::parse` will convert error to `diagnostics::OverlongSource`.
if source_text.len() > MAX_LEN {
source_text = "\0";
}
let start = source_text.as_ptr();
// SAFETY: Adding `source_text.len()` to the starting pointer gives a pointer
// at the end of `source_text`. `end` will never be dereferenced, only checked
// for direct pointer equality with `ptr` to check if at end of file.
let end = unsafe { start.add(source_text.len()) };
Self { start, end, ptr: start, _marker: PhantomData }
}
/// Get entire source text as `&str`.
#[inline]
pub(super) fn whole(&self) -> &'a str {
// SAFETY: `start` and `end` are created from a `&str` in `Source::new`,
// so guaranteed to be start and end of a valid UTF-8 string
unsafe {
let len = self.end as usize - self.start as usize;
let slice = slice::from_raw_parts(self.start, len);
str::from_utf8_unchecked(slice)
}
}
/// Get remaining source text as `&str`.
#[inline]
pub(super) fn remaining(&self) -> &'a str {
// SAFETY:
// `start` and `end` are created from a `&str` in `Source::new` so span a single allocation.
// Invariant of `Source` is that `ptr` is always >= `start` and <= `end`,
// so a slice spanning `ptr` to `end` will always be part of of a single allocation.
// Invariant of `Source` is that `ptr` is always on a UTF-8 character boundary,
// so slice from `ptr` to `end` will always be a valid UTF-8 string.
unsafe {
let len = self.end as usize - self.ptr as usize;
let slice = slice::from_raw_parts(self.ptr, len);
debug_assert!(slice.is_empty() || !is_utf8_cont_byte(slice[0]));
str::from_utf8_unchecked(slice)
}
}
/// Return whether at end of source.
#[inline]
pub(super) fn is_eof(&self) -> bool {
self.ptr == self.end
}
/// Get current position.
///
/// The `SourcePosition` returned is guaranteed to be within bounds of `&str` that `Source`
/// was created from, and on a UTF-8 character boundary, so can be used by caller
/// to later move current position of this `Source` using `Source::set_position`.
///
/// `SourcePosition` lives as long as the source text `&str` that `Source` was created from.
#[inline]
pub(super) fn position(&self) -> SourcePosition<'a> {
SourcePosition { ptr: self.ptr, _marker: PhantomData }
}
/// Move current position.
///
/// # SAFETY
/// `pos` must be created from this `Source`, not another `Source`.
/// If this is the case, the invariants of `Source` are guaranteed to be upheld.
#[inline]
pub(super) unsafe fn set_position(&mut self, pos: SourcePosition) {
// `SourcePosition` always upholds the invariants of `Source`,
// as long as it's created from this `Source`.
// SAFETY: `read_u8`'s contract is upheld by:
// * The preceding checks that `pos.ptr` >= `self.start` and < `self.end`.
// * `Source`'s invariants guarantee that `self.start` - `self.end` contains allocated memory.
// * `Source::new` takes an immutable ref `&str`, guaranteeing that the memory `pos.ptr`
// addresses cannot be aliased by a `&mut` ref as long as `Source` exists.
// * `SourcePosition` can only live as long as the `&str` underlying `Source`.
debug_assert!(
pos.ptr >= self.start
&& pos.ptr <= self.end
&& (pos.ptr == self.end || !is_utf8_cont_byte(read_u8(pos.ptr)))
);
self.ptr = pos.ptr;
}
/// Get current position in source, relative to start of source.
#[allow(clippy::cast_possible_truncation)]
#[inline]
pub(super) fn offset(&self) -> u32 {
// Cannot overflow `u32` because of `MAX_LEN` check in `Source::new`
(self.ptr as usize - self.start as usize) as u32
}
/// Move current position back by `n` bytes.
///
/// # Panic
/// Panics if:
/// * `n` is 0.
/// * `n` is greater than current offset in source.
/// * Moving back `n` bytes would not place current position on a UTF-8 character boundary.
#[inline]
pub(super) fn back(&mut self, n: usize) {
// This assertion is essential to ensure safety of `read_u8()` call below.
// Without this check, calling `back(0)` on an empty `Source` would cause reading
// out of bounds.
// Compiler should remove this assertion when inlining this function,
// as long as it can deduce from calling code that `n` is non-zero.
assert!(n > 0, "Cannot call `Source::back` with 0");
// Ensure not attempting to go back to before start of source
let offset = self.ptr as usize - self.start as usize;
assert!(n <= offset, "Cannot go back {n} bytes - only {offset} bytes consumed");
// SAFETY: We have checked that `n` is less than distance between `start` and `ptr`,
// so `new_ptr` cannot be outside of allocation of original `&str`
let new_ptr = unsafe { self.ptr.sub(n) };
// Enforce invariant that `ptr` must be positioned on a UTF-8 character boundary.
// SAFETY: `new_ptr` is in bounds of original `&str`, and `n > 0` assertion ensures
// not at the end, so valid to read a byte.
// `Source`'s invariants guarantee that `self.start` - `self.end` contains allocated memory.
// `Source::new` takes an immutable ref `&str`, guaranteeing that the memory `new_ptr`
// addresses cannot be aliased by a `&mut` ref as long as `Source` exists.
let byte = unsafe { read_u8(new_ptr) };
assert!(!is_utf8_cont_byte(byte), "Offset is not on a UTF-8 character boundary");
// Move current position. The checks above satisfy `Source`'s invariants.
self.ptr = new_ptr;
}
/// Get next char of source, and advance position to after it.
#[inline]
pub(super) fn next_char(&mut self) -> Option<char> {
// Check not at EOF and handle ASCII bytes
let byte = self.peek_byte()?;
if byte.is_ascii() {
// SAFETY: We already exited if at EOF, so `ptr < end`.
// So incrementing `ptr` cannot result in `ptr > end`.
// Current byte is ASCII, so incremented `ptr` must be on a UTF-8 character boundary.
unsafe { self.ptr = self.ptr.add(1) };
return Some(byte as char);
}
// Multi-byte Unicode character.
// Check invariant that `ptr` is on a UTF-8 character boundary.
debug_assert!(!is_utf8_cont_byte(byte));
// Create a `Chars` iterator, get next char from it, and then update `self.ptr`
// to match `Chars` iterator's updated pointer afterwards.
// `Chars` iterator upholds same invariants as `Source`, so its pointer is guaranteed
// to be valid as `self.ptr`.
let mut chars = self.remaining().chars();
// SAFETY: We know that there's a byte to be consumed, so `chars.next()` must return `Some(_)`
let c = unsafe { chars.next().unwrap_unchecked() };
self.ptr = chars.as_str().as_ptr();
Some(c)
}
/// Get next byte of source, and advance position to after it.
///
/// # SAFETY
/// This function may leave `Source` positioned in middle of a UTF-8 character sequence,
/// which would violate one of `Source`'s invariants.
///
/// This is OK temporarily, but caller *must* ensure the invariant is restored again.
///
/// Caller must ensure one of:
///
/// 1. No byte is returned (end of file).
/// 2. The byte returned is ASCII.
/// 3. Further calls to `Source::next_byte` or `Source::next_byte_unchecked` are made
/// to consume the rest of the multi-byte UTF-8 character, before calling any other methods
/// of `Source` (even safe methods) which rely on `Source` being positioned on a UTF-8
/// character boundary, or before passing control back to other safe code which may call them.
///
/// In particular, safe methods `Source::next_char`, `Source::peek_char`, and `Source::remaining`
/// are *not* safe to call until one of above conditions is satisfied.
///
/// It will often be preferable to instead use `Source::peek_byte`, followed by `Source::next_char`,
/// which are safe methods, and compiler will often reduce to equally efficient code, if calling
/// code tests the byte returned. e.g.:
///
/// ```
/// // Consume a space
/// let byte = source.peek_byte();
/// if byte == Some(b' ') {
/// source.next_char().unwrap();
/// }
/// ```
#[allow(dead_code)]
#[inline]
unsafe fn next_byte(&mut self) -> Option<u8> {
if self.is_eof() {
None
} else {
// SAFETY: Safe to read from `ptr` as we just checked it's not out of bounds
Some(self.next_byte_unchecked())
}
}
/// Get next bytes of source, and advance position to after it, without EOF bounds-check.
///
/// # SAFETY
/// Caller must ensure `Source` is not at end of file.
///
/// This function may leave `Source` positioned in middle of a UTF-8 character sequence,
/// which would violate one of `Source`'s invariants.
///
/// This is OK temporarily, but caller *must* ensure the invariant is restored again.
///
/// Caller must ensure one of:
///
/// 1. The byte returned is ASCII.
/// 2. Further calls to `Source::next_byte` or `Source::next_byte_unchecked` are made
/// to consume the rest of the multi-byte UTF-8 character, before calling any other methods
/// of `Source` (even safe methods) which rely on `Source` being positioned on a UTF-8
/// character boundary, or before passing control back to other safe code which may call them.
///
/// In particular, safe methods `Source::next_char`, `Source::peek_char`, and `Source::remaining`
/// are *not* safe to call until one of above conditions is satisfied.
#[allow(dead_code)]
#[inline]
unsafe fn next_byte_unchecked(&mut self) -> u8 {
// SAFETY: Caller guarantees not at end of file i.e. `ptr != end`.
// Methods of this type provide no way for `ptr` to be before `start` or after `end`.
// Therefore always valid to read a byte from `ptr`, and incrementing `ptr` cannot result
// in `ptr > end`.
let byte = self.peek_byte_unchecked();
self.ptr = self.ptr.add(1);
byte
}
/// Peek next char of source, without consuming it.
#[inline]
pub(super) fn peek_char(&self) -> Option<char> {
// Check not at EOF and handle ASCII bytes
let byte = self.peek_byte()?;
if byte.is_ascii() {
return Some(byte as char);
}
// Multi-byte Unicode character.
// Check invariant that `ptr` is on a UTF-8 character boundary.
debug_assert!(!is_utf8_cont_byte(byte));
// Create a `Chars` iterator, and get next char from it
let mut chars = self.remaining().chars();
// SAFETY: We know that there's a byte to be consumed, so `chars.next()` must return `Some(_)`.
// Could just return `chars.next()` here, but making it clear to compiler that this branch
// always returns `Some(_)` may help it optimize the caller. Compiler seems to have difficulty
// "seeing into" `Chars` iterator and making deductions.
let c = unsafe { chars.next().unwrap_unchecked() };
Some(c)
}
/// Peek next next char of source, without consuming it.
#[inline]
pub(super) fn peek_char2(&self) -> Option<char> {
// Handle EOF
if self.is_eof() {
return None;
}
// Check invariant that `ptr` is on a UTF-8 character boundary.
debug_assert!(!is_utf8_cont_byte(self.peek_byte().unwrap()));
let mut chars = self.remaining().chars();
// SAFETY: We already checked not at EOF, so `chars.next()` must return `Some(_)`
unsafe { chars.next().unwrap_unchecked() };
chars.next()
}
/// Peek next byte of source without consuming it.
#[inline]
pub(super) fn peek_byte(&self) -> Option<u8> {
if self.is_eof() {
None
} else {
// SAFETY: Safe to read from `ptr` as we just checked it's not out of bounds
Some(unsafe { self.peek_byte_unchecked() })
}
}
/// Peek next byte of source without consuming it, without EOF bounds-check.
///
/// # SAFETY
/// Caller must ensure `Source` is not at end of file.
#[inline]
pub(super) unsafe fn peek_byte_unchecked(&self) -> u8 {
// SAFETY: Caller guarantees `ptr` is before `end` (i.e. not at end of file).
// Methods of this type provide no way to allow `ptr` to be before `start`.
// `Source`'s invariants guarantee that `self.start` - `self.end` contains allocated memory.
// `Source::new` takes an immutable ref `&str`, guaranteeing that the memory `self.ptr`
// addresses cannot be aliased by a `&mut` ref as long as `Source` exists.
debug_assert!(self.ptr >= self.start && self.ptr < self.end);
read_u8(self.ptr)
}
}
/// Wrapper around a pointer to a position in `Source`.
#[derive(Debug, Clone, Copy)]
pub struct SourcePosition<'a> {
ptr: *const u8,
_marker: PhantomData<&'a u8>,
}
/// Return if byte is a UTF-8 continuation byte.
#[inline]
const fn is_utf8_cont_byte(byte: u8) -> bool {
// 0x80 - 0xBF are continuation bytes i.e. not 1st byte of a UTF-8 character sequence
byte >= 0x80 && byte < 0xC0
}
/// Read `u8` from `*const u8` pointer.
///
/// Using `as_ref()` for reading is copied from `core::slice::iter::next`.
/// https://doc.rust-lang.org/src/core/slice/iter.rs.html#132
/// https://doc.rust-lang.org/src/core/slice/iter/macros.rs.html#156-168
///
/// This is about 7% faster than `*ptr` or `ptr.read()`, presumably because it tells the compiler
/// it can rely on the memory being immutable, because if a `&mut` reference existed, that would
/// violate Rust's aliasing rules.
///
/// # SAFETY
/// Caller must ensure pointer is non-null, and points to allocated, initialized memory.
/// Pointer must point to within an object for which no `&mut` references are currently held.
#[inline]
unsafe fn read_u8(ptr: *const u8) -> u8 {
// SAFETY: Caller guarantees pointer is non-null, and points to allocated, initialized memory.
// Caller guarantees no mutable references to same memory exist, thus upholding Rust's aliasing rules.
// Pointer is "dereferenceable" by definition as a `u8` is 1 byte and cannot span multiple objects.
// Alignment is not relevant as `u8` is aligned on 1 (i.e. no alignment requirements).
debug_assert!(!ptr.is_null());
*ptr.as_ref().unwrap_unchecked()
}

View file

@ -42,8 +42,8 @@ impl<'a> Lexer<'a> {
if !has_escape {
return;
}
self.escaped_strings.insert(self.current.token.start, s);
self.current.token.escaped = true;
self.escaped_strings.insert(self.token.start, s);
self.token.escaped = true;
}
pub(crate) fn get_string(&self, token: Token) -> &'a str {
@ -51,7 +51,7 @@ impl<'a> Lexer<'a> {
return self.escaped_strings[&token.start];
}
let raw = &self.source[token.start as usize..token.end as usize];
let raw = &self.source.whole()[token.start as usize..token.end as usize];
match token.kind {
Kind::Str => {
&raw[1..raw.len() - 1] // omit surrounding quotes

View file

@ -15,14 +15,14 @@ impl<'a> AutoCow<'a> {
AutoCow { start, value: None }
}
// Push a char that matches lexer.current.chars().next()
// Push a char that matches `lexer.next_char()`.
pub fn push_matching(&mut self, c: char) {
if let Some(text) = &mut self.value {
text.push(c);
}
}
// Push a different character than lexer.current.chars().next().
// Push a different character than `lexer.next_char()`.
// force_allocation_without_current_ascii_char must be called before this.
pub fn push_different(&mut self, c: char) {
debug_assert!(self.value.is_some());

View file

@ -47,7 +47,7 @@ impl<'a> Lexer<'a> {
/// Re-tokenize the current `}` token for `TemplateSubstitutionTail`
/// See Section 12, the parser needs to re-tokenize on `TemplateSubstitutionTail`,
pub(crate) fn next_template_substitution_tail(&mut self) -> Token {
self.current.token.start = self.offset() - 1;
self.token.start = self.offset() - 1;
let kind = self.read_template_literal(Kind::TemplateMiddle, Kind::TemplateTail);
self.lookahead.clear();
self.finish_next(kind)
@ -63,16 +63,15 @@ impl<'a> Lexer<'a> {
if !has_escape {
return;
}
self.escaped_templates
.insert(self.current.token.start, is_valid_escape_sequence.then(|| s));
self.current.token.escaped = true;
self.escaped_templates.insert(self.token.start, is_valid_escape_sequence.then(|| s));
self.token.escaped = true;
}
pub(crate) fn get_template_string(&self, token: Token) -> Option<&'a str> {
if token.escaped {
return self.escaped_templates[&token.start];
}
let raw = &self.source[token.start as usize..token.end as usize];
let raw = &self.source.whole()[token.start as usize..token.end as usize];
Some(match token.kind {
Kind::NoSubstitutionTemplate | Kind::TemplateTail => {
&raw[1..raw.len() - 1] // omit surrounding quotes or leading "}" and trailing "`"

View file

@ -8,8 +8,8 @@ impl<'a> Lexer<'a> {
Kind::ShiftLeftEq => 3,
_ => unreachable!(),
};
self.current.token.start = self.offset() - offset;
self.current.chars = self.source[self.current.token.start as usize + 1..].chars();
self.token.start = self.offset() - offset;
self.source.back(offset as usize - 1);
let kind = Kind::LAngle;
self.lookahead.clear();
self.finish_next(kind)

View file

@ -29,15 +29,13 @@ impl<'a> Lexer<'a> {
}
c if is_irregular_whitespace(c) => {
self.consume_char();
self.trivia_builder
.add_irregular_whitespace(self.current.token.start, self.offset());
self.trivia_builder.add_irregular_whitespace(self.token.start, self.offset());
Kind::Skip
}
c if is_irregular_line_terminator(c) => {
self.consume_char();
self.current.token.is_on_new_line = true;
self.trivia_builder
.add_irregular_whitespace(self.current.token.start, self.offset());
self.token.is_on_new_line = true;
self.trivia_builder.add_irregular_whitespace(self.token.start, self.offset());
Kind::Skip
}
_ => {