## [0.16.3] - 2024-07-02
### Features
- b257d53 linter: Support report
`@typescript-eslint/consistent-type-imports` (#3895) (mysteryven)
### Bug Fixes
- 23038ad codegen: Print `TSFunctionType` inside `TSTypeAssertion`
(#3999) (Boshen)
- d995f94 semantic: Resolve reference incorrectly when a parameter
references a parameter that hasn't been defined yet (#4004) (Dunqing)
- bdee156 transformer/typescript: `declare class` incorrectly preserved
as runtime class (#3997) (Dunqing)
- a50ce3d transformer/typescript: Missing initializer for class
constructor arguments with `private` and `protected` modifier (#3996)
(Dunqing)
### Refactor
- 0fe22a8 ast: Reorder fields to reflect their visit order. (#3994)
(rzvxa)
- d0eac46 parser: Use function instead of trait to parse normal lists
(#4003) (Boshen)
Co-authored-by: Boshen <Boshen@users.noreply.github.com>
## [0.16.2] - 2024-06-30
### Features
- dc6d45e ast,codegen: Add `TSParenthesizedType` and print type
parentheses correctly (#3979) (Boshen)
- 63f36da parser: Parse modifiers with `parse_modifiers` (take 2)
(#3977) (DonIsaac)
### Bug Fixes
- dac617d codegen: Print some missing typescript attributes (#3980)
(Boshen)
- bd1141d isolated-declarations: If declarations is referenced in
`declare global` then keep it (#3982) (Dunqing)
### Performance
- b234ddd semantic: Only check for jsdoc if jsdoc building is enabled
(Boshen)
- 1eac3d2 semantic: Use `Atom<'a>` for `Reference`s (#3972) (Don Isaac)
- 0c81fbe syntax: Use `NonZeroU32` for `SymbolId` and `ReferenceId`
(#3970) (Boshen)
### Refactor
- 5845057 transformer: Pass in symbols and scopes (#3978) (Boshen)
Co-authored-by: Boshen <Boshen@users.noreply.github.com>
## [0.16.1] - 2024-06-29
### Features
- 7b38bde parser: Parse modifiers with `parse_modifiers` (#3948)
(DonIsaac)
- f64ad4b semantic: Make jsdoc building optional (turned off by default)
(#3955) (Boshen)
### Bug Fixes
- 51e54f9 codegen: Should print `TSModuleDeclarationKind` instead of
just `module` (#3957) (Dunqing)
- 31e4c3b isolated-declarations: `declare global {}` should be kept even
if it is not exported (#3956) (Dunqing)
### Refactor
- 2705df9 linter: Improve diagnostic labeling (#3960) (DonIsaac)
- 15ec254 semantic: Remove the unused `Semantic::build2` function
(Boshen)
Co-authored-by: Boshen <Boshen@users.noreply.github.com>
Since this is a temporary solution in the time that we are waiting for the `#[span]` hint, And there are already other workarounds used in our `ast_codegen` I propose removing it right away - sorry in my opinion adding it in the first place was a mistake - in favor of adding an edge case in the codegen. It is better to do the refactoring in the codegen instead of the production code which people may depend on.
`Atom` is just a wrapper around `&str`, so better not to pass `&Atom` to functions, as that's a double-reference. Prefer `Atom` or `&str` instead to avoid indirection.
Follow-on after #3296.
Make parsing binary/octal/hex numeric literals a little more efficient.
These changes all rely on that we know more than the compiler does -
that strings passed to these `parse_*` functions can only contain a
certain set of characters.
## What This PR Does
- perf(lexer): use bit shifting when parsing hex, octal, and binary
integers instead of `mul_add`-ing on `f64`s. Check out the difference in
assembly generated [here](https://godbolt.org/z/zMEKaeYzh)
- perf(lexer): skip redundant utf8 check when parsing BigInts
- refactor(lexer): remove `unsafe` usage (as per @overlookmotel's
request
[here](https://github.com/oxc-project/oxc/pull/3283#issuecomment-2111814598))
- test(lexer): add numeric parsing unit tests
I don't expect this PR to have a large performance improvement, since
the most common case (`Kind::Decimal`) is not affected. We could do
this, however, by splitting `Kind::Decimal` into `Kind::DecimalFloat`
and `Kind::DecimalInt` when the lexer encounters a `.`
## What This PR Does
Updates numeric literal token lexing to record when separator characters
(`_`) are found in a new `Token` flag. This then gets passed to
`parse_int` and `parse_float`, removing the need for a second `_` check
in those two functions.
When run locally, I see no change to lexer benchmarks and minor
improvements to codegen benchmarks. For some reason, semantic and source
map benches seem to be doing slightly worse.
Note that I attempted to implement this with `bitflags!` (making
`escaped` and `is_on_newline` flags as well) and this caused performance
degradation. My best guess is that it turned reads on these flags from a
`mov` to a `mov` + a binary and.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
part of #3213
We should only have one diagnostic struct instead 353 copies of them, so we don't end up choking LLVM with 50k lines of the same code due to monomorphization.
If the proposed approach is good, then I'll start writing a codemod to turn all the existing structs to plain functions.
---
Background:
Using `--timings`, we see `oxc_linter` is slow on codegen (the purple part).

The crate currently contains 353 miette errors. [cargo-llvm-lines](https://github.com/dtolnay/cargo-llvm-lines) displays
```
cargo llvm-lines -p oxc_linter --lib --release
Lines Copies Function name
----- ------ -------------
830350 33438 (TOTAL)
29252 (3.5%, 3.5%) 808 (2.4%, 2.4%) <alloc::boxed::Box<T,A> as core::ops::drop::Drop>::drop
23298 (2.8%, 6.3%) 353 (1.1%, 3.5%) miette::eyreish::error::object_downcast
19062 (2.3%, 8.6%) 706 (2.1%, 5.6%) core::error::Error::type_id
12610 (1.5%, 10.1%) 65 (0.2%, 5.8%) alloc::raw_vec::RawVec<T,A>::grow_amortized
12002 (1.4%, 11.6%) 706 (2.1%, 7.9%) miette::eyreish::ptr::Own<T>::boxed
9215 (1.1%, 12.7%) 115 (0.3%, 8.2%) core::iter::traits::iterator::Iterator::try_fold
9150 (1.1%, 13.8%) 1 (0.0%, 8.2%) oxc_linter::rules::RuleEnum::read_json
8825 (1.1%, 14.9%) 353 (1.1%, 9.3%) <miette::eyreish::error::ErrorImpl<E> as core::error::Error>::source
8822 (1.1%, 15.9%) 353 (1.1%, 10.3%) miette::eyreish::error::<impl miette::eyreish::Report>::construct
8119 (1.0%, 16.9%) 353 (1.1%, 11.4%) miette::eyreish::error::object_ref
8119 (1.0%, 17.9%) 353 (1.1%, 12.5%) miette::eyreish::error::object_ref_stderr
7413 (0.9%, 18.8%) 353 (1.1%, 13.5%) <miette::eyreish::error::ErrorImpl<E> as core::fmt::Display>::fmt
7413 (0.9%, 19.7%) 353 (1.1%, 14.6%) miette::eyreish::ptr::Own<T>::new
6669 (0.8%, 20.5%) 39 (0.1%, 14.7%) alloc::raw_vec::RawVec<T,A>::try_allocate_in
6173 (0.7%, 21.2%) 353 (1.1%, 15.7%) miette::eyreish::error::<impl miette::eyreish::Report>::from_std
6027 (0.7%, 21.9%) 70 (0.2%, 16.0%) <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
6001 (0.7%, 22.7%) 353 (1.1%, 17.0%) miette::eyreish::error::object_drop
6001 (0.7%, 23.4%) 353 (1.1%, 18.1%) miette::eyreish::error::object_drop_front
5648 (0.7%, 24.1%) 353 (1.1%, 19.1%) <miette::eyreish::error::ErrorImpl<E> as core::fmt::Debug>::fmt
```
It's totalling more than 50k llvm lines, and is putting pressure on rustc codegen (the purple part on `oxc_linter` in the image above.
---
It's pretty obvious by looking at https://github.com/zkat/miette/blob/main/src/eyreish/error.rs, the generics can expand out to lots of code.
OK, this is a big one...
I have done this as part of work on Traversable AST, but I believe it
has wider benefits, so thought better to spin it off into its own PR.
## What this PR does
This PR squashes all nested AST enum types (#2685).
e.g.: Previously:
```rs
pub enum Statement<'a> {
BlockStatement(Box<'a, BlockStatement<'a>>),
/* ...other Statement variants... */
Declaration(Declaration<'a>),
}
pub enum Declaration<'a> {
VariableDeclaration(Box<'a, VariableDeclaration<'a>>),
/* ...other Declaration variants... */
}
```
After this PR:
```rs
#[repr(C, u8)]
pub enum Statement<'a> {
BlockStatement(Box<'a, BlockStatement<'a>>) = 0,
/* ...other Statement variants... */
VariableDeclaration(Box<'a, VariableDeclaration<'a>>) = 32,
/* ...other Declaration variants... */
}
#[repr(C, u8)]
pub enum Declaration<'a> {
VariableDeclaration(Box<'a, VariableDeclaration<'a>>) = 32,
/* ...other Declaration variants... */
}
```
All `Declaration`'s variants are combined into `Statement`, but
`Declaration` type still exists.
As both types are `#[repr(C, u8)]`, and the discriminants are aligned, a
`Declaration` can be transmuted to a `Statement` at zero cost.
This is the same thing as #2847, but here applied to *all* nested enums
in the AST, and with improved helper methods.
No enums increase in size, and a few get smaller. Indirection is reduced
for some types (this removes multiple levels of boxing).
## Why?
1. It is a prerequisite for Traversable AST (#2987).
2. It would help a lot with AST Transfer (#2409) - it solves the only
remaining blocker for this.
3. It is a step closer to making the whole AST `#[repr(C)]`.
## Why is it a good thing for the AST to be `#[repr(C)]`?
Oxc's direction appears to be increasingly to build up control over the
fundamental primitives we use, in order to unlock performance and
features. We have our own allocator, our own custom implementations for
`Box` and `Vec`, our own `IndexVec` (TBC). The AST is the central
building block of Oxc, and taking control of its memory layout feels
like a step in this same direction.
Oxc has a major advantage over other similar libraries in that it keeps
all the AST data in an arena. This opens the door to treating the AST
either as Rust types or as *pure data* (just bytes). That data can be
moved around and manipulated beyond what Rust natively allows.
However, to enable that, the types need to be well-specified, with
completely stable layouts. `#[repr(C)]` is the only tool Rust provides
to do this.
Once the types are `#[repr(C)]`, various features become possible:
1. Cheap transfer of the AST across boundaries without ser/deser - the
property used by AST Transfer.
2. Having multiple versions of the AST (standard, read-only,
traversable), and these AST representations can be converted to one
other at zero cost via transmute - the property used by Traversable AST
scheme.
3. Caching AST data on disk (#3079) or transferring across network.
4. Stuff we haven't thought of yet!
Allowing the AST to be treated as pure data will likely unlock other
"next level" features further down the track (caching for "edge
bundling" comes to mind).
## The problem with `#[repr(C)]`
It's not *required* to squash nested enums to make the AST `#[repr(C)]`.
But the problem with `#[repr(C)]` is that it disables some compiler
optimizations. Without `#[repr(C)]`, the compiler squashes enums itself
in some cases (which is how `Statement` is currently 16 bytes). But
making the types `#[repr(C)]` as they are currently disables this
optimization.
So this PR essentially makes explicit what the compiler is already doing
- and in fact goes a bit further with the optimization than the compiler
is able to, in squashing 3 or 4 layers of nested enums (the compiler
only does up to 2 layers).
## Implementation
One enum "inheriting" variants from another is implemented with
`inherit_variants!` macro.
```rs
inherit_variants! {
#[repr(C, u8)]
pub enum Statement<'a> {
BlockStatement(Box<'a, BlockStatement<'a>>),
/* ...other Statement variants... */
// `Declaration` variants added here by `inherit_variants!` macro
@inherit Declaration
// `ModuleDeclaration` variants added here by `inherit_variants!` macro
@inherit ModuleDeclaration
}
}
```
The macro is *fairly* lightweight, and I think the above is quite easy
to understand. No proc macros.
The macro also implements utility methods for converting between enums
e.g. `Statement::as_declaration`. These methods are all zero-cost
(essentially transmutes).
New patterns for dealing with nested enums are introduced:
Creation:
```rs
// Old
let stmt = Statement::Declaration(Declaration::VariableDeclaration(var_decl));
// New
let stmt = Statement::VariableDeclaration(var_decl);
```
Conversion:
```rs
// Old
let stmt = Statement::Declaration(decl);
// New
let stmt = Statement::from(decl);
```
Testing:
```rs
// Old
if matches!(stmt, Statement::Declaration(_)) { }
if matches!(stmt, Statement::ModuleDeclaration(m) if m.is_import()) { }
// New
if stmt.is_declaration() { }
if matches!(stmt, Statement::ImportDeclaration(_)) { }
```
Branching:
```rs
// Old
if let Statement::Declaration(decl) = &stmt { decl.do_stuff() };
// New
if let Some(decl) = stmt.as_declaration() { decl.do_stuff() };
```
Matching:
```rs
// Old
match stmt {
Statement::Declaration(decl) => visitor.visit(decl),
}
// New (exhaustive match)
match stmt {
match_declaration!(Statement) => visitor.visit(stmt.to_declaration()),
}
// New (alternative)
match stmt {
_ if stmt.is_declaration() => visitor.visit(stmt.to_declaration()),
}
```
New syntax has pluses and minuses vs the old. `match` syntax is worse,
but when working with a deeply nested enum, the code is much nicer -
it's shorter and easier to read.
This PR removes 200 lines from the linter with changes like this:
https://github.com/oxc-project/oxc/pull/3115/files#diff-dc417ff57352da6727a760ec6dee22de6816f8231fb69dbef1bf05d478699103L92-R95
```diff
- let AssignmentTarget::SimpleAssignmentTarget(simple_assignment_target) =
- &assignment_expr.left
- else {
- return;
- };
- let SimpleAssignmentTarget::AssignmentTargetIdentifier(ident) =
- simple_assignment_target
+ let AssignmentTarget::AssignmentTargetIdentifier(ident) = &assignment_expr.left
else {
return;
};
```
We used to export `static_assertions` as part of the `oxc_index`. It
would've made sense back when it was only a vessel for exporting other
crates - although even then it wouldn't make much sense other than being
convenient - Now with it turning into a port of `index_vec` and
potentially getting bigger as the result of specific needs of the
project; It makes much more sense to stop exporting it from `oxc_index`
and use the crate directly in places that used to use what `oxc_index`
were exporting.
PS: we may want to follow up this with an `oxc_asset` crate containing
our own set of assertion tools which would also export
`static_assertions`.
Pure refactor. This change does nothing except makes it more consistent
with other types which are also just a wrapper around `Span` e.g.
`NullLiteral` and `TSThisType`.
Box all enum variants for JSX types (`JSXAttributeName`,
`JSXAttributeValue`, `JSXChild`, `JSXElementName`,
`JSXMemberExpressionObject`). Part of #3047.
I'm not sure how to interpret the benchmark results. As I said on #3047:
> I imagine it may cost a little in performance in the parser due to
extra calls to `alloc`, but in return traversing the AST should be
cheaper, as the data is more compact, so less cache misses.
Sure enough, there is a small impact (1%) on the 2 parser benchmarks for
JSX files. However, the other benchmarks have too much noise in them to
see whether this is repaid in a speed up on transformer etc, especially
as the transformer benchmarks also include parsing.
What do you think @Boshen?
Part of #3047.
As with #3058, it's hard to interpret the benchmark results here. But in
this case I think it's easier to see from "first principles" that this
should be an improvement - `ImportSpecifier` is pretty massive (80
bytes) vs `ImportDefaultSpecifier` (40 bytes), and the latter (e.g.
`import React from 'react'`) is common in JS code.
## Why
Due to the usage of `&'alloc mut T` in `oxc_allocator::Box`, and
`bumpalo::collections::Vec` in `oxc_allocator::Vec`, ast types are
currently invariant over their allocator lifetime `'a`. This prevents
`ouroboros` from generating `borrow_*` on ast type fields, leading to
the unfriendly `with_*` api:
c250b288ef/crates/oxc_parser/examples/multi-thread.rs (L82-L84)
## How
- For `oxc_allocator::Vec`, switch to `allocator_api2::vec::Vec`, which
has a covariant relationship with the allocator lifetime.
- For `oxc_allocator::Box`, use `std::ptr::NonNull` which is
specifically designed to be covariant. I don't use
`allocator_api2::boxed::Box` because it holds the allocator for
dropping, so the size is bigger.
## Downside
Now that `oxc_allocator::Box` uses the unsafe `NonNull`. It has to be a
private field to be safe. This make it impossible to do `Box(....)`
pattern matching.
Should be merged after #2829, Tried a few times to get it done with
graphite stacking but found no success. I guess it either doesn't work
with forks or It is just a skill issue since I'm not familiar with it.
closes: #2829closes: #2830
---------
Co-authored-by: Dmytro Maretskyi <maretskii@gmail.com>
* move `visit` and `visit_mut` modules to a super module called `visit`
* add `walk_mut` module containing walk functions
* update `enter_node` and `leave_node` events to not pass a reference in the `VisitMut` trait
* add `AstType`, a non-referencing version of `AstKind` to use with `VisitMut` trait
* update the `VisitMut` trait's usages.
Speed up lexing JSX identifier continuations (i.e. after `-`), by
searching for end of identifier byte-by-byte.
Change does not register on benchmarks, only because benchmarks don't
contain any `<Foo-Bar />` identifiers, so don't exercise this code path.
This PR merges the previous confusing features `serde` and `wasm` into a
single `serialize` feature.
We'll eventually do serialize + type information for both wasm and napi
targets.
`oxc_macros` is removed from `oxc_ast`'s dependency because it requires
`syn` and friends, which goes against our policy ["Third-party
dependencies should be
minimal."](https://oxc-project.github.io/docs/contribute/rules.html#development-policy)
This is one point where Babel and TSESLint diverge. For linter purposes
TSESLint structure makes more sense and that the reason of
https://github.com/typescript-eslint/typescript-eslint/issues/4130
The remaining `is_export` was creating redundant information and made
prettier (and the WIP oxc/prettier) print the AST of `export import X =
Y` as `export export import X = Y`.
Preparatory step for #2620.
This PR purely changes names of types and methods:
* `CompactString` -> `CompactStr`
* `Atom::to_compact_string` -> `to_compact_str`
* `Atom::into_compact_string` -> `into_compact_str`
Have split this into a separate PR as the diff is large, but it does absolutely nothing but renaming (I've checked the whole diff twice, so feel free not to check it again!). This should make it easier to see the content of the substantive change in #2620.
Implements
https://github.com/typescript-eslint/typescript-eslint/issues/2998
The copy of props feels wrong, but could not get it working otherwise
with the box and borrow things 😅
Also I found that TSImportType was missing some entries for visitors and
codegen.
In the case of codegen I'm not really understand the need as all the
types seems to be dismissed?
Change behavior of `byte_search!` macro, to make it easier to understand and use:
1. `handle_match` removed. Macro instead evaluates to the first matching byte.
2. `handle_eof` does not return from enclosing function.
3. Alter syntax to make clear that `continue_if` and `handle_eof` are not closures, so can use `return` statements in them.
These changes enabled by #2552.
This PR greatly simplifies the `byte_search!` macro.
Mainly removing `cold_branch()` from the "not enough bytes remaining for a batch" branch, which allows refactoring so that `handle_match` and `continue_if` don't need to be repeated twice.
Result for performance is inconsistent - a little better on some benchmarks, a little worse on others. But not by significant amounts either way. In my view, the benefit of making the macro simpler outweighs a small speed loss anyway.
Speed up lexing template strings.
This was the last use of `AutoCow` remaining in the lexer, and it's now removed.
Implementation is quite complex, to avoid repeatedly branching on whether an unescaped string is required or not (the way `AutoCow` did). I tried to simplify it down to a single function, but this hurt performance significantly.
Benchmarks do not show much movement, but I believe that's because there aren't many template strings in the benchmarks. Where there are template strings, I believe this speeds up lexing them significantly.
Simplify lexing JSX string attributes. As the search is purely for 1
byte value (the closing quote), and so doesn't require a byte table, use
`memchr`.
This change doesn't really register on benchmarks, but it's one step
closer to removing `AutoCow`, and transitioning all the searches in the
lexer to byte-by-byte.
This gets all the new TS types working to the same level TS output was
before and fixes a bunch of other codegen
---------
Co-authored-by: Boshen <boshenc@gmail.com>
1. Remove the check implementation of the parser
2. Implement it to semantic checker
3. Support typescript's check for duplicate class elements
Support checking for duplicate class elements in semantic checker is
easier to support typescript checking rules.
#2439 made using `continue_if` in `byte_search!` macro safe, as it no longer continues the main loop after a match, so no danger of reading out of bounds if `continue_if` code fast-forwards the current position.
This follow-on PR removes the unsafe blocks, and uses that fast-forward ability in a couple of places.
Refactor `byte_search!` macro to move logic out of the main loop. This ensures the compiler unrolls the loop.
This speeds up lexing single-line comments by 20%-25% on the benchmarks which contain enough comments for the change to register. Presumably the loop wasn't unrolled previously.
The code required to do this is a little odd. It adds an extra `loop {}` which always exits on the first turn (so not really a useful loop), but is required to be able to use `break` to exit that "loop", making 2 different paths for (1) matching byte found and (2) `for` loop completed without finding any match.
This is only way I could find to produce this behavior without using a macro. Is there a more "normal" way to get the same logic?
Catch all illegal UTF-8 bytes with the `UER` byte handler.
From https://datatracker.ietf.org/doc/html/rfc3629:
> The octet values C0, C1, F5 to FF never appear.
This change *should* make no difference at all, as a valid `&str` may not contain any of these byte values anyway. But it's possible if user has e.g. created the string with `str::from_utf8_unchecked` and not obeyed the safety contraints. This will at least contain the damage if that's happened, and panic rather than lead to UB. And since we're already catching other error conditions, may as well catch them all.
Consume multi-line comments faster.
* Initially search for `*/`, `\r`, `\n` or `0xE2` (first byte of
irregular line breaks).
* Once a line break is found, switch to faster search which only looks
for `*/`, as it's not relevant whether there are more line breaks or
not.
Using `memchr` for the 2nd simpler search, as it's efficient for a
search with only one "needle".
Initializing `memchr::memmem::Finder` is fairly expensive, and tried
numerous ways to handle it. This is most performant way I could find.
Any ideas how to avoid re-creating it for each Lexer pass? (it can't be
a `static` as `Finder::new` is not a const function, and `lazy_static!`
is too costly)
Lex string literals as bytes, using same techniques as for identifiers.
Handling escapes could be optimized a bit more, and maybe I'll return to that, but as escapes are fairly rare, it wouldn't be the biggest gain.
This was a bit of a whoopsie in last batch of PRs. This assertion shouldn't be there, because all reads are now via `source.position().read()`, so this assertion says "you can only read some byte values".
Only reason it didn't blow up conformance tests is that they run in release mode.
Sorry. Please merge soon as you can and cover my shame!
This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new `Source` / `SourcePosition` APIs.
Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent.
I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance.
### Techniques
This implementation uses a few different strategies for performance:
* Search byte-by-byte, not char-by-char.
* Process batches of 32 bytes at a time to reduce bounds checks.
* Mark uncommon paths `#[cold]`.
### Structure
The implementation is built in 3 layers:
1. ASCII characters only.
2. ASCII and Unicode characters.
3. `\` escape sequences (and all the above).
`identifier_name_handler` starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt.
I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised.
That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier *starts* with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a `#[cold]` boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric.
### ASCII search macro
The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked `#[inline]` or `#[inline(always)]`. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro.
This macro, and the `ByteMatchTable` types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple.
Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD.
### Bye bye `AutoCow`
`AutoCow` is removed. Instead, a string-builder is only created if it's needed, when a `\` escape is first encountered. The string builder is also more efficient than `AutoCow` was, as it copies bytes in chunks, rather than 1-by-1.
This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.
Make `Source::set_position` a safe function.
This addresses a shortcoming of #2288.
Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level.
`Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway.
So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
Introduce invariant that only a single `lexer::Source` can exist on a thread at one time.
This is a preparatory step for #2341.
2 notes:
Restriction is only 1 x `ParserImpl` / `Lexer` / `Source` on 1 *thread* at a time, not globally. So this does not prevent parsing multiple files simultaneously on different threads.
Restriction does not apply to public type `Parser`, only `ParserImpl`. `ParserImpl`s are not created in created in `Parser::new`, but instead in `Parser::parse`, where they're created and then immediately consumed. So the end user is also free to create multiple `Parser` instances (if they want to for some reason) on the same thread.
Split parser into public interface `Parser` and internal implementation `ParserImpl`.
This involves no changes to public API.
This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile.
The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code.
All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.
This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.
## What it does
`Source` holds the source text, and allows:
* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.
`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).
However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.
## Why
I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.
It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.
`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.
In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.
## Notes on implementation
`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.
As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.
I've split it into 2 commits.
* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.
Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.
### `SourcePosition`
There is one annoyance with the API which I haven't been able solve:
`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.
The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.
This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?
## Oddity in the benchmarks
There's something really odd going on with the semantic benchmark for
`pdf.mjs`.
While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.
What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.
If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.
I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!
I really am at a loss to explain it. Have you seen anything like this
before?
One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!
I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
This PR solves the problem of lexer byte handlers all being called
`core::ops::function::FnOnce::call_once` in the flame graphs on
CodSpeed, by defining them as named functions instead of closures.
Pure refactor, no substantive changes.
This fixes a mistake I made in #2237.
I was confused by the `!(...)` wrapping of the preceding `if` test and
missed that there are definitely 2 chars to consume, so can use
`consume_char()` instead of `next_char()`. This makes no difference to
behavior, but it follows the convention to always prefer
`consume_char()` when possible.
I've also refactored the code which confused me, so hopefully others
won't be confused too!
This PR replaces most usages of `lexer.current.chars.next()` with
`lexer.consume_char()`, or a new function `lexer.next_char()`.
This is a preparatory step towards replacing the `Chars` iterator with
something more flexible which can also consume bytes (not `char`s), and
this PR was intended as pure refactor. But surprised to see there is a
small performance bump (no idea why!).
There's an additional benefit: Using `consume_char()` everywhere where
we believe there's definitely a char there to be consumed will make
logic errors produce a panic, rather than silently outputting garbage.
This adds a separate byte handler to the lexer for byte values which
should never be encountered:
1. UTF-8 continuation bytes (i.e. middle of a multi-byte UTF-8 byte
sequence).
2. Bytes values which are illegal in valid UTF-8 strings.
At present, this function is impossible to reach, because
`std::str::Chars` ensures the next byte is always the *start* of a valid
UTF-8 byte sequence. But later changes I intend introducing unsafe code
will make it possible (but highly undesirable!). In the meantime, I
don't think it does any harm to handle this case.
This PR has a large diff, but it contains no substantive changes
whatsoever. It purely breaks up the lexer into multiple smaller files.
I've been working quite intensively on the lexer over past few weeks,
but still have been finding it hard to make sense of, due to most of the
logic currently being contained in [a single 1800-line
file](018675ceb1/crates/oxc_parser/src/lexer/mod.rs).
I feel that breaking it up into multiple files makes it much easier to
navigate and understand.
An additional benefit is that many functions can have their visibility
reduced to module scope, so sub-systems for e.g. lexing numbers have
fewer exposed functions. This makes it clearer what the entry points
are, and makes it harder to make mistakes when working on the lexer.
I intend to later make changes to the lexer for performance which will
introduce unsafe code. Keeping that unsafe code encapsulated in modules
will make it more viable to validate the workings of that code, and
avoid accidental UB.
There is one downside to this change. Previously
[`lexer/mod.rs`](018675ceb1/crates/oxc_parser/src/lexer/mod.rs)
was laid out in same order as the JS spec. If you were trying to
validate the lexer against the spec, this would make it easier. However,
as OXC's parser is fairly mature at this point, and I imagine most
spec-compliance issues have been flushed out by now, in my opinion this
advantage is less compelling than it probably used to be. So in my view
it's outweighed by the benefit of more readable code.
Reviewing this could be a bit of a battle due to the size of the diff. I
do have further changes I'd like to make, but I've intentionally kept
this PR as 100% just:
1. Moving code around.
2. Reducing visibility of functions to module/super scope where that's
possible to do without changing anything else.
Aside from that, not even a single comment has changed.
If you're willing to trust me on that promise, I think it can be merged
without poring through it line by line.
A faster way to calculate offset in the lexer.
This only moves the needle because it's on the hottest path in the lexer
- `Lexer::offset` is called for every token in `Lexer::read_next_token`.
All the ASCII `ByteHandler`s are unsafe to call. I forgot to mark them
as unsafe when making that change.
This PR fixes that, and will make it harder for someone to accidentally
call one of them without considering the safety invariants.
Counter-intuitively, it seems that *increasing* the size of `Token`
improves performance slightly.
This appears to be because when `Token` is 16 bytes, copying `Token` is
a single 16-byte load/store. At present, it's 12 bytes which requires an
8-byte load/store + a 4-byte load/store.
https://godbolt.org/z/KPYsn3ab7
This suggests that either:
1. #2010 could be reverted at no cost, and the overhead of the hash
table removed.
or:
2. We need to get `Token` down to 8 bytes!
I have an idea how to *maybe* do (2), so I'd suggest leaving it as is
for now until I've been able to research that.
NB I also tried putting `#[repr(align(16))]` on `Token` so that copying
uses aligned loads/stores. That [hurt the benchmarks very
slightly](https://codspeed.io/overlookmotel/oxc/branches/lexer-pad-token),
though it might produce a gain on architectures where unaligned loads
are more expensive (ARM64 I think?). But I can't test that theory, so
have left it out.
This change makes little difference in itself, but moving the check into
the lexer will allow some optimizations in lexer using unsafe code which
depend on this invariant.
Maximum length of source parser can accept is limited on 32-bit systems
to `isize::MAX` (i.e. `i32::MAX` not `u32::MAX`) because Rust [limits
the size of
allocations](https://doc.rust-lang.org/std/alloc/struct.Layout.html#method.from_size_align)
to `isize::MAX`.
This PR takes that constraint into account when calculating
`Parser::MAX_LEN`.
It also speeds up the `overlong_source` test so it runs in under 500ms
(previously it took ~4 secs on a M1 Macbook Pro).
This PR adds benchmarks for the lexer. I'm doing some work on optimizing
the lexer and I thought it'd be useful to see the effects of changes in
isolation, separate from the parser.
These benchmarks may not be ideal to keep long-term, but for now it'd be
useful.
In order to do so, it's necessary for `oxc_parser` crate to expose the
lexer, but have done that without adding it to the docs, and using an
alias `__lexer`.
Small optimization to the lexer.
Whitespace, line breaks, and comments are all skipped by
`read_next_token()`.
At present there's a different `Kind` for each, and `read_next_token()`
decides whether to skip with `matches!(kind, Kind::WhiteSpace |
Kind::NewLine | Kind::Comment | Kind::MultiLineComment)`.
These `Kind`s are used for no other purpose, so there seems little
reason to differentiate them.
This PR combines them all into `Kind::Skip`, so then the test of whether
to skip is reduced to `kind == Kind::Skip`.
Only produces ~0.3% performance bump on parser benchmarks. But, why
not?...
As discussed on #2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).
This PR achieves the same objective of #2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.
Before #2046:
```rs
const SPS: ByteHandler = |lexer| {
lexer.consume_char();
Kind::WhiteSpace
};
```
After this PR:
```rs
ascii_byte_handler!(SPS(lexer) {
lexer.consume_char();
Kind::WhiteSpace
});
```
i.e. The body of the handlers are unchanged from how they were before
https://github.com/oxc-project/oxc/pull/2046.
This expands to:
```rs
const SPS: ByteHandler = |lexer| {
unsafe {
let s = lexer.current.chars.as_str();
assert_unchecked!(!s.is_empty());
assert_unchecked!(s.as_bytes()[0] < 128);
}
lexer.consume_char();
Kind::WhiteSpace
};
```
But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by #2046 is unnecessary,
and can be removed again.
The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).
I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:
1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.
What do you think? Does this strike a reasonable balance between
readability and safety?
In the lexer, most `BYTE_HANDLER`s immediately consume the current char
with `lexer.consume_char()`.
Byte handlers are only called if there's a certain value (or range of
values) for the next char. This is their entire purpose. So in all cases
we know for sure that we're not at EOF, and that the next char is a
single-byte ASCII character.
The compiler, however, doesn't seem to be able to "see through" the
`BYTE_HANDLERS[byte](self)` call and understand these invariants. So it
produces very verbose ASM for `lexer.consume_char()`.
This PR replaces `lexer.consume_char()` in the byte handlers with an
unsafe `lexer.consume_ascii_char()` which skips on to next char with a
single `inc` instruction.
The difference in codegen can be seen here:
https://godbolt.org/z/1ha3cr9W5 (compare the 2 x
`core::ops::function::FnOnce::call_once` handlers).
Downside is that this does introduce a lot of unsafe blocks, but in my
opinion they're all pretty trivial to validate.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
2 related changes to lexer's `read_next_token()`:
1. Hint to branch predictor that unicode identifiers and non-standard
whitespace are rare by marking that branch `#[cold]`.
2. The branch is on whether next character is ASCII or not. This check
only requires reading 1 byte, as ASCII characters are always single byte
in UTF8. So only do the work of getting a `char` in the cold path, once
it's established that character is not ASCII and this work is required.
This PR removes some code in parsing regexp flags which is extraneous:
```rs
if !ch.is_ascii_lowercase() {
self.error(diagnostics::RegExpFlag(ch, self.current_offset()));
continue;
}
```
Which is followed by:
```rs
let flag = if let Ok(flag) = RegExpFlags::try_from(ch) {
flag
} else {
self.error(diagnostics::RegExpFlag(ch, self.current_offset()));
continue;
};
```
`!ch.is_ascii_lowercase()` is equivalent to `ch < 'a' || ch > 'z'`. The
compiler implements `RegExpFlags::try_from(ch)` as `ch < 'd' || ch >
'y'` and then a jump table. So `ch.is_ascii_lowercase()` does nothing
that `RegExpFlags::try_from(ch)` doesn't do already.
https://godbolt.org/z/51GPPY9nx
(this PR built on top of #2007 for ease)
Part of #1880
`Token` size is reduced from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.
It would be nice if this handle is eliminated entirely because
the normal case for a string is always
`&source_text[token.span.start.token.span.end]`
Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a" === "a"`.
This leads us to adding a `escaped_strings[]` vec for storing these
unescaped and allocated
strings.
Performance regression for adding this vec should be minimal because
escaped strings are rare.
Background Reading:
* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
This PR is part of #1880.
`Token` size is reduced from 48 to 40 bytes.
To reconstruct the regex pattern and flags within the parser , the regex
string is
re-parsed from the end by reading all valid flags.
In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
This PR partially fixes#1803 and is part of #1880.
BigInt is removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
`Token` is now also `Copy`, which removes all the `clone` and `drop`
calls.
This yields 5% performance improvement for the parser.
Parser incorrectly identifies string literals as directives if they
follow after `import`s, `export`s, or decorators.
In all of these cases, `'use strict'` produces a directive in the AST,
where it should be parsed as an `ExpressionStatement` containing a
`StringLiteral`:
```js
import x from 'foo';
'use strict';
```
```js
export {x};
'use strict';
```
```js
@foo
'use strict';
```
[Playground](https://oxc-project.github.io/oxc/playground/?code=3YCAAIC0gICAgICAgIC0G8rnONK89ITJ3zrK%2FUP7OmSZPgHQzStr3yMtwFTU%2BD1WPt09JgqZJLoYooydbGsM5vGcf34BnIA%3D)
This PR should fix that.
I'm not sure about the decorator case, though. I assume it's not a
directive. But is prefixing a string literal with a decorator even legal
syntax anyway?
And a side nit: If I'm reading it right, I don't think the `continue`
statement in the decorator arm of the match does anything. Do I have
that right?
Last question: Where does one go about putting a test? I guess these
silly cases aren't covered by Babel etc's tests.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
`Token` and `Span` both represent `start` and `end` as `u32`.
This limits size of source which can be parsed to `u32::MAX`.
19577709db/crates/oxc_span/src/span.rs (L14-L20)
However, this constraint is currently not enforced.
In a release build, code will not panic on arithmetic overflow, so
`start`/`end` could wrap around back to zero if source is 4 GiB or more.
That'd produce nonsense spans. But worse, the lexer relies in some
places on `self.current.token.start` being correct, so if the value
wrapped around, possibly it'd keep rewinding to the start of the source
and lexing it again, causing an infinite loop.
In worst case, if for some reason an application's public API used OXC's
parser with user-supplied source code (parser-as-a-service!), this could
be exploited for denial of service.
This PR adds an assertion to catch this at the start of parsing instead.
This does add an extra instruction, but I imagine the effect will be
negligible compared to the work required to parse the code.
Parser, trivias and trivias_builder were edited to get all whitespaces.
Now Trivias struct store comments and whitespaces Vec. After that, i
will implement the no-irregular-whitespace rule.
P.S.: There isn't a way to implement this feature without lose a little
bit of performance, comparing with my last PR #1819 to minimax this
trouble instead of store the irregular whitespace as Span it was stored
as u32, i removed a map iterator and removed too a unused function. If
you have a suggestion about it pls give me a feedback.
Just removes a couple of lines of redundant code from the lexer.
A note on the 2nd one:
```rs
let mut builder = AutoCow::new(lexer);
let c = lexer.consume_char();
builder.push_matching(c);
```
`push_matching()` is a no-op unless
`force_allocation_without_current_ascii_char()` has already been called.
Here the `AutoCow` has just been freshly created, so we know it hasn't.