No description
Find a file
overlookmotel cdef41d552
refactor(parser): lexer replace Chars with Source (#2288)
This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.

## What it does

`Source` holds the source text, and allows:

* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.

`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.

## Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.

`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.

## Notes on implementation

`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.

I've split it into 2 commits.

* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.

### `SourcePosition`

There is one annoyance with the API which I haven't been able solve:

`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.

The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.

This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?

## Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for
`pdf.mjs`.

While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.

What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this
before?

One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!

I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
2024-02-05 13:51:46 +00:00
.cargo ci: disable native build 2023-11-20 17:45:51 +08:00
.github chore(renovate): ignore miette and ureq 2024-02-05 14:14:29 +08:00
.vscode chore: add some useful informantion log (#1912) 2024-01-06 22:30:01 +08:00
crates refactor(parser): lexer replace Chars with Source (#2288) 2024-02-05 13:51:46 +00:00
editors/vscode Release oxlint and vscode extension v0.2.7 2024-02-03 21:21:23 +08:00
fuzz chore(fuzz): add a timeout command 2024-02-05 14:41:14 +08:00
napi/parser feat: setup wasm parser for npm (#2221) 2024-01-30 21:40:10 +08:00
npm Release oxlint and vscode extension v0.2.7 2024-02-03 21:21:23 +08:00
tasks feat(semantic): report parameter related errors for setter/getter (#2316) 2024-02-05 17:38:43 +08:00
wasm/parser Release @oxc-parser/wasm v0.0.5 2024-02-05 21:10:11 +08:00
website chore(deps): update website npm packages (#2303) 2024-02-05 13:11:40 +08:00
.git-blame-ignore-revs chore: update .git-blame-ignore-revs 2023-07-28 13:57:29 +08:00
.gitignore chore: ignore git submodules 2024-02-04 16:15:01 +08:00
.ignore chore: add just watch command for overcoming cargo-watch being slow 2023-05-16 13:22:42 +08:00
.taplo.toml feat: Release resolver with NAPI (#1212) 2023-11-10 15:25:17 +00:00
.typos.toml refactor(parser): split lexer into multiple files (#2228) 2024-01-31 11:43:53 +08:00
Cargo.lock chore(deps): update rust crates (#2302) 2024-02-05 14:36:53 +08:00
Cargo.toml chore(clippy): disable nursery group rules (#2319) 2024-02-05 18:43:15 +08:00
CONTRIBUTING.md chore(CONTRIBUTING): use the website content 2023-12-16 21:02:52 +08:00
deny.toml ci: add cargo deny 2023-04-22 22:35:19 +08:00
justfile chore: manually clone git modules instead of using submodules (#2274) 2024-02-02 11:56:18 +00:00
LICENSE Change license holder to @boshen 2023-11-10 14:26:11 +08:00
MAINTENANCE.md Publish crates v0.6.0 2024-02-03 22:35:30 +08:00
README.md Update README 2024-01-19 18:39:07 +08:00
rust-toolchain.toml chore: upgrade rustc toolchain to stable 1.75.0 (#1853) 2023-12-29 12:20:51 +08:00
rustfmt.toml chore(rustfmt): disable all unstable format options 2023-07-27 13:11:46 +08:00
THIRD-PARTY-LICENSE chore: update readme 2023-12-06 19:03:57 +08:00

OXC Logo

MIT licensed Build Status Code Coverage CodSpeed Badge Sponsors

Discord chat Playground Website

Oxc

The Oxidation Compiler is creating a collection of high-performance tools for JavaScript and TypeScript.

Oxc is building a parser, linter, formatter, transpiler, minifier, resolver ... all written in Rust.

💡 Philosophy

This project shares the same philosophies as Biome and Ruff.

  1. JavaScript tooling could be rewritten in a more performant language.
  2. An integrated toolchain can tap into efficiencies that are not available to a disparate set of tools.

Quick Start

The linter is ready to catch mistakes for you. It comes with over 60 default rules and no configuration is required.

To start using, install oxlint or via npx:

npx oxlint@latest

To give you an idea of its capabilities, here is an example from the vscode repository, which finishes linting 4000+ files in 0.5 seconds.

Performance

  • The parser aim to be the fastest Rust-based ready-for-production parser.
  • The linter is more than 50 times faster than ESLint, and scales with the number of CPU cores.

⌨️ Programming Usage

Rust

Individual crates are published, you may use them to build your own JavaScript tools.

  • The umbrella crate oxc exports all public crates from this repository.
  • The AST and parser crates oxc_ast and oxc_parser are production ready.
  • See crates/*/examples for example usage.

While Rust has gained a reputation for its comparatively slower compilation speed, we have dedicated significant effort to fine-tune the Rust compilation speed. Our aim is to minimize any impact on your development workflow, ensuring that developing your own Oxc based tools remains a smooth and efficient experience.

This is demonstrated by our CI runs, where warm runs complete in 5 minutes.

Node.js


🎯 Tools

🔸 AST and Parser

Oxc maintains its own AST and parser, which is by far the fastest and most conformant JavaScript and TypeScript (including JSX and TSX) parser written in Rust.

As the parser often represents a key performance bottleneck in JavaScript tooling, any minor improvements can have a cascading effect on our downstream tools. By developing our parser, we have the opportunity to explore and implement well-researched performance techniques.

While many existing JavaScript tools rely on estree as their AST specification, a notable drawback is its abundance of ambiguous nodes. This ambiguity often leads to confusion during development with estree.

The Oxc AST differs slightly from the estree AST by removing ambiguous nodes and introducing distinct types. For example, instead of using a generic estree Identifier, the Oxc AST provides specific types such as BindingIdentifier, IdentifierReference, and IdentifierName. This clear distinction greatly enhances the development experience by aligning more closely with the ECMAScript specification.

🏆 Parser Performance

Our benchmark reveals that the Oxc parser surpasses the speed of the swc parser by approximately 2 times and the Biome parser by 3 times.

How is it so fast?
  • AST is allocated in a memory arena (bumpalo) for fast AST memory allocation and deallocation.
  • Short strings are inlined by CompactString.
  • No other heap allocations are done except the above two.
  • Scope binding, symbol resolution and some syntax errors are not done in the parser, they are delegated to the semantic analyzer.

🔸 Linter

The linter embraces convention over configuration, eliminating the need for extensive configuration and plugin setup. Unlike other linters like ESLint, which often require intricate configurations and plugin installations (e.g. @typescript-eslint), our linter only requires a single command that you can immediately run on your codebase:

npx oxlint@latest

🏆 Linter Performance

The linter is 50 - 100 times faster than ESLint depending on the number of rules and number of CPU cores used. It completes in less than a second for most codebases with a few hundred files and completes in a few seconds for larger monorepos. See bench-javascript-linter for details.

As an upside, the binary is approximately 5MB, whereas ESLint and its associated plugin dependencies can easily exceed 100.

You may also download the linter binary from the latest release tag as a standalone binary, this lets you run the linter without a Node.js installation in your CI.

How is it so fast?
  • Oxc parser is used.
  • AST visit is a fast operation due to linear memory scan from the memory arena.
  • Files are linted in a multi-threaded environment, so scales with the total number of CPU cores.
  • Every single lint rule is tuned for performance.

🔸 Resolver

Module resolution plays a crucial role in JavaScript tooling, especially for tasks like multi-file analysis or bundling. However, it can often become a performance bottleneck. To address this, we are actively working on porting enhanced-resolve.

The resolver is production-ready and is currently being used in Rspack. Usage and examples can be found in its own repository.

🔸 Transpiler

A transpiler is responsible for turning higher versions of ECMAScript to a lower version that can be used in older browsers. We are currently focusing on an esnext to es2015 transpiler. See the umbrella issue for details.

🔸 Minifier

JavaScript minification plays a crucial role in optimizing website performance as it reduces the amount of data sent to users, resulting in faster page loads. This holds tremendous economic value, particularly for e-commerce websites, where every second can equate to millions of dollars.

However, existing minifiers typically require a trade-off between compression quality and speed. You have to choose between the slowest for the best compression or the fastest for less compression. But what if we could develop a faster minifier without compromising on compression?

We are actively working on a prototype that aims to achieve this goal, by porting all test cases from well-known minifiers such as google-closure-compiler, terser, esbuild, and tdewolff-minify.

Preliminary results indicate that we are on track to achieve our objectives. With the Oxc minifier, you can expect faster minification times without sacrificing compression quality.

🔸 Formatter

While prettier has established itself as the de facto code formatter for JavaScript, there is a significant demand in the developer community for a less opinionated alternative. Recognizing this need, our ambition is to undertake research and development to create a new JavaScript formatter that offers increased flexibility and customization options.

The prototype is currently work in progress.


✍️ Contribute

See CONTRIBUTING.md for guidance.

Check out some of the good first issues or ask us on Discord.

If you are unable to contribute by code, you can still participate by:

📚 Learning Resources

🤝 Credits

This project was incubated with the assistance of these exceptional mentors and their projects:

Special thanks go to

And also

📖 License

Oxc is free and open-source software licensed under the MIT License.

Oxc partially copies code from the following projects, their licenses are listed in Third-party library licenses.

Project License
eslint/eslint MIT
typescript-eslint/typescript-eslint MIT
import-js/eslint-plugin-import MIT
jest-community/eslint-plugin-jest MIT
microsoft/TypeScript Apache 2.0
biomejs/biome MIT
mozilla-spidermonkey/jsparagus MIT Apache 2.0
prettier/prettier MIT
acorn MIT
zkat/miette Apache 2.0
sindresorhus/globals MIT
terser BSD
evanw/esbuild MIT
google/closure-compiler Apache 2.0
tdewolff/minify MIT