Fixes#2258
### Overview
- Re-implemented the config parser to use `serde::Deserialize`
- In order to benefit from it as much as possible, avoided implementing
custom deserializers and tried to use attributes as much as possible
- This required some changes to the caller signatures...
➕
- Fixed a bug that did not support for abbreviations like `"rule-name":
1`
- Fixed settings that should have been located in `settings.react` but
were not
Make `Source::set_position` a safe function.
This addresses a shortcoming of #2288.
Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level.
`Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway.
So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
Introduce invariant that only a single `lexer::Source` can exist on a thread at one time.
This is a preparatory step for #2341.
2 notes:
Restriction is only 1 x `ParserImpl` / `Lexer` / `Source` on 1 *thread* at a time, not globally. So this does not prevent parsing multiple files simultaneously on different threads.
Restriction does not apply to public type `Parser`, only `ParserImpl`. `ParserImpl`s are not created in created in `Parser::new`, but instead in `Parser::parse`, where they're created and then immediately consumed. So the end user is also free to create multiple `Parser` instances (if they want to for some reason) on the same thread.
Running latest on one of my projects these warnings jumped out at me
because they were "anonymous" vs the others.
This PR just adds the usual rule-name prefix to the errors where it was
missing
Split parser into public interface `Parser` and internal implementation `ParserImpl`.
This involves no changes to public API.
This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile.
The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code.
All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.
This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.
## What it does
`Source` holds the source text, and allows:
* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.
`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).
However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.
## Why
I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.
It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.
`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.
In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.
## Notes on implementation
`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.
As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.
I've split it into 2 commits.
* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.
Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.
### `SourcePosition`
There is one annoyance with the API which I haven't been able solve:
`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.
The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.
This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?
## Oddity in the benchmarks
There's something really odd going on with the semantic benchmark for
`pdf.mjs`.
While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.
What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.
If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.
I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!
I really am at a loss to explain it. Have you seen anything like this
before?
One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!
I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
Following #2297, this adds another benchmark.
This one is from radix-ui website. I've chosen this particular file
because it differs from the other benchmark sources in 3 ways:
1. JSX not TSX (despite the file extension).
2. Contains no logic, only JSX component hierarchy, and content text.
3. Very small (60 LOC).
The last is particularly important, I think. Often developers will be
working on small files (single component per file convention). And some
possible directions for the parser (SIMD etc) involve optimizing chewing
through chunks of text, with a de-opt at the end to process the final
batch of bytes. If that imposes a penalty on short files, this benchmark
will surface it.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
[](https://renovatebot.com)
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| [codecov/codecov-action](https://togithub.com/codecov/codecov-action)
| action | major | `v3` -> `v4` |
---
### Release Notes
<details>
<summary>codecov/codecov-action (codecov/codecov-action)</summary>
### [`v4`](https://togithub.com/codecov/codecov-action/compare/v3...v4)
[Compare
Source](https://togithub.com/codecov/codecov-action/compare/v3...v4)
</details>
---
### Configuration
📅 **Schedule**: Branch creation - "before 8am on monday" in timezone
Asia/Shanghai, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/oxc-project/oxc).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4xNzAuMCIsInVwZGF0ZWRJblZlciI6IjM3LjE3MC4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiJ9-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR solves the problem of lexer byte handlers all being called
`core::ops::function::FnOnce::call_once` in the flame graphs on
CodSpeed, by defining them as named functions instead of closures.
Pure refactor, no substantive changes.
closes#1803
This string is currently unsafe, but I want to get miri working before
introducing more changes.
I want to make a progress from memory leak to unsafe then to safety.
It's harder to do the steps in one go.