As discussed on #2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).
This PR achieves the same objective of #2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.
Before #2046:
```rs
const SPS: ByteHandler = |lexer| {
lexer.consume_char();
Kind::WhiteSpace
};
```
After this PR:
```rs
ascii_byte_handler!(SPS(lexer) {
lexer.consume_char();
Kind::WhiteSpace
});
```
i.e. The body of the handlers are unchanged from how they were before
https://github.com/oxc-project/oxc/pull/2046.
This expands to:
```rs
const SPS: ByteHandler = |lexer| {
unsafe {
let s = lexer.current.chars.as_str();
assert_unchecked!(!s.is_empty());
assert_unchecked!(s.as_bytes()[0] < 128);
}
lexer.consume_char();
Kind::WhiteSpace
};
```
But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by #2046 is unnecessary,
and can be removed again.
The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).
I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:
1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.
What do you think? Does this strike a reasonable balance between
readability and safety?
For a various reasons:
This features bloats the code size.
We have many tools for profiling in Rust (as compared to ESLint where the feature came from),
so a built-in feature is not really needed anymore.
ESLint needed `--timings` because it needs to monitor plugins.
We control all our code so we don't need this.
Adds a new linter rule generation task, `just new-react-perf-rule`, for
incorporating rules from
[eslint-plugin-react-perf](https://github.com/cvazac/eslint-plugin-react-perf)
into oxc.
Since this library has its own testing utilities and only 4 rules, I
didn't bother writing code to port over test cases. If we deem this
requisite I'll add this to the rulegen task.
In the lexer, most `BYTE_HANDLER`s immediately consume the current char
with `lexer.consume_char()`.
Byte handlers are only called if there's a certain value (or range of
values) for the next char. This is their entire purpose. So in all cases
we know for sure that we're not at EOF, and that the next char is a
single-byte ASCII character.
The compiler, however, doesn't seem to be able to "see through" the
`BYTE_HANDLERS[byte](self)` call and understand these invariants. So it
produces very verbose ASM for `lexer.consume_char()`.
This PR replaces `lexer.consume_char()` in the byte handlers with an
unsafe `lexer.consume_ascii_char()` which skips on to next char with a
single `inc` instruction.
The difference in codegen can be seen here:
https://godbolt.org/z/1ha3cr9W5 (compare the 2 x
`core::ops::function::FnOnce::call_once` handlers).
Downside is that this does introduce a lot of unsafe blocks, but in my
opinion they're all pretty trivial to validate.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
this should probably be squashed, i'm not familiar with code commit
norms here. we can expand a bunch of the commutative compressions later.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
2 related changes to lexer's `read_next_token()`:
1. Hint to branch predictor that unicode identifiers and non-standard
whitespace are rare by marking that branch `#[cold]`.
2. The branch is on whether next character is ASCII or not. This check
only requires reading 1 byte, as ASCII characters are always single byte
in UTF8. So only do the work of getting a `char` in the cold path, once
it's established that character is not ASCII and this work is required.
This PR removes some code in parsing regexp flags which is extraneous:
```rs
if !ch.is_ascii_lowercase() {
self.error(diagnostics::RegExpFlag(ch, self.current_offset()));
continue;
}
```
Which is followed by:
```rs
let flag = if let Ok(flag) = RegExpFlags::try_from(ch) {
flag
} else {
self.error(diagnostics::RegExpFlag(ch, self.current_offset()));
continue;
};
```
`!ch.is_ascii_lowercase()` is equivalent to `ch < 'a' || ch > 'z'`. The
compiler implements `RegExpFlags::try_from(ch)` as `ch < 'd' || ch >
'y'` and then a jump table. So `ch.is_ascii_lowercase()` does nothing
that `RegExpFlags::try_from(ch)` doesn't do already.
https://godbolt.org/z/51GPPY9nx
(this PR built on top of #2007 for ease)