This PR adds benchmarks for the lexer. I'm doing some work on optimizing
the lexer and I thought it'd be useful to see the effects of changes in
isolation, separate from the parser.
These benchmarks may not be ideal to keep long-term, but for now it'd be
useful.
In order to do so, it's necessary for `oxc_parser` crate to expose the
lexer, but have done that without adding it to the docs, and using an
alias `__lexer`.
Small optimization to the lexer.
Whitespace, line breaks, and comments are all skipped by
`read_next_token()`.
At present there's a different `Kind` for each, and `read_next_token()`
decides whether to skip with `matches!(kind, Kind::WhiteSpace |
Kind::NewLine | Kind::Comment | Kind::MultiLineComment)`.
These `Kind`s are used for no other purpose, so there seems little
reason to differentiate them.
This PR combines them all into `Kind::Skip`, so then the test of whether
to skip is reduced to `kind == Kind::Skip`.
Only produces ~0.3% performance bump on parser benchmarks. But, why
not?...
As discussed on #2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).
This PR achieves the same objective of #2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.
Before #2046:
```rs
const SPS: ByteHandler = |lexer| {
lexer.consume_char();
Kind::WhiteSpace
};
```
After this PR:
```rs
ascii_byte_handler!(SPS(lexer) {
lexer.consume_char();
Kind::WhiteSpace
});
```
i.e. The body of the handlers are unchanged from how they were before
https://github.com/oxc-project/oxc/pull/2046.
This expands to:
```rs
const SPS: ByteHandler = |lexer| {
unsafe {
let s = lexer.current.chars.as_str();
assert_unchecked!(!s.is_empty());
assert_unchecked!(s.as_bytes()[0] < 128);
}
lexer.consume_char();
Kind::WhiteSpace
};
```
But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by #2046 is unnecessary,
and can be removed again.
The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).
I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:
1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.
What do you think? Does this strike a reasonable balance between
readability and safety?