Split parser into public interface `Parser` and internal implementation `ParserImpl`.
This involves no changes to public API.
This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile.
The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code.
All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.
This change makes little difference in itself, but moving the check into
the lexer will allow some optimizations in lexer using unsafe code which
depend on this invariant.
Maximum length of source parser can accept is limited on 32-bit systems
to `isize::MAX` (i.e. `i32::MAX` not `u32::MAX`) because Rust [limits
the size of
allocations](https://doc.rust-lang.org/std/alloc/struct.Layout.html#method.from_size_align)
to `isize::MAX`.
This PR takes that constraint into account when calculating
`Parser::MAX_LEN`.
It also speeds up the `overlong_source` test so it runs in under 500ms
(previously it took ~4 secs on a M1 Macbook Pro).
This PR adds benchmarks for the lexer. I'm doing some work on optimizing
the lexer and I thought it'd be useful to see the effects of changes in
isolation, separate from the parser.
These benchmarks may not be ideal to keep long-term, but for now it'd be
useful.
In order to do so, it's necessary for `oxc_parser` crate to expose the
lexer, but have done that without adding it to the docs, and using an
alias `__lexer`.
Part of #1880
`Token` size is reduced from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.
It would be nice if this handle is eliminated entirely because
the normal case for a string is always
`&source_text[token.span.start.token.span.end]`
Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a" === "a"`.
This leads us to adding a `escaped_strings[]` vec for storing these
unescaped and allocated
strings.
Performance regression for adding this vec should be minimal because
escaped strings are rare.
Background Reading:
* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Parser incorrectly identifies string literals as directives if they
follow after `import`s, `export`s, or decorators.
In all of these cases, `'use strict'` produces a directive in the AST,
where it should be parsed as an `ExpressionStatement` containing a
`StringLiteral`:
```js
import x from 'foo';
'use strict';
```
```js
export {x};
'use strict';
```
```js
@foo
'use strict';
```
[Playground](https://oxc-project.github.io/oxc/playground/?code=3YCAAIC0gICAgICAgIC0G8rnONK89ITJ3zrK%2FUP7OmSZPgHQzStr3yMtwFTU%2BD1WPt09JgqZJLoYooydbGsM5vGcf34BnIA%3D)
This PR should fix that.
I'm not sure about the decorator case, though. I assume it's not a
directive. But is prefixing a string literal with a decorator even legal
syntax anyway?
And a side nit: If I'm reading it right, I don't think the `continue`
statement in the decorator arm of the match does anything. Do I have
that right?
Last question: Where does one go about putting a test? I guess these
silly cases aren't covered by Babel etc's tests.
---------
Co-authored-by: Boshen <boshenc@gmail.com>
`Token` and `Span` both represent `start` and `end` as `u32`.
This limits size of source which can be parsed to `u32::MAX`.
19577709db/crates/oxc_span/src/span.rs (L14-L20)
However, this constraint is currently not enforced.
In a release build, code will not panic on arithmetic overflow, so
`start`/`end` could wrap around back to zero if source is 4 GiB or more.
That'd produce nonsense spans. But worse, the lexer relies in some
places on `self.current.token.start` being correct, so if the value
wrapped around, possibly it'd keep rewinding to the start of the source
and lexing it again, causing an infinite loop.
In worst case, if for some reason an application's public API used OXC's
parser with user-supplied source code (parser-as-a-service!), this could
be exploited for denial of service.
This PR adds an assertion to catch this at the start of parsing instead.
This does add an extra instruction, but I imagine the effect will be
negligible compared to the work required to parse the code.
* fix(oxc_parser): fix panic when EOF on a multi-byte character
relates #232
* fix(parser): fix panic on multi-byte char in private identifer
relates #232