Commit graph

47 commits

Author SHA1 Message Date
Boshen
28eeee0f71 fix(parser): fix asi error diagnostic pointing at invalid text causing crash (#4163) 2024-07-10 14:45:10 +00:00
rzvxa
b936162093 refactor(ast/ast_builder)!: shorter allocator utility method names. (#4122)
This PR serves two purposes, First off it would lower the amount of characters we have to type in for a simple operation such as wrapping an expression in a vector. Secondly, it would follow the generated names more closely since nowhere else in the builder we do have `new_xxx`, We always say `xxx` since a builder always constructs something.

```
new_vec -> vec
new_vec_single -> vec1*
new_vec_from_iter -> vec_from_iter
new_vec_with_capacity -> vec_with_capacity
new_str -> str
new_atom -> atom
```

`*` This one is the main motivation behind this PR, It saves 10 characters!
2024-07-09 12:16:38 +00:00
Boshen
243c9f35b0 refactor(parser): use function instead of trait to parse list with rest element (#4028)
closes #3887
2024-07-02 13:43:14 +00:00
Boshen
1dacb1fc5b
refactor(parser): use function instead of trait to parse delimited lists (#4014)
relates #3887

The rest of the list parsing trait implementations involves ... parsing
`rest`, which I'll refactor in another PR.
2024-07-02 14:47:56 +08:00
Boshen
d0eac46fc8 refactor(parser): use function instead of trait to parse normal lists (#4003)
To reduce boilerplate and code noise.

relates #3887
2024-07-01 15:57:36 +00:00
Boshen
a471e62e2d refactor(parser): clean up try_parse (#3925) 2024-06-26 11:18:02 +00:00
Boshen
3db2553dc2 refactor(parser): improve parsing of TypeScript type arguments (#3923) 2024-06-26 07:16:18 +00:00
Boshen
4bf405ddfc perf(parser): add a few more inline hints to cursor functions (#3894) 2024-06-25 06:00:46 +00:00
Boshen
1e802c71d5
refactor(parser): clean up ParserState (#3345) 2024-05-19 01:30:16 +08:00
Boshen
b27a905958
refactor(parser): simplify Context passing (#3266) 2024-05-14 12:22:27 +08:00
Boshen
2064ae9e0a refactor(parser,diagnostic): one diagnostic struct to eliminate monomorphization of generic types (#3214)
part of #3213

We should only have one diagnostic struct instead 353 copies of them, so we don't end up choking LLVM with 50k lines of the same code due to monomorphization.

If the proposed approach is good, then I'll start writing a codemod to turn all the existing structs to plain functions.

---

Background:

Using `--timings`, we see `oxc_linter` is slow on codegen (the purple part).

![image](https://github.com/zkat/miette/assets/1430279/c1df4f7d-90ef-4c0f-9956-2ec3194db7ca)

The crate currently contains 353 miette errors. [cargo-llvm-lines](https://github.com/dtolnay/cargo-llvm-lines) displays

```
cargo llvm-lines -p oxc_linter --lib --release

  Lines                 Copies               Function name
  -----                 ------               -------------
  830350                33438                (TOTAL)
   29252 (3.5%,  3.5%)    808 (2.4%,  2.4%)  <alloc::boxed::Box<T,A> as core::ops::drop::Drop>::drop
   23298 (2.8%,  6.3%)    353 (1.1%,  3.5%)  miette::eyreish::error::object_downcast
   19062 (2.3%,  8.6%)    706 (2.1%,  5.6%)  core::error::Error::type_id
   12610 (1.5%, 10.1%)     65 (0.2%,  5.8%)  alloc::raw_vec::RawVec<T,A>::grow_amortized
   12002 (1.4%, 11.6%)    706 (2.1%,  7.9%)  miette::eyreish::ptr::Own<T>::boxed
    9215 (1.1%, 12.7%)    115 (0.3%,  8.2%)  core::iter::traits::iterator::Iterator::try_fold
    9150 (1.1%, 13.8%)      1 (0.0%,  8.2%)  oxc_linter::rules::RuleEnum::read_json
    8825 (1.1%, 14.9%)    353 (1.1%,  9.3%)  <miette::eyreish::error::ErrorImpl<E> as core::error::Error>::source
    8822 (1.1%, 15.9%)    353 (1.1%, 10.3%)  miette::eyreish::error::<impl miette::eyreish::Report>::construct
    8119 (1.0%, 16.9%)    353 (1.1%, 11.4%)  miette::eyreish::error::object_ref
    8119 (1.0%, 17.9%)    353 (1.1%, 12.5%)  miette::eyreish::error::object_ref_stderr
    7413 (0.9%, 18.8%)    353 (1.1%, 13.5%)  <miette::eyreish::error::ErrorImpl<E> as core::fmt::Display>::fmt
    7413 (0.9%, 19.7%)    353 (1.1%, 14.6%)  miette::eyreish::ptr::Own<T>::new
    6669 (0.8%, 20.5%)     39 (0.1%, 14.7%)  alloc::raw_vec::RawVec<T,A>::try_allocate_in
    6173 (0.7%, 21.2%)    353 (1.1%, 15.7%)  miette::eyreish::error::<impl miette::eyreish::Report>::from_std
    6027 (0.7%, 21.9%)     70 (0.2%, 16.0%)  <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
    6001 (0.7%, 22.7%)    353 (1.1%, 17.0%)  miette::eyreish::error::object_drop
    6001 (0.7%, 23.4%)    353 (1.1%, 18.1%)  miette::eyreish::error::object_drop_front
    5648 (0.7%, 24.1%)    353 (1.1%, 19.1%)  <miette::eyreish::error::ErrorImpl<E> as core::fmt::Debug>::fmt
```

It's totalling more than 50k llvm lines, and is putting pressure on rustc codegen (the purple part on `oxc_linter` in the image above.

---

It's pretty obvious by looking at https://github.com/zkat/miette/blob/main/src/eyreish/error.rs, the generics can expand out to lots of code.
2024-05-11 04:56:22 +00:00
Dunqing
0ba7778e5e
fix(parser): correctly parse cls.fn<C> = x (#3208)
close: #3206
2024-05-09 10:23:45 +08:00
Boshen
504698ab4a
chore: guard against unsafe code as much as possible. 2024-04-03 19:35:07 +08:00
Boshen
cda9c93436
fix(parser): improve lexing of jsx identifier to fix duplicated comments after jsx name (#2687) 2024-03-12 15:51:51 +08:00
Boshen
8a73d18fcf
chore(parser): make sure all span.end >= span.start (#2681)
closes #2679
2024-03-11 19:49:51 +08:00
Boshen
bf42158ad7
perf(parser): inline end_span and parse_identifier_kind which are on the hot path (#2612) 2024-03-05 15:39:53 +08:00
overlookmotel
f3470163d9
refactor(parser): make Source::set_position safe (#2341)
Make `Source::set_position` a safe function.

This addresses a shortcoming of #2288.

Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level.

`Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway.

So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
2024-02-08 14:56:26 +08:00
overlookmotel
0bdecb5043
refactor(parser): wrapper type for parser (#2339)
Split parser into public interface `Parser` and internal implementation `ParserImpl`.

This involves no changes to public API.

This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile.

The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code.

All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.
2024-02-07 23:22:08 +08:00
overlookmotel
cdef41d552
refactor(parser): lexer replace Chars with Source (#2288)
This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.

## What it does

`Source` holds the source text, and allows:

* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.

`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.

## Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.

`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.

## Notes on implementation

`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.

I've split it into 2 commits.

* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.

### `SourcePosition`

There is one annoyance with the API which I haven't been able solve:

`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.

The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.

This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?

## Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for
`pdf.mjs`.

While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.

What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this
before?

One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!

I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
2024-02-05 13:51:46 +00:00
Boshen
aa91fde1d9
refactor(parser): only allocate for escaped template strings (#2005) 2024-01-12 18:56:36 +08:00
overlookmotel
c7316856db
refactor(parser): reduce work parsing regexps (#1999)
#1926 produced a small performance regression because when parsing a
regexp, some work is repeated.
2024-01-12 11:36:30 +08:00
Boshen
4706765d2a
refactor(parser): reduce Token size from 32 to 16 bytes (#1962)
Part of #1880

`Token` size is reduced from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always
`&source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings[]` vec for storing these
unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because
escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
2024-01-09 15:17:02 +08:00
Boshen
7eb2573178
refactor(parser): parse BigInt lazily (#1924)
This PR partially fixes #1803 and is part of #1880.

BigInt is removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.

`Token` is now also `Copy`, which removes all the `clone` and `drop`
calls.

This yields 5% performance improvement for the parser.
2024-01-08 12:37:20 +08:00
Boshen
4886d408eb
chore(clippy): enable undocumented_unsafe_blocks 2023-10-16 15:18:14 +08:00
Boshen
6428139b76
fix(parser): fix re_lex_jsx_identifier not omitting whitespaces
closes #518
2023-07-05 12:53:21 +08:00
Boshen
ad2835f11b
chore(rustfmt): run cargo fmt 2023-05-21 11:52:26 +08:00
Boshen
7f93e58f10
chore: remove all #[must_use] 2023-05-11 21:08:00 +08:00
Boshen
cd276c2850
feat: add oxc_span crate (#323) 2023-04-27 21:51:15 +08:00
Boshen
ca0e80691c
refactor(oxc_parser): remove unused re_lex_as_typescript_r_angle 2023-04-16 12:15:49 +08:00
Boshen
b11f774c41 refactor(oxc_parser): clean up doc 2023-04-01 19:03:33 +08:00
Boshen
d917348f9b refactor(ast,parser): move parsing context from ast to parser 2023-04-01 18:01:33 +08:00
Boshen
d4ff0bb40e refactor(oxc_parser): parser and lexer does not need to share the errors vec 2023-04-01 15:59:42 +08:00
Boshen
174330561c
fix(parser): fix panic on multi-byte characaters (#233)
* fix(oxc_parser): fix panic when EOF on a multi-byte character

relates #232

* fix(parser): fix panic on multi-byte char in private identifer

relates #232
2023-04-01 13:34:18 +08:00
Boshen
2fe8fba5b6
refactor(lexer): make TokenValue 8 bytes smaller by changing RegExp.pattern to &'a str (#175) 2023-03-13 23:20:52 +08:00
Boshen
f36e3301fd
refactor(lexer): change TokenValue::String(Atom) to TokenValue::String(&str) (#174) 2023-03-13 09:33:44 +08:00
Boshen
605684f4c0
fix: fix clippy warnings 2023-03-12 21:53:08 +08:00
Boshen
66207e74a4
refactor(lexer): remove LexerContext::JsxChild (#172) 2023-03-12 20:19:51 +08:00
Boshen
4d32bfb55e
refactor: remove all declarations of const fn, which is useless for us 2023-03-07 21:29:47 +08:00
Ye Yangchen
0bf8f817f5 feat(oxc_parser): Port isStartOfDeclaration form tsc 2023-02-27 12:27:44 +08:00
Boshen
4f4a9802b7 refactor(diagnostics,parser): move diagnostics to parser 2023-02-22 19:23:01 +08:00
Boshen
5390d3e6b4 refactor(diagnostic): change Err type to miette::Error
This is the prerequisite for breaking up the large Diagnostic enum.
2023-02-22 11:08:21 +08:00
Boshen
4c6407b152 refactor(ast): s/node/span
This corrects the jargon for span. The term `node` came from `estree`,
which is a bit misleading here in Rust.

closes #9
2023-02-21 19:17:49 +08:00
Boshen
a733856536 refactor(ast,parser): use u32 for node spans
The next PR will fix the jargon where Node = Span.

relates to #9
2023-02-21 16:02:23 +08:00
Boshen
d57ab2f088 refactor(ast,parser): remove Node::ctx
This is adding too many bytes to the AST
2023-02-21 13:11:58 +08:00
Boshen
0bbbc7768f perf(oxc_parser): use u8 for offset 2023-02-21 13:11:58 +08:00
Boshen
85955d7147
refactor(parser): clean up some lexer code 2023-02-12 21:34:19 +08:00
Boshen
1fdc635638 feat(parser): add parser 2023-02-11 05:26:49 -08:00