Commit graph

268 commits

Author SHA1 Message Date
Dunqing
70295a5552
feat(ast): update arrow_expression to arrow_function_expression (#2496) 2024-02-25 14:39:34 +00:00
Boshen
7a796c4b5f
feat(ast): add TSModuleDeclaration.kind (#2487)
closes #2395
2024-02-24 17:09:31 +08:00
Boshen
5212f7b51e
fix(parser): fix missing end span from TSTypeAliasDeclaration (#2485)
closes #2483
2024-02-24 16:51:00 +08:00
Boshen
1634586934 refactor(ast): s/TSTypeOperatorType/TSTypeOperator to align with estree 2024-02-21 22:25:04 +08:00
Boshen
9087f71765 refactor(ast): s/TSThisKeyword/TSThisType to align with estree 2024-02-21 22:25:04 +08:00
Boshen
d08abc638e refactor(ast): s/NumberLiteral/NumericLiteral to align with estree 2024-02-21 21:41:08 +08:00
Boshen
35608c8eb1
chore: fix all docs 2024-02-21 18:06:37 +08:00
Andrew McClenaghan
6b3b260dcc
feat(Codegen): Improve codegen (#2460)
This gets all the new TS types working to the same level TS output was
before and fixes a bunch of other codegen

---------

Co-authored-by: Boshen <boshenc@gmail.com>
2024-02-21 14:41:57 +08:00
Dunqing
197fa16613
feat(semantic): add check for duplicate class elements in checker (#2455)
1. Remove the check implementation of the parser
2. Implement it to semantic checker
3. Support typescript's check for duplicate class elements

Support checking for duplicate class elements in semantic checker is
easier to support typescript checking rules.
2024-02-21 14:10:19 +08:00
overlookmotel
a78303d5a6
refactor(parser): continue_if in byte_search macro not unsafe (#2440)
#2439 made using `continue_if` in `byte_search!` macro safe, as it no longer continues the main loop after a match, so no danger of reading out of bounds if `continue_if` code fast-forwards the current position.

This follow-on PR removes the unsafe blocks, and uses that fast-forward ability in a couple of places.
2024-02-20 10:45:31 +08:00
overlookmotel
a5a3c695f7
refactor(parser): correct comment (#2441)
Just correcting a typo in a comment, and moving comment to a better
place.
2024-02-20 10:43:12 +08:00
overlookmotel
996a9d27eb
perf(parser): byte_search macro always unroll main loop (#2439)
Refactor `byte_search!` macro to move logic out of the main loop. This ensures the compiler unrolls the loop.

This speeds up lexing single-line comments by 20%-25% on the benchmarks which contain enough comments for the change to register. Presumably the loop wasn't unrolled previously.

The code required to do this is a little odd. It adds an extra `loop {}` which always exits on the first turn (so not really a useful loop), but is required to be able to use `break` to exit that "loop", making 2 different paths for (1) matching byte found and (2) `for` loop completed without finding any match.

This is only way I could find to produce this behavior without using a macro. Is there a more "normal" way to get the same logic?
2024-02-20 10:39:52 +08:00
Dunqing
60db720fa6
feat(parser): parse import attributes in TSImportType (#2436)
close: #2394 

64d2eeea7b/src/compiler/types.ts (L2177-L2185)

The corresponding test cases were skipped, so I manually added some
cases to misc

f5db48237f/tasks/coverage/src/typescript.rs (L118-L121)
2024-02-19 12:26:42 +08:00
Dunqing
3cbe786b18
refactor(ast): update TSImportType parameter to argument (#2429)
In typescript it's named argument, so we should keep it consistent

64d2eeea7b/src/compiler/types.ts (L2180)
2024-02-19 10:29:24 +08:00
overlookmotel
90f9266d00
chore(deps): update bumpalo crate (#2417)
Latest version of `bumpalo` includes a couple of performance fixes for
`String` (e.g. https://github.com/fitzgen/bumpalo/pull/229) which may
help the parser a little.
2024-02-18 11:49:31 +08:00
overlookmotel
cc2ddbee77
refactor(parser): catch all illegal UTF-8 bytes (#2415)
Catch all illegal UTF-8 bytes with the `UER` byte handler.

From https://datatracker.ietf.org/doc/html/rfc3629:

> The octet values C0, C1, F5 to FF never appear.

This change *should* make no difference at all, as a valid `&str` may not contain any of these byte values anyway. But it's possible if user has e.g. created the string with `str::from_utf8_unchecked` and not obeyed the safety contraints. This will at least contain the damage if that's happened, and panic rather than lead to UB. And since we're already catching other error conditions, may as well catch them all.
2024-02-16 20:49:01 +08:00
Dunqing
73e116e8a1
fix(parser): incorrect parsing of class accessor property name (#2386) 2024-02-11 22:57:13 +08:00
overlookmotel
383f5b3081
perf(parser): consume multi-line comments faster (#2377)
Consume multi-line comments faster.

* Initially search for `*/`, `\r`, `\n` or `0xE2` (first byte of
irregular line breaks).
* Once a line break is found, switch to faster search which only looks
for `*/`, as it's not relevant whether there are more line breaks or
not.

Using `memchr` for the 2nd simpler search, as it's efficient for a
search with only one "needle".

Initializing `memchr::memmem::Finder` is fairly expensive, and tried
numerous ways to handle it. This is most performant way I could find.
Any ideas how to avoid re-creating it for each Lexer pass? (it can't be
a `static` as `Finder::new` is not a const function, and `lazy_static!`
is too costly)
2024-02-11 12:43:14 +08:00
Boshen
ef336cb66b
feat(parser): recover from async x [newline] => x (#2375)
```javascript
async x
=> x
```

Babel recovers and displays "No line break is allowed before '=>'
2024-02-10 11:19:08 +08:00
overlookmotel
c4fa738312
perf(parser): consume single-line comments faster (#2374)
Use `byte_search!` macro to consume single-line comments.

Would be a lot simpler if didn't have to deal with irregular line breaks. Damn you Unicode!
2024-02-10 11:02:30 +08:00
overlookmotel
b29719d2df
refactor(parser): add methods to Source + SourcePosition (#2373)
Preparatory step for #2374.
2024-02-10 10:57:33 +08:00
overlookmotel
79ae9a9b2c
refactor(parser): extend byte_search macro (#2372)
Preparatory step for #2374.
2024-02-10 10:52:59 +08:00
overlookmotel
0be8397c77
perf(parser): optimize lexing strings (#2366)
Optimize lexing strings a bit.
2024-02-09 23:52:45 +08:00
overlookmotel
c0d1d6b08a
perf(parser): lex strings as bytes (#2357)
Lex string literals as bytes, using same techniques as for identifiers.

Handling escapes could be optimized a bit more, and maybe I'll return to that, but as escapes are fairly rare, it wouldn't be the biggest gain.
2024-02-09 21:00:27 +08:00
overlookmotel
2f6cf73d51
fix(parser): remove erroneous debug assertion (#2356)
This was a bit of a whoopsie in last batch of PRs. This assertion shouldn't be there, because all reads are now via `source.position().read()`, so this assertion says "you can only read some byte values".

Only reason it didn't blow up conformance tests is that they run in release mode.

Sorry. Please merge soon as you can and cover my shame!
2024-02-09 20:55:12 +08:00
overlookmotel
8376f15b9a
perf(parser): eat whitespace after line break (#2353)
Uses the `byte_search!` macro introduced in #2352 to consume whitespace after a line break.
2024-02-09 12:02:51 +08:00
overlookmotel
d3a59f27f7
perf(parser): lex identifiers as bytes not chars (#2352)
This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new `Source` / `SourcePosition` APIs.

Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent.

I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance.

### Techniques

This implementation uses a few different strategies for performance:

* Search byte-by-byte, not char-by-char.
* Process batches of 32 bytes at a time to reduce bounds checks.
* Mark uncommon paths `#[cold]`.

### Structure

The implementation is built in 3 layers:

1. ASCII characters only.
2. ASCII and Unicode characters.
3. `\` escape sequences (and all the above).

`identifier_name_handler` starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt.

I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised.

That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier *starts* with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a `#[cold]` boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric.

### ASCII search macro

The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked `#[inline]` or `#[inline(always)]`. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro.

This macro, and the `ByteMatchTable` types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple.

Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD.

### Bye bye `AutoCow`

`AutoCow` is removed. Instead, a string-builder is only created if it's needed, when a `\` escape is first encountered. The string builder is also more efficient than `AutoCow` was, as it copies bytes in chunks, rather than 1-by-1.

This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.
2024-02-09 12:01:30 +08:00
overlookmotel
6910e4f71b
refactor(parser): macro for ASCII identifier byte handlers (#2351)
Add a macro for ASCII identifier byte handlers.

This is a preparatory step towards #2352.
2024-02-09 11:55:35 +08:00
overlookmotel
6f597b18bc
refactor(parser): all pointer manipulation through SourcePosition (#2350)
A safer and faster interface for reading source text using pointers than `*ptr`.
2024-02-09 10:26:51 +08:00
overlookmotel
185b3dbcc3
refactor(parser): fix outdated comment (#2344)
Just fixes an outdated comment.
2024-02-08 19:47:33 +08:00
overlookmotel
f3470163d9
refactor(parser): make Source::set_position safe (#2341)
Make `Source::set_position` a safe function.

This addresses a shortcoming of #2288.

Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level.

`Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway.

So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
2024-02-08 14:56:26 +08:00
overlookmotel
aef593fb50
parser(refactor): promise only one Source on a thread at a time (#2340)
Introduce invariant that only a single `lexer::Source` can exist on a thread at one time.

This is a preparatory step for #2341.

2 notes:

Restriction is only 1 x `ParserImpl` / `Lexer` / `Source` on 1 *thread* at a time, not globally. So this does not prevent parsing multiple files simultaneously on different threads.

Restriction does not apply to public type `Parser`, only `ParserImpl`. `ParserImpl`s are not created in created in `Parser::new`, but instead in `Parser::parse`, where they're created and then immediately consumed. So the end user is also free to create multiple `Parser` instances (if they want to for some reason) on the same thread.
2024-02-08 14:51:17 +08:00
overlookmotel
0bdecb5043
refactor(parser): wrapper type for parser (#2339)
Split parser into public interface `Parser` and internal implementation `ParserImpl`.

This involves no changes to public API.

This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile.

The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code.

All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.
2024-02-07 23:22:08 +08:00
overlookmotel
cdef41d552
refactor(parser): lexer replace Chars with Source (#2288)
This PR replaces the `Chars` iterator in the lexer with a new structure
`Source`.

## What it does

`Source` holds the source text, and allows:

* Iterating through source text char-by-char (same as `Chars` did).
* Iterating byte-by-byte.
* Getting a `SourcePosition` for current position, which can be used
later to rewind to that position, without having to clone the entire
`Source` struct.

`Source` has the same invariants as `Chars` - cursor must always be
positioned on a UTF-8 character boundary (i.e. not in the middle of a
multi-byte Unicode character).

However, unsafe APIs are provided to allow a caller to temporarily break
that invariant, as long as they satisfy it again before they pass
control back to safe code. This will be useful for processing batches of
bytes.

## Why

I envisage most of the Lexer migrating to byte-by-byte iteration, and I
believe it'll make a significant impact on performance.

It will allow efficiently processing batches of bytes (e.g. consuming
identifiers or whitespace) without the overhead of calculating code
points for every character. It should also make all the many `peek()`,
`next_char()` and `next_eq()` calls faster.

`Source` is also more performant than `Chars` in itself. This wasn't my
intent, but seems to be a pleasant side-effect of it being less opaque
to the compiler than `Chars`, so it can apply more optimizations.

In addition, because checkpoints don't need to store the entire `Source`
struct, but only a `SourcePosition` (8 bytes), was able to reduce the
size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both
`Copy`.

## Notes on implementation

`Source` is heavily based on Rust's `std::str::Chars` and
`std::slice::Iter` iterators and I've copied the code/concepts from them
as much as possible.

As it's a low-level primitive, it uses raw pointers and contains a *lot*
of unsafe code. I *think* I've crossed the T's and dotted the I's, and
I've commented the code extensively, but I'd appreciate a close review
if anyone has time.

I've split it into 2 commits.

* First commit is all the substantive changes.
* 2nd commit just does away with `lexer.current` which is no longer
needed, and replaces `lexer.current.token` with `lexer.token`
everywhere.

Hopefully looking just at the 1st commit will reduce the noise and make
it easier to review.

### `SourcePosition`

There is one annoyance with the API which I haven't been able solve:

`SourcePosition` is a wrapper around a pointer, which can only be
created from the current position of `Source`. Due to the invariant
mentioned above, therefore `SourcePosition` is always in bounds of the
source text, and points to a UTF-8 character boundary. So `Source` can
be rewound to a `SourcePosition` cheaply, without any checks. I had
originally envisaged `Source::set_position` being a safe function, as
`SourcePosition` enforces the necessary invariants itself.

The fly in the ointment is that a `SourcePosition` could theoretically
have been created from *another* `Source`. If that was the case, it
would be out of bounds, and it would be instant UB. Consequently,
`Source::set_position` has to be an unsafe function.

This feels rather ridiculous. *Of course* the parser won't create 2
Lexers at the same time. But still it's *possible*, so I think better to
take the strict approach and make it unsafe until can find a way to
statically prove the safety by some other means. Any ideas?

## Oddity in the benchmarks

There's something really odd going on with the semantic benchmark for
`pdf.mjs`.

While I was developing this, small and seemingly irrelevant changes
would flip that benchmark from +0.5% or so to -4%, and then another
small change would flip it back.

What I don't understand is that parsing happens outside of the
measurement loop in the semantic benchmark, so the parser shouldn't have
*any* effect either way on semantic's benchmarks.

If CodSpeed's flame graph is to be believed, most of the negative effect
appears to be a large Vec reallocation happening somewhere in semantic.

I've ruled out a few things: The AST produced by the parser for
`pdf.mjs` after this PR is identical to what it was before. And
semantic's `nodes` and `scopes` Vecs are same length as they were
before. Nothing seems to have changed!

I really am at a loss to explain it. Have you seen anything like this
before?

One possibility is a fault in my unsafe code which is manifesting only
with `pdf.mjs`, and it's triggering UB, which I guess could explain the
weird effects. I'm running the parser on `pdf.mjs` in Miri now and will
see if it finds anything (Miri doesn't find any problem running the
tests). It's been running for over an hour now. Hopefully it'll be done
by morning!

I feel like this shouldn't merged until that question is resolved, so
marking this as draft in the meantime.
2024-02-05 13:51:46 +00:00
Dunqing
a3570d41f0
feat(semantic): report parameter related errors for setter/getter (#2316) 2024-02-05 17:38:43 +08:00
overlookmotel
9811c3a2c3
refactor(parser): name byte handler functions (#2301)
This PR solves the problem of lexer byte handlers all being called
`core::ops::function::FnOnce::call_once` in the flame graphs on
CodSpeed, by defining them as named functions instead of closures.

Pure refactor, no substantive changes.
2024-02-05 13:06:09 +08:00
Boshen
1822cfe18d
refactor(ast): fix BigInt memory leak by removing it (#2293)
relates

We'll need to evaluate the value by other means.
2024-02-04 16:47:00 +08:00
Dunqing
2578bb3d64
feat(ast): remove generator property from ArrowFunction (#2260)
ArrowFunction doesn't support generator.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/function*
2024-02-02 04:01:19 +00:00
Dunqing
165f948227
feat(ast): remove expression property from Function (#2247) 2024-02-01 15:23:27 +08:00
Boshen
2beacd3f4d
fix(lexer): correct the span for irregular whitespaces (#2245)
closes #2236
2024-02-01 14:18:47 +08:00
overlookmotel
d0d708295b
refactor(parser): consume chars when parsing surrogate pair escape (#2243)
This fixes a mistake I made in #2237.

I was confused by the `!(...)` wrapping of the preceding `if` test and
missed that there are definitely 2 chars to consume, so can use
`consume_char()` instead of `next_char()`. This makes no difference to
behavior, but it follows the convention to always prefer
`consume_char()` when possible.

I've also refactored the code which confused me, so hopefully others
won't be confused too!
2024-02-01 11:34:26 +08:00
overlookmotel
622a2c37fa
refactor(lexer): don't use lexer.current.chars directly (#2237)
This PR replaces most usages of `lexer.current.chars.next()` with
`lexer.consume_char()`, or a new function `lexer.next_char()`.

This is a preparatory step towards replacing the `Chars` iterator with
something more flexible which can also consume bytes (not `char`s), and
this PR was intended as pure refactor. But surprised to see there is a
small performance bump (no idea why!).

There's an additional benefit: Using `consume_char()` everywhere where
we believe there's definitely a char there to be consumed will make
logic errors produce a panic, rather than silently outputting garbage.
2024-01-31 21:35:46 +08:00
overlookmotel
5279e8955f
refactor(parser): byte handler for illegal bytes (#2229)
This adds a separate byte handler to the lexer for byte values which
should never be encountered:

1. UTF-8 continuation bytes (i.e. middle of a multi-byte UTF-8 byte
sequence).
2. Bytes values which are illegal in valid UTF-8 strings.

At present, this function is impossible to reach, because
`std::str::Chars` ensures the next byte is always the *start* of a valid
UTF-8 byte sequence. But later changes I intend introducing unsafe code
will make it possible (but highly undesirable!). In the meantime, I
don't think it does any harm to handle this case.
2024-01-31 18:57:47 +08:00
overlookmotel
3d79d77b40
refactor(parser): split lexer into multiple files (#2228)
This PR has a large diff, but it contains no substantive changes
whatsoever. It purely breaks up the lexer into multiple smaller files.

I've been working quite intensively on the lexer over past few weeks,
but still have been finding it hard to make sense of, due to most of the
logic currently being contained in [a single 1800-line
file](018675ceb1/crates/oxc_parser/src/lexer/mod.rs).

I feel that breaking it up into multiple files makes it much easier to
navigate and understand.

An additional benefit is that many functions can have their visibility
reduced to module scope, so sub-systems for e.g. lexing numbers have
fewer exposed functions. This makes it clearer what the entry points
are, and makes it harder to make mistakes when working on the lexer.

I intend to later make changes to the lexer for performance which will
introduce unsafe code. Keeping that unsafe code encapsulated in modules
will make it more viable to validate the workings of that code, and
avoid accidental UB.

There is one downside to this change. Previously
[`lexer/mod.rs`](018675ceb1/crates/oxc_parser/src/lexer/mod.rs)
was laid out in same order as the JS spec. If you were trying to
validate the lexer against the spec, this would make it easier. However,
as OXC's parser is fairly mature at this point, and I imagine most
spec-compliance issues have been flushed out by now, in my opinion this
advantage is less compelling than it probably used to be. So in my view
it's outweighed by the benefit of more readable code.

Reviewing this could be a bit of a battle due to the size of the diff. I
do have further changes I'd like to make, but I've intentionally kept
this PR as 100% just:

1. Moving code around.
2. Reducing visibility of functions to module/super scope where that's
possible to do without changing anything else.

Aside from that, not even a single comment has changed.

If you're willing to trust me on that promise, I think it can be merged
without poring through it line by line.
2024-01-31 11:43:53 +08:00
overlookmotel
81e33a3701
perf(parser): faster offset calculation (#2215)
A faster way to calculate offset in the lexer.

This only moves the needle because it's on the hottest path in the lexer
- `Lexer::offset` is called for every token in `Lexer::read_next_token`.
2024-01-30 18:49:31 +08:00
overlookmotel
51ac392ae4
refactor(parser): mark ByteHandlers unsafe (#2212)
All the ASCII `ByteHandler`s are unsafe to call. I forgot to mark them
as unsafe when making that change.

This PR fixes that, and will make it harder for someone to accidentally
call one of them without considering the safety invariants.
2024-01-30 12:23:35 +08:00
overlookmotel
20679d1e1e
perf(parser): pad Token to 16 bytes (#2211)
Counter-intuitively, it seems that *increasing* the size of `Token`
improves performance slightly.

This appears to be because when `Token` is 16 bytes, copying `Token` is
a single 16-byte load/store. At present, it's 12 bytes which requires an
8-byte load/store + a 4-byte load/store.

https://godbolt.org/z/KPYsn3ab7

This suggests that either:

1. #2010 could be reverted at no cost, and the overhead of the hash
table removed.
or:
2. We need to get `Token` down to 8 bytes!

I have an idea how to *maybe* do (2), so I'd suggest leaving it as is
for now until I've been able to research that.

NB I also tried putting `#[repr(align(16))]` on `Token` so that copying
uses aligned loads/stores. That [hurt the benchmarks very
slightly](https://codspeed.io/overlookmotel/oxc/branches/lexer-pad-token),
though it might produce a gain on architectures where unaligned loads
are more expensive (ARM64 I think?). But I can't test that theory, so
have left it out.
2024-01-30 11:47:26 +08:00
overlookmotel
872d751a18
refactor(parser): re-order match branches (#2209)
Just a tiny bit of code tidying.
2024-01-30 00:53:56 +08:00
overlookmotel
71898ffdd5
refactor(parser): move source length check into lexer (#2206)
This change makes little difference in itself, but moving the check into
the lexer will allow some optimizations in lexer using unsafe code which
depend on this invariant.
2024-01-29 22:29:02 +08:00
overlookmotel
e123be0a00
fix(parser): correct MAX_LEN for 32-bit systems (#2204)
Maximum length of source parser can accept is limited on 32-bit systems
to `isize::MAX` (i.e. `i32::MAX` not `u32::MAX`) because Rust [limits
the size of
allocations](https://doc.rust-lang.org/std/alloc/struct.Layout.html#method.from_size_align)
to `isize::MAX`.

This PR takes that constraint into account when calculating
`Parser::MAX_LEN`.

It also speeds up the `overlong_source` test so it runs in under 500ms
(previously it took ~4 secs on a M1 Macbook Pro).
2024-01-29 21:45:45 +08:00