dan/oxc - BGit

dan/oxc

mirror of https://github.com/danbulant/oxc synced 2026-05-24 20:32:10 +00:00

Author	SHA1	Message	Date
overlookmotel	18cff6aab8	refactor(parser): remove start params for `byte_search` macro arms (#2553 ) Simplify `byte_search` macro a bit more.	2024-03-01 21:15:27 +08:00
overlookmotel	34ecdd58d8	refactor(parser): simplify `byte_search` macro (#2552 ) This PR greatly simplifies the `byte_search!` macro. Mainly removing `cold_branch()` from the "not enough bytes remaining for a batch" branch, which allows refactoring so that `handle_match` and `continue_if` don't need to be repeated twice. Result for performance is inconsistent - a little better on some benchmarks, a little worse on others. But not by significant amounts either way. In my view, the benefit of making the macro simpler outweighs a small speed loss anyway.	2024-03-01 21:07:39 +08:00
overlookmotel	ddccaa1af9	refactor(parser): remove unsafe code in lexer (#2549 ) Same as #2527. Just remove some unnecessary unsafe code, no substantive changes.	2024-02-29 15:00:08 +00:00
overlookmotel	5a13714a18	perf(parser): faster lexing template strings (#2541 ) Speed up lexing template strings. This was the last use of `AutoCow` remaining in the lexer, and it's now removed. Implementation is quite complex, to avoid repeatedly branching on whether an unescaped string is required or not (the way `AutoCow` did). I tried to simplify it down to a single function, but this hurt performance significantly. Benchmarks do not show much movement, but I believe that's because there aren't many template strings in the benchmarks. Where there are template strings, I believe this speeds up lexing them significantly.	2024-02-29 13:28:30 +08:00
overlookmotel	9d7ea6b3f0	refactor(parser): single function for all string slicing (#2540 ) Pure refactor. Move all string-slicing in `lexer::Source` into a single function.	2024-02-29 13:22:55 +08:00
Boshen	3efbbb2e1f	feat(ast): add "abstract" type to `MethodDefinition` and `PropertyDefinition` (#2536 ) closes #2532 ``` pub enum PropertyDefinitionType { PropertyDefinition, TSAbstractPropertyDefinition, } pub enum MethodDefinitionType { MethodDefinition, TSAbstractMethodDefinition, } ```	2024-02-28 17:33:11 +08:00
overlookmotel	24ded3cb15	perf(parser): lex JSX strings with `memchr` (#2528 ) Simplify lexing JSX string attributes. As the search is purely for 1 byte value (the closing quote), and so doesn't require a byte table, use `memchr`. This change doesn't really register on benchmarks, but it's one step closer to removing `AutoCow`, and transitioning all the searches in the lexer to byte-by-byte.	2024-02-28 14:39:23 +08:00
overlookmotel	0ddfc856d2	refactor(parser): remove unsafe code (#2527 ) Remove some unnecessary unsafe code.	2024-02-27 20:28:21 +08:00
Boshen	46e779194a	chore: fix clippy warnings (#2519 )	2024-02-26 23:55:18 +08:00
Boshen	351a0572be	chore(parser): print both AST and errors in examples/parser	2024-02-26 23:20:46 +08:00
Boshen	be6b8b7ce6	[BREAKING CHANGE] Change `Atom` to `Atom<'a>` to make it safe (#2497 ) Part of #2295 This PR splits the `Atom` type into `Atom<'a>` and `CompactString`. All the AST node strings now use `Atom<'a>` instead of `Atom` to signify it belongs to the arena. It is now up to the user to select which form of the string to use. This PR essentially removes the really unsafe code `93742f89e9/crates/oxc_span/src/atom.rs (L98-L107)` which can lead to ![image](https://github.com/oxc-project/oxc/assets/1430279/8c513c4f-19b0-4b63-b61c-e07c187c95b5)	2024-02-26 19:34:40 +08:00
Boshen	4fabe66621	Publish crates v0.8.0	2024-02-26 19:01:51 +08:00
Dunqing	70295a5552	feat(ast): update arrow_expression to arrow_function_expression (#2496 )	2024-02-25 14:39:34 +00:00
Boshen	7a796c4b5f	feat(ast): add `TSModuleDeclaration.kind` (#2487 ) closes #2395	2024-02-24 17:09:31 +08:00
Boshen	5212f7b51e	fix(parser): fix missing end span from `TSTypeAliasDeclaration` (#2485 ) closes #2483	2024-02-24 16:51:00 +08:00
Boshen	1634586934	refactor(ast): s/TSTypeOperatorType/TSTypeOperator to align with estree	2024-02-21 22:25:04 +08:00
Boshen	9087f71765	refactor(ast): s/TSThisKeyword/TSThisType to align with estree	2024-02-21 22:25:04 +08:00
Boshen	d08abc638e	refactor(ast): s/NumberLiteral/NumericLiteral to align with estree	2024-02-21 21:41:08 +08:00
Boshen	35608c8eb1	chore: fix all docs	2024-02-21 18:06:37 +08:00
Andrew McClenaghan	6b3b260dcc	feat(Codegen): Improve codegen (#2460 ) This gets all the new TS types working to the same level TS output was before and fixes a bunch of other codegen --------- Co-authored-by: Boshen <boshenc@gmail.com>	2024-02-21 14:41:57 +08:00
Dunqing	197fa16613	feat(semantic): add check for duplicate class elements in checker (#2455 ) 1. Remove the check implementation of the parser 2. Implement it to semantic checker 3. Support typescript's check for duplicate class elements Support checking for duplicate class elements in semantic checker is easier to support typescript checking rules.	2024-02-21 14:10:19 +08:00
Boshen	a2c173de57	refactor: remove `panic!` from examples (#2454 ) relates #2308	2024-02-20 16:18:39 +08:00
overlookmotel	a78303d5a6	refactor(parser): `continue_if` in `byte_search` macro not unsafe (#2440 ) #2439 made using `continue_if` in `byte_search!` macro safe, as it no longer continues the main loop after a match, so no danger of reading out of bounds if `continue_if` code fast-forwards the current position. This follow-on PR removes the unsafe blocks, and uses that fast-forward ability in a couple of places.	2024-02-20 10:45:31 +08:00
overlookmotel	a5a3c695f7	refactor(parser): correct comment (#2441 ) Just correcting a typo in a comment, and moving comment to a better place.	2024-02-20 10:43:12 +08:00
overlookmotel	996a9d27eb	perf(parser): `byte_search` macro always unroll main loop (#2439 ) Refactor `byte_search!` macro to move logic out of the main loop. This ensures the compiler unrolls the loop. This speeds up lexing single-line comments by 20%-25% on the benchmarks which contain enough comments for the change to register. Presumably the loop wasn't unrolled previously. The code required to do this is a little odd. It adds an extra `loop {}` which always exits on the first turn (so not really a useful loop), but is required to be able to use `break` to exit that "loop", making 2 different paths for (1) matching byte found and (2) `for` loop completed without finding any match. This is only way I could find to produce this behavior without using a macro. Is there a more "normal" way to get the same logic?	2024-02-20 10:39:52 +08:00
Dunqing	60db720fa6	feat(parser): parse import attributes in TSImportType (#2436 ) close: #2394 `64d2eeea7b/src/compiler/types.ts (L2177-L2185)` The corresponding test cases were skipped, so I manually added some cases to misc `f5db48237f/tasks/coverage/src/typescript.rs (L118-L121)`	2024-02-19 12:26:42 +08:00
Dunqing	3cbe786b18	refactor(ast): update TSImportType parameter to argument (#2429 ) In typescript it's named argument, so we should keep it consistent `64d2eeea7b/src/compiler/types.ts (L2180)`	2024-02-19 10:29:24 +08:00
overlookmotel	90f9266d00	chore(deps): update `bumpalo` crate (#2417 ) Latest version of `bumpalo` includes a couple of performance fixes for `String` (e.g. https://github.com/fitzgen/bumpalo/pull/229) which may help the parser a little.	2024-02-18 11:49:31 +08:00
overlookmotel	cc2ddbee77	refactor(parser): catch all illegal UTF-8 bytes (#2415 ) Catch all illegal UTF-8 bytes with the `UER` byte handler. From https://datatracker.ietf.org/doc/html/rfc3629: > The octet values C0, C1, F5 to FF never appear. This change should make no difference at all, as a valid `&str` may not contain any of these byte values anyway. But it's possible if user has e.g. created the string with `str::from_utf8_unchecked` and not obeyed the safety contraints. This will at least contain the damage if that's happened, and panic rather than lead to UB. And since we're already catching other error conditions, may as well catch them all.	2024-02-16 20:49:01 +08:00
Dunqing	73e116e8a1	fix(parser): incorrect parsing of class accessor property name (#2386 )	2024-02-11 22:57:13 +08:00
overlookmotel	383f5b3081	perf(parser): consume multi-line comments faster (#2377 ) Consume multi-line comments faster. * Initially search for `/`, `\r`, `\n` or `0xE2` (first byte of irregular line breaks). Once a line break is found, switch to faster search which only looks for `*/`, as it's not relevant whether there are more line breaks or not. Using `memchr` for the 2nd simpler search, as it's efficient for a search with only one "needle". Initializing `memchr::memmem::Finder` is fairly expensive, and tried numerous ways to handle it. This is most performant way I could find. Any ideas how to avoid re-creating it for each Lexer pass? (it can't be a `static` as `Finder::new` is not a const function, and `lazy_static!` is too costly)	2024-02-11 12:43:14 +08:00
Boshen	ef336cb66b	feat(parser): recover from `async x [newline] => x` (#2375 ) ```javascript async x => x ``` Babel recovers and displays "No line break is allowed before '=>'	2024-02-10 11:19:08 +08:00
overlookmotel	c4fa738312	perf(parser): consume single-line comments faster (#2374 ) Use `byte_search!` macro to consume single-line comments. Would be a lot simpler if didn't have to deal with irregular line breaks. Damn you Unicode!	2024-02-10 11:02:30 +08:00
overlookmotel	b29719d2df	refactor(parser): add methods to `Source` + `SourcePosition` (#2373 ) Preparatory step for #2374.	2024-02-10 10:57:33 +08:00
overlookmotel	79ae9a9b2c	refactor(parser): extend `byte_search` macro (#2372 ) Preparatory step for #2374.	2024-02-10 10:52:59 +08:00
overlookmotel	0be8397c77	perf(parser): optimize lexing strings (#2366 ) Optimize lexing strings a bit.	2024-02-09 23:52:45 +08:00
Boshen	d6d921ea1f	Publish crates v0.7.0	2024-02-09 23:01:12 +08:00
overlookmotel	c0d1d6b08a	perf(parser): lex strings as bytes (#2357 ) Lex string literals as bytes, using same techniques as for identifiers. Handling escapes could be optimized a bit more, and maybe I'll return to that, but as escapes are fairly rare, it wouldn't be the biggest gain.	2024-02-09 21:00:27 +08:00
overlookmotel	2f6cf73d51	fix(parser): remove erroneous debug assertion (#2356 ) This was a bit of a whoopsie in last batch of PRs. This assertion shouldn't be there, because all reads are now via `source.position().read()`, so this assertion says "you can only read some byte values". Only reason it didn't blow up conformance tests is that they run in release mode. Sorry. Please merge soon as you can and cover my shame!	2024-02-09 20:55:12 +08:00
overlookmotel	8376f15b9a	perf(parser): eat whitespace after line break (#2353 ) Uses the `byte_search!` macro introduced in #2352 to consume whitespace after a line break.	2024-02-09 12:02:51 +08:00
overlookmotel	d3a59f27f7	perf(parser): lex identifiers as bytes not chars (#2352 ) This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new `Source` / `SourcePosition` APIs. Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent. I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance. ### Techniques This implementation uses a few different strategies for performance: * Search byte-by-byte, not char-by-char. * Process batches of 32 bytes at a time to reduce bounds checks. * Mark uncommon paths `#[cold]`. ### Structure The implementation is built in 3 layers: 1. ASCII characters only. 2. ASCII and Unicode characters. 3. `\` escape sequences (and all the above). `identifier_name_handler` starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt. I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised. That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier starts with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a `#[cold]` boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric. ### ASCII search macro The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked `#[inline]` or `#[inline(always)]`. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro. This macro, and the `ByteMatchTable` types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple. Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD. ### Bye bye `AutoCow` `AutoCow` is removed. Instead, a string-builder is only created if it's needed, when a `\` escape is first encountered. The string builder is also more efficient than `AutoCow` was, as it copies bytes in chunks, rather than 1-by-1. This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.	2024-02-09 12:01:30 +08:00
overlookmotel	6910e4f71b	refactor(parser): macro for ASCII identifier byte handlers (#2351 ) Add a macro for ASCII identifier byte handlers. This is a preparatory step towards #2352.	2024-02-09 11:55:35 +08:00
overlookmotel	6f597b18bc	refactor(parser): all pointer manipulation through `SourcePosition` (#2350 ) A safer and faster interface for reading source text using pointers than `*ptr`.	2024-02-09 10:26:51 +08:00
overlookmotel	185b3dbcc3	refactor(parser): fix outdated comment (#2344 ) Just fixes an outdated comment.	2024-02-08 19:47:33 +08:00
overlookmotel	f3470163d9	refactor(parser): make `Source::set_position` safe (#2341 ) Make `Source::set_position` a safe function. This addresses a shortcoming of #2288. Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level. `Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a lot of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway. So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.	2024-02-08 14:56:26 +08:00
overlookmotel	aef593fb50	parser(refactor): promise only one `Source` on a thread at a time (#2340 ) Introduce invariant that only a single `lexer::Source` can exist on a thread at one time. This is a preparatory step for #2341. 2 notes: Restriction is only 1 x `ParserImpl` / `Lexer` / `Source` on 1 thread at a time, not globally. So this does not prevent parsing multiple files simultaneously on different threads. Restriction does not apply to public type `Parser`, only `ParserImpl`. `ParserImpl`s are not created in created in `Parser::new`, but instead in `Parser::parse`, where they're created and then immediately consumed. So the end user is also free to create multiple `Parser` instances (if they want to for some reason) on the same thread.	2024-02-08 14:51:17 +08:00
overlookmotel	0bdecb5043	refactor(parser): wrapper type for parser (#2339 ) Split parser into public interface `Parser` and internal implementation `ParserImpl`. This involves no changes to public API. This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile. The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code. All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.	2024-02-07 23:22:08 +08:00
overlookmotel	cdef41d552	refactor(parser): lexer replace `Chars` with `Source` (#2288 ) This PR replaces the `Chars` iterator in the lexer with a new structure `Source`. ## What it does `Source` holds the source text, and allows: * Iterating through source text char-by-char (same as `Chars` did). * Iterating byte-by-byte. * Getting a `SourcePosition` for current position, which can be used later to rewind to that position, without having to clone the entire `Source` struct. `Source` has the same invariants as `Chars` - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character). However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes. ## Why I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance. It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many `peek()`, `next_char()` and `next_eq()` calls faster. `Source` is also more performant than `Chars` in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than `Chars`, so it can apply more optimizations. In addition, because checkpoints don't need to store the entire `Source` struct, but only a `SourcePosition` (8 bytes), was able to reduce the size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both `Copy`. ## Notes on implementation `Source` is heavily based on Rust's `std::str::Chars` and `std::slice::Iter` iterators and I've copied the code/concepts from them as much as possible. As it's a low-level primitive, it uses raw pointers and contains a lot of unsafe code. I think I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time. I've split it into 2 commits. * First commit is all the substantive changes. * 2nd commit just does away with `lexer.current` which is no longer needed, and replaces `lexer.current.token` with `lexer.token` everywhere. Hopefully looking just at the 1st commit will reduce the noise and make it easier to review. ### `SourcePosition` There is one annoyance with the API which I haven't been able solve: `SourcePosition` is a wrapper around a pointer, which can only be created from the current position of `Source`. Due to the invariant mentioned above, therefore `SourcePosition` is always in bounds of the source text, and points to a UTF-8 character boundary. So `Source` can be rewound to a `SourcePosition` cheaply, without any checks. I had originally envisaged `Source::set_position` being a safe function, as `SourcePosition` enforces the necessary invariants itself. The fly in the ointment is that a `SourcePosition` could theoretically have been created from another `Source`. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, `Source::set_position` has to be an unsafe function. This feels rather ridiculous. Of course the parser won't create 2 Lexers at the same time. But still it's possible, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas? ## Oddity in the benchmarks There's something really odd going on with the semantic benchmark for `pdf.mjs`. While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back. What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have any effect either way on semantic's benchmarks. If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic. I've ruled out a few things: The AST produced by the parser for `pdf.mjs` after this PR is identical to what it was before. And semantic's `nodes` and `scopes` Vecs are same length as they were before. Nothing seems to have changed! I really am at a loss to explain it. Have you seen anything like this before? One possibility is a fault in my unsafe code which is manifesting only with `pdf.mjs`, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on `pdf.mjs` in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning! I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.	2024-02-05 13:51:46 +00:00
Dunqing	a3570d41f0	feat(semantic): report parameter related errors for setter/getter (#2316 )	2024-02-05 17:38:43 +08:00
overlookmotel	9811c3a2c3	refactor(parser): name byte handler functions (#2301 ) This PR solves the problem of lexer byte handlers all being called `core::ops::function::FnOnce::call_once` in the flame graphs on CodSpeed, by defining them as named functions instead of closures. Pure refactor, no substantive changes.	2024-02-05 13:06:09 +08:00

1 2 3 4 5 ...

329 commits