dan/oxc - BGit

dan/oxc

mirror of https://github.com/danbulant/oxc synced 2026-05-25 12:51:57 +00:00

Author	SHA1	Message	Date
overlookmotel	0bdecb5043	refactor(parser): wrapper type for parser (#2339 ) Split parser into public interface `Parser` and internal implementation `ParserImpl`. This involves no changes to public API. This change is a bit annoying, but justification is that it's required for #2341, which I believe to be very worthwhile. The `ParserOptions` type also makes it a bit clearer what the defaults for `allow_return_outside_function` and `preserve_parens` are. It came as a surprise to me that `preserve_parens` defaults to `true`, and this refactor makes that a bit more obvious when reading the code. All the real changes are in [oxc_parser/src/lib.rs](https://github.com/oxc-project/oxc/pull/2339/files#diff-8e59dfd35fc50b6ac9a9ccd991e25c8b5d30826e006d565a2e01f3d15dc5f7cb). The rest of the diff is basically replacing `Parser` with `ParserImpl` everywhere else.	2024-02-07 23:22:08 +08:00
overlookmotel	cdef41d552	refactor(parser): lexer replace `Chars` with `Source` (#2288 ) This PR replaces the `Chars` iterator in the lexer with a new structure `Source`. ## What it does `Source` holds the source text, and allows: * Iterating through source text char-by-char (same as `Chars` did). * Iterating byte-by-byte. * Getting a `SourcePosition` for current position, which can be used later to rewind to that position, without having to clone the entire `Source` struct. `Source` has the same invariants as `Chars` - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character). However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes. ## Why I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance. It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many `peek()`, `next_char()` and `next_eq()` calls faster. `Source` is also more performant than `Chars` in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than `Chars`, so it can apply more optimizations. In addition, because checkpoints don't need to store the entire `Source` struct, but only a `SourcePosition` (8 bytes), was able to reduce the size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both `Copy`. ## Notes on implementation `Source` is heavily based on Rust's `std::str::Chars` and `std::slice::Iter` iterators and I've copied the code/concepts from them as much as possible. As it's a low-level primitive, it uses raw pointers and contains a lot of unsafe code. I think I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time. I've split it into 2 commits. * First commit is all the substantive changes. * 2nd commit just does away with `lexer.current` which is no longer needed, and replaces `lexer.current.token` with `lexer.token` everywhere. Hopefully looking just at the 1st commit will reduce the noise and make it easier to review. ### `SourcePosition` There is one annoyance with the API which I haven't been able solve: `SourcePosition` is a wrapper around a pointer, which can only be created from the current position of `Source`. Due to the invariant mentioned above, therefore `SourcePosition` is always in bounds of the source text, and points to a UTF-8 character boundary. So `Source` can be rewound to a `SourcePosition` cheaply, without any checks. I had originally envisaged `Source::set_position` being a safe function, as `SourcePosition` enforces the necessary invariants itself. The fly in the ointment is that a `SourcePosition` could theoretically have been created from another `Source`. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, `Source::set_position` has to be an unsafe function. This feels rather ridiculous. Of course the parser won't create 2 Lexers at the same time. But still it's possible, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas? ## Oddity in the benchmarks There's something really odd going on with the semantic benchmark for `pdf.mjs`. While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back. What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have any effect either way on semantic's benchmarks. If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic. I've ruled out a few things: The AST produced by the parser for `pdf.mjs` after this PR is identical to what it was before. And semantic's `nodes` and `scopes` Vecs are same length as they were before. Nothing seems to have changed! I really am at a loss to explain it. Have you seen anything like this before? One possibility is a fault in my unsafe code which is manifesting only with `pdf.mjs`, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on `pdf.mjs` in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning! I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.	2024-02-05 13:51:46 +00:00
Dunqing	a3570d41f0	feat(semantic): report parameter related errors for setter/getter (#2316 )	2024-02-05 17:38:43 +08:00
overlookmotel	9811c3a2c3	refactor(parser): name byte handler functions (#2301 ) This PR solves the problem of lexer byte handlers all being called `core::ops::function::FnOnce::call_once` in the flame graphs on CodSpeed, by defining them as named functions instead of closures. Pure refactor, no substantive changes.	2024-02-05 13:06:09 +08:00
Boshen	1822cfe18d	refactor(ast): fix BigInt memory leak by removing it (#2293 ) relates We'll need to evaluate the value by other means.	2024-02-04 16:47:00 +08:00
Boshen	d2b304b1f8	Publish crates v0.6.0	2024-02-03 22:35:30 +08:00
Boshen	6849c047ef	chore(parser): add visitor example (#2271 ) closes #2256	2024-02-02 17:08:00 +08:00
Dunqing	2578bb3d64	feat(ast): remove generator property from ArrowFunction (#2260 ) ArrowFunction doesn't support generator. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/function*	2024-02-02 04:01:19 +00:00
Dunqing	165f948227	feat(ast): remove expression property from Function (#2247 )	2024-02-01 15:23:27 +08:00
Boshen	2beacd3f4d	fix(lexer): correct the span for irregular whitespaces (#2245 ) closes #2236	2024-02-01 14:18:47 +08:00
overlookmotel	d0d708295b	refactor(parser): consume chars when parsing surrogate pair escape (#2243 ) This fixes a mistake I made in #2237. I was confused by the `!(...)` wrapping of the preceding `if` test and missed that there are definitely 2 chars to consume, so can use `consume_char()` instead of `next_char()`. This makes no difference to behavior, but it follows the convention to always prefer `consume_char()` when possible. I've also refactored the code which confused me, so hopefully others won't be confused too!	2024-02-01 11:34:26 +08:00
overlookmotel	622a2c37fa	refactor(lexer): don't use `lexer.current.chars` directly (#2237 ) This PR replaces most usages of `lexer.current.chars.next()` with `lexer.consume_char()`, or a new function `lexer.next_char()`. This is a preparatory step towards replacing the `Chars` iterator with something more flexible which can also consume bytes (not `char`s), and this PR was intended as pure refactor. But surprised to see there is a small performance bump (no idea why!). There's an additional benefit: Using `consume_char()` everywhere where we believe there's definitely a char there to be consumed will make logic errors produce a panic, rather than silently outputting garbage.	2024-01-31 21:35:46 +08:00
overlookmotel	5279e8955f	refactor(parser): byte handler for illegal bytes (#2229 ) This adds a separate byte handler to the lexer for byte values which should never be encountered: 1. UTF-8 continuation bytes (i.e. middle of a multi-byte UTF-8 byte sequence). 2. Bytes values which are illegal in valid UTF-8 strings. At present, this function is impossible to reach, because `std::str::Chars` ensures the next byte is always the start of a valid UTF-8 byte sequence. But later changes I intend introducing unsafe code will make it possible (but highly undesirable!). In the meantime, I don't think it does any harm to handle this case.	2024-01-31 18:57:47 +08:00
overlookmotel	3d79d77b40	refactor(parser): split lexer into multiple files (#2228 ) This PR has a large diff, but it contains no substantive changes whatsoever. It purely breaks up the lexer into multiple smaller files. I've been working quite intensively on the lexer over past few weeks, but still have been finding it hard to make sense of, due to most of the logic currently being contained in [a single 1800-line file](`018675ceb1/crates/oxc_parser/src/lexer/mod.rs`). I feel that breaking it up into multiple files makes it much easier to navigate and understand. An additional benefit is that many functions can have their visibility reduced to module scope, so sub-systems for e.g. lexing numbers have fewer exposed functions. This makes it clearer what the entry points are, and makes it harder to make mistakes when working on the lexer. I intend to later make changes to the lexer for performance which will introduce unsafe code. Keeping that unsafe code encapsulated in modules will make it more viable to validate the workings of that code, and avoid accidental UB. There is one downside to this change. Previously [`lexer/mod.rs`](`018675ceb1/crates/oxc_parser/src/lexer/mod.rs`) was laid out in same order as the JS spec. If you were trying to validate the lexer against the spec, this would make it easier. However, as OXC's parser is fairly mature at this point, and I imagine most spec-compliance issues have been flushed out by now, in my opinion this advantage is less compelling than it probably used to be. So in my view it's outweighed by the benefit of more readable code. Reviewing this could be a bit of a battle due to the size of the diff. I do have further changes I'd like to make, but I've intentionally kept this PR as 100% just: 1. Moving code around. 2. Reducing visibility of functions to module/super scope where that's possible to do without changing anything else. Aside from that, not even a single comment has changed. If you're willing to trust me on that promise, I think it can be merged without poring through it line by line.	2024-01-31 11:43:53 +08:00
overlookmotel	81e33a3701	perf(parser): faster offset calculation (#2215 ) A faster way to calculate offset in the lexer. This only moves the needle because it's on the hottest path in the lexer - `Lexer::offset` is called for every token in `Lexer::read_next_token`.	2024-01-30 18:49:31 +08:00
overlookmotel	51ac392ae4	refactor(parser): mark `ByteHandler`s unsafe (#2212 ) All the ASCII `ByteHandler`s are unsafe to call. I forgot to mark them as unsafe when making that change. This PR fixes that, and will make it harder for someone to accidentally call one of them without considering the safety invariants.	2024-01-30 12:23:35 +08:00
overlookmotel	20679d1e1e	perf(parser): pad `Token` to 16 bytes (#2211 ) Counter-intuitively, it seems that increasing the size of `Token` improves performance slightly. This appears to be because when `Token` is 16 bytes, copying `Token` is a single 16-byte load/store. At present, it's 12 bytes which requires an 8-byte load/store + a 4-byte load/store. https://godbolt.org/z/KPYsn3ab7 This suggests that either: 1. #2010 could be reverted at no cost, and the overhead of the hash table removed. or: 2. We need to get `Token` down to 8 bytes! I have an idea how to maybe do (2), so I'd suggest leaving it as is for now until I've been able to research that. NB I also tried putting `#[repr(align(16))]` on `Token` so that copying uses aligned loads/stores. That [hurt the benchmarks very slightly](https://codspeed.io/overlookmotel/oxc/branches/lexer-pad-token), though it might produce a gain on architectures where unaligned loads are more expensive (ARM64 I think?). But I can't test that theory, so have left it out.	2024-01-30 11:47:26 +08:00
overlookmotel	872d751a18	refactor(parser): re-order match branches (#2209 ) Just a tiny bit of code tidying.	2024-01-30 00:53:56 +08:00
overlookmotel	71898ffdd5	refactor(parser): move source length check into lexer (#2206 ) This change makes little difference in itself, but moving the check into the lexer will allow some optimizations in lexer using unsafe code which depend on this invariant.	2024-01-29 22:29:02 +08:00
overlookmotel	e123be0a00	fix(parser): correct MAX_LEN for 32-bit systems (#2204 ) Maximum length of source parser can accept is limited on 32-bit systems to `isize::MAX` (i.e. `i32::MAX` not `u32::MAX`) because Rust [limits the size of allocations](https://doc.rust-lang.org/std/alloc/struct.Layout.html#method.from_size_align) to `isize::MAX`. This PR takes that constraint into account when calculating `Parser::MAX_LEN`. It also speeds up the `overlong_source` test so it runs in under 500ms (previously it took ~4 secs on a M1 Macbook Pro).	2024-01-29 21:45:45 +08:00
Dunqing	ea8cc98c34	fix(ast): AcessorProperty is missing decorators (#2176 )	2024-01-26 15:43:05 +08:00
renovate[bot]	0be0a234ce	chore(deps): update cargo (#2138 )	2024-01-23 14:48:04 +08:00
overlookmotel	bc7ea0bedb	refactor(parser): make `is_identifier` methods consistent	2024-01-23 11:05:17 +08:00
Dunqing	766ca63aa0	refactor(ast): rename RestElement to BindingRestElement (#2116 ) close: #2115	2024-01-22 14:28:35 +08:00
overlookmotel	36c718ee82	feat(tasks): benchmarks for lexer (#2101 ) This PR adds benchmarks for the lexer. I'm doing some work on optimizing the lexer and I thought it'd be useful to see the effects of changes in isolation, separate from the parser. These benchmarks may not be ideal to keep long-term, but for now it'd be useful. In order to do so, it's necessary for `oxc_parser` crate to expose the lexer, but have done that without adding it to the docs, and using an alias `__lexer`.	2024-01-21 14:32:50 +00:00
Boshen	59e29f286a	chore(parser): explain the reason for omitting "}" and ">" in jsx text lexer (#2097 ) closes #2094	2024-01-20 23:03:44 +08:00
Boshen	3f2b48f1a9	refactor(parser): remove useless string builder from jsx text lexer (#2096 ) relates #2094	2024-01-20 22:34:57 +08:00
Boshen	2f5afff9bd	fix(parser): fix crash on TSTemplateLiteralType in function return position (#2089 ) ``` interface Helpers { inspect(): `~~~~\n${string}\n~~~~`; } ```	2024-01-19 23:14:05 +08:00
overlookmotel	0e32618664	refactor(parser): combine token kinds for skipped tokens (#2072 ) Small optimization to the lexer. Whitespace, line breaks, and comments are all skipped by `read_next_token()`. At present there's a different `Kind` for each, and `read_next_token()` decides whether to skip with `matches!(kind, Kind::WhiteSpace \| Kind::NewLine \| Kind::Comment \| Kind::MultiLineComment)`. These `Kind`s are used for no other purpose, so there seems little reason to differentiate them. This PR combines them all into `Kind::Skip`, so then the test of whether to skip is reduced to `kind == Kind::Skip`. Only produces ~0.3% performance bump on parser benchmarks. But, why not?...	2024-01-18 21:14:12 +08:00
overlookmotel	8d5f5b8a49	refactor(parser): macro for ASCII byte handlers (#2066 ) As discussed on #2046, it wasn't ideal to have `unsafe { lexer.consume_ascii_char() }` in every byte handler. It also wasn't great to have a safe function `consume_ascii_char()` which could cause UB if called incorrectly (so wasn't really safe at all). This PR achieves the same objective of #2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII. Before #2046: ```rs const SPS: ByteHandler = \|lexer\| { lexer.consume_char(); Kind::WhiteSpace }; ``` After this PR: ```rs ascii_byte_handler!(SPS(lexer) { lexer.consume_char(); Kind::WhiteSpace }); ``` i.e. The body of the handlers are unchanged from how they were before https://github.com/oxc-project/oxc/pull/2046. This expands to: ```rs const SPS: ByteHandler = \|lexer\| { unsafe { let s = lexer.current.chars.as_str(); assert_unchecked!(!s.is_empty()); assert_unchecked!(s.as_bytes()[0] < 128); } lexer.consume_char(); Kind::WhiteSpace }; ``` But due to the assertions the macro inserts, `consume_char()` is now optimized for ASCII characters, and reduces to a single instruction. So the `consume_ascii_char()` function introduced by #2046 is unnecessary, and can be removed again. The "boundary of unsafe" is moved to a new function `handle_byte()` which `read_next_token()` calls. `read_next_token()` is responsible for upholding the safety invariants, which include ensuring that `ascii_byte_handler!()` macro is not being misused (that last part is strictly speaking a bit of a cheat, but...). I am not a fan of macros, as they're not great for readability. But in this case I don't think it's too bad, because: 1. The macro is well-documented. 2. It's not too clever (only one syntax is accepted). 3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all. What do you think? Does this strike a reasonable balance between readability and safety?	2024-01-17 15:29:15 +08:00
overlookmotel	408acb90e6	refactor(parser): lexer handle unicode without branch (#2039 ) As suggested by @strager in https://github.com/oxc-project/oxc/pull/2025#pullrequestreview-1820273832, this PR adds `BYTE_HANDLERS` for first bytes of unicode characters. This removes a branch from `read_next_token()` and produces a +1% speed-up on parser benchmarks.	2024-01-16 13:14:22 +08:00
overlookmotel	66a7a68f9f	perf(parser): lexer byte handlers consume ASCII chars faster (#2046 ) In the lexer, most `BYTE_HANDLER`s immediately consume the current char with `lexer.consume_char()`. Byte handlers are only called if there's a certain value (or range of values) for the next char. This is their entire purpose. So in all cases we know for sure that we're not at EOF, and that the next char is a single-byte ASCII character. The compiler, however, doesn't seem to be able to "see through" the `BYTE_HANDLERS[byte](self)` call and understand these invariants. So it produces very verbose ASM for `lexer.consume_char()`. This PR replaces `lexer.consume_char()` in the byte handlers with an unsafe `lexer.consume_ascii_char()` which skips on to next char with a single `inc` instruction. The difference in codegen can be seen here: https://godbolt.org/z/1ha3cr9W5 (compare the 2 x `core::ops::function::FnOnce::call_once` handlers). Downside is that this does introduce a lot of unsafe blocks, but in my opinion they're all pretty trivial to validate. --------- Co-authored-by: Boshen <boshenc@gmail.com>	2024-01-16 12:31:45 +08:00
Boshen	09c7570560	ci: use miri to detect memory leak for the parser (#2037 ) We'll merge this and then eventually turn it on as a nightly check, it's a manual run for now.	2024-01-15 15:11:02 +00:00
overlookmotel	b4d76f0b0d	refactor(parser): remove noop code (#2028 ) This PR removes some code from the lexer which doesn't do anything.	2024-01-14 23:48:35 +08:00
overlookmotel	60a927d8f5	perf(parser): lexer match byte not char (#2025 ) 2 related changes to lexer's `read_next_token()`: 1. Hint to branch predictor that unicode identifiers and non-standard whitespace are rare by marking that branch `#[cold]`. 2. The branch is on whether next character is ASCII or not. This check only requires reading 1 byte, as ASCII characters are always single byte in UTF8. So only do the work of getting a `char` in the cold path, once it's established that character is not ASCII and this work is required.	2024-01-14 18:50:11 +08:00
Boshen	1886a5b838	perf(parser): reduce `Token` size from 16 to 12 bytes (#2010 ) I also had to change how the string for private identifiers are built, otherwise they will always be allocated.	2024-01-13 12:42:39 +08:00
overlookmotel	6996948825	refactor(parser): remove extraneous code from regex parsing (#2008 ) This PR removes some code in parsing regexp flags which is extraneous: ```rs if !ch.is_ascii_lowercase() { self.error(diagnostics::RegExpFlag(ch, self.current_offset())); continue; } ``` Which is followed by: ```rs let flag = if let Ok(flag) = RegExpFlags::try_from(ch) { flag } else { self.error(diagnostics::RegExpFlag(ch, self.current_offset())); continue; }; ``` `!ch.is_ascii_lowercase()` is equivalent to `ch < 'a' \|\| ch > 'z'`. The compiler implements `RegExpFlags::try_from(ch)` as `ch < 'd' \|\| ch > 'y'` and then a jump table. So `ch.is_ascii_lowercase()` does nothing that `RegExpFlags::try_from(ch)` doesn't do already. https://godbolt.org/z/51GPPY9nx (this PR built on top of #2007 for ease)	2024-01-13 02:34:05 +00:00
overlookmotel	712e99cf9b	fix(parser): restore regex flag parsing (#2007 ) As discussed in https://github.com/oxc-project/oxc/pull/1999#issuecomment-1888916383, this PR restores some of regex parsing behavior to as it was prior to #1926.	2024-01-13 03:19:33 +08:00
Boshen	61f37ea973	Publish crates v0.5.0	2024-01-12 23:33:47 +08:00
Boshen	aa91fde1d9	refactor(parser): only allocate for escaped template strings (#2005 )	2024-01-12 18:56:36 +08:00
Boshen	38f86b0cac	refactor(parser): remove string builder from number parsing (#2002 ) The builder was used to build an allocated string for numbers with underscores, this is no longer required because it is now allocated on demand. `0d77e1e788/crates/oxc_parser/src/lexer/number.rs (L32)`	2024-01-12 17:01:51 +08:00
overlookmotel	c7316856db	refactor(parser): reduce work parsing regexps (#1999 ) #1926 produced a small performance regression because when parsing a regexp, some work is repeated.	2024-01-12 11:36:30 +08:00
Boshen	4706765d2a	refactor(parser): reduce `Token` size from 32 to 16 bytes (#1962 ) Part of #1880 `Token` size is reduced from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `&source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings[]` vec for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html	2024-01-09 15:17:02 +08:00
Boshen	6e0bd52af1	refactor(parser): remove TokenValue::Number from Token (#1945 ) This PR is part of #1880. Token size is reduced from 40 to 32 bytes.	2024-01-08 16:29:03 +08:00
Dunqing	b50c5ec623	fix(parser): unexpected ts type annotation in get/set (#1942 ) fix: https://github.com/oxc-project/oxc/issues/1939	2024-01-08 15:07:43 +08:00
Boshen	08438e04ba	refactor(parser): remove TokenValue::RegExp from `Token` (#1926 ) This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.	2024-01-08 13:48:52 +08:00
Boshen	7eb2573178	refactor(parser): parse BigInt lazily (#1924 ) This PR partially fixes #1803 and is part of #1880. BigInt is removed from the `Token` value, so that the token size can be reduced once we removed all the variants. `Token` is now also `Copy`, which removes all the `clone` and `drop` calls. This yields 5% performance improvement for the parser.	2024-01-08 12:37:20 +08:00
overlookmotel	eb2966c512	fix(parser): fix incorrectly identified directives (#1885 ) Parser incorrectly identifies string literals as directives if they follow after `import`s, `export`s, or decorators. In all of these cases, `'use strict'` produces a directive in the AST, where it should be parsed as an `ExpressionStatement` containing a `StringLiteral`: ```js import x from 'foo'; 'use strict'; ``` ```js export {x}; 'use strict'; ``` ```js @foo 'use strict'; ``` [Playground](https://oxc-project.github.io/oxc/playground/?code=3YCAAIC0gICAgICAgIC0G8rnONK89ITJ3zrK%2FUP7OmSZPgHQzStr3yMtwFTU%2BD1WPt09JgqZJLoYooydbGsM5vGcf34BnIA%3D) This PR should fix that. I'm not sure about the decorator case, though. I assume it's not a directive. But is prefixing a string literal with a decorator even legal syntax anyway? And a side nit: If I'm reading it right, I don't think the `continue` statement in the decorator arm of the match does anything. Do I have that right? Last question: Where does one go about putting a test? I guess these silly cases aren't covered by Babel etc's tests. --------- Co-authored-by: Boshen <boshenc@gmail.com>	2024-01-04 13:39:15 +00:00
Dunqing	c3090c2c70	fix(parser): terminate parsing if an EmptyParenthesizedExpression error occurs (#1874 ) close: https://github.com/oxc-project/oxc/issues/1870#issue-2061901976	2024-01-03 11:34:14 +08:00
overlookmotel	62bc8c5cea	fix(parser): error on source larger than 4 GiB (#1860 ) `Token` and `Span` both represent `start` and `end` as `u32`. This limits size of source which can be parsed to `u32::MAX`. `19577709db/crates/oxc_span/src/span.rs (L14-L20)` However, this constraint is currently not enforced. In a release build, code will not panic on arithmetic overflow, so `start`/`end` could wrap around back to zero if source is 4 GiB or more. That'd produce nonsense spans. But worse, the lexer relies in some places on `self.current.token.start` being correct, so if the value wrapped around, possibly it'd keep rewinding to the start of the source and lexing it again, causing an infinite loop. In worst case, if for some reason an application's public API used OXC's parser with user-supplied source code (parser-as-a-service!), this could be exploited for denial of service. This PR adds an assertion to catch this at the start of parsing instead. This does add an extra instruction, but I imagine the effect will be negligible compared to the work required to parse the code.	2024-01-02 11:05:28 +08:00

1 2 3 4 5 ...

283 commits