Commit graph

217 commits

Author SHA1 Message Date
overlookmotel
bc7ea0bedb refactor(parser): make is_identifier methods consistent 2024-01-23 11:05:17 +08:00
Dunqing
766ca63aa0
refactor(ast): rename RestElement to BindingRestElement (#2116)
close: #2115
2024-01-22 14:28:35 +08:00
overlookmotel
36c718ee82
feat(tasks): benchmarks for lexer (#2101)
This PR adds benchmarks for the lexer. I'm doing some work on optimizing
the lexer and I thought it'd be useful to see the effects of changes in
isolation, separate from the parser.

These benchmarks may not be ideal to keep long-term, but for now it'd be
useful.

In order to do so, it's necessary for `oxc_parser` crate to expose the
lexer, but have done that without adding it to the docs, and using an
alias `__lexer`.
2024-01-21 14:32:50 +00:00
Boshen
59e29f286a
chore(parser): explain the reason for omitting "}" and ">" in jsx text lexer (#2097)
closes #2094
2024-01-20 23:03:44 +08:00
Boshen
3f2b48f1a9
refactor(parser): remove useless string builder from jsx text lexer (#2096)
relates #2094
2024-01-20 22:34:57 +08:00
Boshen
2f5afff9bd
fix(parser): fix crash on TSTemplateLiteralType in function return position (#2089)
```
interface Helpers {
  inspect(): `~~~~\n${string}\n~~~~`;
}
```
2024-01-19 23:14:05 +08:00
overlookmotel
0e32618664
refactor(parser): combine token kinds for skipped tokens (#2072)
Small optimization to the lexer.

Whitespace, line breaks, and comments are all skipped by
`read_next_token()`.

At present there's a different `Kind` for each, and `read_next_token()`
decides whether to skip with `matches!(kind, Kind::WhiteSpace |
Kind::NewLine | Kind::Comment | Kind::MultiLineComment)`.

These `Kind`s are used for no other purpose, so there seems little
reason to differentiate them.

This PR combines them all into `Kind::Skip`, so then the test of whether
to skip is reduced to `kind == Kind::Skip`.

Only produces ~0.3% performance bump on parser benchmarks. But, why
not?...
2024-01-18 21:14:12 +08:00
overlookmotel
8d5f5b8a49
refactor(parser): macro for ASCII byte handlers (#2066)
As discussed on #2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).

This PR achieves the same objective of #2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.

Before #2046:

```rs
const SPS: ByteHandler = |lexer| {
  lexer.consume_char();
  Kind::WhiteSpace
};
```

After this PR:

```rs
ascii_byte_handler!(SPS(lexer) {
  lexer.consume_char();
  Kind::WhiteSpace
});
```

i.e. The body of the handlers are unchanged from how they were before
https://github.com/oxc-project/oxc/pull/2046.

This expands to:

```rs
const SPS: ByteHandler = |lexer| {
  unsafe {
    let s = lexer.current.chars.as_str();
    assert_unchecked!(!s.is_empty());
    assert_unchecked!(s.as_bytes()[0] < 128);
  }
  lexer.consume_char();
  Kind::WhiteSpace
};
```

But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by #2046 is unnecessary,
and can be removed again.

The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).

I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:

1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.

What do you think? Does this strike a reasonable balance between
readability and safety?
2024-01-17 15:29:15 +08:00
overlookmotel
408acb90e6
refactor(parser): lexer handle unicode without branch (#2039)
As suggested by @strager in
https://github.com/oxc-project/oxc/pull/2025#pullrequestreview-1820273832,
this PR adds `BYTE_HANDLERS` for first bytes of unicode characters.

This removes a branch from `read_next_token()` and produces a +1%
speed-up on parser benchmarks.
2024-01-16 13:14:22 +08:00
overlookmotel
66a7a68f9f
perf(parser): lexer byte handlers consume ASCII chars faster (#2046)
In the lexer, most `BYTE_HANDLER`s immediately consume the current char
with `lexer.consume_char()`.

Byte handlers are only called if there's a certain value (or range of
values) for the next char. This is their entire purpose. So in all cases
we know for sure that we're not at EOF, and that the next char is a
single-byte ASCII character.

The compiler, however, doesn't seem to be able to "see through" the
`BYTE_HANDLERS[byte](self)` call and understand these invariants. So it
produces very verbose ASM for `lexer.consume_char()`.

This PR replaces `lexer.consume_char()` in the byte handlers with an
unsafe `lexer.consume_ascii_char()` which skips on to next char with a
single `inc` instruction.

The difference in codegen can be seen here:
https://godbolt.org/z/1ha3cr9W5 (compare the 2 x
`core::ops::function::FnOnce::call_once` handlers).

Downside is that this does introduce a lot of unsafe blocks, but in my
opinion they're all pretty trivial to validate.

---------

Co-authored-by: Boshen <boshenc@gmail.com>
2024-01-16 12:31:45 +08:00
Boshen
09c7570560
ci: use miri to detect memory leak for the parser (#2037)
We'll merge this and then eventually turn it on as a nightly check, it's
a manual run for now.
2024-01-15 15:11:02 +00:00
overlookmotel
b4d76f0b0d
refactor(parser): remove noop code (#2028)
This PR removes some code from the lexer which doesn't do anything.
2024-01-14 23:48:35 +08:00
overlookmotel
60a927d8f5
perf(parser): lexer match byte not char (#2025)
2 related changes to lexer's `read_next_token()`:

1. Hint to branch predictor that unicode identifiers and non-standard
whitespace are rare by marking that branch `#[cold]`.

2. The branch is on whether next character is ASCII or not. This check
only requires reading 1 byte, as ASCII characters are always single byte
in UTF8. So only do the work of getting a `char` in the cold path, once
it's established that character is not ASCII and this work is required.
2024-01-14 18:50:11 +08:00
Boshen
1886a5b838
perf(parser): reduce Token size from 16 to 12 bytes (#2010)
I also had to change how the string for private identifiers are built,
otherwise they will always be allocated.
2024-01-13 12:42:39 +08:00
overlookmotel
6996948825
refactor(parser): remove extraneous code from regex parsing (#2008)
This PR removes some code in parsing regexp flags which is extraneous:

```rs
if !ch.is_ascii_lowercase() {
  self.error(diagnostics::RegExpFlag(ch, self.current_offset()));
  continue;
}
```

Which is followed by:

```rs
let flag = if let Ok(flag) = RegExpFlags::try_from(ch) {
  flag
} else {
  self.error(diagnostics::RegExpFlag(ch, self.current_offset()));
  continue;
};
```

`!ch.is_ascii_lowercase()` is equivalent to `ch < 'a' || ch > 'z'`. The
compiler implements `RegExpFlags::try_from(ch)` as `ch < 'd' || ch >
'y'` and then a jump table. So `ch.is_ascii_lowercase()` does nothing
that `RegExpFlags::try_from(ch)` doesn't do already.

https://godbolt.org/z/51GPPY9nx

(this PR built on top of #2007 for ease)
2024-01-13 02:34:05 +00:00
overlookmotel
712e99cf9b
fix(parser): restore regex flag parsing (#2007)
As discussed in
https://github.com/oxc-project/oxc/pull/1999#issuecomment-1888916383,
this PR restores some of regex parsing behavior to as it was prior to
#1926.
2024-01-13 03:19:33 +08:00
Boshen
aa91fde1d9
refactor(parser): only allocate for escaped template strings (#2005) 2024-01-12 18:56:36 +08:00
Boshen
38f86b0cac
refactor(parser): remove string builder from number parsing (#2002)
The builder was used to build an allocated string for numbers with
underscores, this is no longer required because it is now allocated on
demand.


0d77e1e788/crates/oxc_parser/src/lexer/number.rs (L32)
2024-01-12 17:01:51 +08:00
overlookmotel
c7316856db
refactor(parser): reduce work parsing regexps (#1999)
#1926 produced a small performance regression because when parsing a
regexp, some work is repeated.
2024-01-12 11:36:30 +08:00
Boshen
4706765d2a
refactor(parser): reduce Token size from 32 to 16 bytes (#1962)
Part of #1880

`Token` size is reduced from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always
`&source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings[]` vec for storing these
unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because
escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
2024-01-09 15:17:02 +08:00
Boshen
6e0bd52af1
refactor(parser): remove TokenValue::Number from Token (#1945)
This PR is part of #1880.

Token size is reduced from 40 to 32 bytes.
2024-01-08 16:29:03 +08:00
Dunqing
b50c5ec623
fix(parser): unexpected ts type annotation in get/set (#1942)
fix: https://github.com/oxc-project/oxc/issues/1939
2024-01-08 15:07:43 +08:00
Boshen
08438e04ba
refactor(parser): remove TokenValue::RegExp from Token (#1926)
This PR is part of #1880.

`Token` size is reduced from 48 to 40 bytes.

To reconstruct the regex pattern and flags within the parser , the regex
string is
re-parsed from the end by reading all valid flags.

In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
2024-01-08 13:48:52 +08:00
Boshen
7eb2573178
refactor(parser): parse BigInt lazily (#1924)
This PR partially fixes #1803 and is part of #1880.

BigInt is removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.

`Token` is now also `Copy`, which removes all the `clone` and `drop`
calls.

This yields 5% performance improvement for the parser.
2024-01-08 12:37:20 +08:00
overlookmotel
eb2966c512
fix(parser): fix incorrectly identified directives (#1885)
Parser incorrectly identifies string literals as directives if they
follow after `import`s, `export`s, or decorators.

In all of these cases, `'use strict'` produces a directive in the AST,
where it should be parsed as an `ExpressionStatement` containing a
`StringLiteral`:

```js
import x from 'foo';
'use strict';
```

```js
export {x};
'use strict';
```

```js
@foo
'use strict';
```


[Playground](https://oxc-project.github.io/oxc/playground/?code=3YCAAIC0gICAgICAgIC0G8rnONK89ITJ3zrK%2FUP7OmSZPgHQzStr3yMtwFTU%2BD1WPt09JgqZJLoYooydbGsM5vGcf34BnIA%3D)

This PR should fix that.

I'm not sure about the decorator case, though. I assume it's not a
directive. But is prefixing a string literal with a decorator even legal
syntax anyway?

And a side nit: If I'm reading it right, I don't think the `continue`
statement in the decorator arm of the match does anything. Do I have
that right?

Last question: Where does one go about putting a test? I guess these
silly cases aren't covered by Babel etc's tests.

---------

Co-authored-by: Boshen <boshenc@gmail.com>
2024-01-04 13:39:15 +00:00
Dunqing
c3090c2c70
fix(parser): terminate parsing if an EmptyParenthesizedExpression error occurs (#1874)
close: https://github.com/oxc-project/oxc/issues/1870#issue-2061901976
2024-01-03 11:34:14 +08:00
overlookmotel
62bc8c5cea
fix(parser): error on source larger than 4 GiB (#1860)
`Token` and `Span` both represent `start` and `end` as `u32`.

This limits size of source which can be parsed to `u32::MAX`.


19577709db/crates/oxc_span/src/span.rs (L14-L20)

However, this constraint is currently not enforced.

In a release build, code will not panic on arithmetic overflow, so
`start`/`end` could wrap around back to zero if source is 4 GiB or more.

That'd produce nonsense spans. But worse, the lexer relies in some
places on `self.current.token.start` being correct, so if the value
wrapped around, possibly it'd keep rewinding to the start of the source
and lexing it again, causing an infinite loop.

In worst case, if for some reason an application's public API used OXC's
parser with user-supplied source code (parser-as-a-service!), this could
be exploited for denial of service.

This PR adds an assertion to catch this at the start of parsing instead.

This does add an extra instruction, but I imagine the effect will be
negligible compared to the work required to parse the code.
2024-01-02 11:05:28 +08:00
Deivid Almeida
c1cfd1759e
feat(linter): no-irregular-whitespace rule (#1835)
Parser, trivias and trivias_builder were edited to get all whitespaces.
Now Trivias struct store comments and whitespaces Vec. After that, i
will implement the no-irregular-whitespace rule.

P.S.: There isn't a way to implement this feature without lose a little
bit of performance, comparing with my last PR #1819 to minimax this
trouble instead of store the irregular whitespace as Span it was stored
as u32, i removed a map iterator and removed too a unused function. If
you have a suggestion about it pls give me a feedback.
2023-12-31 12:05:38 +08:00
IWANABETHATGUY
4bbc977971
chore: upgrade rustc toolchain to stable 1.75.0 (#1853)
ref: 
https://blog.rust-lang.org/2023/12/28/Rust-1.75.0.html
2023-12-29 12:20:51 +08:00
overlookmotel
19577709db
Remove redundant code from lexer (#1850)
Just removes a couple of lines of redundant code from the lexer.

A note on the 2nd one:

```rs
let mut builder = AutoCow::new(lexer);
let c = lexer.consume_char();
builder.push_matching(c);
```

`push_matching()` is a no-op unless
`force_allocation_without_current_ascii_char()` has already been called.
Here the `AutoCow` has just been freshly created, so we know it hasn't.
2023-12-29 10:07:21 +08:00
overlookmotel
1feec95a94
fix(parser) fix typo in expecting_directives variable name (#1801)
Renamves `expecting_diretives ` to `expecting_directives` to fix spelling
2023-12-24 16:51:02 +00:00
magic-akari
5b2696b711
refactor(parser): report this parameter error (#1788)
- follow up: #1728
2023-12-23 22:09:14 +08:00
Boshen
2b4d1bf142
fix(parser): await in jsx expression
closes #1740
2023-12-19 20:23:16 +08:00
magic-akari
a2858ed452
refactor(ast): introduce ThisParameter (#1728)
Most TypeScript types can be eliminated during the code generation phase
by not printing the corresponding AST nodes.
The changes in this PR enable applying a similar technique to the `this`
parameter.
2023-12-19 13:20:33 +08:00
Boshen
19e77b0af3
fix(parser): false postive for "Missing initializer in const declaration" in declare + namespace (#1724)
closes #1723
2023-12-18 17:03:42 +08:00
Boshen
8edcab82f2
chore(lexer): document the accessor keyword 2023-12-14 12:55:55 +08:00
Boshen
1554f7c0d2
feat(parsr): parse let.a = 1 with error recovery (#1587) 2023-11-29 23:21:39 +08:00
Boshen
9842be4461
refactor(parser): remove duplicated code 2023-11-29 18:23:32 +08:00
Boshen
6670d94708
chore(rust): remove unnecessary clippy::non_upper_case_globals (#1557) 2023-11-27 14:31:38 +08:00
magic-akari
9ff0ffcc6f
feat(ast): implement new proposal-import-attributes (#1476)
- [Import Attributes](https://tc39.es/proposal-import-attributes)
2023-11-25 15:56:09 +08:00
Boshen
567c6ed757
feat(prettier): print directives (#1497) 2023-11-22 19:39:25 +08:00
JonaAnders
08164b0e18
refactor(parser) Updated comments mentioning the ecma specification section 12.x (#1496)
The ECMA specification seems to added the "Tokens" section to the
specification as 12.6. This pushed all the other sections down,
resulting in e.g. former 12.6 now being 12.7. Comments in the parser
mention this part of the specification. All the mentions of section
12.6+ therefor are outdated now. This pull request tries to fix that by
updating all the comments.
2023-11-22 19:29:04 +08:00
Boshen
07b010912a
feat(parser): add preserve_parens option (default: true) (#1474)
closes #1461
2023-11-21 11:16:30 +08:00
magic-akari
a7e0706dbc
fix(parser): correct import_kind of TSImportEqualsDeclaration (#1449) 2023-11-20 16:57:38 +08:00
Boshen
0218ae8641
feat(prettier): print leading comments with newlines (#1434) 2023-11-19 22:46:55 +08:00
Jon Surrell
cb804d3cd2
Add base to AST BigintLiteral (#1416) 2023-11-19 11:11:19 +08:00
magic-akari
445352991f
fix(parser): Fix type import (#1291)
- fix: #1288 
- fix: #1289
2023-11-14 15:17:58 +08:00
magic-akari
9c0aafcd1c
fix(parser): Disallow ReservedWord in NamedExports (#1230)
- fix: #1222

---------

Co-authored-by: Boshen <boshenc@gmail.com>
2023-11-12 10:52:02 +00:00
magic-akari
8afb81aa34
fix(parser): ASI of async class member (#1214)
Co-authored-by: Boshen <boshenc@gmail.com>
2023-11-10 16:21:51 +00:00
Boshen
a455c81db6
fix(linter): revert changes to JSX attribute strings (#1101) 2023-10-30 15:26:04 +08:00