`LineOffsetTables` records mappings from byte offset to line and column numbers (with column number in UTF-16 characters).
Most lines do not contain any Unicode characters, and for these lines there is an exact correspondence between number of bytes from start of line and UTF-16 column number, so no column lookup table is required.
Reduce the data stored for each line from 32 bytes to 8 bytes by storing column offset lookup tables for the rare lines which do contain Unicode chars separately.
Additionally, store column lookup tables as a `Box<[u32]>` instead of `Vec<u32>` to reduce the size of `ColumnOffsets` by 8 bytes.
Oxc have a limit on size of source files of 4 GiB, so `u32` is sufficient to hold line and column offsets. Use `u32` for these values in `LineOffsetTable`, which reduces size of the type by 8 bytes.
The ast span is not ordering at rolldown, eg the module original ast is
`a,b,c`, after mutate could be `b,c,a`. So here revert changes from
[here](https://github.com/oxc-project/oxc/pull/2728).
The sourcemap implement port from
[rust-sourcemap](https://github.com/getsentry/rust-sourcemap), but has
some different with it.
- Encode sourcemap at parallel, including quote `sourceContent` and
encode token to `vlq` mappings.
- Avoid `Sourcemap` some methods overhead, like `SourceMap::tokens()`
caused extra overhead at common cases. Here using `SourceViewToken` to
instead of it.
This PR optimizes the `update_generated_line_and_column` function used in generating source maps.
The main change is that `generated_column` only depends on the last line, so just spin through bytes to find the last line break, and then only convert last line UTF-8 to UTF-16. There's also a fast path for when last line is ASCII, to avoid iterating over the last line twice in that common case.
Speed up building the line tables for source maps, using same kind of techniques as have been using in the lexer:
* Iterate byte-by-byte not char-by-char (`chars` iterator is slow).
* Fast path for ASCII (common case).
Fix creating a sourcemap mapping when last byte of output in `\r`. Currently this panics because it assumes there's another byte after it when checking for `\n`, and reads out of bounds.
Fix source mapping of Window-style line breaks in presence of Unicode chars.
`content.chars().nth(i + 1)` gets the `i + 1`th *char*, but `i` is a byte offset not a char offset.
The replacement `content.as_bytes().get(i + 1)` gets the `i + 1`th *byte*, and should also be faster as doesn't require iterating through `chars` again.
#2565 added source map support in codegen. But there was a bug in creating the line offset tables for Unicode. This PR fixes that.
This function could probably be made more efficient, but I think this at least makes it correct.