mathpix-markdown-it 2.0.38 → 2.0.40
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +3 -0
- package/doc/changelog.md +128 -0
- package/es5/browser/auto-render.js +1 -1
- package/es5/bundle.js +4 -4
- package/es5/index.js +4 -4
- package/lib/components/mathpix-markdown/index.js +2 -1
- package/lib/components/mathpix-markdown/index.js.map +1 -1
- package/lib/index.d.ts +2 -1
- package/lib/index.js +3 -1
- package/lib/index.js.map +1 -1
- package/lib/markdown/common/consts.d.ts +5 -0
- package/lib/markdown/common/consts.js +17 -5
- package/lib/markdown/common/consts.js.map +1 -1
- package/lib/markdown/common/convert-math-to-html.d.ts +10 -0
- package/lib/markdown/common/convert-math-to-html.js +163 -41
- package/lib/markdown/common/convert-math-to-html.js.map +1 -1
- package/lib/markdown/common/labels.d.ts +9 -1
- package/lib/markdown/common/labels.js +82 -37
- package/lib/markdown/common/labels.js.map +1 -1
- package/lib/markdown/common/reset-mmd-state.d.ts +4 -0
- package/lib/markdown/common/reset-mmd-state.js +30 -0
- package/lib/markdown/common/reset-mmd-state.js.map +1 -0
- package/lib/markdown/common.d.ts +3 -0
- package/lib/markdown/common.js +34 -23
- package/lib/markdown/common.js.map +1 -1
- package/lib/markdown/highlight/highlight-math-token.js +1 -0
- package/lib/markdown/highlight/highlight-math-token.js.map +1 -1
- package/lib/markdown/index.js +22 -8
- package/lib/markdown/index.js.map +1 -1
- package/lib/markdown/mathpix-markdown-plugins.js +21 -1
- package/lib/markdown/mathpix-markdown-plugins.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/common.d.ts +15 -1
- package/lib/markdown/md-block-rule/begin-tabular/common.js +57 -11
- package/lib/markdown/md-block-rule/begin-tabular/common.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/index.d.ts +3 -0
- package/lib/markdown/md-block-rule/begin-tabular/index.js +79 -20
- package/lib/markdown/md-block-rule/begin-tabular/index.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/multi-column-row.d.ts +3 -1
- package/lib/markdown/md-block-rule/begin-tabular/multi-column-row.js +15 -9
- package/lib/markdown/md-block-rule/begin-tabular/multi-column-row.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/parse-tabular.d.ts +2 -1
- package/lib/markdown/md-block-rule/begin-tabular/parse-tabular.js +177 -73
- package/lib/markdown/md-block-rule/begin-tabular/parse-tabular.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/sub-cell.d.ts +1 -0
- package/lib/markdown/md-block-rule/begin-tabular/sub-cell.js +11 -23
- package/lib/markdown/md-block-rule/begin-tabular/sub-cell.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/sub-code.d.ts +0 -6
- package/lib/markdown/md-block-rule/begin-tabular/sub-code.js +10 -21
- package/lib/markdown/md-block-rule/begin-tabular/sub-code.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/sub-math.d.ts +13 -5
- package/lib/markdown/md-block-rule/begin-tabular/sub-math.js +132 -93
- package/lib/markdown/md-block-rule/begin-tabular/sub-math.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/sub-tabular.d.ts +3 -1
- package/lib/markdown/md-block-rule/begin-tabular/sub-tabular.js +44 -36
- package/lib/markdown/md-block-rule/begin-tabular/sub-tabular.js.map +1 -1
- package/lib/markdown/md-block-rule/begin-tabular/tabular-td.d.ts +11 -3
- package/lib/markdown/md-block-rule/begin-tabular/tabular-td.js +207 -57
- package/lib/markdown/md-block-rule/begin-tabular/tabular-td.js.map +1 -1
- package/lib/markdown/md-block-rule/mmd-html-block.js +11 -2
- package/lib/markdown/md-block-rule/mmd-html-block.js.map +1 -1
- package/lib/markdown/md-core-rules/set-positions.js +90 -15
- package/lib/markdown/md-core-rules/set-positions.js.map +1 -1
- package/lib/markdown/md-inline-rule/core-inline.js +41 -9
- package/lib/markdown/md-inline-rule/core-inline.js.map +1 -1
- package/lib/markdown/md-inline-rule/tabular.js +5 -2
- package/lib/markdown/md-inline-rule/tabular.js.map +1 -1
- package/lib/markdown/md-latex-footnotes/block-rule.js +72 -3
- package/lib/markdown/md-latex-footnotes/block-rule.js.map +1 -1
- package/lib/markdown/md-latex-lists-env/re-level.js +39 -22
- package/lib/markdown/md-latex-lists-env/re-level.js.map +1 -1
- package/lib/markdown/md-renderer-rules/render-tabular.js +115 -36
- package/lib/markdown/md-renderer-rules/render-tabular.js.map +1 -1
- package/lib/markdown/md-svg-to-base64/base64.js +8 -8
- package/lib/markdown/md-svg-to-base64/base64.js.map +1 -1
- package/lib/markdown/md-theorem/block-rule.js +10 -6
- package/lib/markdown/md-theorem/block-rule.js.map +1 -1
- package/lib/markdown/mdPluginRaw.js +24 -3
- package/lib/markdown/mdPluginRaw.js.map +1 -1
- package/lib/markdown/mdPluginTOC.js +30 -4
- package/lib/markdown/mdPluginTOC.js.map +1 -1
- package/lib/markdown/mdPluginTableTabular.js +46 -1
- package/lib/markdown/mdPluginTableTabular.js.map +1 -1
- package/lib/markdown/utils.js +3 -0
- package/lib/markdown/utils.js.map +1 -1
- package/lib/mathjax/index.js +3 -3
- package/lib/mathjax/index.js.map +1 -1
- package/lib/mathpix-markdown-model/index.d.ts +4 -0
- package/lib/mathpix-markdown-model/index.js +2 -1
- package/lib/mathpix-markdown-model/index.js.map +1 -1
- package/package.json +1 -1
- package/pr-specs/2026-04-global-state-cleanup-and-perf.md +212 -0
- package/pr-specs/2026-04-optimize-tabular-parsing.md +211 -0
- package/pr-specs/2026-05-footnote-perf-and-parser-invariants.md +246 -0
- package/pr-specs/2026-05-tabular-vertical-align-bracket.md +270 -0
- package/lib/markdown/mdPluginSeparateForBlock.d.ts +0 -2
- package/lib/markdown/mdPluginSeparateForBlock.js +0 -209
- package/lib/markdown/mdPluginSeparateForBlock.js.map +0 -1
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
# PR: Global state cleanup and performance improvements
|
|
2
|
+
|
|
3
|
+
Status: Implemented
|
|
4
|
+
Owner: @OlgaRedozubova
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Context
|
|
9
|
+
|
|
10
|
+
Audit of the codebase revealed multiple module-level mutable state variables that accumulate data across documents in long-lived processes. Several hot-path data structures used `Array` + `findIndex()` for O(n) lookups that should be O(1) Maps, and `getInlineCodeListFromString` result was scanned with `.find()` per character — O(n × m).
|
|
11
|
+
|
|
12
|
+
Two additional memory-retention bugs were uncovered during this work:
|
|
13
|
+
|
|
14
|
+
- `mdPluginTOC` stored the parse `state` in a module-level `gstate` variable so the TOC render rule could reach the top-level token list. That reference was never cleared and kept the ENTIRE token tree pinned across unrelated parses until the next `md.use()` call. On tabular-heavy documents this alone retained hundreds of megabytes.
|
|
15
|
+
|
|
16
|
+
- `coreInline` rebound `state.env = Object.assign({}, ...)` inside the inline loop. That desynced state.env from the env reference the caller of `md.render(src, env)` still held, so parse-time mutations (including our TOC / cache bookkeeping) became invisible to render rules that received the original env.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Goal
|
|
21
|
+
|
|
22
|
+
Fix memory leaks from module-level state, improve lookup performance for tabular-related data structures, add per-parse cleanup guarantees for reused md instances, and preserve the caller's env reference contract between parse and render.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Non-Goals
|
|
27
|
+
|
|
28
|
+
- Highlight dedup optimization (order-dependent behavior is pre-existing, changing risks regression without dedicated test coverage).
|
|
29
|
+
- Migrating `mathTable` / `subTabular` / `extractedCodeBlocks` to `state.env` (they now work correctly with the two-hook scheme — `reset_tabular_state` at the start of parse plus `cleanup_tabular_state` at the end).
|
|
30
|
+
- Browser-only concerns (click handlers, setInterval, context menu listeners).
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Current Behavior (before)
|
|
35
|
+
|
|
36
|
+
- `Clear*` functions run once at `md.use()` time. If the md instance is reused for multiple documents, module-level state accumulates.
|
|
37
|
+
- `mdPluginTOC` pins every parsed token tree on module-level `gstate` indefinitely.
|
|
38
|
+
- `coreInline` replaces `state.env` with a fresh object per inline token, breaking the shared-env contract.
|
|
39
|
+
- `diagboxTable` has no cleanup function at all — unbounded growth.
|
|
40
|
+
- `subTabular`, `extractedCodeBlocks` use `Array` + `findIndex()` — O(n) per lookup.
|
|
41
|
+
- `findEndMarker()` calls `.find()` on inline code list for every character position — O(n × m).
|
|
42
|
+
- `labelsList` uses `Array` for all lookups — O(n) per `find`/`findIndex`.
|
|
43
|
+
- `makeTagRegexes` creates 6 new `RegExp` objects per HTML block.
|
|
44
|
+
- `SetItemizeLevelTokens` clones the entire `md.options` object on every call.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Desired Behavior (after)
|
|
49
|
+
|
|
50
|
+
- Every `md.parse()` begins with a `reset_tabular_state` core-ruler hook that clears module-level tabular state. A second `cleanup_tabular_state` hook runs at the end of the core pipeline and drops parse-only caches (`subTabular`, `mathTable`, `extractedCodeBlocks`, `diagboxTable`, and the column-style intern cache) so they're not retained through render.
|
|
51
|
+
- `mdPluginTOC` stores the token list on `state.env[TOC_ENV_KEY]`, so the reference is released together with env when the parse ends. The TOC render rule reads the list from env instead of the removed `gstate`.
|
|
52
|
+
- `coreInline` mutates `state.env` in place for the inline pass and derives a private `inlineEnv` for the nested `inline.parse()` call. The caller's env binding stays intact for downstream render rules.
|
|
53
|
+
- All lookup data structures use `Map` for O(1) access.
|
|
54
|
+
- Inline code position check is O(1) via `Set<number>`.
|
|
55
|
+
- Tag-regex objects are cached and reused; `g`-flag regexes are matched with `.match()` to avoid `lastIndex` corruption across calls.
|
|
56
|
+
- `SetItemizeLevelTokens` saves/restores only `outMath` with `try/finally`.
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## Constraints / Invariants
|
|
61
|
+
|
|
62
|
+
- `reset_tabular_state` and `cleanup_tabular_state` both respect `renderElement.startLine` — partial renders skip cleanup so the enclosing re-parse still sees the cached state.
|
|
63
|
+
- `labelsByKey` / `labelsByUuid` survive `cleanup_tabular_state` — labels are read by render rules for `\ref{}` / `\eqref{}` resolution.
|
|
64
|
+
- Highlight rendering files (`render-rule-highlights.ts`, `common.ts`) are NOT modified — reverted to master to avoid behavioural regression.
|
|
65
|
+
- `labelsList` export kept for deep-import backward compatibility (deprecated, exposed as a `Proxy` that returns a version-cached snapshot of `labelsByKey.values()` — snapshot is rebuilt only when `addIntoLabelsList` / `clearLabelsList` bumps the version. Supports `.length`, iteration, and Array methods. Mutations (`.push`, index assignment) target the throwaway target array and are effectively ignored).
|
|
66
|
+
|
|
67
|
+
### Per-parse cross-plugin state reset
|
|
68
|
+
|
|
69
|
+
Sub-plugins (TOC, theorem, labels, footnotes, list-env, text counters) each maintain module-level state that was previously cleared only at `md.use` time or inside the `initMathpixMarkdown.parse` / `renderer.render` wrappers. Direct callers of `markdownIt().use(mathpixMarkdownPlugin)` that reuse one md instance across documents saw drift:
|
|
70
|
+
|
|
71
|
+
- TOC slug registry (`slugsTocItems`) accumulated → second parse of the same document produced `#introduction-2`, `#introduction-3`, …
|
|
72
|
+
- `theoremEnvironments` / `environmentsCounter` / `counterProof` kept stale theorem numbers across documents.
|
|
73
|
+
- `labelsByKey` / `labelsByUuid` still held prior document's `\label{}` entries, so `\ref{}` could resolve to the wrong block.
|
|
74
|
+
- `mmd_footnotes_list` carried old footnote items.
|
|
75
|
+
- `itemizeLevelTokens` held parsed token trees from prior `\renewcommand{\labelitemi}` — visible as stale marker tokens in the next document.
|
|
76
|
+
- `resetTextCounter` / size counter / `MathJax.Reset()` (equation numbering) were also wrapper-only.
|
|
77
|
+
- `ParseErrorList` was the worst offender: `ClearParseErrorList()` was defined but never called anywhere — tabular parse errors grew monotonically.
|
|
78
|
+
|
|
79
|
+
A new core-ruler hook `reset_mmd_global_state` (registered `before('normalize')` from `mathpixMarkdownPlugin`) calls all the reset functions at the start of every `md.parse()`. It respects `renderElement.startLine` so partial re-renders don't tear down the enclosing parse's cross-reference state. The same implementation is exported as `resetMmdGlobalState()` from the package root so one-shot converters (e.g. DOCX export) can release module-level state immediately after render without waiting for the next parse.
|
|
80
|
+
|
|
81
|
+
### Additional parse-only retention fixes
|
|
82
|
+
|
|
83
|
+
- `cleanup_math_cache` core-ruler hook (pushed, end of pipeline) clears `state.env.__mathpix`. Previously the per-parse math dedup cache was only initialized, never released — MathJax html/svg strings for every unique expression stayed on env until the caller dropped it.
|
|
84
|
+
- `mdPluginTOC.grab_state` stashes `state.tokens` on `state.env[TOC_ENV_KEY]` only when the document actually used `[[toc]]`, detected by a one-pass scan of inline-token children for `toc_body`. Documents without `[[toc]]` no longer retain the whole token tree on env.
|
|
85
|
+
|
|
86
|
+
### `markdownToHtmlPipelineSegments` balance fix
|
|
87
|
+
|
|
88
|
+
The segments renderer tracked a single `pendingCloseTag` + `pendingLevel` pair, so a nested same-type same-level `_open` terminated the segment mid-block. Concrete case: md-theorem wraps an inner `paragraph_open` at level 0 inside the outer `paragraph_open` class `theorem_block` — the first `paragraph_close` closed the segment prematurely, leaving `<div><div>...</div>` in one segment and `</div></div>...` in the next. Added a `pendingDepth` counter so nested opens of the same type at the same level increment depth; the segment closes only when depth drops back to zero. Regression test in `tests/_html-segments.js` covers 38 scenarios across all block rules from `mmdRules.ts`.
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Done When
|
|
93
|
+
|
|
94
|
+
- [x] Per-parse `reset_tabular_state` hook clears all tabular module-level state at the start of parse
|
|
95
|
+
- [x] Post-parse `cleanup_tabular_state` hook drops parse-only caches (`subTabular`, `mathTable`, `extractedCodeBlocks`, `diagboxTable`, column-style intern cache) at the end of the core pipeline
|
|
96
|
+
- [x] `mdPluginTOC` stores the token list on `state.env[TOC_ENV_KEY]`; module-level `gstate` removed
|
|
97
|
+
- [x] `coreInline` mutates `state.env` in place instead of rebinding it
|
|
98
|
+
- [x] `diagboxTable` has `ClearDiagboxTable()` + `diagboxById` reverse Map
|
|
99
|
+
- [x] `subTabular` converted from Array to Map
|
|
100
|
+
- [x] `extractedCodeBlocks` converted from Array to Map
|
|
101
|
+
- [x] `labelsByKey` + `labelsByUuid` Map indexes for O(1) lookups
|
|
102
|
+
- [x] `buildInlineCodePositionSet()` → `Set<number>` for O(1) position check
|
|
103
|
+
- [x] `tagRegexCache` memoization + `.test()` → `.match()` fix for g-regex `lastIndex`
|
|
104
|
+
- [x] `utf8Encode` uses `parts[]` + `join()` instead of `+=`
|
|
105
|
+
- [x] `SetItemizeLevelTokens` saves/restores only `outMath` with `try/finally`
|
|
106
|
+
- [x] All 3,286 tests pass
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Architecture
|
|
111
|
+
|
|
112
|
+
### Two-hook cleanup scheme
|
|
113
|
+
|
|
114
|
+
```
|
|
115
|
+
core.ruler.before('normalize', 'reset_tabular_state', resetHook)
|
|
116
|
+
// clears state before parsing starts (defensive — also catches the case
|
|
117
|
+
// where the previous parse threw)
|
|
118
|
+
core.ruler.push( 'cleanup_tabular_state', cleanupHook)
|
|
119
|
+
// drops parse-only caches after the last core rule — they're never read
|
|
120
|
+
// during render
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
### TOC token-list handoff
|
|
124
|
+
|
|
125
|
+
Old (leaked entire token tree):
|
|
126
|
+
|
|
127
|
+
```ts
|
|
128
|
+
let gstate; // module scope
|
|
129
|
+
md.core.ruler.push('grab_state', (state) => { gstate = state; });
|
|
130
|
+
// ...
|
|
131
|
+
const dataToc = getTocList(0, gstate.tokens, ...);
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
New (per-parse scope):
|
|
135
|
+
|
|
136
|
+
```ts
|
|
137
|
+
const TOC_ENV_KEY = '__mathpix_toc_tokens';
|
|
138
|
+
md.core.ruler.push('grab_state', (state) => {
|
|
139
|
+
if (!state.env) state.env = {};
|
|
140
|
+
state.env[TOC_ENV_KEY] = state.tokens;
|
|
141
|
+
});
|
|
142
|
+
// ...
|
|
143
|
+
const allTokens = (env && env[TOC_ENV_KEY]) || tokens || [];
|
|
144
|
+
const dataToc = getTocList(0, allTokens, ...);
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### coreInline env preservation
|
|
148
|
+
|
|
149
|
+
Old (broke caller env):
|
|
150
|
+
|
|
151
|
+
```ts
|
|
152
|
+
state.env = Object.assign({}, {...state.env}, {
|
|
153
|
+
currentTag, ...envToInline,
|
|
154
|
+
});
|
|
155
|
+
state.md.inline.parse(token.content, state.md, state.env, token.children);
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
New (preserves caller env binding):
|
|
159
|
+
|
|
160
|
+
```ts
|
|
161
|
+
const inlineEnv = Object.assign({}, state.env, {currentTag}, envToInline);
|
|
162
|
+
state.env.currentTag = currentTag;
|
|
163
|
+
if (envToInline && typeof envToInline === 'object') {
|
|
164
|
+
Object.assign(state.env, envToInline);
|
|
165
|
+
}
|
|
166
|
+
state.md.inline.parse(token.content, state.md, inlineEnv, token.children);
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
The same pattern is applied in the deeper recursive walker `walkInlineInTokens` (used by `footnote_latex` / `tabular` deep-walk): it now also builds a private `inlineEnv` per token and mutates `state.env` in place, rather than rebinding it.
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## Memory impact
|
|
174
|
+
|
|
175
|
+
Benchmark document: 16 MB MMD with 13,713 tabular blocks, ~479K `<td>` cells, and ~49K inline math expressions.
|
|
176
|
+
|
|
177
|
+
### Full SVG+HTML render
|
|
178
|
+
|
|
179
|
+
| Stage | Before | After | Δ |
|
|
180
|
+
|-----------------------------|--------:|--------:|-------------:|
|
|
181
|
+
| Peak heap (html held) | 2597 MB | 778 MB | −1819 (−70%) |
|
|
182
|
+
| Heap after releasing html | 1887 MB | 68 MB | −1819 (−96%) |
|
|
183
|
+
|
|
184
|
+
The bulk of the reduction comes from the TOC `gstate` / `coreInline` env fix: without it the token tree stayed pinned across parses and dominated the retained heap. The `cleanup_tabular_state` hook removes the remaining ~45 MB of parse-only caches that used to survive into the render phase.
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Files Changed
|
|
189
|
+
|
|
190
|
+
| File | Change |
|
|
191
|
+
|------|--------|
|
|
192
|
+
| `src/markdown/mdPluginTableTabular.ts` | `clearTabularState()` + two core-ruler hooks: `reset_tabular_state` (before normalize) and `cleanup_tabular_state` (push, end of pipeline) |
|
|
193
|
+
| `src/markdown/mdPluginTOC.ts` | `TOC_ENV_KEY` on `state.env` replaces module-level `gstate`; render rule reads from env |
|
|
194
|
+
| `src/markdown/md-inline-rule/core-inline.ts` | In-place `state.env` mutation, derived `inlineEnv` for nested `inline.parse` |
|
|
195
|
+
| `src/markdown/md-block-rule/begin-tabular/sub-cell.ts` | `ClearDiagboxTable()`, `diagboxById` reverse Map, `buildInlineCodePositionSet` in `extractNextBraceContent` |
|
|
196
|
+
| `src/markdown/md-block-rule/begin-tabular/sub-tabular.ts` | Array → Map, all `findIndex` → `.get()` |
|
|
197
|
+
| `src/markdown/md-block-rule/begin-tabular/sub-code.ts` | Array → Map |
|
|
198
|
+
| `src/markdown/md-block-rule/begin-tabular/sub-math.ts` | Single `tail` slice in `getMathTableContent` |
|
|
199
|
+
| `src/markdown/common.ts` | `buildInlineCodePositionSet()`, `findEndMarker` uses `Set.has()` |
|
|
200
|
+
| `src/markdown/common/labels.ts` | `labelsByKey` + `labelsByUuid` Map indexes |
|
|
201
|
+
| `src/markdown/md-block-rule/mmd-html-block.ts` | `tagRegexCache` memoization, `.test()` → `.match()` for g-regex |
|
|
202
|
+
| `src/markdown/md-svg-to-base64/base64.ts` | `parts[]` + `join()` in utf8Encode |
|
|
203
|
+
| `src/markdown/md-latex-lists-env/re-level.ts` | Save/restore only `outMath` with `try/finally` + `beginCacheBypass`/`endCacheBypass` |
|
|
204
|
+
| `src/markdown/mdPluginSeparateForBlock.ts` | Removed — dead code, never imported anywhere |
|
|
205
|
+
|
|
206
|
+
---
|
|
207
|
+
|
|
208
|
+
## Testing
|
|
209
|
+
|
|
210
|
+
- All 3,286 tests pass
|
|
211
|
+
- Per-parse cleanup verified: 100 sequential parses on the same md instance show no memory growth
|
|
212
|
+
- Highlight rendering files reverted to master — zero risk of behavioural regression
|
|
@@ -0,0 +1,211 @@
|
|
|
1
|
+
# PR: Optimize tabular parsing performance
|
|
2
|
+
|
|
3
|
+
Status: Implemented
|
|
4
|
+
Owner: @OlgaRedozubova
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Context
|
|
9
|
+
|
|
10
|
+
Parsing and rendering large documents with many `\begin{tabular}` blocks was slow and memory-hungry. The benchmark document — a 16 MB MMD with 13,713 tabular blocks, ~479K `<td>` cells, and ~49K inline math expressions — showed a peak heap of 2.6 GB for the full SVG/HTML render path.
|
|
11
|
+
|
|
12
|
+
Profiling revealed three classes of waste:
|
|
13
|
+
|
|
14
|
+
1. **Algorithmic**: `getSubMath()` rebuilt the placeholder string on every math expression found (O(N×M)); several lookup tables used `Array + findIndex()`; MathJax was invoked for every math token, including 67% duplicates.
|
|
15
|
+
2. **Per-token allocation**: CSS border strings, `attrs` arrays, and close-token objects were allocated fresh for every `<td>`/`<tr>`/`<table>` even though the underlying content was identical across thousands of tokens.
|
|
16
|
+
3. **Speculative output**: `renderInlineTokenBlock` always built all five outputs (HTML, TSV, CSV, tableMd, tableSmoothed) regardless of which one the caller actually consumed; `token.mathEquation` was stored on every math token even when the caller only walked the token tree.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Goal
|
|
21
|
+
|
|
22
|
+
Reduce parse time and memory usage for documents with many tabular environments and repeated math expressions. Avoid building per-token data that the caller will not read.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Non-Goals
|
|
27
|
+
|
|
28
|
+
- Persistent cross-parse MathJax cache (decided against — complexity outweighs benefit for the supported use cases; cache is per-parse via `state.env`).
|
|
29
|
+
- Migrating all module-level state to `state.env` (tracked in the sibling `global-state-cleanup-and-perf.md` spec).
|
|
30
|
+
- Changing the markdown-it Token shape (kept for downstream compatibility).
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Done When
|
|
35
|
+
|
|
36
|
+
- [x] `getSubMath` is iterative single-pass
|
|
37
|
+
- [x] `mathTable` is `Map<string, string>`
|
|
38
|
+
- [x] Per-parse math cache in `state.env.__mathpix` deduplicates identical math
|
|
39
|
+
- [x] Cache bypass for `SetItemizeLevelTokens` forDocx mutation
|
|
40
|
+
- [x] Accessibility `mjx-mml-*` IDs remain unique on cache hits
|
|
41
|
+
- [x] `outMath.skipMathToHtml` option — skips SVG serialization and `token.mathEquation` storage for token-only callers
|
|
42
|
+
- [x] Border-style strings pre-interned; composed cell style deduplicated per parse
|
|
43
|
+
- [x] Shared attrs arrays for `td_open` / `tr_open` / `table_open` / `tbody_open` with clone-on-write
|
|
44
|
+
- [x] Frozen singleton close-tokens for `td_close` / `tr_close` / `table_close`
|
|
45
|
+
- [x] `StatePushTabulars` skips `content` / `children = []` assignments on marker tokens
|
|
46
|
+
- [x] `res.concat` → in-place `res.push` inside the tabular construction loop
|
|
47
|
+
- [x] `renderInlineTokenBlock` and `renderNonTableTokenIntoCell` build each output only when the caller requested it via `options.outMath.include_*`
|
|
48
|
+
- [x] HTML-visual attrs (style, `_empty` class, `table_tabular` class, `border-*` tr style) skipped when the caller sets `forMD` or `forLatex`
|
|
49
|
+
- [x] `token.mathData.svg` retained only when highlights are active
|
|
50
|
+
- [x] Empty `labels` object is returned as `null` from MathJax `OuterData` (no empty `{}` per math token)
|
|
51
|
+
- [x] All 3,286 tests pass
|
|
52
|
+
- [x] New unit tests cover: `getSubMath` edge cases, math-cache dedup/bypass/isolation, accessibility ID uniqueness, cell-attrs sharing, round-trip placeholder restoration
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## Architecture
|
|
57
|
+
|
|
58
|
+
### Iterative getSubMath (sub-math.ts)
|
|
59
|
+
|
|
60
|
+
Single-pass scan with a local `new RegExp(RE_MATH_OPEN.source, 'g')`:
|
|
61
|
+
- `RE_MATH_OPEN` stored as literal without the `/g` flag (immutable template)
|
|
62
|
+
- Each call creates a local copy — reentrant-safe
|
|
63
|
+
- The `startPos: number = 0` optional parameter is preserved for signature compatibility with the earlier recursive implementation; it is seeked to via `re.lastIndex = startPos` before the scan loop
|
|
64
|
+
- `getEndMarker()` uses capture groups for eqref/ref detection (no substring matching)
|
|
65
|
+
- `shouldSkipDollar()` validates `$`/`$$` edge cases
|
|
66
|
+
- `mathTablePush()` accepts both `(id, content)` and `({id, content})` for backward compatibility
|
|
67
|
+
|
|
68
|
+
### colsToFixWidth: Array → Set (parse-tabular.ts)
|
|
69
|
+
|
|
70
|
+
The per-table set of column indices that need fixed width was tracked as a `number[]` with `.includes()` + `.push()` for dedup — O(N²) in cell count for wide tables. Converted to a local `Set<number>` with `.add()` (O(1) dedup). Converted to a plain array once at `tableOpen.meta.colsToFixWidth` assignment so downstream consumers (`shouldRewriteColSpec`) keep their existing array input shape. Removes four identical `if-includes-push` fragments along the way.
|
|
71
|
+
|
|
72
|
+
### Dead split/join round-trips removed (common.ts)
|
|
73
|
+
|
|
74
|
+
`getColumnLines` and `getColumnAlign` ended with `.split('').join('')` — an identity operation that allocates a per-call character array for nothing. The neighbouring `.split('').join(' ')` (which inserts spaces between characters) is not a no-op and is preserved.
|
|
75
|
+
|
|
76
|
+
### Per-parse math cache (convert-math-to-html.ts)
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
state.env.__mathpix = {
|
|
80
|
+
inlineCache: Map<math, CachedResult>, // inline_math tokens
|
|
81
|
+
displayCache: Map<math, CachedResult>, // display_math tokens
|
|
82
|
+
cacheBypass: number, // >0 disables cache
|
|
83
|
+
}
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Initialized by the `init_math_cache` core-ruler hook. On cache hit with accessibility, the original `mjx-assistive-mml` id is replaced with a fresh one from `MathJax.nextAssistiveId()`.
|
|
87
|
+
|
|
88
|
+
Cache hits mark the returned `TypesetResult` with `_labelsRegistered: true` so `convertMathToHtml` skips the label-registration loop — the `state.md.inline.parse` for every `\label{}` tag and the `addIntoLabelsList` side-effect already ran on the first miss and are idempotent for the same key+content. `idLabels` is still computed from `Object.keys(token.labels)` so downstream ref/eqref resolution is unaffected.
|
|
89
|
+
|
|
90
|
+
### `outMath.skipMathToHtml` (convert-math-to-html.ts)
|
|
91
|
+
|
|
92
|
+
When set, `applyTypesetResultToToken` does not copy the serialized math HTML to `token.mathEquation`, and `typesetMathForToken` path 4 forces `outMath.include_svg = false` for the MathJax call so the SVG string is never built. Other MathJax outputs (mathml_word, asciimath, metrics) still populate when the caller enabled them.
|
|
93
|
+
|
|
94
|
+
### Pre-interned border styles + per-parse style intern (tabular-td.ts)
|
|
95
|
+
|
|
96
|
+
`verticalCellLine` / `horizontalCellLine` switch on the line token and return one of 16 module-level border strings (solid/double/dashed/none × 4 sides) — template-literal allocation per call is gone. The composed cell style (`composeCellStyle`) is interned in a per-parse `columnStyleCache` so repeated cells share a single string instance.
|
|
97
|
+
|
|
98
|
+
### Shared attrs for structural tabular tokens
|
|
99
|
+
|
|
100
|
+
`getSharedCellAttrs(style, isEmpty, skipVisual)`, `getSharedTableOpenAttrs(extraClass?, skipVisual)`, `getSharedTbodyOpenAttrs(numCol)`, and `getSharedTrOpenAttrs(skipVisual)` return read-only attrs arrays cached per-parse by key. The shared arrays carry the non-enumerable `Symbol.for('mathpix.tabular.attrsShared')` marker so mutation sites (`tokenAttrSet` in the tabular renderer, `addAttributesToParentTokenByType` in utils) detach a private clone before writing.
|
|
101
|
+
|
|
102
|
+
### Frozen close-token singletons
|
|
103
|
+
|
|
104
|
+
`td_close`, `tr_close`, `table_close`, and `tbody_close` (non-forLatex only) have no variable data. A single `Object.freeze`d instance per kind is exported and pushed everywhere these markers appear. Under `forLatex`, `tbody_close` carries a per-table `latex` payload and is allocated per-instance.
|
|
105
|
+
|
|
106
|
+
`StatePushTabulars` no longer assigns `content` / `children = []` onto marker tokens — those fields are never read on open/close markers and assignment would throw on the frozen close singletons. The multi-column branch of `parse-tabular.ts` also pushes `SHARED_TD_CLOSE` directly instead of allocating a fresh `{token:'td_close', ...}` object per cell.
|
|
107
|
+
|
|
108
|
+
`addStyle` / `addHLineIntoStyle` (`tabular-td.ts`) check the input attrs for the `attrsSharedMarker` symbol and clone before mutating so that callers which pass in a shared-attrs array do not corrupt the cached object.
|
|
109
|
+
|
|
110
|
+
### Output gating in renderInlineTokenBlock / renderNonTableTokenIntoCell
|
|
111
|
+
|
|
112
|
+
The tabular renderer computes flags via a shared `computeOutputGates(options)` helper and only populates the outputs the caller asked for:
|
|
113
|
+
|
|
114
|
+
```ts
|
|
115
|
+
const { needHtml, needTsv, needCsv, needMd, needSmoothed } = computeOutputGates(options);
|
|
116
|
+
// needHtml: !forMD && include_table_html !== false
|
|
117
|
+
// needTsv/needCsv: include_tsv / include_csv
|
|
118
|
+
// needMd: forMD || include_table_markdown
|
|
119
|
+
// needSmoothed: forPptx
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
`renderInlineTokenBlock` and `renderNonTableTokenIntoCell` both use the same helper so their gating cannot drift. Every `result += ...`, `arr*.push(...)`, `cellMd += ...`, and `formatTsvCell` / `formatCsvCell` call is gated on the corresponding flag.
|
|
123
|
+
|
|
124
|
+
Leaf-token handling still invokes `slf.renderInline([token], options, env)` even when `needHtml` is false — the `latex_list_item_open` render rule sets `token.meta.itemizeLevel` as a side effect, which `handleListTokensForCellMarkdown` reads to emit list markers.
|
|
125
|
+
|
|
126
|
+
### HTML-visual attrs skipped for forMD / forLatex
|
|
127
|
+
|
|
128
|
+
`td_open.style`, `td_open.class='_empty'`, `tr_open.style`, `table_open.class='tabular'`, and the `table_tabular` class + text-align style on the wrapping `paragraph_open` are HTML/CSS-only. When the caller sets `forMD` or `forLatex`, `AddTd` / `AddTdSubTable` / `getMultiColumnMultiRow` / `StatePushParagraphOpen` skip these. Colspan / rowspan on multicol/multirow cells and `paragraph_open.data-align` (for forLatex) are preserved.
|
|
129
|
+
|
|
130
|
+
### mathData.svg only retained under highlights
|
|
131
|
+
|
|
132
|
+
`applyTypesetResultToToken` shallow-clones `mathData` without the `svg` field unless `options.highlights?.length` is set. The SVG is still available on `token.mathEquation` (the HTML render rule path) — only the duplicate copy on `mathData` is dropped. The highlight render path re-populates `mathData.svg` in `convertMathToHtmlWithHighlight`. Invariant: the `highlights` length is read at parse time; mutating `options.highlights` between parse and render does not retroactively re-populate `mathData.svg`.
|
|
133
|
+
|
|
134
|
+
### Hot-path outMath spread memoization (skipMathToHtml path)
|
|
135
|
+
|
|
136
|
+
`typesetMathForToken` under `skipMathToHtml: true` forces `include_svg: false` for the MathJax call. The spread `{ ...options.outMath, include_svg: false }` would run per math token (~49K times on the benchmark). A `WeakMap<outMath, clone>` cache memoizes the spread — first token pays the allocation, the rest reuse the cached clone. GC-friendly: when the parse's `options.outMath` is released, the cache entry goes with it.
|
|
137
|
+
|
|
138
|
+
### OuterData empty-labels guard (mathjax/index.ts)
|
|
139
|
+
|
|
140
|
+
When the current equation has no labels, `res.labels` is set to `null` instead of `{...emptyObject}` so each math token does not carry an empty `{}` allocation.
|
|
141
|
+
|
|
142
|
+
### res.concat → res.push in tabular construction
|
|
143
|
+
|
|
144
|
+
Tabular token arrays are assembled in a hot loop. `res = res.concat(data.res)` allocated a new array on every iteration; replaced with `for (const t of data.res) res.push(t)` to mutate in place.
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Public API changes
|
|
149
|
+
|
|
150
|
+
| Option | Type | Default | Effect |
|
|
151
|
+
|--------|------|--------:|--------|
|
|
152
|
+
| `outMath.skipMathToHtml` | boolean | `false` | When `true`, skips SVG serialization and `token.mathEquation` storage; other MathJax outputs still respect their own `include_*` flags. Intended for callers that walk tokens directly and never read the serialized math HTML. |
|
|
153
|
+
|
|
154
|
+
No other options introduced. Existing flags — `options.forMD`, `options.forLatex`, `options.forDocx`, `options.forPptx`, `outMath.include_tsv` / `include_csv` / `include_table_html` / `include_table_markdown`, `options.highlights` — drive the output-gating decisions described above.
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## Memory impact
|
|
159
|
+
|
|
160
|
+
Benchmark document: 16 MB MMD with 13,713 tabular blocks, ~479K `<td>` cells, and ~49K inline math expressions.
|
|
161
|
+
|
|
162
|
+
### Full SVG/HTML render path
|
|
163
|
+
|
|
164
|
+
| Stage | Before | After | Δ |
|
|
165
|
+
|-------------------------|--------:|-------:|-------------:|
|
|
166
|
+
| Peak heap (html held) | 2597 MB | 778 MB | −1819 (−70%) |
|
|
167
|
+
| Heap after drop html | 1887 MB | 68 MB | −1819 (−96%) |
|
|
168
|
+
| Parse time | 17.9 s | 14.6 s | −18% |
|
|
169
|
+
| HTML output size | 355 MB | 355 MB | 0 |
|
|
170
|
+
|
|
171
|
+
### Token-only path (`forMD: true`, `outMath.skipMathToHtml: true`)
|
|
172
|
+
|
|
173
|
+
| Stage | Before | After | Δ |
|
|
174
|
+
|-------------------------|--------:|-------:|-------------:|
|
|
175
|
+
| Peak heap | 2597 MB | 443 MB | −2154 (−83%) |
|
|
176
|
+
| Heap after drop output | 1887 MB | 81 MB | −1806 (−96%) |
|
|
177
|
+
| Parse time | 17.9 s | 20.5 s | |
|
|
178
|
+
| Serialized output size | 355 MB | 165 MB | −190 |
|
|
179
|
+
|
|
180
|
+
Parse time on the token-only path includes more bookkeeping but skips SVG serialization; wall-clock depends heavily on which MathJax outputs the caller enables.
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## Files Changed
|
|
185
|
+
|
|
186
|
+
| File | Change |
|
|
187
|
+
|------|--------|
|
|
188
|
+
| `src/markdown/md-block-rule/begin-tabular/sub-math.ts` | Iterative `getSubMath`; `mathTable` Array→Map; `mathTablePush` overload; `getEndMarker` with capture groups |
|
|
189
|
+
| `src/markdown/md-block-rule/begin-tabular/tabular-td.ts` | Pre-interned border strings; `columnStyleCache` per-parse intern; `getSharedCellAttrs` / `getSharedTableOpenAttrs` / `getSharedTrOpenAttrs` / `getSharedTbodyOpenAttrs` with clone-on-write marker; frozen `SHARED_TD_CLOSE` / `SHARED_TR_CLOSE` / `SHARED_TABLE_CLOSE`; `skipVisual` pathway in `AddTd` / `AddTdSubTable` |
|
|
190
|
+
| `src/markdown/md-block-rule/begin-tabular/parse-tabular.ts` | Derives `skipVisual = forMD \|\| forLatex`; gates style-only calls (`addHLineIntoStyle`, shared attrs helpers); `res.concat` → `res.push` in the construction loop |
|
|
191
|
+
| `src/markdown/md-block-rule/begin-tabular/multi-column-row.ts` | `getMultiColumnMultiRow` honors `skipVisual`; colspan/rowspan always preserved, style/width skipped under the flag |
|
|
192
|
+
| `src/markdown/md-block-rule/begin-tabular/index.ts` | `StatePushTabulars` skips `content` / `children = []` on open/close markers; `StatePushParagraphOpen` gates class/style on `forMD \|\| forLatex`, preserves `data-align` for forLatex |
|
|
193
|
+
| `src/markdown/md-inline-rule/tabular.ts` | Subtable `table_open` uses `getSharedTableOpenAttrs('subtable')` instead of mutating shared attrs |
|
|
194
|
+
| `src/markdown/md-renderer-rules/render-tabular.ts` | Clone-on-write `tokenAttrSet`; per-output `needHtml` / `needTsv` / `needCsv` / `needMd` / `needSmoothed` gates throughout `renderInlineTokenBlock` and `renderNonTableTokenIntoCell`; leaf-token `renderInline` retained for list-marker side effect |
|
|
195
|
+
| `src/markdown/common/convert-math-to-html.ts` | Per-parse cache in `state.env.__mathpix`; `initMathCache`; `beginCacheBypass` / `endCacheBypass`; accessibility ID replacement; `outMath.skipMathToHtml` path that disables SVG serialization and `mathEquation` storage; `mathData.svg` gated on `options.highlights` |
|
|
196
|
+
| `src/markdown/mdPluginRaw.ts` | Registers `init_math_cache` core-ruler hook |
|
|
197
|
+
| `src/markdown/md-latex-lists-env/re-level.ts` | `beginCacheBypass` / `endCacheBypass` around forDocx mutation; `try/finally` for `outMath` restoration |
|
|
198
|
+
| `src/markdown/utils.ts` | Clone-on-write in `addAttributesToParentTokenByType` (shared attrs marker) |
|
|
199
|
+
| `src/mathjax/index.ts` | `OuterData` returns `null` for empty `labels` instead of an empty `{}` clone |
|
|
200
|
+
| `tests/_sub-math.js` | Unit tests for `getSubMath` edge cases |
|
|
201
|
+
| `tests/_typeset-cache.js` | Tests for dedup, env isolation, mutation protection, accessibility IDs, bypass counter, forDocx integration, mathData svg gating |
|
|
202
|
+
| `tests/_accessibility.js` | Tests for unique IDs with cache, aria-labelledby consistency |
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## Testing
|
|
207
|
+
|
|
208
|
+
- All 3,286 tests pass
|
|
209
|
+
- `outMath.skipMathToHtml` is opt-in; existing callers see identical output
|
|
210
|
+
- Shared attrs + close-token singletons verified correct under highlight/diagbox/forDocx paths (tests exercise cloning)
|
|
211
|
+
- MD list markers in tabular cells verified — leaf-token `renderInline` is still invoked so the `latex_list_item_open` render rule can set `token.meta.itemizeLevel`
|