otomate 0.2.1 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,476 @@
1
+ # HTML → docx → HTML, with tracked changes
2
+
3
+ A complete, end-to-end walkthrough of converting HTML to a Word document with
4
+ tracked changes, reimporting it, accepting the revisions, and confirming the
5
+ round-trip is lossless. Every step here is backed by a real test at
6
+ `packages/otomate/src/__tests__/e2e-tracked-changes.test.ts` — if you copy the
7
+ snippets verbatim they will run.
8
+
9
+ > **Prerequisites.** Node ≥ 20 **or any modern browser** (Deno, Bun, workers,
10
+ > edge runtimes also work — the library uses the Web Crypto API internally
11
+ > and has no Node-specific imports). `otomate` installed. All `writeDocx` /
12
+ > `readDocx` / `writeDiffDocx` calls are **async** — always `await` them.
13
+ > `readHtml` / `writeHtml` / `renderDiffHtml` / `diff` are synchronous.
14
+
15
+ ---
16
+
17
+ ## Overview — the 7 stages
18
+
19
+ ```
20
+ HTML → Stage 1 → UDM (oldTree)
21
+ ↓ edit
22
+ HTML' → Stage 2 → UDM (newTree)
23
+
24
+ diff(old, new) → Stage 3 → DiffResult + annotated HTML preview
25
+
26
+ writeDiffDocx → Stage 4 → .docx with <w:ins> / <w:del>
27
+
28
+ readDocx → Stage 5 → UDM + tracked-changes HTML render
29
+ ↓ "accept all"
30
+ writeHtml → Stage 6 → plain HTML, revisions applied
31
+
32
+ assert equal → Stage 7 → writeHtml(newTree) // lossless round-trip
33
+ ```
34
+
35
+ The invariant that proves everything works: **the output of stage 6 must be
36
+ byte-for-byte identical to `writeHtml(newTree)`**. If that equality holds, the
37
+ diff you computed in stage 3 survived the docx round-trip intact, and "accept
38
+ all revisions" recovered exactly the edit you wanted.
39
+
40
+ ---
41
+
42
+ ## Stage 1 — Generate rich HTML and parse it
43
+
44
+ ```typescript
45
+ import { readHtml } from "otomate";
46
+
47
+ const originalHtml = `<style>
48
+ .title { color: #1e3a5f; font-family: Georgia; font-size: 24pt; }
49
+ .callout { background-color: #fff8e1; border: 1pt solid #fbbf24; }
50
+ .critical { color: #dc2626; font-weight: bold; }
51
+ </style>
52
+ <h1 class="title">Q1 2024 Product Roadmap</h1>
53
+ <p>Welcome to the <strong>first quarter</strong> roadmap covering
54
+ our <em>strategic priorities</em>.</p>
55
+ <h2>Initiatives</h2>
56
+ <ul>
57
+ <li>
58
+ <p>Infrastructure improvements</p>
59
+ <ul>
60
+ <li>
61
+ <p>Database migration</p>
62
+ <blockquote>
63
+ <p>The migration must complete before <u>March 15th</u>.</p>
64
+ </blockquote>
65
+ </li>
66
+ </ul>
67
+ </li>
68
+ </ul>
69
+ <table>
70
+ <thead><tr><th>Metric</th><th>Q4</th><th>Q1 Target</th></tr></thead>
71
+ <tbody><tr><td>Revenue</td><td>$1.2M</td><td>$1.5M</td></tr></tbody>
72
+ </table>
73
+ <div class="callout">
74
+ <p class="critical">Risks identified:</p>
75
+ <ul><li><p>Integration delays</p></li></ul>
76
+ </div>`;
77
+
78
+ const oldTree = readHtml(originalHtml);
79
+ ```
80
+
81
+ **What's in `oldTree`:**
82
+
83
+ - The UDM tree — a `root` with block children (`h1`, `p`, `h2`, `ul`, `table`, `div`).
84
+ - `oldTree.data.css.classRules` — auto-extracted from the inline `<style>` block (no need to pass `options.css` explicitly; it merges with whatever you do pass).
85
+ - `classes: string[]` on every element that had an HTML `class` attribute.
86
+ - Nesting depth ≥ 5 (the blockquote path alone goes `ul → li → ul → li → blockquote → p → text`).
87
+
88
+ ---
89
+
90
+ ## Stage 2 — Modify the document
91
+
92
+ Generate an edited version. For the diff engine to exercise every code path
93
+ you want a mix of edit kinds:
94
+
95
+ | Kind | Effect in docx output |
96
+ |---|---|
97
+ | Text change | `<w:del>` + `<w:ins>` on a per-character/word run |
98
+ | Root-level paragraph insertion | `<w:ins>` wrapping each run of the new paragraph |
99
+ | Root-level node deletion | `<w:del>` wrapping the deleted block's text |
100
+
101
+ ```typescript
102
+ let modifiedHtml = originalHtml;
103
+ modifiedHtml = modifiedHtml.replace(
104
+ "Q1 2024 Product Roadmap",
105
+ "Q1 2024 Product & Engineering Roadmap",
106
+ );
107
+ modifiedHtml = modifiedHtml.replace("March 15th", "March 30th");
108
+ modifiedHtml = modifiedHtml.replace("$1.5M", "$1.8M");
109
+ modifiedHtml = modifiedHtml.replace(
110
+ "</h1>",
111
+ `</h1>\n<p class="summary">This quarter focuses on scale and accessibility.</p>`,
112
+ );
113
+ // Delete the entire callout div (a root-level node)
114
+ modifiedHtml = modifiedHtml.replace(/<div class="callout">[\s\S]*?<\/div>/, "");
115
+
116
+ const newTree = readHtml(modifiedHtml);
117
+ ```
118
+
119
+ > ⚠️ **Only root-level deletes render as `<w:del>` in the docx.**
120
+ > `writeDiffDocx` renders deletes via a Pass 2 that filters for
121
+ > `op.path.length === 1`. A delete nested inside a list or table will survive
122
+ > in the snapshot (so the round-trip still works) but **will not show up as a
123
+ > revision mark** in Word. See the troubleshooting section below.
124
+
125
+ ---
126
+
127
+ ## Stage 3 — Diff and render an annotated HTML preview
128
+
129
+ ```typescript
130
+ import { diff, renderDiffHtml } from "otomate";
131
+
132
+ const delta = diff(oldTree, newTree);
133
+ console.log(delta.stats);
134
+ // { nodesAdded: 1, nodesDeleted: 1, nodesMoved: 0, nodesModified: 0, textChanges: 3 }
135
+
136
+ // The signature is (oldTree, newTree, delta) — NOT (delta, oldTree, newTree).
137
+ // See troubleshooting #3 below.
138
+ const diffHtml = renderDiffHtml(oldTree, newTree, delta);
139
+ ```
140
+
141
+ `diffHtml` is a standalone HTML string with `<ins>` and `<del>` markers plus
142
+ `data-diff="insert|delete|update"` attributes. It's the canonical preview
143
+ format — use it to show reviewers what will change before you commit the
144
+ edit to a Word document.
145
+
146
+ **Accepted options:** `{ insClass, delClass, modClass, moveClass, inlineStyles, side }`. `side: "old" | "new" | "merged"` controls whether you render the old view, the new view, or a combined view with both insertions and deletions visible. The default is `"merged"` which is what you want for a diff preview.
147
+
148
+ ---
149
+
150
+ ## Stage 4 — Write a tracked-changes `.docx`
151
+
152
+ ```typescript
153
+ import { writeDiffDocx } from "otomate";
154
+ import { writeFileSync } from "node:fs";
155
+
156
+ const trackedBuf = await writeDiffDocx(newTree, delta, {
157
+ author: "Jane Editor",
158
+ date: "2024-04-07T12:00:00Z",
159
+ });
160
+ writeFileSync("roadmap-tracked.docx", trackedBuf);
161
+ ```
162
+
163
+ The resulting file opens in Microsoft Word with all changes as revisions. Word's
164
+ **Review → Accept All** and **Review → Reject All** buttons work correctly
165
+ because each `<w:ins>` / `<w:del>` has a unique `w:id` and carries the author
166
+ and date you supplied.
167
+
168
+ **What the writer does internally:**
169
+
170
+ 1. Builds lookup maps of insert/delete/updateText operations indexed by tree path.
171
+ 2. Renders `newTree` with diff-aware converters that wrap inserted runs in `<w:ins>` and split text changes into `<w:del>` (old) + `<w:ins>` (new) segments.
172
+ 3. **Pass 2** splices root-level deleted blocks into their approximate old position, each wrapped in `<w:del>` with the deleted block's text content extracted recursively.
173
+ 4. Embeds the UDM snapshot (bound to the `document.xml` hash) so `readDocx` can round-trip perfectly.
174
+
175
+ **Verify the XML actually contains tracked changes:**
176
+
177
+ ```typescript
178
+ import { extractDocx } from "otomate";
179
+
180
+ const parts = await extractDocx(trackedBuf);
181
+ const docXml = parts.document;
182
+
183
+ // Both revision types must be present.
184
+ if (!docXml.includes("<w:ins")) throw new Error("no insertions emitted");
185
+ if (!docXml.includes("<w:del")) throw new Error("no deletions emitted");
186
+
187
+ // Every revision id must be unique.
188
+ const ids = [...docXml.matchAll(/<w:(?:ins|del)\s+w:id="(\d+)"/g)].map(m => m[1]);
189
+ if (new Set(ids).size !== ids.length) throw new Error("duplicate revision ids");
190
+ ```
191
+
192
+ ---
193
+
194
+ ## Stage 5 — Re-import the `.docx` and render tracked HTML
195
+
196
+ ```typescript
197
+ import { readDocx } from "otomate";
198
+
199
+ const rereadTree = await readDocx(trackedBuf);
200
+ ```
201
+
202
+ `readDocx` sees `word/otomate-udm.json` inside the ZIP, validates its
203
+ `__docHash` against the current `document.xml`, and (because we haven't
204
+ touched the file) returns `newTree` back from the snapshot — lossless.
205
+
206
+ To get **HTML comprehensive of rendered tracked changes** — i.e., a view
207
+ showing both the accepted and rejected content so a reviewer can see what
208
+ will change — feed `rereadTree` back through `renderDiffHtml` with the
209
+ **same delta** you computed in stage 3:
210
+
211
+ ```typescript
212
+ const rereadTrackedHtml = renderDiffHtml(oldTree, rereadTree, delta);
213
+ // Contains <ins> for insertions, <del> for deletions, data-diff-* markers.
214
+ ```
215
+
216
+ You need both `oldTree` and the delta because the OOXML path loses the
217
+ diff-operation metadata — `<w:ins>` / `<w:del>` tell Word what to render but
218
+ don't encode the structural tree mapping the diff engine needs to reproduce
219
+ the preview. Keep the `delta` object around if you want to re-render the
220
+ tracked-changes view later.
221
+
222
+ ---
223
+
224
+ ## Stage 6 — Accept all changes
225
+
226
+ Accepting all revisions is the same as taking the "new" side of the diff,
227
+ which is exactly what `rereadTree` already is (it came from the snapshot of
228
+ `newTree`). So "accept all" is just:
229
+
230
+ ```typescript
231
+ import { writeHtml } from "otomate";
232
+
233
+ const acceptedHtml = writeHtml(rereadTree);
234
+ ```
235
+
236
+ No `<ins>`, no `<del>`, no `data-diff-*` attributes — just clean HTML with
237
+ the edits applied.
238
+
239
+ If you need to **reject** all changes instead, serialize `oldTree`:
240
+
241
+ ```typescript
242
+ const rejectedHtml = writeHtml(oldTree); // equivalent to "reject all"
243
+ ```
244
+
245
+ For mixed accept/reject (per-revision decisions) you'd need to replay the
246
+ diff operations selectively against `oldTree`. The library doesn't currently
247
+ ship an `applyDiff(tree, opsToApply)` helper; if you need this, walk
248
+ `delta.operations` yourself and reconstruct the tree by picking which ops to
249
+ include.
250
+
251
+ ---
252
+
253
+ ## Stage 7 — Prove the round-trip is lossless
254
+
255
+ ```typescript
256
+ import assert from "node:assert/strict";
257
+
258
+ const expectedAcceptedHtml = writeHtml(newTree);
259
+ assert.equal(
260
+ acceptedHtml,
261
+ expectedAcceptedHtml,
262
+ "round-trip broke: accepted HTML does not match the expected modified HTML",
263
+ );
264
+ ```
265
+
266
+ This is the strongest possible assertion: byte-for-byte equality between
267
+ the accepted state (stage 6) and the direct serialization of `newTree`.
268
+ If the equality holds, every part of the pipeline — diff computation,
269
+ docx tracked-changes serialization, ZIP packing, OOXML re-parsing, snapshot
270
+ validation, HTML serialization — is lossless end to end.
271
+
272
+ **You can also spot-check individual changes:**
273
+
274
+ ```typescript
275
+ // Every insertion made it into the output.
276
+ assert.ok(acceptedHtml.includes("Engineering Roadmap")); // heading edit
277
+ assert.ok(acceptedHtml.includes("March 30th")); // blockquote edit
278
+ assert.ok(acceptedHtml.includes("$1.8M")); // table edit
279
+ assert.ok(acceptedHtml.includes("focuses on scale")); // new paragraph
280
+
281
+ // Every deletion was applied.
282
+ assert.ok(!acceptedHtml.includes("March 15th"));
283
+ assert.ok(!acceptedHtml.includes("$1.5M"));
284
+ assert.ok(!acceptedHtml.includes("Risks identified")); // deleted callout
285
+ ```
286
+
287
+ ---
288
+
289
+ ## Troubleshooting
290
+
291
+ Every item in this section is a trap we hit while building the test that
292
+ this guide is based on, **or** a subtle pitfall you will hit the first time
293
+ you integrate the library. Read the whole list before writing any code.
294
+
295
+ ### 1. `diffResult.operations is not iterable` / `Cannot read property 'type' of undefined` inside `renderDiffHtml`
296
+
297
+ **Cause.** Wrong argument order on `renderDiffHtml`. The correct signature is:
298
+
299
+ ```typescript
300
+ renderDiffHtml(oldTree, newTree, diffResult, options?)
301
+ ```
302
+
303
+ It is **not** `(diffResult, oldTree, newTree)`. An easy way to remember:
304
+ `renderDiffHtml` parallels `diff` in that the trees come first.
305
+
306
+ **Fix.** Pass the arguments in the order `(oldTree, newTree, delta)`.
307
+
308
+ ### 2. Deleted content is missing from the generated docx
309
+
310
+ **Symptom.** The diff engine reports a delete (visible in `delta.stats.nodesDeleted`), but Word opens the file with no `<w:del>` anywhere and the deleted content simply isn't there.
311
+
312
+ **Cause.** `writeDiffDocx` only renders **root-level** deletes as `<w:del>`. Its pass 2 filters for `op.path.length === 1`, so a delete nested inside a list item, table cell, or blockquote will not produce any OOXML revision mark. The tree snapshot still contains the correct state, so `readDocx` will round-trip correctly, but Word won't show a strike-through for that deletion.
313
+
314
+ **Fix.** If you need a deletion to show up as a tracked change in Word, restructure the edit so the deleted node is a direct child of `root`. If that's not possible (e.g. deleting a single list item), you have two options:
315
+
316
+ - Accept the limitation — the change still applies when the user clicks "Accept All", just without a visible revision mark for that specific delete.
317
+ - Mutate via an insert of an empty node at the same position plus text replacement, turning the delete into an `updateText` op which *is* rendered as `<w:del>` + `<w:ins>` inside the paragraph.
318
+
319
+ ### 3. Tracked changes show up in the `.docx` but not in my re-rendered HTML preview
320
+
321
+ **Symptom.** Stage 4's XML contains `<w:ins>` and `<w:del>`, but stage 5's `renderDiffHtml` output has no `<ins>` or `<del>` markers.
322
+
323
+ **Cause.** You passed the wrong tree or delta to `renderDiffHtml`. The function needs **the original `oldTree`**, **the re-imported tree** (`rereadTree`, which equals `newTree` via the snapshot), and **the original delta** you computed in stage 3. Passing `oldTree, newTree, delta` renders correctly. Passing `rereadTree, rereadTree, delta` won't — there's nothing to compare against.
324
+
325
+ **Fix.** Keep `oldTree` and `delta` in scope through the whole pipeline. Don't try to "rediscover" the diff from `rereadTree` alone — the snapshot path gives you a clean `newTree`, not the edit history.
326
+
327
+ ### 4. "File is corrupt" / "Word found unreadable content" dialog
328
+
329
+ If Word complains when opening your `.docx`, one of these usually matches:
330
+
331
+ | Symptom in the dialog | Cause | Fix |
332
+ |---|---|---|
333
+ | Word offers to "repair" the document | Schema-order violation inside `<w:rPr>` or `<w:pPr>`, or empty `<w:tr>` with no `<w:tc>`, or missing `<w:tblGrid>`, or invalid content in text | All of these are fixed inside the library. If you see one, update to the latest version. |
334
+ | "The name in the end tag... must match the start tag" | The text you fed in contained raw XML control characters (`\x00`–`\x1F`) that aren't valid in XML 1.0 | `esc()` strips these automatically via `sanitizeText` — you shouldn't hit this in normal use, but if you're passing binary data as text, clean it first |
335
+ | "Cannot find a part of the document" | You edited the `.docx` ZIP by hand and forgot to update `[Content_Types].xml` | Let the library produce the ZIP; don't unzip/rezip manually |
336
+
337
+ ### 5. `acceptedHtml` doesn't match `writeHtml(newTree)` exactly
338
+
339
+ **Symptom.** Stage 7's `assert.equal` fails. The two strings differ in whitespace, attribute order, or similar cosmetic details.
340
+
341
+ **Cause.** Something mutated the tree between stages. Common culprits:
342
+
343
+ - You modified `newTree` after computing `delta` — now `newTree` and the snapshot disagree.
344
+ - You wrote the docx, opened it in Word, saved it, then read it back — the snapshot hash is invalidated and `readDocx` falls back to OOXML parsing, which is lossier than the snapshot path. You'll see `div`/`figure` flattened into paragraphs, `data.html.*` attributes stripped, and custom marks dropped.
345
+ - You passed a different diff to `renderDiffHtml` in stage 5 than you used in stage 4.
346
+
347
+ **Fix.** Treat `oldTree`, `newTree`, and `delta` as immutable once computed. If you must edit in Word before round-tripping back, expect lossy results — the snapshot is only valid for documents otomate wrote and nothing has touched since.
348
+
349
+ ### 6. `readHtml` returns a tree with no CSS rules even though my HTML has a `<style>` block
350
+
351
+ **Symptom.** `(tree.data as any)?.css` is `undefined` after calling `readHtml(htmlWithStyle)`.
352
+
353
+ **Cause.** One of:
354
+
355
+ - Your `<style>` element has no class selectors or element selectors — e.g. it only has `@media` queries, pseudo-classes (`:hover`), attribute selectors, or `@keyframes`. The CSS parser only extracts simple class (`.foo`) and element (`h1`) selectors; everything else is silently skipped. This is by design — OOXML has no way to express `:hover` anyway.
356
+ - You're on an older version of `@otomate/html` that doesn't auto-extract inline `<style>` blocks. Before version 0.2, you had to pass the CSS as a string via `readHtml(html, { css: "..." })`. Current versions auto-extract.
357
+
358
+ **Fix.** Check the selectors you're using. If they're all simple class or element selectors, update to a version with auto-extraction. If you need `:hover`-style selectors for some reason, strip them to plain `.class` selectors before passing to `readHtml`.
359
+
360
+ ### 7. Custom CSS class names produce weird-looking Word styles
361
+
362
+ **Symptom.** A class named `"my-fancy class!"` shows up as `"myfancyclass"` in the generated Word document, and Word style IDs truncate at 31 characters.
363
+
364
+ **Cause.** Style IDs in OOXML are restricted to `[A-Za-z0-9_\-:]` and a maximum of 31 characters per ECMA-376 §17.7.4.9. The library sanitizes CSS class names (via `sanitizeStyleId`) when mapping them to Word style IDs: disallowed characters are stripped, and names longer than 31 characters are truncated.
365
+
366
+ **Fix.** If you want 1:1 fidelity between CSS class names and Word style IDs, keep your class names alphanumeric (plus `_`, `-`, `:`) and ≤ 31 characters. If you can't, the sanitized name is what you'll see — but the styling will still apply, because the library generates a unique style element per sanitized ID.
367
+
368
+ ### 8. `await` was skipped and `writeFileSync` wrote a `Promise` literal
369
+
370
+ **Symptom.** Your `.docx` file contains the literal text `[object Promise]` instead of actual bytes.
371
+
372
+ **Cause.** You forgot to `await` a call to `writeDocx`, `writeDiffDocx`, or `readDocx`. Those three functions are async. `readHtml`, `writeHtml`, `renderDiffHtml`, and `diff` are synchronous. **Always check the return type.**
373
+
374
+ **Fix.**
375
+
376
+ ```typescript
377
+ // Wrong
378
+ writeFileSync("out.docx", writeDocx(tree));
379
+
380
+ // Right
381
+ writeFileSync("out.docx", await writeDocx(tree));
382
+ ```
383
+
384
+ ### 9. `readDocx(filePath)` throws `buffer.slice is not a function`
385
+
386
+ **Symptom.** You passed a string file path to `readDocx` and got a runtime error about `.slice` or `.byteLength`.
387
+
388
+ **Cause.** `readDocx` expects an `ArrayBuffer` or `Uint8Array`, **not** a filesystem path. It has no filesystem access — it's a pure buffer parser so it works the same in Node and the browser.
389
+
390
+ **Fix.**
391
+
392
+ ```typescript
393
+ import { readFileSync } from "node:fs";
394
+ import { readDocx } from "otomate";
395
+
396
+ const buf = readFileSync("input.docx"); // Buffer (subclass of Uint8Array)
397
+ const tree = await readDocx(buf);
398
+ ```
399
+
400
+ ### 10. Running the library in a browser — things that used to break but no longer do
401
+
402
+ As of `@otomate/docx@0.3.1` / `otomate@0.3.1`, the library has **zero Node-only imports**. Snapshot hashing goes through `globalThis.crypto.subtle.digest` (Web Crypto API) and the base64 fallback in the ZIP reader uses `globalThis.btoa`. Both are available in Node ≥ 20 and every modern browser. If you're on an older version and see `createHash is not a function` or `Buffer is not defined` at load time or runtime, upgrade to ≥ 0.3.1.
403
+
404
+ If you're still seeing a `tsc --noEmit` complaint about `Cannot find name 'Buffer'` or `Cannot find module 'node:crypto'` in your own downstream code (not from inside the library), that's a tsconfig issue in your project — add `@types/node` to your devDeps if you actually use Node globals, or remove stale references if you don't.
405
+
406
+ And install `@types/node` as a devDep.
407
+
408
+ ### 11. Hyperlinks I added programmatically collide with existing hyperlinks on round-trip
409
+
410
+ **Symptom.** You read a `.docx`, added hyperlinks to the tree, and wrote it back — now some hyperlinks point to the wrong URL.
411
+
412
+ **Cause.** OOXML hyperlinks are keyed by `r:id` into `word/_rels/document.xml.rels`. If the input file already had hyperlinks using `rId100`, `rId101`, etc., the writer needs to allocate fresh rIds that don't collide. The library handles this automatically via `nextRIdFor` which scans the existing rels for the highest used rId and seeds past it — **but only if `tree.data.docx.relationships` is preserved on the input tree**. If you built the tree from scratch or stripped `data.docx`, you might clash.
413
+
414
+ **Fix.** When round-tripping, don't strip `tree.data.docx`. When building from scratch, don't worry — there are no pre-existing rIds to collide with.
415
+
416
+ ### 12. Numbered lists in the output all continue from the previous list's counter
417
+
418
+ **Symptom.** You have two `<ol>` elements and the second one starts at "4" instead of "1".
419
+
420
+ **Cause.** Earlier versions of the library shared `numId="2"` across all ordered lists, so Word treated them as one continuous sequence. Current versions allocate a fresh `w:numId` per top-level list (via `allocNumId`) and inject matching `<w:num>` entries into `numbering.xml`, so each list gets its own counter.
421
+
422
+ **Fix.** Update to the latest version. If you're stuck on an old one, insert an explicit `start: 1` into the list node.
423
+
424
+ ### 13. Diff engine drops changes deep inside a large subtree
425
+
426
+ **Symptom.** Editing a few words inside a paragraph nested six levels deep shows no change in the diff result.
427
+
428
+ **Cause.** The diff engine uses a Dice-coefficient threshold (default `0.5`) for bottom-up matching — subtrees that are more than 50% similar are considered "the same" and their differences are merged into a single `updateText` operation rather than generating insert/delete pairs. If the similarity falls just above the threshold, small edits can get lost in the noise.
429
+
430
+ **Fix.** Pass `diff(oldTree, newTree, { diceThreshold: 0.3 })` to make the matcher more sensitive. Lower values catch more fine-grained changes at the cost of more total operations.
431
+
432
+ ### 14. Running the e2e test locally — stale `dist/` trap
433
+
434
+ **Symptom.** The test imports from `otomate` (the umbrella), fails with an assertion that feels wrong — e.g., "inline `<style>` should have been extracted into data.css" — even though the source code clearly does the extraction.
435
+
436
+ **Cause.** The umbrella package imports from `@otomate/html`, `@otomate/docx`, etc. via their **built `dist/` directories**, not their source. If you made changes to a sub-package's source but haven't rebuilt it, the umbrella will still use the stale dist.
437
+
438
+ **Fix.** Before running tests that exercise the umbrella, rebuild the sub-packages:
439
+
440
+ ```bash
441
+ pnpm --filter @otomate/core --filter @otomate/diff --filter @otomate/css-docx \
442
+ --filter @otomate/inject --filter @otomate/html --filter @otomate/docx \
443
+ build
444
+ ```
445
+
446
+ Or use `tsx` to import directly from the sub-package sources if you're iterating rapidly.
447
+
448
+ ---
449
+
450
+ ## Running the test that backs this guide
451
+
452
+ The canonical reference implementation of this flow is a real test file:
453
+
454
+ ```bash
455
+ cd packages/otomate
456
+ pnpm test # runs every test including e2e-tracked-changes.test.ts
457
+ ```
458
+
459
+ Or to run just the e2e test:
460
+
461
+ ```bash
462
+ cd packages/otomate
463
+ node --test --import tsx src/__tests__/e2e-tracked-changes.test.ts
464
+ ```
465
+
466
+ The test exercises every stage in this guide with explicit assertions, so if
467
+ the library regresses on any part of the pipeline, that test will fail with
468
+ a message like `[stage 4] document.xml is missing any <w:del> elements` —
469
+ pinpointing exactly where the breakage is.
470
+
471
+ ## See also
472
+
473
+ - **`SKILL.md`** (next to this file) — condensed entry-point reference and pattern recipes
474
+ - **`README.md`** (repo root) — architecture overview and diff algorithm details
475
+ - **`packages/docx/src/__tests__/realworld.test.ts`** — end-to-end tests against a dozen real-world HTML fixtures
476
+ - **ECMA-376 §17.13.5** — the OOXML tracked-changes schema (`<w:ins>`, `<w:del>`, `<w:moveFrom>`, etc.) if you want to understand the output format at the byte level
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "otomate",
3
- "version": "0.2.1",
3
+ "version": "0.3.1",
4
4
  "description": "Universal document diffing library — structure-aware, string-level, multi-format",
5
5
  "type": "module",
6
6
  "main": "./dist/otomate.umd.cjs",
@@ -15,11 +15,14 @@
15
15
  },
16
16
  "files": [
17
17
  "dist",
18
- "README.md"
18
+ "README.md",
19
+ "SKILL.md",
20
+ "guides"
19
21
  ],
20
22
  "scripts": {
21
23
  "build": "vite build && tsc --emitDeclarationOnly",
22
- "typecheck": "tsc --noEmit"
24
+ "typecheck": "tsc --noEmit",
25
+ "test": "node --test --import tsx src/__tests__/*.test.ts"
23
26
  },
24
27
  "devDependencies": {
25
28
  "@otomate/core": "workspace:*",
@@ -28,6 +31,8 @@
28
31
  "@otomate/docx": "workspace:*",
29
32
  "@otomate/css-docx": "workspace:*",
30
33
  "@otomate/inject": "workspace:*",
34
+ "@types/node": "^25.5.2",
35
+ "tsx": "^4.19.0",
31
36
  "typescript": "^5.7.0",
32
37
  "vite": "^6.0.0"
33
38
  },