otomate 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -3
- package/SKILL.md +368 -0
- package/dist/index.d.ts +1 -1
- package/dist/index.d.ts.map +1 -1
- package/dist/otomate.js +7453 -6938
- package/dist/otomate.js.map +1 -1
- package/dist/otomate.umd.cjs +61 -43
- package/dist/otomate.umd.cjs.map +1 -1
- package/guides/html-to-docx-and-back.md +484 -0
- package/package.json +8 -3
|
@@ -0,0 +1,484 @@
|
|
|
1
|
+
# HTML → docx → HTML, with tracked changes
|
|
2
|
+
|
|
3
|
+
A complete, end-to-end walkthrough of converting HTML to a Word document with
|
|
4
|
+
tracked changes, reimporting it, accepting the revisions, and confirming the
|
|
5
|
+
round-trip is lossless. Every step here is backed by a real test at
|
|
6
|
+
`packages/otomate/src/__tests__/e2e-tracked-changes.test.ts` — if you copy the
|
|
7
|
+
snippets verbatim they will run.
|
|
8
|
+
|
|
9
|
+
> **Prerequisites.** Node ≥ 20, `otomate` installed. All `writeDocx` /
|
|
10
|
+
> `readDocx` / `writeDiffDocx` calls are **async** — always `await` them.
|
|
11
|
+
> `readHtml` / `writeHtml` / `renderDiffHtml` / `diff` are synchronous.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Overview — the 7 stages
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
HTML → Stage 1 → UDM (oldTree)
|
|
19
|
+
↓ edit
|
|
20
|
+
HTML' → Stage 2 → UDM (newTree)
|
|
21
|
+
↓
|
|
22
|
+
diff(old, new) → Stage 3 → DiffResult + annotated HTML preview
|
|
23
|
+
↓
|
|
24
|
+
writeDiffDocx → Stage 4 → .docx with <w:ins> / <w:del>
|
|
25
|
+
↓
|
|
26
|
+
readDocx → Stage 5 → UDM + tracked-changes HTML render
|
|
27
|
+
↓ "accept all"
|
|
28
|
+
writeHtml → Stage 6 → plain HTML, revisions applied
|
|
29
|
+
↓
|
|
30
|
+
assert equal → Stage 7 → writeHtml(newTree) // lossless round-trip
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
The invariant that proves everything works: **the output of stage 6 must be
|
|
34
|
+
byte-for-byte identical to `writeHtml(newTree)`**. If that equality holds, the
|
|
35
|
+
diff you computed in stage 3 survived the docx round-trip intact, and "accept
|
|
36
|
+
all revisions" recovered exactly the edit you wanted.
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Stage 1 — Generate rich HTML and parse it
|
|
41
|
+
|
|
42
|
+
```typescript
|
|
43
|
+
import { readHtml } from "otomate";
|
|
44
|
+
|
|
45
|
+
const originalHtml = `<style>
|
|
46
|
+
.title { color: #1e3a5f; font-family: Georgia; font-size: 24pt; }
|
|
47
|
+
.callout { background-color: #fff8e1; border: 1pt solid #fbbf24; }
|
|
48
|
+
.critical { color: #dc2626; font-weight: bold; }
|
|
49
|
+
</style>
|
|
50
|
+
<h1 class="title">Q1 2024 Product Roadmap</h1>
|
|
51
|
+
<p>Welcome to the <strong>first quarter</strong> roadmap covering
|
|
52
|
+
our <em>strategic priorities</em>.</p>
|
|
53
|
+
<h2>Initiatives</h2>
|
|
54
|
+
<ul>
|
|
55
|
+
<li>
|
|
56
|
+
<p>Infrastructure improvements</p>
|
|
57
|
+
<ul>
|
|
58
|
+
<li>
|
|
59
|
+
<p>Database migration</p>
|
|
60
|
+
<blockquote>
|
|
61
|
+
<p>The migration must complete before <u>March 15th</u>.</p>
|
|
62
|
+
</blockquote>
|
|
63
|
+
</li>
|
|
64
|
+
</ul>
|
|
65
|
+
</li>
|
|
66
|
+
</ul>
|
|
67
|
+
<table>
|
|
68
|
+
<thead><tr><th>Metric</th><th>Q4</th><th>Q1 Target</th></tr></thead>
|
|
69
|
+
<tbody><tr><td>Revenue</td><td>$1.2M</td><td>$1.5M</td></tr></tbody>
|
|
70
|
+
</table>
|
|
71
|
+
<div class="callout">
|
|
72
|
+
<p class="critical">Risks identified:</p>
|
|
73
|
+
<ul><li><p>Integration delays</p></li></ul>
|
|
74
|
+
</div>`;
|
|
75
|
+
|
|
76
|
+
const oldTree = readHtml(originalHtml);
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
**What's in `oldTree`:**
|
|
80
|
+
|
|
81
|
+
- The UDM tree — a `root` with block children (`h1`, `p`, `h2`, `ul`, `table`, `div`).
|
|
82
|
+
- `oldTree.data.css.classRules` — auto-extracted from the inline `<style>` block (no need to pass `options.css` explicitly; it merges with whatever you do pass).
|
|
83
|
+
- `classes: string[]` on every element that had an HTML `class` attribute.
|
|
84
|
+
- Nesting depth ≥ 5 (the blockquote path alone goes `ul → li → ul → li → blockquote → p → text`).
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## Stage 2 — Modify the document
|
|
89
|
+
|
|
90
|
+
Generate an edited version. For the diff engine to exercise every code path
|
|
91
|
+
you want a mix of edit kinds:
|
|
92
|
+
|
|
93
|
+
| Kind | Effect in docx output |
|
|
94
|
+
|---|---|
|
|
95
|
+
| Text change | `<w:del>` + `<w:ins>` on a per-character/word run |
|
|
96
|
+
| Root-level paragraph insertion | `<w:ins>` wrapping each run of the new paragraph |
|
|
97
|
+
| Root-level node deletion | `<w:del>` wrapping the deleted block's text |
|
|
98
|
+
|
|
99
|
+
```typescript
|
|
100
|
+
let modifiedHtml = originalHtml;
|
|
101
|
+
modifiedHtml = modifiedHtml.replace(
|
|
102
|
+
"Q1 2024 Product Roadmap",
|
|
103
|
+
"Q1 2024 Product & Engineering Roadmap",
|
|
104
|
+
);
|
|
105
|
+
modifiedHtml = modifiedHtml.replace("March 15th", "March 30th");
|
|
106
|
+
modifiedHtml = modifiedHtml.replace("$1.5M", "$1.8M");
|
|
107
|
+
modifiedHtml = modifiedHtml.replace(
|
|
108
|
+
"</h1>",
|
|
109
|
+
`</h1>\n<p class="summary">This quarter focuses on scale and accessibility.</p>`,
|
|
110
|
+
);
|
|
111
|
+
// Delete the entire callout div (a root-level node)
|
|
112
|
+
modifiedHtml = modifiedHtml.replace(/<div class="callout">[\s\S]*?<\/div>/, "");
|
|
113
|
+
|
|
114
|
+
const newTree = readHtml(modifiedHtml);
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
> ⚠️ **Only root-level deletes render as `<w:del>` in the docx.**
|
|
118
|
+
> `writeDiffDocx` renders deletes via a Pass 2 that filters for
|
|
119
|
+
> `op.path.length === 1`. A delete nested inside a list or table will survive
|
|
120
|
+
> in the snapshot (so the round-trip still works) but **will not show up as a
|
|
121
|
+
> revision mark** in Word. See the troubleshooting section below.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## Stage 3 — Diff and render an annotated HTML preview
|
|
126
|
+
|
|
127
|
+
```typescript
|
|
128
|
+
import { diff, renderDiffHtml } from "otomate";
|
|
129
|
+
|
|
130
|
+
const delta = diff(oldTree, newTree);
|
|
131
|
+
console.log(delta.stats);
|
|
132
|
+
// { nodesAdded: 1, nodesDeleted: 1, nodesMoved: 0, nodesModified: 0, textChanges: 3 }
|
|
133
|
+
|
|
134
|
+
// The signature is (oldTree, newTree, delta) — NOT (delta, oldTree, newTree).
|
|
135
|
+
// See troubleshooting #3 below.
|
|
136
|
+
const diffHtml = renderDiffHtml(oldTree, newTree, delta);
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
`diffHtml` is a standalone HTML string with `<ins>` and `<del>` markers plus
|
|
140
|
+
`data-diff="insert|delete|update"` attributes. It's the canonical preview
|
|
141
|
+
format — use it to show reviewers what will change before you commit the
|
|
142
|
+
edit to a Word document.
|
|
143
|
+
|
|
144
|
+
**Accepted options:** `{ insClass, delClass, modClass, moveClass, inlineStyles, side }`. `side: "old" | "new" | "merged"` controls whether you render the old view, the new view, or a combined view with both insertions and deletions visible. The default is `"merged"` which is what you want for a diff preview.
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Stage 4 — Write a tracked-changes `.docx`
|
|
149
|
+
|
|
150
|
+
```typescript
|
|
151
|
+
import { writeDiffDocx } from "otomate";
|
|
152
|
+
import { writeFileSync } from "node:fs";
|
|
153
|
+
|
|
154
|
+
const trackedBuf = await writeDiffDocx(newTree, delta, {
|
|
155
|
+
author: "Jane Editor",
|
|
156
|
+
date: "2024-04-07T12:00:00Z",
|
|
157
|
+
});
|
|
158
|
+
writeFileSync("roadmap-tracked.docx", trackedBuf);
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
The resulting file opens in Microsoft Word with all changes as revisions. Word's
|
|
162
|
+
**Review → Accept All** and **Review → Reject All** buttons work correctly
|
|
163
|
+
because each `<w:ins>` / `<w:del>` has a unique `w:id` and carries the author
|
|
164
|
+
and date you supplied.
|
|
165
|
+
|
|
166
|
+
**What the writer does internally:**
|
|
167
|
+
|
|
168
|
+
1. Builds lookup maps of insert/delete/updateText operations indexed by tree path.
|
|
169
|
+
2. Renders `newTree` with diff-aware converters that wrap inserted runs in `<w:ins>` and split text changes into `<w:del>` (old) + `<w:ins>` (new) segments.
|
|
170
|
+
3. **Pass 2** splices root-level deleted blocks into their approximate old position, each wrapped in `<w:del>` with the deleted block's text content extracted recursively.
|
|
171
|
+
4. Embeds the UDM snapshot (bound to the `document.xml` hash) so `readDocx` can round-trip perfectly.
|
|
172
|
+
|
|
173
|
+
**Verify the XML actually contains tracked changes:**
|
|
174
|
+
|
|
175
|
+
```typescript
|
|
176
|
+
import { extractDocx } from "otomate";
|
|
177
|
+
|
|
178
|
+
const parts = await extractDocx(trackedBuf);
|
|
179
|
+
const docXml = parts.document;
|
|
180
|
+
|
|
181
|
+
// Both revision types must be present.
|
|
182
|
+
if (!docXml.includes("<w:ins")) throw new Error("no insertions emitted");
|
|
183
|
+
if (!docXml.includes("<w:del")) throw new Error("no deletions emitted");
|
|
184
|
+
|
|
185
|
+
// Every revision id must be unique.
|
|
186
|
+
const ids = [...docXml.matchAll(/<w:(?:ins|del)\s+w:id="(\d+)"/g)].map(m => m[1]);
|
|
187
|
+
if (new Set(ids).size !== ids.length) throw new Error("duplicate revision ids");
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Stage 5 — Re-import the `.docx` and render tracked HTML
|
|
193
|
+
|
|
194
|
+
```typescript
|
|
195
|
+
import { readDocx } from "otomate";
|
|
196
|
+
|
|
197
|
+
const rereadTree = await readDocx(trackedBuf);
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
`readDocx` sees `word/otomate-udm.json` inside the ZIP, validates its
|
|
201
|
+
`__docHash` against the current `document.xml`, and (because we haven't
|
|
202
|
+
touched the file) returns `newTree` back from the snapshot — lossless.
|
|
203
|
+
|
|
204
|
+
To get **HTML comprehensive of rendered tracked changes** — i.e., a view
|
|
205
|
+
showing both the accepted and rejected content so a reviewer can see what
|
|
206
|
+
will change — feed `rereadTree` back through `renderDiffHtml` with the
|
|
207
|
+
**same delta** you computed in stage 3:
|
|
208
|
+
|
|
209
|
+
```typescript
|
|
210
|
+
const rereadTrackedHtml = renderDiffHtml(oldTree, rereadTree, delta);
|
|
211
|
+
// Contains <ins> for insertions, <del> for deletions, data-diff-* markers.
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
You need both `oldTree` and the delta because the OOXML path loses the
|
|
215
|
+
diff-operation metadata — `<w:ins>` / `<w:del>` tell Word what to render but
|
|
216
|
+
don't encode the structural tree mapping the diff engine needs to reproduce
|
|
217
|
+
the preview. Keep the `delta` object around if you want to re-render the
|
|
218
|
+
tracked-changes view later.
|
|
219
|
+
|
|
220
|
+
---
|
|
221
|
+
|
|
222
|
+
## Stage 6 — Accept all changes
|
|
223
|
+
|
|
224
|
+
Accepting all revisions is the same as taking the "new" side of the diff,
|
|
225
|
+
which is exactly what `rereadTree` already is (it came from the snapshot of
|
|
226
|
+
`newTree`). So "accept all" is just:
|
|
227
|
+
|
|
228
|
+
```typescript
|
|
229
|
+
import { writeHtml } from "otomate";
|
|
230
|
+
|
|
231
|
+
const acceptedHtml = writeHtml(rereadTree);
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
No `<ins>`, no `<del>`, no `data-diff-*` attributes — just clean HTML with
|
|
235
|
+
the edits applied.
|
|
236
|
+
|
|
237
|
+
If you need to **reject** all changes instead, serialize `oldTree`:
|
|
238
|
+
|
|
239
|
+
```typescript
|
|
240
|
+
const rejectedHtml = writeHtml(oldTree); // equivalent to "reject all"
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
For mixed accept/reject (per-revision decisions) you'd need to replay the
|
|
244
|
+
diff operations selectively against `oldTree`. The library doesn't currently
|
|
245
|
+
ship an `applyDiff(tree, opsToApply)` helper; if you need this, walk
|
|
246
|
+
`delta.operations` yourself and reconstruct the tree by picking which ops to
|
|
247
|
+
include.
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## Stage 7 — Prove the round-trip is lossless
|
|
252
|
+
|
|
253
|
+
```typescript
|
|
254
|
+
import assert from "node:assert/strict";
|
|
255
|
+
|
|
256
|
+
const expectedAcceptedHtml = writeHtml(newTree);
|
|
257
|
+
assert.equal(
|
|
258
|
+
acceptedHtml,
|
|
259
|
+
expectedAcceptedHtml,
|
|
260
|
+
"round-trip broke: accepted HTML does not match the expected modified HTML",
|
|
261
|
+
);
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
This is the strongest possible assertion: byte-for-byte equality between
|
|
265
|
+
the accepted state (stage 6) and the direct serialization of `newTree`.
|
|
266
|
+
If the equality holds, every part of the pipeline — diff computation,
|
|
267
|
+
docx tracked-changes serialization, ZIP packing, OOXML re-parsing, snapshot
|
|
268
|
+
validation, HTML serialization — is lossless end to end.
|
|
269
|
+
|
|
270
|
+
**You can also spot-check individual changes:**
|
|
271
|
+
|
|
272
|
+
```typescript
|
|
273
|
+
// Every insertion made it into the output.
|
|
274
|
+
assert.ok(acceptedHtml.includes("Engineering Roadmap")); // heading edit
|
|
275
|
+
assert.ok(acceptedHtml.includes("March 30th")); // blockquote edit
|
|
276
|
+
assert.ok(acceptedHtml.includes("$1.8M")); // table edit
|
|
277
|
+
assert.ok(acceptedHtml.includes("focuses on scale")); // new paragraph
|
|
278
|
+
|
|
279
|
+
// Every deletion was applied.
|
|
280
|
+
assert.ok(!acceptedHtml.includes("March 15th"));
|
|
281
|
+
assert.ok(!acceptedHtml.includes("$1.5M"));
|
|
282
|
+
assert.ok(!acceptedHtml.includes("Risks identified")); // deleted callout
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
---
|
|
286
|
+
|
|
287
|
+
## Troubleshooting
|
|
288
|
+
|
|
289
|
+
Every item in this section is a trap we hit while building the test that
|
|
290
|
+
this guide is based on, **or** a subtle pitfall you will hit the first time
|
|
291
|
+
you integrate the library. Read the whole list before writing any code.
|
|
292
|
+
|
|
293
|
+
### 1. `diffResult.operations is not iterable` / `Cannot read property 'type' of undefined` inside `renderDiffHtml`
|
|
294
|
+
|
|
295
|
+
**Cause.** Wrong argument order on `renderDiffHtml`. The correct signature is:
|
|
296
|
+
|
|
297
|
+
```typescript
|
|
298
|
+
renderDiffHtml(oldTree, newTree, diffResult, options?)
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
It is **not** `(diffResult, oldTree, newTree)`. An easy way to remember:
|
|
302
|
+
`renderDiffHtml` parallels `diff` in that the trees come first.
|
|
303
|
+
|
|
304
|
+
**Fix.** Pass the arguments in the order `(oldTree, newTree, delta)`.
|
|
305
|
+
|
|
306
|
+
### 2. Deleted content is missing from the generated docx
|
|
307
|
+
|
|
308
|
+
**Symptom.** The diff engine reports a delete (visible in `delta.stats.nodesDeleted`), but Word opens the file with no `<w:del>` anywhere and the deleted content simply isn't there.
|
|
309
|
+
|
|
310
|
+
**Cause.** `writeDiffDocx` only renders **root-level** deletes as `<w:del>`. Its pass 2 filters for `op.path.length === 1`, so a delete nested inside a list item, table cell, or blockquote will not produce any OOXML revision mark. The tree snapshot still contains the correct state, so `readDocx` will round-trip correctly, but Word won't show a strike-through for that deletion.
|
|
311
|
+
|
|
312
|
+
**Fix.** If you need a deletion to show up as a tracked change in Word, restructure the edit so the deleted node is a direct child of `root`. If that's not possible (e.g. deleting a single list item), you have two options:
|
|
313
|
+
|
|
314
|
+
- Accept the limitation — the change still applies when the user clicks "Accept All", just without a visible revision mark for that specific delete.
|
|
315
|
+
- Mutate via an insert of an empty node at the same position plus text replacement, turning the delete into an `updateText` op which *is* rendered as `<w:del>` + `<w:ins>` inside the paragraph.
|
|
316
|
+
|
|
317
|
+
### 3. Tracked changes show up in the `.docx` but not in my re-rendered HTML preview
|
|
318
|
+
|
|
319
|
+
**Symptom.** Stage 4's XML contains `<w:ins>` and `<w:del>`, but stage 5's `renderDiffHtml` output has no `<ins>` or `<del>` markers.
|
|
320
|
+
|
|
321
|
+
**Cause.** You passed the wrong tree or delta to `renderDiffHtml`. The function needs **the original `oldTree`**, **the re-imported tree** (`rereadTree`, which equals `newTree` via the snapshot), and **the original delta** you computed in stage 3. Passing `oldTree, newTree, delta` renders correctly. Passing `rereadTree, rereadTree, delta` won't — there's nothing to compare against.
|
|
322
|
+
|
|
323
|
+
**Fix.** Keep `oldTree` and `delta` in scope through the whole pipeline. Don't try to "rediscover" the diff from `rereadTree` alone — the snapshot path gives you a clean `newTree`, not the edit history.
|
|
324
|
+
|
|
325
|
+
### 4. "File is corrupt" / "Word found unreadable content" dialog
|
|
326
|
+
|
|
327
|
+
If Word complains when opening your `.docx`, one of these usually matches:
|
|
328
|
+
|
|
329
|
+
| Symptom in the dialog | Cause | Fix |
|
|
330
|
+
|---|---|---|
|
|
331
|
+
| Word offers to "repair" the document | Schema-order violation inside `<w:rPr>` or `<w:pPr>`, or empty `<w:tr>` with no `<w:tc>`, or missing `<w:tblGrid>`, or invalid content in text | All of these are fixed inside the library. If you see one, update to the latest version. |
|
|
332
|
+
| "The name in the end tag... must match the start tag" | The text you fed in contained raw XML control characters (`\x00`–`\x1F`) that aren't valid in XML 1.0 | `esc()` strips these automatically via `sanitizeText` — you shouldn't hit this in normal use, but if you're passing binary data as text, clean it first |
|
|
333
|
+
| "Cannot find a part of the document" | You edited the `.docx` ZIP by hand and forgot to update `[Content_Types].xml` | Let the library produce the ZIP; don't unzip/rezip manually |
|
|
334
|
+
|
|
335
|
+
### 5. `acceptedHtml` doesn't match `writeHtml(newTree)` exactly
|
|
336
|
+
|
|
337
|
+
**Symptom.** Stage 7's `assert.equal` fails. The two strings differ in whitespace, attribute order, or similar cosmetic details.
|
|
338
|
+
|
|
339
|
+
**Cause.** Something mutated the tree between stages. Common culprits:
|
|
340
|
+
|
|
341
|
+
- You modified `newTree` after computing `delta` — now `newTree` and the snapshot disagree.
|
|
342
|
+
- You wrote the docx, opened it in Word, saved it, then read it back — the snapshot hash is invalidated and `readDocx` falls back to OOXML parsing, which is lossier than the snapshot path. You'll see `div`/`figure` flattened into paragraphs, `data.html.*` attributes stripped, and custom marks dropped.
|
|
343
|
+
- You passed a different diff to `renderDiffHtml` in stage 5 than you used in stage 4.
|
|
344
|
+
|
|
345
|
+
**Fix.** Treat `oldTree`, `newTree`, and `delta` as immutable once computed. If you must edit in Word before round-tripping back, expect lossy results — the snapshot is only valid for documents otomate wrote and nothing has touched since.
|
|
346
|
+
|
|
347
|
+
### 6. `readHtml` returns a tree with no CSS rules even though my HTML has a `<style>` block
|
|
348
|
+
|
|
349
|
+
**Symptom.** `(tree.data as any)?.css` is `undefined` after calling `readHtml(htmlWithStyle)`.
|
|
350
|
+
|
|
351
|
+
**Cause.** One of:
|
|
352
|
+
|
|
353
|
+
- Your `<style>` element has no class selectors or element selectors — e.g. it only has `@media` queries, pseudo-classes (`:hover`), attribute selectors, or `@keyframes`. The CSS parser only extracts simple class (`.foo`) and element (`h1`) selectors; everything else is silently skipped. This is by design — OOXML has no way to express `:hover` anyway.
|
|
354
|
+
- You're on an older version of `@otomate/html` that doesn't auto-extract inline `<style>` blocks. Before version 0.2, you had to pass the CSS as a string via `readHtml(html, { css: "..." })`. Current versions auto-extract.
|
|
355
|
+
|
|
356
|
+
**Fix.** Check the selectors you're using. If they're all simple class or element selectors, update to a version with auto-extraction. If you need `:hover`-style selectors for some reason, strip them to plain `.class` selectors before passing to `readHtml`.
|
|
357
|
+
|
|
358
|
+
### 7. Custom CSS class names produce weird-looking Word styles
|
|
359
|
+
|
|
360
|
+
**Symptom.** A class named `"my-fancy class!"` shows up as `"myfancyclass"` in the generated Word document, and Word style IDs truncate at 31 characters.
|
|
361
|
+
|
|
362
|
+
**Cause.** Style IDs in OOXML are restricted to `[A-Za-z0-9_\-:]` and a maximum of 31 characters per ECMA-376 §17.7.4.9. The library sanitizes CSS class names (via `sanitizeStyleId`) when mapping them to Word style IDs: disallowed characters are stripped, and names longer than 31 characters are truncated.
|
|
363
|
+
|
|
364
|
+
**Fix.** If you want 1:1 fidelity between CSS class names and Word style IDs, keep your class names alphanumeric (plus `_`, `-`, `:`) and ≤ 31 characters. If you can't, the sanitized name is what you'll see — but the styling will still apply, because the library generates a unique style element per sanitized ID.
|
|
365
|
+
|
|
366
|
+
### 8. `await` was skipped and `writeFileSync` wrote a `Promise` literal
|
|
367
|
+
|
|
368
|
+
**Symptom.** Your `.docx` file contains the literal text `[object Promise]` instead of actual bytes.
|
|
369
|
+
|
|
370
|
+
**Cause.** You forgot to `await` a call to `writeDocx`, `writeDiffDocx`, or `readDocx`. Those three functions are async. `readHtml`, `writeHtml`, `renderDiffHtml`, and `diff` are synchronous. **Always check the return type.**
|
|
371
|
+
|
|
372
|
+
**Fix.**
|
|
373
|
+
|
|
374
|
+
```typescript
|
|
375
|
+
// Wrong
|
|
376
|
+
writeFileSync("out.docx", writeDocx(tree));
|
|
377
|
+
|
|
378
|
+
// Right
|
|
379
|
+
writeFileSync("out.docx", await writeDocx(tree));
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
### 9. `readDocx(filePath)` throws `buffer.slice is not a function`
|
|
383
|
+
|
|
384
|
+
**Symptom.** You passed a string file path to `readDocx` and got a runtime error about `.slice` or `.byteLength`.
|
|
385
|
+
|
|
386
|
+
**Cause.** `readDocx` expects an `ArrayBuffer` or `Uint8Array`, **not** a filesystem path. It has no filesystem access — it's a pure buffer parser so it works the same in Node and the browser.
|
|
387
|
+
|
|
388
|
+
**Fix.**
|
|
389
|
+
|
|
390
|
+
```typescript
|
|
391
|
+
import { readFileSync } from "node:fs";
|
|
392
|
+
import { readDocx } from "otomate";
|
|
393
|
+
|
|
394
|
+
const buf = readFileSync("input.docx"); // Buffer (subclass of Uint8Array)
|
|
395
|
+
const tree = await readDocx(buf);
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
### 10. Tests pass via `tsx` but `tsc --noEmit` fails with `Cannot find name 'Buffer'` or `Cannot find module 'node:crypto'`
|
|
399
|
+
|
|
400
|
+
**Symptom.** Running the package's test suite works fine (because it uses `tsx`), but running `tsc --noEmit` over a file that imports from `@otomate/docx` complains about missing Node globals.
|
|
401
|
+
|
|
402
|
+
**Cause.** The `@otomate/docx` package uses `node:crypto` and `Buffer` for the snapshot hash and base64 fallback. These types come from `@types/node`, which `tsc` only loads if your `tsconfig.json` includes `"types": ["node"]` in `compilerOptions`, or your code does a triple-slash reference. `tsx` loads Node types automatically so this doesn't bite you at runtime.
|
|
403
|
+
|
|
404
|
+
**Fix.** In your consuming project's `tsconfig.json`:
|
|
405
|
+
|
|
406
|
+
```json
|
|
407
|
+
{
|
|
408
|
+
"compilerOptions": {
|
|
409
|
+
"types": ["node"]
|
|
410
|
+
}
|
|
411
|
+
}
|
|
412
|
+
```
|
|
413
|
+
|
|
414
|
+
And install `@types/node` as a devDep.
|
|
415
|
+
|
|
416
|
+
### 11. Hyperlinks I added programmatically collide with existing hyperlinks on round-trip
|
|
417
|
+
|
|
418
|
+
**Symptom.** You read a `.docx`, added hyperlinks to the tree, and wrote it back — now some hyperlinks point to the wrong URL.
|
|
419
|
+
|
|
420
|
+
**Cause.** OOXML hyperlinks are keyed by `r:id` into `word/_rels/document.xml.rels`. If the input file already had hyperlinks using `rId100`, `rId101`, etc., the writer needs to allocate fresh rIds that don't collide. The library handles this automatically via `nextRIdFor` which scans the existing rels for the highest used rId and seeds past it — **but only if `tree.data.docx.relationships` is preserved on the input tree**. If you built the tree from scratch or stripped `data.docx`, you might clash.
|
|
421
|
+
|
|
422
|
+
**Fix.** When round-tripping, don't strip `tree.data.docx`. When building from scratch, don't worry — there are no pre-existing rIds to collide with.
|
|
423
|
+
|
|
424
|
+
### 12. Numbered lists in the output all continue from the previous list's counter
|
|
425
|
+
|
|
426
|
+
**Symptom.** You have two `<ol>` elements and the second one starts at "4" instead of "1".
|
|
427
|
+
|
|
428
|
+
**Cause.** Earlier versions of the library shared `numId="2"` across all ordered lists, so Word treated them as one continuous sequence. Current versions allocate a fresh `w:numId` per top-level list (via `allocNumId`) and inject matching `<w:num>` entries into `numbering.xml`, so each list gets its own counter.
|
|
429
|
+
|
|
430
|
+
**Fix.** Update to the latest version. If you're stuck on an old one, insert an explicit `start: 1` into the list node.
|
|
431
|
+
|
|
432
|
+
### 13. Diff engine drops changes deep inside a large subtree
|
|
433
|
+
|
|
434
|
+
**Symptom.** Editing a few words inside a paragraph nested six levels deep shows no change in the diff result.
|
|
435
|
+
|
|
436
|
+
**Cause.** The diff engine uses a Dice-coefficient threshold (default `0.5`) for bottom-up matching — subtrees that are more than 50% similar are considered "the same" and their differences are merged into a single `updateText` operation rather than generating insert/delete pairs. If the similarity falls just above the threshold, small edits can get lost in the noise.
|
|
437
|
+
|
|
438
|
+
**Fix.** Pass `diff(oldTree, newTree, { diceThreshold: 0.3 })` to make the matcher more sensitive. Lower values catch more fine-grained changes at the cost of more total operations.
|
|
439
|
+
|
|
440
|
+
### 14. Running the e2e test locally — stale `dist/` trap
|
|
441
|
+
|
|
442
|
+
**Symptom.** The test imports from `otomate` (the umbrella), fails with an assertion that feels wrong — e.g., "inline `<style>` should have been extracted into data.css" — even though the source code clearly does the extraction.
|
|
443
|
+
|
|
444
|
+
**Cause.** The umbrella package imports from `@otomate/html`, `@otomate/docx`, etc. via their **built `dist/` directories**, not their source. If you made changes to a sub-package's source but haven't rebuilt it, the umbrella will still use the stale dist.
|
|
445
|
+
|
|
446
|
+
**Fix.** Before running tests that exercise the umbrella, rebuild the sub-packages:
|
|
447
|
+
|
|
448
|
+
```bash
|
|
449
|
+
pnpm --filter @otomate/core --filter @otomate/diff --filter @otomate/css-docx \
|
|
450
|
+
--filter @otomate/inject --filter @otomate/html --filter @otomate/docx \
|
|
451
|
+
build
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
Or use `tsx` to import directly from the sub-package sources if you're iterating rapidly.
|
|
455
|
+
|
|
456
|
+
---
|
|
457
|
+
|
|
458
|
+
## Running the test that backs this guide
|
|
459
|
+
|
|
460
|
+
The canonical reference implementation of this flow is a real test file:
|
|
461
|
+
|
|
462
|
+
```bash
|
|
463
|
+
cd packages/otomate
|
|
464
|
+
pnpm test # runs every test including e2e-tracked-changes.test.ts
|
|
465
|
+
```
|
|
466
|
+
|
|
467
|
+
Or to run just the e2e test:
|
|
468
|
+
|
|
469
|
+
```bash
|
|
470
|
+
cd packages/otomate
|
|
471
|
+
node --test --import tsx src/__tests__/e2e-tracked-changes.test.ts
|
|
472
|
+
```
|
|
473
|
+
|
|
474
|
+
The test exercises every stage in this guide with explicit assertions, so if
|
|
475
|
+
the library regresses on any part of the pipeline, that test will fail with
|
|
476
|
+
a message like `[stage 4] document.xml is missing any <w:del> elements` —
|
|
477
|
+
pinpointing exactly where the breakage is.
|
|
478
|
+
|
|
479
|
+
## See also
|
|
480
|
+
|
|
481
|
+
- **`SKILL.md`** (next to this file) — condensed entry-point reference and pattern recipes
|
|
482
|
+
- **`README.md`** (repo root) — architecture overview and diff algorithm details
|
|
483
|
+
- **`packages/docx/src/__tests__/realworld.test.ts`** — end-to-end tests against a dozen real-world HTML fixtures
|
|
484
|
+
- **ECMA-376 §17.13.5** — the OOXML tracked-changes schema (`<w:ins>`, `<w:del>`, `<w:moveFrom>`, etc.) if you want to understand the output format at the byte level
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "otomate",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.3.0",
|
|
4
4
|
"description": "Universal document diffing library — structure-aware, string-level, multi-format",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./dist/otomate.umd.cjs",
|
|
@@ -15,11 +15,14 @@
|
|
|
15
15
|
},
|
|
16
16
|
"files": [
|
|
17
17
|
"dist",
|
|
18
|
-
"README.md"
|
|
18
|
+
"README.md",
|
|
19
|
+
"SKILL.md",
|
|
20
|
+
"guides"
|
|
19
21
|
],
|
|
20
22
|
"scripts": {
|
|
21
23
|
"build": "vite build && tsc --emitDeclarationOnly",
|
|
22
|
-
"typecheck": "tsc --noEmit"
|
|
24
|
+
"typecheck": "tsc --noEmit",
|
|
25
|
+
"test": "node --test --import tsx src/__tests__/*.test.ts"
|
|
23
26
|
},
|
|
24
27
|
"devDependencies": {
|
|
25
28
|
"@otomate/core": "workspace:*",
|
|
@@ -28,6 +31,8 @@
|
|
|
28
31
|
"@otomate/docx": "workspace:*",
|
|
29
32
|
"@otomate/css-docx": "workspace:*",
|
|
30
33
|
"@otomate/inject": "workspace:*",
|
|
34
|
+
"@types/node": "^25.5.2",
|
|
35
|
+
"tsx": "^4.19.0",
|
|
31
36
|
"typescript": "^5.7.0",
|
|
32
37
|
"vite": "^6.0.0"
|
|
33
38
|
},
|