otomate 0.0.8 → 0.0.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,298 @@
1
+ # Otomate
2
+
3
+ **Universal document diffing library** — structure-aware, string-level, multi-format.
4
+
5
+ otomate evaluates differences across documents with three core properties that existing libraries lack:
6
+
7
+ 1. **Structure tracking** — tree-based diffing that understands document hierarchy, not just flat text
8
+ 2. **String tracking** — character and word-level granularity within text nodes via Myers algorithm
9
+ 3. **Lossless multi-format conversion** — HTML, docx, JSON with round-trip fidelity
10
+
11
+ ## Architecture
12
+
13
+ ```
14
+ HTML ──→ ┌─────────────────────┐ ──→ HTML
15
+ │ │
16
+ docx ──→ │ Universal Document │ ──→ docx
17
+ │ Model (UDM) │
18
+ CSS ──→ │ │ ──→ JSON
19
+ └─────────┬───────────┘
20
+
21
+ ┌────▼────┐
22
+ │ Diff │ GumTree-inspired
23
+ │ Engine │ 3-phase algorithm
24
+ └────┬────┘
25
+
26
+ DiffResult (typed operations)
27
+ ```
28
+
29
+ ### Universal Document Model (UDM)
30
+
31
+ A JSON-serializable AST that any document format converts to/from.
32
+
33
+ - **ProseMirror-style marks** — `Text("hello", marks: [{type: "strong"}])` instead of nested `Strong > Text("hello")`. Changing bold→italic is one mark update, not a delete+insert+reparent.
34
+ - **`classes` as first-class field** — HTML classes, docx styles, and Markdown attributes all map to `classes: string[]` on any node. Classes affect the content hash and show up in diffs.
35
+ - **Format-specific metadata** — `data.html`, `data.docx` namespaces preserve round-trip fidelity per format.
36
+ - **Lossless docx round-trip** — embedded `otomate-udm.json` inside the docx ZIP enables perfect reconstruction when both writer and reader are otomate.
37
+
38
+ ### Diff Algorithm
39
+
40
+ Three-phase GumTree-inspired approach:
41
+
42
+ | Phase | Algorithm | What it catches |
43
+ |-------|-----------|----------------|
44
+ | 1. Top-down | FNV-1a content hash matching | Identical subtrees (unchanged content) |
45
+ | 2. Bottom-up | Dice coefficient + inner matching | Modified nodes, moves, reorders |
46
+ | 3. Text diff | Myers O(ND) | Word/character-level changes within text nodes |
47
+
48
+ Output: serializable `DiffResult` with typed operations — `insert`, `delete`, `move`, `update`, `updateText`.
49
+
50
+ ## Packages
51
+
52
+ | Package | Description |
53
+ |---------|-------------|
54
+ | `@otomate/core` | UDM types, node builders, traversal, FNV-1a hashing |
55
+ | `@otomate/diff` | Tree diff engine + Myers string-level diffing |
56
+ | `@otomate/html` | HTML ↔ UDM adapter (via hast) |
57
+ | `@otomate/docx` | docx ↔ UDM adapter (via jszip + fast-xml-parser) |
58
+ | `@otomate/css-docx` | CSS properties → OOXML style mapping |
59
+ | `@otomate/ui` | Web UI for testing conversions and diffs |
60
+
61
+ ## Installation
62
+
63
+ ### npm / pnpm (Node.js or bundler)
64
+
65
+ ```bash
66
+ npm install otomate
67
+ ```
68
+
69
+ ```typescript
70
+ import { readHtml, writeDocx, diff } from "otomate";
71
+ ```
72
+
73
+ ### CDN (browser, no build step)
74
+
75
+ ```html
76
+ <script src="https://cdn.jsdelivr.net/npm/otomate/dist/otomate.umd.cjs"></script>
77
+ <script>
78
+ const tree = otomate.readHtml("<p>Hello <strong>world</strong></p>");
79
+ const html = otomate.writeHtml(tree);
80
+ console.log(tree); // UDM tree
81
+ </script>
82
+ ```
83
+
84
+ Or with ES modules:
85
+
86
+ ```html
87
+ <script type="module">
88
+ import { readHtml, writeDocx, diff } from "https://cdn.jsdelivr.net/npm/otomate/dist/otomate.js";
89
+
90
+ const tree = readHtml("<h1>Hello</h1><p>World</p>");
91
+ const docxBuffer = await writeDocx(tree);
92
+ </script>
93
+ ```
94
+
95
+ ### Individual packages (tree-shakeable)
96
+
97
+ ```bash
98
+ npm install @otomate/core @otomate/diff @otomate/html @otomate/docx
99
+ ```
100
+
101
+ ## Quick Start
102
+
103
+ ### Parse HTML to UDM
104
+
105
+ ```typescript
106
+ import { readHtml, writeHtml } from "@otomate/html";
107
+
108
+ const tree = readHtml(`
109
+ <h1>Hello World</h1>
110
+ <p>This is <strong>bold</strong> and <em>italic</em> text.</p>
111
+ `);
112
+
113
+ // tree is a UDM Root node:
114
+ // root
115
+ // heading depth=1
116
+ // text "Hello World"
117
+ // paragraph
118
+ // text "This is "
119
+ // text "bold" [strong]
120
+ // text " and "
121
+ // text "italic" [emphasis]
122
+ // text " text."
123
+ ```
124
+
125
+ ### Parse with CSS styling
126
+
127
+ ```typescript
128
+ const tree = readHtml(html, {
129
+ css: `
130
+ h1 { font-size: 28pt; color: #1e3a5f; }
131
+ .intro { font-style: italic; color: #374151; }
132
+ .highlighted { background-color: #dbeafe; }
133
+ `
134
+ });
135
+ // CSS rules stored on tree.data.css — used by docx writer for Word styles
136
+ ```
137
+
138
+ ### Convert to docx
139
+
140
+ ```typescript
141
+ import { writeDocx } from "@otomate/docx";
142
+
143
+ const buffer = await writeDocx(tree);
144
+ // buffer is a valid .docx file with:
145
+ // - CSS-derived Word styles (fonts, colors, sizes, backgrounds)
146
+ // - Embedded otomate-udm.json for lossless round-trip
147
+ fs.writeFileSync("output.docx", buffer);
148
+ ```
149
+
150
+ ### Lossless round-trip
151
+
152
+ ```typescript
153
+ import { readDocx } from "@otomate/docx";
154
+
155
+ const tree2 = await readDocx(buffer);
156
+ // tree2 is identical to tree — all classes, hierarchy, marks preserved
157
+ // via embedded otomate-udm.json (invisible to Word, used by otomate)
158
+ ```
159
+
160
+ ### Diff two documents
161
+
162
+ ```typescript
163
+ import { diff } from "@otomate/diff";
164
+
165
+ const result = diff(oldTree, newTree);
166
+ // result.operations: [
167
+ // { type: "updateText", path: [0,0], changes: [
168
+ // { type: "equal", value: "Hello " },
169
+ // { type: "delete", value: "World" },
170
+ // { type: "insert", value: "Everyone" }
171
+ // ]},
172
+ // { type: "insert", path: [2], node: { type: "paragraph", ... } },
173
+ // { type: "move", from: [1], to: [3] }
174
+ // ]
175
+ ```
176
+
177
+ ### Cross-format diff
178
+
179
+ ```typescript
180
+ // Diff HTML against a docx — both parsed to UDM first
181
+ const htmlTree = readHtml(htmlString);
182
+ const docxTree = await readDocx(docxBuffer);
183
+ const result = diff(htmlTree, docxTree);
184
+ // Structural comparison regardless of source format
185
+ ```
186
+
187
+ ## Web UI
188
+
189
+ A built-in test interface for interactive conversion and diffing:
190
+
191
+ ```bash
192
+ cd packages/ui
193
+ pnpm dev
194
+ # Opens at http://localhost:5555
195
+ ```
196
+
197
+ **Four panels:**
198
+ - **Input** — HTML editor or docx file upload
199
+ - **CSS Stylesheet** — element and class rules (applied to preview and docx export)
200
+ - **UDM Tree** — live parsed document structure
201
+ - **Output** — HTML preview, HTML source, or UDM JSON
202
+
203
+ **Features:**
204
+ - Live CSS → preview (edit a rule, see it instantly)
205
+ - Export .docx with CSS-derived Word styles
206
+ - Re-import .docx with lossless round-trip
207
+ - Diff two documents with typed operation view
208
+
209
+ ## Node Types
210
+
211
+ ### Block nodes
212
+
213
+ | UDM Type | HTML | docx | Description |
214
+ |----------|------|------|-------------|
215
+ | `root` | `<body>` | document | Document root |
216
+ | `paragraph` | `<p>` | `w:p` | Paragraph |
217
+ | `heading` | `<h1>`–`<h6>` | `w:pStyle Heading1-6` | Heading (depth 1-6) |
218
+ | `blockquote` | `<blockquote>` | indented `w:p` + left border | Block quote |
219
+ | `list` | `<ul>`/`<ol>` | `w:numPr` | List (ordered/unordered) |
220
+ | `listItem` | `<li>` | list paragraph | List item |
221
+ | `codeBlock` | `<pre><code>` | `w:pStyle Code` | Code block with optional lang |
222
+ | `table` | `<table>` | `w:tbl` | Table |
223
+ | `tableRow` | `<tr>` | `w:tr` | Table row |
224
+ | `tableCell` | `<td>`/`<th>` | `w:tc` | Table cell |
225
+ | `thematicBreak` | `<hr>` | bottom border | Horizontal rule |
226
+ | `div` | `<div>` | transparent container | Generic container |
227
+ | `image` | `<img>` | `w:drawing` | Image |
228
+
229
+ ### Inline marks
230
+
231
+ | Mark | HTML | docx | Description |
232
+ |------|------|------|-------------|
233
+ | `strong` | `<strong>`/`<b>` | `w:b` | Bold |
234
+ | `emphasis` | `<em>`/`<i>` | `w:i` | Italic |
235
+ | `underline` | `<u>` | `w:u` | Underline |
236
+ | `strikethrough` | `<del>`/`<s>` | `w:strike` | Strikethrough |
237
+ | `superscript` | `<sup>` | `w:vertAlign superscript` | Superscript |
238
+ | `subscript` | `<sub>` | `w:vertAlign subscript` | Subscript |
239
+ | `code` | `<code>` | monospace font | Inline code |
240
+ | `link` | `<a>` | `w:hyperlink` | Hyperlink (with URL) |
241
+ | `highlight` | `<mark>` | `w:highlight` | Highlighted text |
242
+
243
+ ## CSS → docx Style Mapping
244
+
245
+ When CSS rules are provided, otomate generates real Word styles with OOXML formatting:
246
+
247
+ | CSS Property | OOXML | Notes |
248
+ |---|---|---|
249
+ | `font-family` | `w:rFonts` | First font in stack |
250
+ | `font-size` | `w:sz` | Converted to half-points |
251
+ | `font-weight: bold` | `w:b` | ≥700 or "bold" |
252
+ | `font-style: italic` | `w:i` | |
253
+ | `color` | `w:color` | Hex, rgb(), named colors |
254
+ | `background-color` | `w:shd` (paragraph) + `w:shd` (run) | |
255
+ | `text-decoration: underline` | `w:u` | |
256
+ | `text-decoration: line-through` | `w:strike` | |
257
+ | `text-align` | `w:jc` | left, center, right, justify→both |
258
+ | `margin-top/bottom` | `w:spacing before/after` | Converted to twips |
259
+ | `margin-left` | `w:ind left` | Converted to twips |
260
+ | `line-height` | `w:spacing line` | |
261
+
262
+ Container CSS (on `div` elements) cascades to all child paragraphs, headings, and list items.
263
+
264
+ ## Lossless Round-Trip Strategy
265
+
266
+ otomate uses a **two-tier** system for format conversion:
267
+
268
+ **Tier 1 — Universal structure** (diffable across formats):
269
+ Text, paragraphs, headings, lists, tables, images, bold, italic, links, classes.
270
+
271
+ **Tier 2 — Format-specific metadata** (preserved per format):
272
+ - `data.html` — id, data-* attributes, style, aria-*, ol type
273
+ - `data.docx` — spacing, indent, alignment, fonts, colors, raw XML parts
274
+
275
+ **otomate-to-otomate round-trip:**
276
+ The docx writer embeds `word/otomate-udm.json` (the full UDM tree) and `word/otomate-css.json` (CSS rules) inside the ZIP. Word ignores these files. When otomate reads the docx back, it finds the snapshot and reconstructs the tree perfectly — all classes, hierarchy, marks, and CSS rules intact.
277
+
278
+ ## Development
279
+
280
+ ```bash
281
+ # Install dependencies
282
+ pnpm install
283
+
284
+ # Build all packages
285
+ pnpm -r build
286
+
287
+ # Run tests
288
+ pnpm -r test
289
+
290
+ # Start the UI dev server
291
+ cd packages/ui && pnpm dev
292
+ ```
293
+
294
+ ## License
295
+
296
+ [Business Source License 1.1](LICENSE) — free for non-production use. Production use permitted except for offering otomate as a competitive hosted/embedded service. Converts to MIT on 2030-04-03.
297
+
298
+ All dependencies are MIT or Apache-2.0 licensed.