npm - otomate - Versions diffs - 0.0.8 → 0.0.10 - Mend

otomate 0.0.8 → 0.0.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,298 @@
+# Otomate
+**Universal document diffing library** — structure-aware, string-level, multi-format.
+otomate evaluates differences across documents with three core properties that existing libraries lack:
+1. **Structure tracking** — tree-based diffing that understands document hierarchy, not just flat text
+2. **String tracking** — character and word-level granularity within text nodes via Myers algorithm
+3. **Lossless multi-format conversion** — HTML, docx, JSON with round-trip fidelity
+## Architecture
+```
+HTML ──→ ┌─────────────────────┐ ──→ HTML
+         │                     │
+docx ──→ │  Universal Document │ ──→ docx
+         │     Model (UDM)     │
+ CSS ──→ │                     │ ──→ JSON
+         └─────────┬───────────┘
+                   │
+              ┌────▼────┐
+              │  Diff    │  GumTree-inspired
+              │  Engine  │  3-phase algorithm
+              └────┬────┘
+                   │
+           DiffResult (typed operations)
+```
+### Universal Document Model (UDM)
+A JSON-serializable AST that any document format converts to/from.
+- **ProseMirror-style marks** — `Text("hello", marks: [{type: "strong"}])` instead of nested `Strong > Text("hello")`. Changing bold→italic is one mark update, not a delete+insert+reparent.
+- **`classes` as first-class field** — HTML classes, docx styles, and Markdown attributes all map to `classes: string[]` on any node. Classes affect the content hash and show up in diffs.
+- **Format-specific metadata** — `data.html`, `data.docx` namespaces preserve round-trip fidelity per format.
+- **Lossless docx round-trip** — embedded `otomate-udm.json` inside the docx ZIP enables perfect reconstruction when both writer and reader are otomate.
+### Diff Algorithm
+Three-phase GumTree-inspired approach:
+| Phase | Algorithm | What it catches |
+|-------|-----------|----------------|
+| 1. Top-down | FNV-1a content hash matching | Identical subtrees (unchanged content) |
+| 2. Bottom-up | Dice coefficient + inner matching | Modified nodes, moves, reorders |
+| 3. Text diff | Myers O(ND) | Word/character-level changes within text nodes |
+Output: serializable `DiffResult` with typed operations — `insert`, `delete`, `move`, `update`, `updateText`.
+## Packages
+| Package | Description |
+|---------|-------------|
+| `@otomate/core` | UDM types, node builders, traversal, FNV-1a hashing |
+| `@otomate/diff` | Tree diff engine + Myers string-level diffing |
+| `@otomate/html` | HTML ↔ UDM adapter (via hast) |
+| `@otomate/docx` | docx ↔ UDM adapter (via jszip + fast-xml-parser) |
+| `@otomate/css-docx` | CSS properties → OOXML style mapping |
+| `@otomate/ui` | Web UI for testing conversions and diffs |
+## Installation
+### npm / pnpm (Node.js or bundler)
+```bash
+npm install otomate
+```
+```typescript
+import { readHtml, writeDocx, diff } from "otomate";
+```
+### CDN (browser, no build step)
+```html
+<script src="https://cdn.jsdelivr.net/npm/otomate/dist/otomate.umd.cjs"></script>
+<script>
+  const tree = otomate.readHtml("<p>Hello <strong>world</strong></p>");
+  const html = otomate.writeHtml(tree);
+  console.log(tree); // UDM tree
+</script>
+```
+Or with ES modules:
+```html
+<script type="module">
+  import { readHtml, writeDocx, diff } from "https://cdn.jsdelivr.net/npm/otomate/dist/otomate.js";
+  const tree = readHtml("<h1>Hello</h1><p>World</p>");
+  const docxBuffer = await writeDocx(tree);
+</script>
+```
+### Individual packages (tree-shakeable)
+```bash
+npm install @otomate/core @otomate/diff @otomate/html @otomate/docx
+```
+## Quick Start
+### Parse HTML to UDM
+```typescript
+import { readHtml, writeHtml } from "@otomate/html";
+const tree = readHtml(`
+  <h1>Hello World</h1>
+  <p>This is <strong>bold</strong> and <em>italic</em> text.</p>
+`);
+// tree is a UDM Root node:
+// root
+//   heading depth=1
+//     text "Hello World"
+//   paragraph
+//     text "This is "
+//     text "bold" [strong]
+//     text " and "
+//     text "italic" [emphasis]
+//     text " text."
+```
+### Parse with CSS styling
+```typescript
+const tree = readHtml(html, {
+  css: `
+    h1 { font-size: 28pt; color: #1e3a5f; }
+    .intro { font-style: italic; color: #374151; }
+    .highlighted { background-color: #dbeafe; }
+  `
+});
+// CSS rules stored on tree.data.css — used by docx writer for Word styles
+```
+### Convert to docx
+```typescript
+import { writeDocx } from "@otomate/docx";
+const buffer = await writeDocx(tree);
+// buffer is a valid .docx file with:
+// - CSS-derived Word styles (fonts, colors, sizes, backgrounds)
+// - Embedded otomate-udm.json for lossless round-trip
+fs.writeFileSync("output.docx", buffer);
+```
+### Lossless round-trip
+```typescript
+import { readDocx } from "@otomate/docx";
+const tree2 = await readDocx(buffer);
+// tree2 is identical to tree — all classes, hierarchy, marks preserved
+// via embedded otomate-udm.json (invisible to Word, used by otomate)
+```
+### Diff two documents
+```typescript
+import { diff } from "@otomate/diff";
+const result = diff(oldTree, newTree);
+// result.operations: [
+//   { type: "updateText", path: [0,0], changes: [
+//     { type: "equal", value: "Hello " },
+//     { type: "delete", value: "World" },
+//     { type: "insert", value: "Everyone" }
+//   ]},
+//   { type: "insert", path: [2], node: { type: "paragraph", ... } },
+//   { type: "move", from: [1], to: [3] }
+// ]
+```
+### Cross-format diff
+```typescript
+// Diff HTML against a docx — both parsed to UDM first
+const htmlTree = readHtml(htmlString);
+const docxTree = await readDocx(docxBuffer);
+const result = diff(htmlTree, docxTree);
+// Structural comparison regardless of source format
+```
+## Web UI
+A built-in test interface for interactive conversion and diffing:
+```bash
+cd packages/ui
+pnpm dev
+# Opens at http://localhost:5555
+```
+**Four panels:**
+- **Input** — HTML editor or docx file upload
+- **CSS Stylesheet** — element and class rules (applied to preview and docx export)
+- **UDM Tree** — live parsed document structure
+- **Output** — HTML preview, HTML source, or UDM JSON
+**Features:**
+- Live CSS → preview (edit a rule, see it instantly)
+- Export .docx with CSS-derived Word styles
+- Re-import .docx with lossless round-trip
+- Diff two documents with typed operation view
+## Node Types
+### Block nodes
+| UDM Type | HTML | docx | Description |
+|----------|------|------|-------------|
+| `root` | `<body>` | document | Document root |
+| `paragraph` | `<p>` | `w:p` | Paragraph |
+| `heading` | `<h1>`–`<h6>` | `w:pStyle Heading1-6` | Heading (depth 1-6) |
+| `blockquote` | `<blockquote>` | indented `w:p` + left border | Block quote |
+| `list` | `<ul>`/`<ol>` | `w:numPr` | List (ordered/unordered) |
+| `listItem` | `<li>` | list paragraph | List item |
+| `codeBlock` | `<pre><code>` | `w:pStyle Code` | Code block with optional lang |
+| `table` | `<table>` | `w:tbl` | Table |
+| `tableRow` | `<tr>` | `w:tr` | Table row |
+| `tableCell` | `<td>`/`<th>` | `w:tc` | Table cell |
+| `thematicBreak` | `<hr>` | bottom border | Horizontal rule |
+| `div` | `<div>` | transparent container | Generic container |
+| `image` | `<img>` | `w:drawing` | Image |
+### Inline marks
+| Mark | HTML | docx | Description |
+|------|------|------|-------------|
+| `strong` | `<strong>`/`<b>` | `w:b` | Bold |
+| `emphasis` | `<em>`/`<i>` | `w:i` | Italic |
+| `underline` | `<u>` | `w:u` | Underline |
+| `strikethrough` | `<del>`/`<s>` | `w:strike` | Strikethrough |
+| `superscript` | `<sup>` | `w:vertAlign superscript` | Superscript |
+| `subscript` | `<sub>` | `w:vertAlign subscript` | Subscript |
+| `code` | `<code>` | monospace font | Inline code |
+| `link` | `<a>` | `w:hyperlink` | Hyperlink (with URL) |
+| `highlight` | `<mark>` | `w:highlight` | Highlighted text |
+## CSS → docx Style Mapping
+When CSS rules are provided, otomate generates real Word styles with OOXML formatting:
+| CSS Property | OOXML | Notes |
+|---|---|---|
+| `font-family` | `w:rFonts` | First font in stack |
+| `font-size` | `w:sz` | Converted to half-points |
+| `font-weight: bold` | `w:b` | ≥700 or "bold" |
+| `font-style: italic` | `w:i` | |
+| `color` | `w:color` | Hex, rgb(), named colors |
+| `background-color` | `w:shd` (paragraph) + `w:shd` (run) | |
+| `text-decoration: underline` | `w:u` | |
+| `text-decoration: line-through` | `w:strike` | |
+| `text-align` | `w:jc` | left, center, right, justify→both |
+| `margin-top/bottom` | `w:spacing before/after` | Converted to twips |
+| `margin-left` | `w:ind left` | Converted to twips |
+| `line-height` | `w:spacing line` | |
+Container CSS (on `div` elements) cascades to all child paragraphs, headings, and list items.
+## Lossless Round-Trip Strategy
+otomate uses a **two-tier** system for format conversion:
+**Tier 1 — Universal structure** (diffable across formats):
+Text, paragraphs, headings, lists, tables, images, bold, italic, links, classes.
+**Tier 2 — Format-specific metadata** (preserved per format):
+- `data.html` — id, data-* attributes, style, aria-*, ol type
+- `data.docx` — spacing, indent, alignment, fonts, colors, raw XML parts
+**otomate-to-otomate round-trip:**
+The docx writer embeds `word/otomate-udm.json` (the full UDM tree) and `word/otomate-css.json` (CSS rules) inside the ZIP. Word ignores these files. When otomate reads the docx back, it finds the snapshot and reconstructs the tree perfectly — all classes, hierarchy, marks, and CSS rules intact.
+## Development
+```bash
+# Install dependencies
+pnpm install
+# Build all packages
+pnpm -r build
+# Run tests
+pnpm -r test
+# Start the UI dev server
+cd packages/ui && pnpm dev
+```
+## License
+[Business Source License 1.1](LICENSE) — free for non-production use. Production use permitted except for offering otomate as a competitive hosted/embedded service. Converts to MIT on 2030-04-03.
+All dependencies are MIT or Apache-2.0 licensed.