@uniweb/semantic-parser 1.0.3 → 1.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md ADDED
@@ -0,0 +1,136 @@
1
+ # AGENTS.md
2
+
3
+ This file provides guidance for AI assistants working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ This is a semantic parser for ProseMirror/TipTap content structures. It transforms rich text editor content into structured, semantic groups that web components can consume. The parser bridges the gap between natural content writing and component-based web development.
8
+
9
+ ## Development Commands
10
+
11
+ ```bash
12
+ # Run all tests
13
+ npm test
14
+
15
+ # Run tests with JSON report output
16
+ npm run test-report
17
+
18
+ # Run a specific test file
19
+ npx jest tests/parser.test.js
20
+
21
+ # Run tests in watch mode
22
+ npx jest --watch
23
+
24
+ # Run a specific test by name
25
+ npx jest -t "handles simple document structure"
26
+ ```
27
+
28
+ ## Architecture
29
+
30
+ ### Three-Stage Processing Pipeline
31
+
32
+ The parser processes content through three distinct stages, each building on the previous:
33
+
34
+ 1. **Sequence Processing** (`src/processors/sequence.js`): Flattens the ProseMirror document tree into a linear sequence of semantic elements (headings, paragraphs, images, lists, etc.)
35
+
36
+ 2. **Groups Processing** (`src/processors/groups.js`): Transforms the sequence into semantic groups with identified main content and items. Supports two grouping modes:
37
+ - Heading-based grouping (default)
38
+ - Divider-based grouping (when horizontal rules are present)
39
+
40
+ 3. **ByType Processing** (`src/processors/byType.js`): Organizes elements by type with positional context, enabling type-specific queries
41
+
42
+ The main entry point (`src/index.js`) returns all three views:
43
+ ```js
44
+ import { parseContent } from './src/index.js';
45
+
46
+ const result = parseContent(doc);
47
+ // {
48
+ // raw: doc, // Original ProseMirror document
49
+ // sequence: [...], // Flat sequence of elements
50
+ // groups: {...}, // Semantic groups with main/items
51
+ // byType: {...} // Elements organized by type
52
+ // }
53
+ ```
54
+
55
+ ### Content Group Structure
56
+
57
+ Groups follow a specific structure defined in `processGroupContent()`:
58
+
59
+ ```js
60
+ {
61
+ header: {
62
+ pretitle: '', // H3 before main title
63
+ title: '', // Main heading (H1 or H2)
64
+ subtitle: '' // Heading after main title
65
+ },
66
+ body: {
67
+ imgs: [],
68
+ icons: [],
69
+ videos: [],
70
+ paragraphs: [],
71
+ links: [],
72
+ lists: [],
73
+ buttons: [],
74
+ properties: [],
75
+ propertyBlocks: [],
76
+ cards: [],
77
+ headings: []
78
+ },
79
+ banner: null, // Image with banner role or image before heading
80
+ metadata: {
81
+ level: null, // Heading level that started this group
82
+ contentTypes: Set()
83
+ }
84
+ }
85
+ ```
86
+
87
+ ### Main Content Identification
88
+
89
+ The `identifyMainContent()` function (src/processors/groups.js:282) determines if the first group should be treated as main content:
90
+ - Single group is always main content
91
+ - First group must have lower heading level than second group
92
+ - Divider mode affects main content identification
93
+
94
+ ### Special Element Detection
95
+
96
+ The sequence processor identifies several special element types by inspecting paragraph content:
97
+ - **Links**: Paragraphs containing only a single link mark
98
+ - **Images**: Paragraphs with single image (role: 'image' or 'banner')
99
+ - **Icons**: Paragraphs with single image (role: 'icon')
100
+ - **Buttons**: Paragraphs with single text node having button mark
101
+ - **Videos**: Paragraphs with single image (role: 'video')
102
+
103
+ These are extracted into dedicated element types for easier downstream processing.
104
+
105
+ ### List Processing
106
+
107
+ Lists maintain hierarchy through nested structure. The `processListItems()` function in sequence.js handles nested lists, while `processListContent()` in groups.js applies full group content processing to each list item, allowing lists to contain rich content (images, paragraphs, nested lists, etc.).
108
+
109
+ ## Content Writing Conventions
110
+
111
+ The parser implements the semantic conventions documented in `docs/guide.md`. Key patterns:
112
+
113
+ - **Pretitle Pattern**: Any heading followed by a more important heading (e.g., H3→H1, H2→H1, H6→H5, etc.)
114
+ - **Banner Pattern**: Image (with banner role or followed by heading) at start of first group
115
+ - **Divider Mode**: Presence of any `horizontalRule` switches entire document to divider-based grouping
116
+ - **Heading Groups**: Consecutive headings with increasing levels are consumed together
117
+ - **Main Content**: First group is main if it's the only group OR has lower heading level than second group
118
+ - **Body Headings**: Headings that overflow the header slots (title, subtitle, subtitle2) are automatically collected in `body.headings`
119
+
120
+ ## Testing Structure
121
+
122
+ Tests are organized by processor:
123
+ - `tests/parser.test.js` - Integration tests
124
+ - `tests/processors/sequence.test.js` - Sequence processing
125
+ - `tests/processors/groups.test.js` - Groups processing
126
+ - `tests/processors/byType.test.js` - ByType processing
127
+ - `tests/utils/role.test.js` - Role utilities
128
+ - `tests/fixtures/` - Shared test documents
129
+
130
+ ## Important Implementation Notes
131
+
132
+ - The parser never modifies the original ProseMirror document
133
+ - Text content can include inline HTML for formatting (bold → `<strong>`, italic → `<em>`, links → `<a>`)
134
+ - The `processors_old/` directory contains legacy implementations - do not modify
135
+ - Context information in byType includes position, previous/next elements, and nearest heading
136
+ - Group splitting logic differs significantly between heading mode and divider mode
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@uniweb/semantic-parser",
3
- "version": "1.0.3",
3
+ "version": "1.0.8",
4
4
  "description": "Semantic parser for ProseMirror/TipTap content structures",
5
5
  "type": "module",
6
6
  "main": "./src/index.js",
@@ -44,7 +44,8 @@ function processByType(sequence) {
44
44
  break;
45
45
 
46
46
  case "image": {
47
- const role = element.role || "content";
47
+ // Support both attrs.role and top-level role for backwards compatibility
48
+ const role = element.attrs?.role || element.role || "content";
48
49
  if (!collections.images[role]) {
49
50
  collections.images[role] = [];
50
51
  }
@@ -320,9 +320,11 @@ function identifyMainContent(groups) {
320
320
 
321
321
  function processInlineElements(children, body) {
322
322
  children.forEach((item) => {
323
- //Handle icons only for now
324
323
  if (item.type === "icon") {
325
324
  body.icons.push(item.attrs);
325
+ } else if (item.type === "link") {
326
+ // Handle inline links extracted from paragraph text nodes
327
+ body.links.push(item.attrs);
326
328
  }
327
329
  });
328
330
  }
@@ -102,10 +102,10 @@ function createSequenceElement(node, options = {}) {
102
102
  attrs: parseImgBlock(attrs),
103
103
  };
104
104
  case "image":
105
- // Standard ProseMirror image node - spread attrs at top level
105
+ // Standard ProseMirror image node - wrap attrs like ImageBlock
106
106
  return {
107
107
  type: "image",
108
- ...attrs,
108
+ attrs: attrs || {},
109
109
  };
110
110
  case "Video":
111
111
  return {
@@ -302,6 +302,18 @@ function processInlineElements(content) {
302
302
  });
303
303
  } else if (item.type === "math-inline") {
304
304
  items.push(item);
305
+ } else if (item.type === "text" && item.marks) {
306
+ // Extract links from text nodes with link marks
307
+ const linkMark = item.marks.find((m) => m.type === "link");
308
+ if (linkMark) {
309
+ items.push({
310
+ type: "link",
311
+ attrs: {
312
+ href: linkMark.attrs?.href,
313
+ label: item.text || "",
314
+ },
315
+ });
316
+ }
305
317
  }
306
318
  }
307
319