@uniweb/semantic-parser 1.0.7 → 1.0.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +136 -0
- package/package.json +1 -1
package/AGENTS.md
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# AGENTS.md
|
|
2
|
+
|
|
3
|
+
This file provides guidance for AI assistants working with code in this repository.
|
|
4
|
+
|
|
5
|
+
## Project Overview
|
|
6
|
+
|
|
7
|
+
This is a semantic parser for ProseMirror/TipTap content structures. It transforms rich text editor content into structured, semantic groups that web components can consume. The parser bridges the gap between natural content writing and component-based web development.
|
|
8
|
+
|
|
9
|
+
## Development Commands
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
# Run all tests
|
|
13
|
+
npm test
|
|
14
|
+
|
|
15
|
+
# Run tests with JSON report output
|
|
16
|
+
npm run test-report
|
|
17
|
+
|
|
18
|
+
# Run a specific test file
|
|
19
|
+
npx jest tests/parser.test.js
|
|
20
|
+
|
|
21
|
+
# Run tests in watch mode
|
|
22
|
+
npx jest --watch
|
|
23
|
+
|
|
24
|
+
# Run a specific test by name
|
|
25
|
+
npx jest -t "handles simple document structure"
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Architecture
|
|
29
|
+
|
|
30
|
+
### Three-Stage Processing Pipeline
|
|
31
|
+
|
|
32
|
+
The parser processes content through three distinct stages, each building on the previous:
|
|
33
|
+
|
|
34
|
+
1. **Sequence Processing** (`src/processors/sequence.js`): Flattens the ProseMirror document tree into a linear sequence of semantic elements (headings, paragraphs, images, lists, etc.)
|
|
35
|
+
|
|
36
|
+
2. **Groups Processing** (`src/processors/groups.js`): Transforms the sequence into semantic groups with identified main content and items. Supports two grouping modes:
|
|
37
|
+
- Heading-based grouping (default)
|
|
38
|
+
- Divider-based grouping (when horizontal rules are present)
|
|
39
|
+
|
|
40
|
+
3. **ByType Processing** (`src/processors/byType.js`): Organizes elements by type with positional context, enabling type-specific queries
|
|
41
|
+
|
|
42
|
+
The main entry point (`src/index.js`) returns all three views:
|
|
43
|
+
```js
|
|
44
|
+
import { parseContent } from './src/index.js';
|
|
45
|
+
|
|
46
|
+
const result = parseContent(doc);
|
|
47
|
+
// {
|
|
48
|
+
// raw: doc, // Original ProseMirror document
|
|
49
|
+
// sequence: [...], // Flat sequence of elements
|
|
50
|
+
// groups: {...}, // Semantic groups with main/items
|
|
51
|
+
// byType: {...} // Elements organized by type
|
|
52
|
+
// }
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Content Group Structure
|
|
56
|
+
|
|
57
|
+
Groups follow a specific structure defined in `processGroupContent()`:
|
|
58
|
+
|
|
59
|
+
```js
|
|
60
|
+
{
|
|
61
|
+
header: {
|
|
62
|
+
pretitle: '', // H3 before main title
|
|
63
|
+
title: '', // Main heading (H1 or H2)
|
|
64
|
+
subtitle: '' // Heading after main title
|
|
65
|
+
},
|
|
66
|
+
body: {
|
|
67
|
+
imgs: [],
|
|
68
|
+
icons: [],
|
|
69
|
+
videos: [],
|
|
70
|
+
paragraphs: [],
|
|
71
|
+
links: [],
|
|
72
|
+
lists: [],
|
|
73
|
+
buttons: [],
|
|
74
|
+
properties: [],
|
|
75
|
+
propertyBlocks: [],
|
|
76
|
+
cards: [],
|
|
77
|
+
headings: []
|
|
78
|
+
},
|
|
79
|
+
banner: null, // Image with banner role or image before heading
|
|
80
|
+
metadata: {
|
|
81
|
+
level: null, // Heading level that started this group
|
|
82
|
+
contentTypes: Set()
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Main Content Identification
|
|
88
|
+
|
|
89
|
+
The `identifyMainContent()` function (src/processors/groups.js:282) determines if the first group should be treated as main content:
|
|
90
|
+
- Single group is always main content
|
|
91
|
+
- First group must have lower heading level than second group
|
|
92
|
+
- Divider mode affects main content identification
|
|
93
|
+
|
|
94
|
+
### Special Element Detection
|
|
95
|
+
|
|
96
|
+
The sequence processor identifies several special element types by inspecting paragraph content:
|
|
97
|
+
- **Links**: Paragraphs containing only a single link mark
|
|
98
|
+
- **Images**: Paragraphs with single image (role: 'image' or 'banner')
|
|
99
|
+
- **Icons**: Paragraphs with single image (role: 'icon')
|
|
100
|
+
- **Buttons**: Paragraphs with single text node having button mark
|
|
101
|
+
- **Videos**: Paragraphs with single image (role: 'video')
|
|
102
|
+
|
|
103
|
+
These are extracted into dedicated element types for easier downstream processing.
|
|
104
|
+
|
|
105
|
+
### List Processing
|
|
106
|
+
|
|
107
|
+
Lists maintain hierarchy through nested structure. The `processListItems()` function in sequence.js handles nested lists, while `processListContent()` in groups.js applies full group content processing to each list item, allowing lists to contain rich content (images, paragraphs, nested lists, etc.).
|
|
108
|
+
|
|
109
|
+
## Content Writing Conventions
|
|
110
|
+
|
|
111
|
+
The parser implements the semantic conventions documented in `docs/guide.md`. Key patterns:
|
|
112
|
+
|
|
113
|
+
- **Pretitle Pattern**: Any heading followed by a more important heading (e.g., H3→H1, H2→H1, H6→H5, etc.)
|
|
114
|
+
- **Banner Pattern**: Image (with banner role or followed by heading) at start of first group
|
|
115
|
+
- **Divider Mode**: Presence of any `horizontalRule` switches entire document to divider-based grouping
|
|
116
|
+
- **Heading Groups**: Consecutive headings with increasing levels are consumed together
|
|
117
|
+
- **Main Content**: First group is main if it's the only group OR has lower heading level than second group
|
|
118
|
+
- **Body Headings**: Headings that overflow the header slots (title, subtitle, subtitle2) are automatically collected in `body.headings`
|
|
119
|
+
|
|
120
|
+
## Testing Structure
|
|
121
|
+
|
|
122
|
+
Tests are organized by processor:
|
|
123
|
+
- `tests/parser.test.js` - Integration tests
|
|
124
|
+
- `tests/processors/sequence.test.js` - Sequence processing
|
|
125
|
+
- `tests/processors/groups.test.js` - Groups processing
|
|
126
|
+
- `tests/processors/byType.test.js` - ByType processing
|
|
127
|
+
- `tests/utils/role.test.js` - Role utilities
|
|
128
|
+
- `tests/fixtures/` - Shared test documents
|
|
129
|
+
|
|
130
|
+
## Important Implementation Notes
|
|
131
|
+
|
|
132
|
+
- The parser never modifies the original ProseMirror document
|
|
133
|
+
- Text content can include inline HTML for formatting (bold → `<strong>`, italic → `<em>`, links → `<a>`)
|
|
134
|
+
- The `processors_old/` directory contains legacy implementations - do not modify
|
|
135
|
+
- Context information in byType includes position, previous/next elements, and nearest heading
|
|
136
|
+
- Group splitting logic differs significantly between heading mode and divider mode
|