@uniweb/semantic-parser 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,179 @@
1
+ # Content Grouping Logic
2
+
3
+ This document outlines how the `processGroups` function interprets flat arrays of content (Headings, Paragraphs, etc.) and organizes them into semantic **Main Content** and **List Items**.
4
+
5
+ ## The Core Challenge
6
+
7
+ The parser must distinguish between two visually similar but semantically different patterns:
8
+
9
+ 1. **Subtitles:** A smaller heading that belongs to the main title (Merge).
10
+
11
+ 2. **List Items:** A smaller heading that starts a new list item (Split).
12
+
13
+ ## The Logic (Heuristics)
14
+
15
+ To make this decision, the parser looks ahead at the structure:
16
+
17
+ 1. **Sibling Boundary:** If we are at Level X and encounter another Level X, it is always a sibling. We **Split**.
18
+
19
+ 2. **Peer Detection:** If we are stepping down (H1 → H2), we check if that H2 has a "peer" (another H2) later in the section.
20
+
21
+ 3. **Leaf vs. Branch:**
22
+
23
+ - **Leaf:** A heading with no sub-headings underneath it.
24
+
25
+ - **Branch:** A heading with sub-headings (e.g., H2 followed by H3 dates).
26
+
27
+ ## Supported Patterns & Behavior
28
+
29
+ ### 1. The "Resume" Pattern (Items)
30
+
31
+ - **Structure:** `H1` → `H2 (Branch)` → `H2 (Branch)`
32
+
33
+ - **Use Case:** Academic Experience, Work History.
34
+
35
+ - **Behavior:** The parser sees the first H2 has a peer. Both are "Branches" (have children).
36
+
37
+ - **Result:** **Split**. The H1 becomes Main; the H2s become separate Items.
38
+
39
+ **Input Data Structure:**
40
+
41
+ ```
42
+ {
43
+ "type": "doc",
44
+ "content": [
45
+ { "type": "heading", "attrs": { "level": 1 }, "content": [{ "type": "text", "text": "Academic Experience" }] },
46
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Ph.D. in CS" }] },
47
+ { "type": "heading", "attrs": { "level": 3 }, "content": [{ "type": "text", "text": "2014-2018" }] },
48
+ { "type": "paragraph", "content": [{ "type": "text", "text": "MIT" }] },
49
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Masters in Data" }] },
50
+ { "type": "heading", "attrs": { "level": 3 }, "content": [{ "type": "text", "text": "2012-2014" }] },
51
+ { "type": "paragraph", "content": [{ "type": "text", "text": "Berkeley" }] }
52
+ ]
53
+ }
54
+ ```
55
+
56
+ **Parsed Output::**
57
+
58
+ ```
59
+ [Main] title: Academic Experience
60
+
61
+ [Item] title: Ph.D. in CS
62
+ subtitle: 2014-2018
63
+
64
+ [Item] title: Masters in Data
65
+ subtitle: 2012-2014
66
+ ```
67
+
68
+ ### 2. The "Standard" Pattern (Leaf Items)
69
+
70
+ - **Structure:** `H1` → `H2 (Leaf)` → `H2 (Leaf)`
71
+
72
+ - **Use Case:** Features list, standard sections.
73
+
74
+ - **Behavior:** Even though the H2s are leaves (no children), the parser detects a peer (another H2).
75
+
76
+ - **Result:** **Split**. Sibling detection forces them into separate items.
77
+
78
+ **Input Data Structure:**
79
+
80
+ ```
81
+ {
82
+ "type": "doc",
83
+ "content": [
84
+ { "type": "heading", "attrs": { "level": 1 }, "content": [{ "type": "text", "text": "Features" }] },
85
+ { "type": "paragraph", "content": [{ "type": "text", "text": "Our main features." }] },
86
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Feature One" }] },
87
+ { "type": "paragraph", "content": [{ "type": "text", "text": "First feature description." }] },
88
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Feature Two" }] },
89
+ { "type": "paragraph", "content": [{ "type": "text", "text": "Second feature description." }] }
90
+ ]
91
+ }
92
+ ```
93
+
94
+ **Parsed Output::**
95
+
96
+ ```
97
+ [Main] title: Features
98
+ body: Our main features.
99
+
100
+ [Item] title: Feature One
101
+ body: First feature description.
102
+
103
+ [Item] title: Feature Two
104
+ body: Second feature description.
105
+ ```
106
+
107
+ ### 3. The "Hybrid" Pattern (Intro Subtitle + Items)
108
+
109
+ - **Structure:** `H1` → `H2 (Leaf)` → `H2 (Branch)`
110
+
111
+ - **Use Case:** A section with a summary heading before the list starts.
112
+
113
+ - **Behavior:** The parser compares the first H2 (Leaf) against the second H2 (Branch).
114
+
115
+ - **Result:** **Merge then Split**. The first H2 merges into Main. The second H2 starts the first Item.
116
+
117
+ **Input Data Structure:**
118
+
119
+ ```
120
+ {
121
+ "type": "doc",
122
+ "content": [
123
+ { "type": "heading", "attrs": { "level": 1 }, "content": [{ "type": "text", "text": "Work History" }] },
124
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "A summary of my roles." }] },
125
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Google" }] },
126
+ { "type": "heading", "attrs": { "level": 3 }, "content": [{ "type": "text", "text": "2020-Present" }] },
127
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Facebook" }] },
128
+ { "type": "heading", "attrs": { "level": 3 }, "content": [{ "type": "text", "text": "2018-2020" }] }
129
+ ]
130
+ }
131
+ ```
132
+
133
+ **Parsed Output::**
134
+
135
+ ```
136
+ [Main] title: Work History
137
+ subtitle: "A summary of my roles."
138
+
139
+ [Item] title: Google
140
+ subtitle: 2020-Present
141
+
142
+ [Item] title: Facebook
143
+ subtitle: 2018-2020
144
+ ```
145
+
146
+ ### 4. The "Deep Header" Pattern
147
+
148
+ - **Structure:** `H3` → `H1` → `H2` -> `H3`
149
+
150
+ - **Use Case:** Complex Hero sections with pre-titles and multiple subtitles.
151
+
152
+ - **Behavior:** The headings are strictly sequential or hierarchical components of a single block.
153
+
154
+ - **Result:** **Merge then Split**. Treats the hierarchy as a single deep header block.
155
+
156
+ **Input Data Structure:**
157
+
158
+ ```
159
+ {
160
+ "type": "doc",
161
+ "content": [
162
+ { "type": "heading", "attrs": { "level": 3 }, "content": [{ "type": "text", "text": "WELCOME" }] },
163
+ { "type": "heading", "attrs": { "level": 1 }, "content": [{ "type": "text", "text": "Main Title" }] },
164
+ { "type": "heading", "attrs": { "level": 2 }, "content": [{ "type": "text", "text": "Subtitle" }] },
165
+ { "type": "heading", "attrs": { "level": 3 }, "content": [{ "type": "text", "text": "Subsubtitle" }] },
166
+ { "type": "paragraph", "content": [{ "type": "text", "text": "Content." }] }
167
+ ]
168
+ }
169
+
170
+ ```
171
+
172
+ **Parsed Output::**
173
+
174
+ ```
175
+ [Main] title: Main Title
176
+ pretitle: WELCOME
177
+ subtitle: Subtitle
178
+ subtitle2: Subsubtitle
179
+ ```