npm - @uniweb/semantic-parser - Versions diffs - 1.0.7 → 1.0.9 - Mend

@uniweb/semantic-parser 1.0.7 → 1.0.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/AGENTS.md +136 -0
package/README.md +52 -104
package/docs/api.md +38 -40
package/docs/mapping-patterns.md +47 -47
package/docs/text-component-reference.md +3 -3
package/package.json +1 -1
package/src/index.js +5 -7
package/src/mappers/extractors.js +113 -120
package/src/processors/groups.js +96 -25
package/src/processors/byType.js +0 -130

package/AGENTS.md ADDED Viewed

@@ -0,0 +1,136 @@
+# AGENTS.md
+This file provides guidance for AI assistants working with code in this repository.
+## Project Overview
+This is a semantic parser for ProseMirror/TipTap content structures. It transforms rich text editor content into structured, semantic groups that web components can consume. The parser bridges the gap between natural content writing and component-based web development.
+## Development Commands
+```bash
+# Run all tests
+npm test
+# Run tests with JSON report output
+npm run test-report
+# Run a specific test file
+npx jest tests/parser.test.js
+# Run tests in watch mode
+npx jest --watch
+# Run a specific test by name
+npx jest -t "handles simple document structure"
+```
+## Architecture
+### Three-Stage Processing Pipeline
+The parser processes content through three distinct stages, each building on the previous:
+1. **Sequence Processing** (`src/processors/sequence.js`): Flattens the ProseMirror document tree into a linear sequence of semantic elements (headings, paragraphs, images, lists, etc.)
+2. **Groups Processing** (`src/processors/groups.js`): Transforms the sequence into semantic groups with identified main content and items. Supports two grouping modes:
+   - Heading-based grouping (default)
+   - Divider-based grouping (when horizontal rules are present)
+3. **ByType Processing** (`src/processors/byType.js`): Organizes elements by type with positional context, enabling type-specific queries
+The main entry point (`src/index.js`) returns all three views:
+```js
+import { parseContent } from './src/index.js';
+const result = parseContent(doc);
+// {
+//   raw: doc,        // Original ProseMirror document
+//   sequence: [...], // Flat sequence of elements
+//   groups: {...},   // Semantic groups with main/items
+//   byType: {...}    // Elements organized by type
+// }
+```
+### Content Group Structure
+Groups follow a specific structure defined in `processGroupContent()`:
+```js
+{
+  header: {
+    pretitle: '',  // H3 before main title
+    title: '',     // Main heading (H1 or H2)
+    subtitle: ''   // Heading after main title
+  },
+  body: {
+    imgs: [],
+    icons: [],
+    videos: [],
+    paragraphs: [],
+    links: [],
+    lists: [],
+    buttons: [],
+    properties: [],
+    propertyBlocks: [],
+    cards: [],
+    headings: []
+  },
+  banner: null,    // Image with banner role or image before heading
+  metadata: {
+    level: null,   // Heading level that started this group
+    contentTypes: Set()
+  }
+}
+```
+### Main Content Identification
+The `identifyMainContent()` function (src/processors/groups.js:282) determines if the first group should be treated as main content:
+- Single group is always main content
+- First group must have lower heading level than second group
+- Divider mode affects main content identification
+### Special Element Detection
+The sequence processor identifies several special element types by inspecting paragraph content:
+- **Links**: Paragraphs containing only a single link mark
+- **Images**: Paragraphs with single image (role: 'image' or 'banner')
+- **Icons**: Paragraphs with single image (role: 'icon')
+- **Buttons**: Paragraphs with single text node having button mark
+- **Videos**: Paragraphs with single image (role: 'video')
+These are extracted into dedicated element types for easier downstream processing.
+### List Processing
+Lists maintain hierarchy through nested structure. The `processListItems()` function in sequence.js handles nested lists, while `processListContent()` in groups.js applies full group content processing to each list item, allowing lists to contain rich content (images, paragraphs, nested lists, etc.).
+## Content Writing Conventions
+The parser implements the semantic conventions documented in `docs/guide.md`. Key patterns:
+- **Pretitle Pattern**: Any heading followed by a more important heading (e.g., H3→H1, H2→H1, H6→H5, etc.)
+- **Banner Pattern**: Image (with banner role or followed by heading) at start of first group
+- **Divider Mode**: Presence of any `horizontalRule` switches entire document to divider-based grouping
+- **Heading Groups**: Consecutive headings with increasing levels are consumed together
+- **Main Content**: First group is main if it's the only group OR has lower heading level than second group
+- **Body Headings**: Headings that overflow the header slots (title, subtitle, subtitle2) are automatically collected in `body.headings`
+## Testing Structure
+Tests are organized by processor:
+- `tests/parser.test.js` - Integration tests
+- `tests/processors/sequence.test.js` - Sequence processing
+- `tests/processors/groups.test.js` - Groups processing
+- `tests/processors/byType.test.js` - ByType processing
+- `tests/utils/role.test.js` - Role utilities
+- `tests/fixtures/` - Shared test documents
+## Important Implementation Notes
+- The parser never modifies the original ProseMirror document
+- Text content can include inline HTML for formatting (bold → `<strong>`, italic → `<em>`, links → `<a>`)
+- The `processors_old/` directory contains legacy implementations - do not modify
+- Context information in byType includes position, previous/next elements, and nearest heading
+- Group splitting logic differs significantly between heading mode and divider mode

package/README.md CHANGED Viewed

@@ -4,11 +4,10 @@ A semantic parser for ProseMirror/TipTap content structures that helps bridge th
 ## What it Does
-The parser transforms rich text editor content (ProseMirror/TipTap) into structured, semantic groups that web components can easily consume. It provides three complementary views of your content:
+The parser transforms rich text editor content (ProseMirror/TipTap) into structured, semantic groups that web components can easily consume. It provides two complementary views of your content:
-1. **Sequence**: A flat, ordered list of all content elements
-2. **Groups**: Content organized into semantic sections with identified main content
-3. **ByType**: Elements categorized by type for easy filtering and queries
+1. **Sequence**: An ordered list of all content elements (for rendering in document order)
+2. **Groups**: Content organized into semantic sections (main content + items)
 ## Installation
@@ -41,16 +40,16 @@ const doc = {
 const result = parseContent(doc);
 // Access different views
-console.log(result.sequence);  // Flat array of elements
-console.log(result.groups);    // Semantic groups with main/items
-console.log(result.byType);    // Elements organized by type
+console.log(result.sequence);  // Ordered array of elements
+console.log(result.title);     // Main content fields at top level
+console.log(result.items);     // Additional content groups
 ```
 ## Output Structure
 ### Sequence View
-A flat array of semantic elements preserving document order:
+An ordered array of semantic elements preserving document order:
 ```js
 result.sequence = [
@@ -59,72 +58,37 @@ result.sequence = [
 ]
 ```
-### Groups View
+### Content Structure
-Content organized into semantic groups:
+Main content fields are at the top level. The `items` array contains additional content groups (e.g., H3 sections), each with the same field structure:
 ```js
-result.groups = {
-  main: {
-    header: {
-      pretitle: "",           // H3 before main title
-      title: "Welcome",       // Main heading
-      subtitle: ""            // Heading after main title
-    },
-    body: {
-      paragraphs: ["Get started today."],
-      imgs: [],
-      videos: [],
-      links: [],
-      lists: [],
-      // ... more content types
-    },
-    banner: null,             // Optional banner image
-    metadata: { level: 1 }
-  },
-  items: [],                  // Additional content groups
-  metadata: {
-    dividerMode: false,       // Using dividers vs headings
-    groups: 0
-  }
-}
-```
-### ByType View
+result = {
+  // Main content fields
+  pretitle: "",             // H3 before main title
+  title: "Welcome",         // Main heading (H1)
+  subtitle: "",             // H2 after main title
+  paragraphs: ["Get started today."],
+  imgs: [],
+  videos: [],
+  links: [],
+  lists: [],
+  icons: [],
+  buttons: [],
+  banner: null,             // Optional banner image
+  // ... more content types
+  // Additional content groups (H3 sections)
+  items: [
+    { title: "Feature 1", paragraphs: [...], links: [...] },
+    { title: "Feature 2", paragraphs: [...], links: [...] }
+  ],
-Elements organized by type with context:
+  // Ordered sequence for document-order rendering
+  sequence: [...],
-```js
-result.byType = {
-  headings: [
-    {
-      type: "heading",
-      level: 1,
-      content: "Welcome",
-      context: {
-        position: 0,
-        previousElement: null,
-        nextElement: { type: "paragraph", ... },
-        nearestHeading: null
-      }
-    }
-  ],
-  paragraphs: [ /* ... */ ],
-  images: {
-    background: [],
-    content: [],
-    gallery: [],
-    icon: []
-  },
-  lists: [],
-  metadata: {
-    totalElements: 2,
-    dominantType: "paragraph",
-    hasMedia: false
-  },
-  // Helper methods
-  getHeadingsByLevel(level),
-  getElementsByHeadingContext(filter)
+  // Original document
+  raw: { type: "doc", content: [...] }
 }
 ```
@@ -133,45 +97,29 @@ result.byType = {
 ### Extracting Main Content
 ```js
-const { groups } = parseContent(doc);
+const content = parseContent(doc);
-const title = groups.main.header.title;
-const description = groups.main.body.paragraphs.join(" ");
-const image = groups.main.banner?.url;
+const title = content.title;
+const description = content.paragraphs.join(" ");
+const image = content.banner?.url;
 ```
 ### Processing Content Sections
 ```js
-const { groups } = parseContent(doc);
+const content = parseContent(doc);
 // Main content
-console.log("Main:", groups.main.header.title);
+console.log("Title:", content.title);
+console.log("Description:", content.paragraphs);
-// Additional sections
-groups.items.forEach(item => {
-  console.log("Section:", item.header.title);
-  console.log("Content:", item.body.paragraphs);
+// Additional sections (H3 groups)
+content.items.forEach(item => {
+  console.log("Section:", item.title);
+  console.log("Content:", item.paragraphs);
 });
 ```
-### Finding Specific Elements
-```js
-const { byType } = parseContent(doc);
-// Get all H2 headings
-const subheadings = byType.getHeadingsByLevel(2);
-// Get all background images
-const backgrounds = byType.images.background;
-// Get content under specific headings
-const features = byType.getElementsByHeadingContext(
-  h => h.content.includes("Features")
-);
-```
 ### Sequential Processing
 ```js
@@ -203,17 +151,17 @@ Automatically transform content based on field types with context-aware behavior
 ```js
 const schema = {
   title: {
-    path: "groups.main.header.title",
+    path: "title",
     type: "plaintext",  // Auto-strips <strong>, <em>, etc.
     maxLength: 60       // Auto-truncates intelligently
   },
   excerpt: {
-    path: "groups.main.body.paragraphs",
+    path: "paragraphs",
     type: "excerpt",    // Auto-creates excerpt from paragraphs
     maxLength: 150
   },
   image: {
-    path: "groups.main.body.imgs[0].url",
+    path: "imgs[0].url",
     type: "image",
     defaultValue: "/placeholder.jpg"
   }
@@ -259,15 +207,15 @@ Define custom mappings using schemas:
 ```js
 const schema = {
-  brand: "groups.main.header.pretitle",
-  title: "groups.main.header.title",
-  subtitle: "groups.main.header.subtitle",
+  brand: "pretitle",
+  title: "title",
+  subtitle: "subtitle",
   image: {
-    path: "groups.main.body.imgs[0].url",
+    path: "imgs[0].url",
     defaultValue: "/placeholder.jpg"
   },
   actions: {
-    path: "groups.main.body.links",
+    path: "links",
     transform: links => links.map(l => ({ label: l.label, type: "primary" }))
   }
 };

package/docs/api.md CHANGED Viewed

@@ -118,51 +118,49 @@ A flat array of semantic elements extracted from the document tree.
 ### `groups`
-Content organized into semantic groups with identified main content and items.
+Content organized into semantic groups with identified main content and items. The structure is flat - header and body fields are merged at the top level.
 ```js
 {
   main: {
-    header: {
-      pretitle: "PRETITLE TEXT",  // H3 before main title
-      title: "Main Title",         // First heading in group
-      subtitle: "Subtitle"         // Second heading in group
-    },
-    body: {
-      paragraphs: ["paragraph text", ...],
-      imgs: [
-        { url: "...", caption: "...", alt: "..." }
-      ],
-      icons: ["<svg>...</svg>", ...],
-      videos: [
-        { src: "...", caption: "...", alt: "..." }
-      ],
-      links: [
-        { href: "...", label: "..." }
-      ],
-      lists: [
-        [/* processed list items */]
-      ],
-      buttons: [
-        { content: "...", attrs: {...} }
-      ],
-      properties: [],       // Code block content
-      propertyBlocks: [],   // Array of code blocks
-      cards: [],            // Not yet implemented
-      headings: []          // Used in list items
-    },
+    // Header fields (flat)
+    pretitle: "PRETITLE TEXT",    // H3 before main title
+    title: "Main Title",           // First heading in group
+    subtitle: "Subtitle",          // Second heading in group
+    // Body fields (flat)
+    paragraphs: ["paragraph text", ...],
+    imgs: [
+      { url: "...", caption: "...", alt: "..." }
+    ],
+    icons: ["<svg>...</svg>", ...],
+    videos: [
+      { src: "...", caption: "...", alt: "..." }
+    ],
+    links: [
+      { href: "...", label: "..." }
+    ],
+    lists: [
+      [/* processed list items */]
+    ],
+    buttons: [
+      { content: "...", attrs: {...} }
+    ],
+    properties: [],       // Code block content
+    propertyBlocks: [],   // Array of code blocks
+    cards: [],            // Not yet implemented
+    headings: [],         // Used in list items
+    // Banner (flat)
     banner: {
       url: "path/to/banner.jpg",
       caption: "Banner caption",
       alt: "Banner alt text"
-    } | null,
-    metadata: {
-      level: 1,             // Heading level that started this group
-      contentTypes: {}      // Set of content types in group
-    }
+    } | null
   },
   items: [
-    // Array of groups with same structure as main
+    // Array of groups with same flat structure as main
+    // { title, pretitle, subtitle, paragraphs, imgs, ... }
   ],
   metadata: {
     dividerMode: false,     // Whether dividers were used for grouping
@@ -268,14 +266,14 @@ const result = parseContent(doc);
 ```js
 const { groups } = parseContent(doc);
-// Access main content
-console.log(groups.main.header.title);
-console.log(groups.main.body.paragraphs);
+// Access main content (flat structure)
+console.log(groups.main.title);
+console.log(groups.main.paragraphs);
 // Iterate through content items
 groups.items.forEach(item => {
-  console.log(item.header.title);
-  console.log(item.body.paragraphs);
+  console.log(item.title);
+  console.log(item.paragraphs);
 });
 ```