npm - structurecc - Versions diffs - 1.0.5 → 2.0.0 - Mend

structurecc 1.0.5 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +154 -67
package/agents/structurecc-classifier.md +135 -0
package/agents/structurecc-extract-chart.md +302 -0
package/agents/structurecc-extract-diagram.md +343 -0
package/agents/structurecc-extract-generic.md +248 -0
package/agents/structurecc-extract-heatmap.md +322 -0
package/agents/structurecc-extract-multipanel.md +310 -0
package/agents/structurecc-extract-table.md +231 -0
package/agents/structurecc-verifier.md +265 -0
package/bin/install.js +82 -18
package/commands/structure/structure.md +434 -112
package/package.json +9 -5
package/agents/structurecc-extractor.md +0 -70

package/commands/structure/structure.md CHANGED Viewed

@@ -1,31 +1,55 @@
 ---
 name: structure
-description: Extract structured data from PDFs and Word docs using AI agent swarms
+description: Extract structured data from PDFs and Word docs using AI agent swarms with verbatim accuracy
 arguments:
   - name: path
     description: Path to document (PDF, DOCX, or image)
     required: true
 ---
-# /structure - Agentic Document Extraction
+# /structure - Agentic Document Extraction v2.0
-Turn complex documents into structured markdown using parallel AI subagents.
+Turn complex documents into structured JSON + markdown using a 3-phase pipeline with verification.
-## Overview
+## Pipeline Overview
-1. Extract all images from the document
-2. Spawn ONE subagent PER IMAGE (all in parallel)
-3. Each agent analyzes its image and writes structured markdown
-4. Combine into final STRUCTURED.md
+```
+Image → [Phase 1: Classify] → [Phase 2: Extract] → [Phase 3: Verify] → Output
+                                     ↑__________REVISION LOOP__________↓
+```
+1. **Classify** - Identify visual element type (table, chart, heatmap, diagram, etc.)
+2. **Extract** - Use specialized extractor for that type with verbatim accuracy
+3. **Verify** - Score extraction quality, trigger revision if < 0.9
-## Step 1: Setup
+## Step 1: Setup Output Directory
 Create output directory next to the document:
 ```
 <document_name>_extracted/
-├── images/           # Extracted visuals
-├── elements/         # Per-element markdown
-└── STRUCTURED.md     # Final output
+├── images/                    # Raw extracted images
+├── classifications/           # Phase 1: type detection results
+├── extractions/              # Phase 2: JSON extractions
+├── verifications/            # Phase 3: quality scores
+├── elements/                 # Final markdown per element
+├── STRUCTURED.md             # Combined markdown output
+└── extraction_report.json    # Quality metrics summary
+```
+```python
+import os
+from pathlib import Path
+from datetime import datetime
+doc_path = "<document_path>"
+doc_name = Path(doc_path).stem
+output_dir = Path(doc_path).parent / f"{doc_name}_extracted"
+# Create all subdirectories
+for subdir in ["images", "classifications", "extractions", "verifications", "elements"]:
+    (output_dir / subdir).mkdir(parents=True, exist_ok=True)
+print(f"Output directory: {output_dir}")
 ```
 ## Step 2: Extract Images
@@ -34,12 +58,11 @@ Create output directory next to the document:
 ```python
 import fitz
-import os
+import json
 pdf_path = "<document_path>"
-output_dir = "<output_dir>"
-images_dir = os.path.join(output_dir, "images")
-os.makedirs(images_dir, exist_ok=True)
+output_dir = Path("<output_dir>")
+images_dir = output_dir / "images"
 doc = fitz.open(pdf_path)
 extracted = []
@@ -53,11 +76,22 @@ for page_num in range(len(doc)):
             pix = fitz.Pixmap(fitz.csRGB, pix)
         img_name = f"p{page_num + 1}_img{img_idx + 1}.png"
-        pix.save(os.path.join(images_dir, img_name))
-        extracted.append({"path": os.path.join(images_dir, img_name), "page": page_num + 1, "name": img_name})
+        img_path = images_dir / img_name
+        pix.save(str(img_path))
+        extracted.append({
+            "id": f"element_{len(extracted) + 1:03d}",
+            "path": str(img_path),
+            "page": page_num + 1,
+            "name": img_name
+        })
         pix = None
 doc.close()
+# Save image manifest
+with open(output_dir / "image_manifest.json", "w") as f:
+    json.dump(extracted, f, indent=2)
 print(f"Extracted {len(extracted)} images")
 ```
@@ -65,164 +99,434 @@ print(f"Extracted {len(extracted)} images")
 ```python
 from zipfile import ZipFile
-import os
 docx_path = "<document_path>"
-output_dir = "<output_dir>"
-images_dir = os.path.join(output_dir, "images")
-os.makedirs(images_dir, exist_ok=True)
+output_dir = Path("<output_dir>")
+images_dir = output_dir / "images"
 extracted = []
 with ZipFile(docx_path, 'r') as z:
     for f in z.namelist():
         if f.startswith('word/media/'):
             name = os.path.basename(f)
-            path = os.path.join(images_dir, name)
+            path = images_dir / name
             with z.open(f) as src, open(path, 'wb') as dst:
                 dst.write(src.read())
-            extracted.append({"path": path, "name": name})
+            extracted.append({
+                "id": f"element_{len(extracted) + 1:03d}",
+                "path": str(path),
+                "name": name
+            })
+# Save image manifest
+with open(output_dir / "image_manifest.json", "w") as f:
+    json.dump(extracted, f, indent=2)
 print(f"Extracted {len(extracted)} images")
 ```
-**For standalone images** - Just process directly.
-Also extract main text:
-- PDF: `page.get_text()` for each page
-- DOCX: `textutil -convert txt "<path>" -stdout`
-## Step 3: Spawn Agent Swarm
+## Step 3: Phase 1 - Classification (Parallel)
-**CRITICAL:** Launch ALL agents in ONE message with MULTIPLE Task tool calls.
+**CRITICAL:** Launch ALL classification agents in ONE message.
-For EACH extracted image:
+For EACH extracted image, spawn a classification agent:
 ```
 Task(
   subagent_type: "general-purpose",
-  description: "Extract element [N]",
+  model: "haiku",  # Fast classification
+  description: "Classify element [N]",
   prompt: """
-You are extracting structured data from a document image.
+You are a visual element classifier. Read the agent instructions from:
+~/.claude/agents/structurecc-classifier.md
 **Image:** <full_path_to_image>
-**Source:** Page <N> of <document_name>
-**Output:** Write to <output_dir>/elements/element_<N>.md
+**Element ID:** <element_id>
+**Output:** Write JSON to <output_dir>/classifications/<element_id>_class.json
+Analyze the image and output the classification JSON as specified in the agent file.
+"""
+)
+```
+Launch 10 images = 10 Task calls in ONE message. They run in parallel.
-## Instructions
+## Step 4: Phase 2 - Specialized Extraction (Parallel)
-1. Read the image carefully
-2. Identify what it contains (table, figure, chart, heatmap, diagram, etc.)
-3. Extract ALL visible data - be exhaustive
-4. Structure as clean markdown
+After classifications complete, read each classification file and dispatch to the correct extractor.
-## Output Format
+**Extractor Routing:**
-Write this to the output file:
+| Classification | Extractor Agent |
+|---------------|-----------------|
+| `table_simple`, `table_complex` | `structurecc-extract-table.md` |
+| `chart_*` (all chart types) | `structurecc-extract-chart.md` |
+| `heatmap` | `structurecc-extract-heatmap.md` |
+| `diagram_*` (all diagram types) | `structurecc-extract-diagram.md` |
+| `multi_panel` | `structurecc-extract-multipanel.md` |
+| Everything else | `structurecc-extract-generic.md` |
-```markdown
-# [Descriptive Title]
+For EACH element, spawn the appropriate extractor:
-**Type:** [table/figure/chart/heatmap/diagram/other]
-**Source:** Page [N], [document name]
+```
+Task(
+  subagent_type: "general-purpose",
+  model: "opus",  # Best quality for extraction
+  description: "Extract element [N]",
+  prompt: """
+You are extracting structured data from a visual element.
+Read the agent instructions from:
+~/.claude/agents/<appropriate_extractor>.md
-## Content
+**Image:** <full_path_to_image>
+**Element ID:** <element_id>
+**Classification:** <classification_type>
+**Source:** Page <N> of <document_name>
+**Output:** Write JSON to <output_dir>/extractions/<element_id>.json
-[For tables: markdown table with all data]
-[For figures: detailed description + all visible text/labels]
-[For charts: data points, axes, trends]
-[For heatmaps: labels, color scale, patterns]
-[For diagrams: components, relationships, flow]
+Follow the extractor instructions EXACTLY. Output ONLY valid JSON.
+Remember: VERBATIM extraction only. Copy text exactly as shown.
+"""
+)
+```
-## Labels & Text
+Launch ALL extractions in ONE message for parallel processing.
-[Every piece of text visible, verbatim]
+## Step 5: Phase 3 - Verification (Parallel)
-## Notes
+After extractions complete, verify each extraction:
-[Confidence level, unclear items marked with [?]]
 ```
+Task(
+  subagent_type: "general-purpose",
+  model: "sonnet",  # Good balance for verification
+  description: "Verify element [N]",
+  prompt: """
+You are verifying an extraction against its source image.
+Read the agent instructions from:
+~/.claude/agents/structurecc-verifier.md
-Be thorough. Extract every data point.
+**Source Image:** <full_path_to_image>
+**Extraction JSON:** <output_dir>/extractions/<element_id>.json
+**Element ID:** <element_id>
+**Output:** Write JSON to <output_dir>/verifications/<element_id>_verify.json
+Compare the extraction to the source image and produce a verification report.
 """
 )
 ```
-Launch 10 images = 10 Task calls in ONE message. They run in parallel.
+Launch ALL verifications in ONE message.
-## Step 4: Extract Main Text
+## Step 6: Revision Loop
-Save document text to `elements/main_text.md`:
+After verifications complete, check for failures:
-```markdown
-# Main Document Text
-**Source:** [document name]
+```python
+import json
+from pathlib import Path
+output_dir = Path("<output_dir>")
+verifications_dir = output_dir / "verifications"
+needs_revision = []
+passed = []
+needs_human_review = []
+for verify_file in verifications_dir.glob("*_verify.json"):
+    with open(verify_file) as f:
+        result = json.load(f)
+    element_id = result["element_id"]
+    if result.get("needs_human_review"):
+        needs_human_review.append(element_id)
+    elif not result["pass"]:
+        revision_num = result.get("revision_feedback", {}).get("revision_number", 0)
+        if revision_num < 2:
+            needs_revision.append({
+                "element_id": element_id,
+                "feedback": result["revision_feedback"],
+                "score": result["scores"]["overall"]
+            })
+        else:
+            needs_human_review.append(element_id)
+    else:
+        passed.append(element_id)
+print(f"Passed: {len(passed)}")
+print(f"Needs revision: {len(needs_revision)}")
+print(f"Needs human review: {len(needs_human_review)}")
+```
----
+For elements needing revision, re-run extraction with specific feedback:
-[Full text extracted from document, preserving structure]
 ```
+Task(
+  subagent_type: "general-purpose",
+  model: "opus",
+  description: "Re-extract element [N] (revision)",
+  prompt: """
+REVISION EXTRACTION - Previous extraction failed verification.
-## Step 5: Combine Results
+Read the agent instructions from:
+~/.claude/agents/<appropriate_extractor>.md
-After all agents complete, read all `elements/*.md` files and create:
+**Image:** <full_path_to_image>
+**Element ID:** <element_id>
+**Previous Score:** <score>
+**Output:** Write JSON to <output_dir>/extractions/<element_id>.json
-**STRUCTURED.md:**
+SPECIFIC FIXES REQUIRED:
+<list_specific_fixes_from_revision_feedback>
-```markdown
-# [Document Name] - Structured Extraction
+Focus on fixing these specific issues while preserving correct sections.
+"""
+)
+```
-**Original:** [filename]
-**Extracted:** [date/time]
-**Elements:** [N] visual elements processed
+After re-extraction, re-verify. Max 2 revision attempts per element.
----
+## Step 7: Generate Markdown Elements
-## Main Text
+After all verifications pass (or reach human review), convert JSON extractions to markdown:
-[Content from main_text.md]
+```python
+import json
+from pathlib import Path
----
+output_dir = Path("<output_dir>")
+extractions_dir = output_dir / "extractions"
+elements_dir = output_dir / "elements"
-## Visual Elements
+for extract_file in extractions_dir.glob("*.json"):
+    element_id = extract_file.stem
-### Element 1
-[Content from element_1.md]
+    with open(extract_file) as f:
+        extraction = json.load(f)
-### Element 2
-[Content from element_2.md]
+    # Convert to markdown based on type
+    md_content = json_to_markdown(extraction)
-[... continue for all elements ...]
+    with open(elements_dir / f"{element_id}.md", "w") as f:
+        f.write(md_content)
+```
----
+**Markdown conversion function:**
-## Extraction Summary
+```python
+def json_to_markdown(extraction: dict) -> str:
+    """Convert JSON extraction to clean markdown."""
+    ext_type = extraction.get("extraction_type")
+    if ext_type == "table":
+        return table_to_markdown(extraction)
+    elif ext_type == "chart":
+        return chart_to_markdown(extraction)
+    elif ext_type == "heatmap":
+        return heatmap_to_markdown(extraction)
+    elif ext_type == "diagram":
+        return diagram_to_markdown(extraction)
+    elif ext_type == "multi_panel":
+        return multipanel_to_markdown(extraction)
+    else:
+        return generic_to_markdown(extraction)
+def table_to_markdown(ext: dict) -> str:
+    md = []
+    meta = ext.get("table_metadata", {})
+    md.append(f"# {meta.get('title', 'Table')}")
+    md.append(f"\n**Type:** Table")
+    md.append(f"**Source:** Page {meta.get('source_page', '?')}")
+    if meta.get("caption"):
+        md.append(f"\n> {meta['caption']}")
+    md.append("\n## Data\n")
+    md.append(ext.get("markdown_table", ""))
+    if meta.get("footnotes"):
+        md.append("\n## Footnotes\n")
+        for fn in meta["footnotes"]:
+            md.append(f"- {fn}")
+    return "\n".join(md)
+def chart_to_markdown(ext: dict) -> str:
+    md = []
+    meta = ext.get("chart_metadata", {})
+    md.append(f"# {meta.get('title', 'Chart')}")
+    md.append(f"\n**Type:** {ext.get('chart_type', 'Chart')}")
+    md.append(f"**Source:** Page {meta.get('source_page', '?')}")
+    axes = ext.get("axes", {})
+    md.append("\n## Axes\n")
+    if axes.get("x"):
+        md.append(f"- **X-axis:** {axes['x'].get('label', 'unlabeled')}")
+        md.append(f"  - Range: {axes['x'].get('min')} to {axes['x'].get('max')}")
+    if axes.get("y"):
+        md.append(f"- **Y-axis:** {axes['y'].get('label', 'unlabeled')}")
+        md.append(f"  - Range: {axes['y'].get('min')} to {axes['y'].get('max')}")
+    legend = ext.get("legend", {})
+    if legend.get("entries"):
+        md.append("\n## Legend\n")
+        for entry in legend["entries"]:
+            style = entry.get("line_style") or entry.get("style", "")
+            md.append(f"- **{entry['label']}**: {entry.get('color', '')} {style}")
+    stats = ext.get("statistical_annotations", [])
+    if stats:
+        md.append("\n## Statistical Annotations\n")
+        for stat in stats:
+            md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
+    risk = ext.get("risk_table", {})
+    if risk.get("present"):
+        md.append("\n## Risk Table\n")
+        headers = risk.get("headers", [])
+        md.append("| " + " | ".join(headers) + " |")
+        md.append("| " + " | ".join(["---"] * len(headers)) + " |")
+        for row in risk.get("rows", []):
+            values = [row.get("group", "")] + row.get("values", [])
+            md.append("| " + " | ".join(values) + " |")
+    return "\n".join(md)
+```
+## Step 8: Generate Combined STRUCTURED.md
+```python
+from pathlib import Path
+from datetime import datetime
+output_dir = Path("<output_dir>")
+elements_dir = output_dir / "elements"
+doc_name = "<document_name>"
+# Read all element files in order
+element_files = sorted(elements_dir.glob("element_*.md"))
+sections = []
+sections.append(f"# {doc_name} - Structured Extraction")
+sections.append(f"\n**Original:** {doc_name}")
+sections.append(f"**Extracted:** {datetime.now().isoformat()}")
+sections.append(f"**Elements:** {len(element_files)} visual elements processed")
+sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification)")
+sections.append("\n---\n")
+# Add each element
+for i, elem_file in enumerate(element_files, 1):
+    with open(elem_file) as f:
+        content = f.read()
+    sections.append(f"## Element {i}")
+    sections.append(content)
+    sections.append("\n---\n")
+# Write combined file
+with open(output_dir / "STRUCTURED.md", "w") as f:
+    f.write("\n".join(sections))
+```
-| # | Type | Source | Status |
-|---|------|--------|--------|
-| 1 | Table | Page 2 | ✓ |
-| 2 | Figure | Page 3 | ✓ |
-| ... | ... | ... | ... |
+## Step 9: Generate Quality Report
+```python
+import json
+from pathlib import Path
+output_dir = Path("<output_dir>")
+verifications_dir = output_dir / "verifications"
+report = {
+    "document": "<document_name>",
+    "timestamp": datetime.now().isoformat(),
+    "pipeline_version": "2.0.0",
+    "elements_total": 0,
+    "elements_passed": 0,
+    "elements_revised": 0,
+    "elements_human_review": 0,
+    "average_quality_score": 0.0,
+    "element_details": []
+}
+scores = []
+for verify_file in sorted(verifications_dir.glob("*_verify.json")):
+    with open(verify_file) as f:
+        result = json.load(f)
+    report["elements_total"] += 1
+    detail = {
+        "element_id": result["element_id"],
+        "scores": result["scores"],
+        "status": "passed" if result["pass"] else "failed",
+        "issues_count": len(result.get("issues", []))
+    }
+    if result.get("needs_human_review"):
+        detail["status"] = "human_review"
+        report["elements_human_review"] += 1
+    elif result["pass"]:
+        report["elements_passed"] += 1
+    if result.get("revision_feedback", {}).get("revision_number", 0) > 0:
+        report["elements_revised"] += 1
+    scores.append(result["scores"]["overall"])
+    report["element_details"].append(detail)
+report["average_quality_score"] = sum(scores) / len(scores) if scores else 0
+# Write report
+with open(output_dir / "extraction_report.json", "w") as f:
+    json.dump(report, f, indent=2)
+print(f"\nQuality Report:")
+print(f"  Total elements: {report['elements_total']}")
+print(f"  Passed: {report['elements_passed']}")
+print(f"  Revised: {report['elements_revised']}")
+print(f"  Human review: {report['elements_human_review']}")
+print(f"  Average score: {report['average_quality_score']:.2f}")
 ```
-## Step 6: Display Results
+## Step 10: Display Results
 ```
-╔═══════════════════════════════════════════════════════════╗
-║  EXTRACTION COMPLETE                                      ║
-╠═══════════════════════════════════════════════════════════╣
-║                                                           ║
-║  Document:  [name]                                        ║
-║  Output:    [path]_extracted/                             ║
-║                                                           ║
-║  Extracted: [N] visual elements                           ║
-║                                                           ║
-║  Files:                                                   ║
-║    images/         [N] extracted images                   ║
-║    elements/       [N] element markdown files             ║
-║    STRUCTURED.md   Combined output                        ║
-║                                                           ║
-╚═══════════════════════════════════════════════════════════╝
+╔═══════════════════════════════════════════════════════════════════════════════╗
+║  EXTRACTION COMPLETE                                                          ║
+╠═══════════════════════════════════════════════════════════════════════════════╣
+║                                                                               ║
+║  Document:  [name]                                                            ║
+║  Output:    [path]_extracted/                                                 ║
+║  Pipeline:  structurecc v2.0 (3-phase with verification)                      ║
+║                                                                               ║
+║  QUALITY SUMMARY                                                              ║
+║  ──────────────────────────────────────────────────────                       ║
+║  Total elements:     [N]                                                      ║
+║  Passed (≥0.90):     [N] ✓                                                    ║
+║  Revised:            [N] ↻                                                    ║
+║  Human review:       [N] ⚠                                                    ║
+║  Average score:      [0.XX]                                                   ║
+║                                                                               ║
+║  FILES                                                                        ║
+║  ──────────────────────────────────────────────────────                       ║
+║  images/              [N] extracted images                                    ║
+║  classifications/     [N] type classifications                                ║
+║  extractions/         [N] JSON extractions                                    ║
+║  verifications/       [N] quality verifications                               ║
+║  elements/            [N] markdown files                                      ║
+║  STRUCTURED.md        Combined output                                         ║
+║  extraction_report.json  Quality metrics                                      ║
+║                                                                               ║
+╚═══════════════════════════════════════════════════════════════════════════════╝
 ```
 Then open: `open "<output_dir>/STRUCTURED.md"`
@@ -237,6 +541,24 @@ pip3 install PyMuPDF --quiet
 ## Tips
 - Use opus model for best extraction quality on complex visuals
-- Each image = one agent = one API call
-- Agents run in parallel for speed
-- Check individual element files if one extraction looks wrong
+- Classification uses haiku for speed (it's just triage)
+- Verification uses sonnet for good balance
+- Each phase runs in parallel for speed
+- Max 2 revision attempts before human review
+- Check `extraction_report.json` for quality metrics
+- Individual JSON extractions preserved for programmatic use
+## Troubleshooting
+**Low quality scores:**
+- Check source image quality
+- Complex tables may need human review
+- Handwritten text is challenging
+**Revision loop stuck:**
+- After 2 revisions, element goes to human review
+- Check verifications/ for specific issues
+**Missing elements:**
+- Some PDFs render text as images - check image count
+- Very small images may be logos/icons (expected)