npm - structurecc - Versions diffs - 1.0.4 → 2.0.0 - Mend

structurecc 1.0.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +154 -67
package/agents/structurecc-classifier.md +135 -0
package/agents/structurecc-extract-chart.md +302 -0
package/agents/structurecc-extract-diagram.md +343 -0
package/agents/structurecc-extract-generic.md +248 -0
package/agents/structurecc-extract-heatmap.md +322 -0
package/agents/structurecc-extract-multipanel.md +310 -0
package/agents/structurecc-extract-table.md +231 -0
package/agents/structurecc-verifier.md +265 -0
package/bin/install.js +96 -38
package/commands/structure/structure.md +434 -112
package/package.json +9 -5
package/agents/structurecc-extractor.md +0 -70

package/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-<h1 align="center">STRUCTURE</h1>
+<h1 align="center">STRUCTURE v2.0</h1>
 <p align="center">
 <strong>Landing AI charges $500/month for agentic document structuring.<br>This is free.</strong>
@@ -19,29 +19,84 @@
 ---
+## What's New in v2.0
+**3-Phase Pipeline with Quality Verification**
+```
+Image → [Classify] → [Extract] → [Verify] → Output
+                         ↑_______↻_______↓
+```
+| Phase | Agent | Purpose |
+|-------|-------|---------|
+| 1. Classify | `structurecc-classifier` | Fast triage to route to correct extractor |
+| 2. Extract | 6 specialized extractors | Type-specific verbatim extraction |
+| 3. Verify | `structurecc-verifier` | Quality scoring with auto-revision |
+**Verbatim Extraction** - Text is copied EXACTLY as shown. No paraphrasing, no "cleanup."
+**Quality Scoring** - Each extraction gets a 0.0-1.0 score. Failures auto-retry up to 2x.
+**Specialized Extractors** - Tables, charts, heatmaps, diagrams each get dedicated agents.
+---
 ## The Problem
 You have a 50-page PDF with figures, tables, and charts. You need that data.
 **Manual approach:** Screenshot each figure. Transcribe tables cell by cell. Spend hours on one document.
-**With structurecc:** One command. Walk away. Come back to perfectly structured markdown.
+**With structurecc:** One command. Walk away. Come back to perfectly structured markdown with quality verification.
 ```
 /structure paper.pdf
 ```
-Spawns parallel AI agents. Each agent analyzes one visual element. All run simultaneously. Done in minutes, not hours.
+Spawns parallel AI agents. Each agent analyzes one visual element. All run simultaneously. Quality verified. Done in minutes, not hours.
 ---
-## What is this?
+## Specialized Extractors
-Give it a document. It extracts every image. Spawns one AI agent per image. Each agent exhaustively analyzes its element—tables become markdown tables, figures get descriptions, charts get data points extracted.
+| Extractor | Handles |
+|-----------|---------|
+| `structurecc-extract-table` | Tables with cell-by-cell accuracy, merged cells, footnotes |
+| `structurecc-extract-chart` | Kaplan-Meier, bar, line, scatter, forest plots with axes, legends, data |
+| `structurecc-extract-heatmap` | Expression heatmaps, correlation matrices with full label extraction |
+| `structurecc-extract-diagram` | CONSORT flows, timelines, network diagrams with all node text |
+| `structurecc-extract-multipanel` | Multi-panel figures (A, B, C, D) with per-panel extraction |
+| `structurecc-extract-generic` | Photographs, schematics, equations, other visuals |
-Runs inside **[Claude Code](https://docs.anthropic.com/en/docs/claude-code)** (Anthropic's terminal assistant). One command. ~$0.50-$5 per document.
+---
-Like [Landing AI's Agentic Document Extraction](https://landing.ai/agentic-document-extraction), but running locally via Claude Code.
+## Quality Verification
+Every extraction is verified against the source image:
+```json
+{
+  "scores": {
+    "completeness": 0.95,
+    "accuracy": 0.92,
+    "verbatim_compliance": 0.88,
+    "structure_correctness": 0.97,
+    "overall": 0.93
+  },
+  "pass": true,
+  "threshold": 0.90
+}
+```
+| Score | Meaning |
+|-------|---------|
+| **completeness** | Was every visible element captured? |
+| **accuracy** | Are values (numbers, stats) correct? |
+| **verbatim_compliance** | Was text copied exactly as shown? |
+| **structure_correctness** | Is the JSON structure valid? |
+**Auto-revision:** If score < 0.90, extraction is re-run with specific feedback. Max 2 attempts.
 ---
@@ -88,10 +143,6 @@ Copy this command and paste it into your terminal:
 npm install -g @anthropic-ai/claude-code
 ```
-<p align="center">
-<img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
-</p>
 Wait for it to finish.
 ---
@@ -104,11 +155,7 @@ Copy and run this:
 npx structurecc
 ```
-<p align="center">
-<img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
-</p>
-You will see a STRUCTURE banner. That means it worked. You only do this once.
+You will see a STRUCTURE banner and 8 agents being installed. You only do this once.
 ---
@@ -116,17 +163,9 @@ You will see a STRUCTURE banner. That means it worked. You only do this once.
 Create a folder with your document:
-<p align="center">
-<img src="assets/screenshots/step2.png" alt="Folder structure" width="380">
-</p>
 ```
 documents/
-├── document.pdf          ← your PDF, DOCX, or image
-└── images/               ← extracted images go here (created automatically)
-    ├── figure_1.png
-    ├── table_2.png
-    └── chart_3.png
+└── document.pdf          ← your PDF, DOCX, or image
 ```
 **Put your document in a folder. That's it.**
@@ -142,10 +181,6 @@ cd ~/Desktop/documents
 claude
 ```
-<p align="center">
-<img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
-</p>
 **Windows users:** Replace `~/Desktop/documents` with your actual path, like `C:\Users\YourName\Desktop\documents`
 The first time you run `claude`, it will ask for your API key. Paste it in.
@@ -160,32 +195,53 @@ Now you are inside Claude Code. Type this command:
 /structure document.pdf
 ```
-<p align="center">
-<img src="assets/screenshots/step3.png" alt="Run /structure" width="500">
-</p>
 **Important:** The `/structure` command only works inside Claude Code. If you type it in your regular terminal, it will not work.
 structurecc will:
 1. Extract every image from your document
-2. Spawn one agent per image (all running in parallel)
-3. Each agent exhaustively analyzes its visual element
-4. Combine everything into `STRUCTURED.md`
+2. Classify each image (table, chart, heatmap, diagram, etc.)
+3. Spawn specialized extractors in parallel
+4. Verify each extraction against the source
+5. Auto-revise failed extractions
+6. Combine everything into `STRUCTURED.md`
 ---
 ## What You Get
-A comprehensive markdown file with every visual element extracted:
+A comprehensive output directory with full traceability:
 ```
 document_extracted/
-├── images/           # All extracted visuals
-├── elements/         # One markdown file per element
-│   ├── element_1.md  # Table fully extracted
-│   ├── element_2.md  # Figure analyzed
+├── images/                    # All extracted visuals
+├── classifications/           # Phase 1: type detection
+│   ├── element_001_class.json
+│   └── ...
+├── extractions/              # Phase 2: JSON extractions
+│   ├── element_001.json
+│   └── ...
+├── verifications/            # Phase 3: quality scores
+│   ├── element_001_verify.json
 │   └── ...
-└── STRUCTURED.md     # Everything combined
+├── elements/                 # Markdown per element
+│   ├── element_001.md
+│   └── ...
+├── STRUCTURED.md             # Combined output
+└── extraction_report.json    # Quality metrics summary
+```
+### Quality Report
+```json
+{
+  "document": "clinical_trial.pdf",
+  "pipeline_version": "2.0.0",
+  "elements_total": 15,
+  "elements_passed": 13,
+  "elements_revised": 2,
+  "elements_human_review": 0,
+  "average_quality_score": 0.92
+}
 ```
 ### Example: Table Extraction
@@ -196,40 +252,52 @@ document_extracted/
 **Type:** Table
 **Source:** Page 3, clinical_trial.pdf
-## Content
+## Data
-| Group | N | Age (mean±SD) | Male (%) |
-|-------|---|---------------|----------|
-| Treatment | 245 | 54.3±12.1 | 58.4 |
-| Placebo | 248 | 53.8±11.9 | 56.9 |
-| p-value | - | 0.67 | 0.73 |
+| Characteristic | Treatment (n=245) | Placebo (n=248) | P-value |
+|---|---|---|---|
+| Age, years | 54.3 ± 12.1 | 53.8 ± 11.9 | 0.67 |
+| Male (%) | 58.4 | 56.9 | 0.73 |
+| BMI (kg/m²) | 28.7 ± 4.2 | 28.4 ± 4.1 | 0.42 |
-## Notes
-- Confidence level: High
+## Footnotes
 - * Missing data excluded from analysis
+- † Adjusted for baseline
 ```
-### Example: Chart Analysis
+### Example: Kaplan-Meier Extraction
 ```markdown
 # Kaplan-Meier Survival Curves
-**Type:** Chart (Line/Survival)
+**Type:** kaplan_meier
 **Source:** Page 7, clinical_trial.pdf
-## Content
+## Axes
+- **X-axis:** Time (Days) Since HSV Diagnosis
+  - Range: 0 to 7000
+- **Y-axis:** Cumulative Risk of Dementia
+  - Range: 0 to 0.6
+## Legend
+- **HSV: Dementia Risk**: purple solid
+- **Control: Dementia Risk**: dark blue solid
+- **HSV: Dementia Risk 95% CI**: light purple shaded area
+- **Control: Dementia Risk 95% CI**: light orange shaded area
-Survival curves comparing treatment (blue) vs placebo (red) over 24 months.
+## Statistical Annotations
-Key data points:
-- 12-month survival: Treatment 0.89, Placebo 0.78
-- 24-month survival: Treatment 0.76, Placebo 0.61
-- Log-rank p = 0.003
+- p_value: < 0.001
+- hazard_ratio: 1.52 (95% CI: 1.38-1.68)
-## Labels & Annotations
-- Y-axis: "Survival Probability"
-- X-axis: "Time (months)"
-- Legend: "Treatment (n=245)", "Placebo (n=248)"
+## Risk Table
+| Time (days) | 0 | 1000 | 2000 | 3000 | 4000 | 5000 | 6000 | 7000 |
+|---|---|---|---|---|---|---|---|---|
+| HSV | 8,362 | 7,891 | 6,543 | 5,102 | 3,876 | 2,654 | 1,432 | 521 |
+| Control | 41,810 | 39,765 | 33,421 | 26,543 | 19,876 | 13,543 | 7,654 | 2,876 |
 ```
 ---
@@ -238,11 +306,14 @@ Key data points:
 | Document | Elements | ~Cost |
 |----------|----------|-------|
-| Simple paper | 5-10 | $0.50-$1 |
-| Full paper | 15-25 | $2-$4 |
-| Dense report | 40+ | $5-$10 |
+| Simple paper | 5-10 | $1-$2 |
+| Full paper | 15-25 | $3-$6 |
+| Dense report | 40+ | $8-$15 |
-Uses Claude's multimodal vision. Works best with **Opus 4.5** for complex tables and charts.
+Uses Claude's multimodal vision with model-appropriate routing:
+- **Haiku** for classification (fast, cheap)
+- **Opus** for extraction (highest quality)
+- **Sonnet** for verification (balanced)
 ---
@@ -268,6 +339,10 @@ You typed `/structure` in your regular terminal. You need to type it inside Clau
 Make sure your PDF contains actual images, not just text. Some PDFs render everything as text.
+**Low quality scores**
+Check `verifications/` for specific issues. Complex tables or poor image quality may need human review.
 **Claude Code asks for an API key**
 Either get an API key at [console.anthropic.com](https://console.anthropic.com/), or subscribe to Claude Pro/Max at [claude.ai](https://claude.ai/).
@@ -282,6 +357,18 @@ npx structurecc --uninstall
 ---
+## Upgrade from v1.x
+Just run the installer again:
+```bash
+npx structurecc
+```
+The installer automatically removes the old `structurecc-extractor` and installs the new 8-agent pipeline.
+---
 ## License
 MIT
@@ -289,5 +376,5 @@ MIT
 ---
 <p align="center">
-<strong>Unstructured in. Structured out.</strong>
+<strong>Verbatim in. Quality verified out.</strong>
 </p>

package/agents/structurecc-classifier.md ADDED Viewed

@@ -0,0 +1,135 @@
+---
+name: structurecc-classifier
+description: Phase 1 - Classify visual elements for specialized extraction
+---
+# Visual Element Classifier
+You are a rapid visual classifier. Your ONLY job is to identify what type of visual element an image contains so the correct specialized extractor can be dispatched.
+## Classification Task
+Given an image, output a JSON classification. Nothing else.
+## Classification Types
+| Type | Description |
+|------|-------------|
+| `table_simple` | Standard grid table with clear rows/columns, no merged cells |
+| `table_complex` | Table with merged cells, nested headers, or irregular structure |
+| `chart_kaplan_meier` | Survival curves / time-to-event plots with step functions |
+| `chart_bar` | Bar charts (horizontal or vertical), grouped or stacked |
+| `chart_line` | Line graphs showing trends over continuous x-axis |
+| `chart_scatter` | Scatter plots with individual data points |
+| `chart_box` | Box plots / whisker plots showing distributions |
+| `chart_pie` | Pie charts or donut charts |
+| `chart_area` | Area charts (filled line charts) |
+| `chart_forest` | Forest plots (meta-analysis results) |
+| `chart_volcano` | Volcano plots (differential expression) |
+| `heatmap` | Color-coded matrix (correlation, expression, etc.) |
+| `diagram_flowchart` | Process flows with boxes and arrows |
+| `diagram_timeline` | Temporal sequences, study timelines, CONSORT diagrams |
+| `diagram_network` | Network graphs, pathway diagrams, interaction maps |
+| `diagram_schematic` | Technical schematics, anatomical diagrams |
+| `diagram_venn` | Venn diagrams showing set overlaps |
+| `multi_panel` | Composite figure with labeled panels (A, B, C, D) |
+| `photograph` | Real-world photographs, microscopy images, scans |
+| `equation` | Mathematical equations, formulas |
+| `text_block` | Text-heavy image, caption, or label |
+| `unknown` | Cannot confidently classify |
+## Output Format
+Return ONLY valid JSON:
+```json
+{
+  "classification": {
+    "primary_type": "chart_kaplan_meier",
+    "confidence": 0.95,
+    "secondary_type": null,
+    "is_multi_panel": false,
+    "panel_count": 1,
+    "contains_table": false,
+    "contains_text_annotations": true
+  },
+  "routing": {
+    "extractor": "structurecc-extract-chart",
+    "extraction_hints": ["survival_curve", "two_groups", "has_risk_table"]
+  }
+}
+```
+## Field Definitions
+### classification
+- `primary_type`: Main visual type from the table above
+- `confidence`: 0.0-1.0 confidence in classification
+- `secondary_type`: If image contains a secondary element (e.g., chart with embedded table)
+- `is_multi_panel`: True if figure has labeled sub-panels (A, B, C...)
+- `panel_count`: Number of panels if multi-panel
+- `contains_table`: True if any tabular data is present
+- `contains_text_annotations`: True if significant text labels/annotations present
+### routing
+- `extractor`: Which specialized extractor to use
+- `extraction_hints`: List of specific features to watch for
+## Extractor Routing
+| Primary Type | Extractor |
+|--------------|-----------|
+| `table_simple`, `table_complex` | `structurecc-extract-table` |
+| `chart_*` | `structurecc-extract-chart` |
+| `heatmap` | `structurecc-extract-heatmap` |
+| `diagram_*` | `structurecc-extract-diagram` |
+| `multi_panel` | `structurecc-extract-multipanel` |
+| `photograph`, `equation`, `text_block`, `unknown` | `structurecc-extract-generic` |
+## Rules
+1. **Be fast** - This is a triage step, not deep analysis
+2. **Be decisive** - Pick the best match, use confidence to express uncertainty
+3. **Detect multi-panel** - If you see A, B, C, D labels, set `is_multi_panel: true`
+4. **Note secondary elements** - Charts often have risk tables, legends, etc.
+5. **Output ONLY JSON** - No explanations, no markdown, just the JSON object
+## Examples
+**Kaplan-Meier curve with risk table below:**
+```json
+{
+  "classification": {
+    "primary_type": "chart_kaplan_meier",
+    "confidence": 0.98,
+    "secondary_type": "table_simple",
+    "is_multi_panel": false,
+    "panel_count": 1,
+    "contains_table": true,
+    "contains_text_annotations": true
+  },
+  "routing": {
+    "extractor": "structurecc-extract-chart",
+    "extraction_hints": ["survival_curve", "has_risk_table", "has_confidence_intervals"]
+  }
+}
+```
+**Four-panel figure with A=bar chart, B=heatmap, C=box plot, D=table:**
+```json
+{
+  "classification": {
+    "primary_type": "multi_panel",
+    "confidence": 0.99,
+    "secondary_type": null,
+    "is_multi_panel": true,
+    "panel_count": 4,
+    "contains_table": true,
+    "contains_text_annotations": true
+  },
+  "routing": {
+    "extractor": "structurecc-extract-multipanel",
+    "extraction_hints": ["panel_A_bar", "panel_B_heatmap", "panel_C_boxplot", "panel_D_table"]
+  }
+}
+```