npm - structurecc - Versions diffs - 2.1.0 → 3.0.1 - Mend

structurecc 2.1.0 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/.claude-plugin/plugin.json +8 -0
package/README.md +156 -66
package/commands/structure-batch.md +276 -0
package/commands/structure.md +326 -0
package/install.js +89 -0
package/package.json +23 -22
package/prompts/chunk-extractor.md +289 -0
package/LICENSE +0 -21
package/agents/structurecc-classifier.md +0 -135
package/agents/structurecc-extract-chart.md +0 -354
package/agents/structurecc-extract-diagram.md +0 -343
package/agents/structurecc-extract-generic.md +0 -248
package/agents/structurecc-extract-heatmap.md +0 -322
package/agents/structurecc-extract-multipanel.md +0 -386
package/agents/structurecc-extract-table.md +0 -231
package/agents/structurecc-verifier.md +0 -265
package/bin/install.js +0 -186
package/commands/structure/structure.md +0 -931

package/.claude-plugin/plugin.json ADDED Viewed

@@ -0,0 +1,8 @@
+{
+  "name": "structurecc",
+  "version": "3.0.0",
+  "description": "Extract structured data from documents using Claude vision and parallel Task agents",
+  "author": {
+    "name": "UTMB Diagnostic Center"
+  }
+}

package/README.md CHANGED Viewed

@@ -1,106 +1,196 @@
-<h1 align="center">STRUCTURE</h1>
+# structurecc
-<p align="center">
-<strong>Extract structured data from PDFs, Word docs, and images using Claude Code.</strong>
-</p>
+Document Structure Extraction for Claude Code
-<p align="center">
-<a href="https://www.npmjs.com/package/structurecc"><img src="https://img.shields.io/npm/v/structurecc.svg" alt="npm version"></a>
-<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
-</p>
+Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.
-<p align="center">
-<img src="assets/terminal.png" alt="structurecc" width="550">
-</p>
+## Installation
----
-## Requirements
-- **Node.js** - [nodejs.org](https://nodejs.org/)
-- **Claude Code** - Requires API key or Pro/Max subscription
+```bash
+npx structurecc
+```
----
+This installs the plugin to `~/.claude/plugins/structurecc/`.
-## Install
+## Usage
-### Step 1: Install Claude Code
+### Single Document
 ```bash
-npm install -g @anthropic-ai/claude-code
+/structure document.pdf
+/structure lab_image.png
+/structure report.docx
 ```
-<p align="center">
-<img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
-</p>
-### Step 2: Install structurecc
+### Batch Processing
 ```bash
-npx structurecc
+/structure:batch ./documents/
+/structure:batch ./patient_files/ --output ./extracted/
 ```
-<p align="center">
-<img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
-</p>
+## Supported Formats
-### Step 3: Start Claude Code
+| Format | Extension | Notes |
+|--------|-----------|-------|
+| PDF | `.pdf` | Multi-page supported, chunked for large documents |
+| Word | `.docx`, `.doc` | Text and embedded images extracted |
+| Images | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp` | Single-page extraction |
-Navigate to your document folder and run:
+## Output
+For each document, structurecc generates:
-```bash
-cd ~/Desktop/documents
-claude
+```
+document_extracted/
+├── chunks/              # Individual chunk extractions (for debugging)
+├── structure.json       # Complete structured extraction
+└── STRUCTURE.md         # Human-readable markdown summary
 ```
-<p align="center">
-<img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
-</p>
+### structure.json
+```json
+{
+  "source": "/path/to/document.pdf",
+  "extracted": "2026-01-30T14:30:22Z",
+  "pages": [
+    {
+      "page": 1,
+      "elements": [
+        {
+          "id": "element_1",
+          "type": "table",
+          "title": "Table 1. Lab Results",
+          "data": {
+            "headers": ["Test", "Result", "Units", "Reference"],
+            "rows": [
+              ["Glucose", "126", "mg/dL", "70-100"]
+            ]
+          },
+          "confidence": 0.98
+        }
+      ]
+    }
+  ],
+  "summary": {
+    "total_pages": 5,
+    "tables": 3,
+    "figures": 4,
+    "equations": 1,
+    "average_confidence": 0.94
+  }
+}
+```
-### Step 4: Run structure
+## Architecture
-Inside Claude Code:
+structurecc uses a chunk-based parallel processing approach:
+1. **Document Analysis** - Determine page count and split into chunks (5 pages each)
+2. **Parallel Extraction** - Launch one Task agent per chunk for parallel processing
+3. **Chunk Merge** - Combine chunk results with page offset correction
+4. **Output Generation** - Create JSON and Markdown outputs
 ```
-/structure document.pdf
+Document (20 pages)
+       │
+       ├── Chunk 1 (Pages 1-5)  → Agent 1
+       ├── Chunk 2 (Pages 6-10) → Agent 2
+       ├── Chunk 3 (Pages 11-15)→ Agent 3
+       └── Chunk 4 (Pages 16-20)→ Agent 4
+               │
+               ▼
+         Merged Output
 ```
-<p align="center">
-<img src="assets/screenshots/step3.png" alt="Run /structure" width="520">
-</p>
+This approach:
+- Maximizes throughput via parallel processing
+- Preserves context within chunks (figures and captions stay together)
+- Uses Claude's native vision (no external APIs)
+- Each agent has 200K context for thorough extraction
-Supports **PDF**, **DOCX**, **PNG**, **JPG**, and **TIFF**.
+## Element Types
----
+### Tables
-## Output
+Extracted with:
+- Headers and all rows
+- Cell values with exact formatting
+- Flags (H, L, *, †)
+- Footnotes
+- Merged cell information
-```
-document_extracted/
-├── images/              # Extracted visuals
-├── elements/            # Markdown per element
-└── STRUCTURED.md        # Combined output
-```
+### Figures
----
+Supports various figure types:
+- **Charts/Graphs**: Line, bar, scatter, pie with data series and axes
+- **Scientific Images**: Western blots, gels, micrographs
+- **Diagrams**: Flowcharts, illustrations, photographs
-## Troubleshooting
+Each figure includes:
+- Title and caption
+- Data points (when visible)
+- Axis labels and ranges
+- Annotations and legends
-| Issue | Solution |
-|-------|----------|
-| `npm: command not found` | Install Node.js from [nodejs.org](https://nodejs.org/) |
-| `/structure: No such file` | Run `claude` first, then type `/structure` inside Claude Code |
-| No images found | PDF may be text-only with no embedded images |
+### Equations
----
+Extracted as:
+- LaTeX representation
+- Plain text fallback
+- Variable definitions
-## Uninstall
+### Text Blocks
-```bash
-npx structurecc --uninstall
-```
+Captured with:
+- Full content
+- Type (header, paragraph, caption, footnote)
+- Formatting information
+## Confidence Scores
+Every element includes a confidence score (0.0-1.0):
+| Score | Meaning |
+|-------|---------|
+| 0.95-1.00 | Crystal clear extraction |
+| 0.85-0.94 | Clear with minor uncertainty |
+| 0.70-0.84 | Readable but some ambiguity |
+| < 0.70 | Needs manual verification |
+Low confidence items are flagged in the output for review.
+## Use Cases
+- **Medical Lab Results**: Extract patient data from PDF reports
+- **Research Papers**: Structure tables and figures from publications
+- **Scientific Images**: Transcribe gel/blot data for documentation
+- **Patient Records**: Batch process document folders
+- **Data Digitization**: Convert scanned documents to structured data
+## Requirements
+- Claude Code CLI
+- No external dependencies (uses Claude's native capabilities)
+## How It Works
+structurecc leverages Claude's multimodal capabilities:
+1. **Claude Vision**: Reads PDFs and images natively without OCR
+2. **Parallel Agents**: Task tool spawns chunk agents for parallel processing
+3. **Structured Output**: JSON schema ensures consistent, parseable output
+4. **Markdown Summary**: Human-readable format for quick review
+No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.
+## Limitations
----
+- Very large documents (100+ pages) may require multiple runs
+- Handwritten content has lower accuracy than printed text
+- Low-resolution images may have reduced confidence scores
+- Complex nested tables may require manual verification
 ## License

package/commands/structure-batch.md ADDED Viewed

@@ -0,0 +1,276 @@
+---
+description: Batch extract structured data from multiple documents in a directory
+argument-hint: <directory>
+---
+# Batch Document Structure Extraction
+You are extracting structured data from multiple documents in a directory.
+## Input
+**Directory path:** $ARGUMENTS
+Parse arguments:
+- First argument: Directory path (required)
+- `--output <dir>`: Custom output directory (optional, defaults to source directory)
+- `--pattern <glob>`: File pattern to match (optional, defaults to all supported types)
+## Workflow
+### Step 1: Discover Documents
+Use Glob to find all supported documents in the directory:
+```
+Supported patterns:
+- *.pdf
+- *.docx
+- *.doc
+- *.png
+- *.jpg
+- *.jpeg
+- *.tiff
+- *.bmp
+```
+List all discovered documents:
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│  BATCH EXTRACTION                                                    │
+├──────────────────────────────────────────────────────────────────────┤
+│                                                                      │
+│  Directory: {path}                                                   │
+│                                                                      │
+│  Documents Found: {count}                                            │
+│    1. document1.pdf (15 pages)                                       │
+│    2. lab_results.png (1 image)                                      │
+│    3. report.docx (8 pages)                                          │
+│    ...                                                               │
+│                                                                      │
+│  Estimated chunks: {total_chunks}                                    │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+### Step 2: Create Batch Output Directory
+```
+{output_dir}/
+├── batch_summary.json       # Summary of all extractions
+├── batch_summary.md         # Human-readable summary
+├── document1_extracted/     # Per-document outputs
+│   ├── structure.json
+│   └── STRUCTURE.md
+├── lab_results_extracted/
+│   ├── structure.json
+│   └── STRUCTURE.md
+└── report_extracted/
+    ├── structure.json
+    └── STRUCTURE.md
+```
+### Step 3: Process Documents
+For each document, invoke the /structure command:
+**Option A: Sequential Processing (safer, easier to track)**
+Process one document at a time, reporting progress:
+```
+Processing document 1/3: document1.pdf
+  Launching 3 chunk agents...
+  Chunks complete. Merging...
+  Done. Tables: 5, Figures: 3
+Processing document 2/3: lab_results.png
+  Processing as single chunk...
+  Done. Tables: 1, Figures: 0
+Processing document 3/3: report.docx
+  Launching 2 chunk agents...
+  Chunks complete. Merging...
+  Done. Tables: 2, Figures: 4
+```
+**Option B: Parallel Documents (faster for many small docs)**
+For directories with many single-page documents (like images), launch multiple document extractions in parallel using Task agents.
+Choose based on:
+- <5 documents or large PDFs: Sequential
+- >5 small documents (images, short PDFs): Parallel
+### Step 4: Generate Batch Summary
+**batch_summary.json:**
+```json
+{
+  "batch": {
+    "directory": "/path/to/documents",
+    "processed": "2026-01-30T14:30:22Z",
+    "document_count": 3,
+    "total_pages": 24,
+    "total_elements": 45
+  },
+  "documents": [
+    {
+      "filename": "document1.pdf",
+      "pages": 15,
+      "tables": 5,
+      "figures": 3,
+      "equations": 1,
+      "average_confidence": 0.94,
+      "output": "document1_extracted/structure.json"
+    },
+    {
+      "filename": "lab_results.png",
+      "pages": 1,
+      "tables": 1,
+      "figures": 0,
+      "equations": 0,
+      "average_confidence": 0.97,
+      "output": "lab_results_extracted/structure.json"
+    },
+    {
+      "filename": "report.docx",
+      "pages": 8,
+      "tables": 2,
+      "figures": 4,
+      "equations": 0,
+      "average_confidence": 0.91,
+      "output": "report_extracted/structure.json"
+    }
+  ],
+  "summary": {
+    "total_tables": 8,
+    "total_figures": 7,
+    "total_equations": 1,
+    "overall_confidence": 0.94,
+    "low_confidence_items": 2
+  }
+}
+```
+**batch_summary.md:**
+```markdown
+# Batch Extraction Summary
+**Directory:** {path}
+**Processed:** {timestamp}
+**Documents:** {count}
+---
+## Overview
+| Metric | Count |
+|--------|-------|
+| Total Documents | 3 |
+| Total Pages | 24 |
+| Total Tables | 8 |
+| Total Figures | 7 |
+| Total Equations | 1 |
+| Overall Confidence | 94% |
+---
+## Document Details
+### 1. document1.pdf
+- **Pages:** 15
+- **Tables:** 5
+- **Figures:** 3
+- **Confidence:** 94%
+- **Output:** [document1_extracted/](document1_extracted/)
+### 2. lab_results.png
+- **Pages:** 1
+- **Tables:** 1
+- **Figures:** 0
+- **Confidence:** 97%
+- **Output:** [lab_results_extracted/](lab_results_extracted/)
+### 3. report.docx
+- **Pages:** 8
+- **Tables:** 2
+- **Figures:** 4
+- **Confidence:** 91%
+- **Output:** [report_extracted/](report_extracted/)
+---
+## Low Confidence Items
+Items with confidence < 80% requiring review:
+1. **document1.pdf, Page 7, Table 3** (72%)
+   - Reason: Partially obscured by watermark
+2. **report.docx, Page 5, Figure 2** (68%)
+   - Reason: Low resolution image
+---
+## Files Generated
+```
+{output_dir}/
+├── batch_summary.json
+├── batch_summary.md
+├── document1_extracted/
+│   ├── structure.json
+│   └── STRUCTURE.md
+├── lab_results_extracted/
+│   ├── structure.json
+│   └── STRUCTURE.md
+└── report_extracted/
+    ├── structure.json
+    └── STRUCTURE.md
+```
+```
+### Step 5: Display Completion
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│  BATCH EXTRACTION COMPLETE                                           │
+├──────────────────────────────────────────────────────────────────────┤
+│                                                                      │
+│  Documents Processed: {count}                                        │
+│  Total Pages: {pages}                                                │
+│  Total Elements: {elements}                                          │
+│                                                                      │
+│  Summary:                                                            │
+│    Tables: {count}                                                   │
+│    Figures: {count}                                                  │
+│    Equations: {count}                                                │
+│                                                                      │
+│  Overall Confidence: {score}%                                        │
+│  Items Needing Review: {count}                                       │
+│                                                                      │
+│  Output:                                                             │
+│    {output_dir}/batch_summary.json                                   │
+│    {output_dir}/batch_summary.md                                     │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+## Error Handling
+- If directory doesn't exist: Report error
+- If no supported documents found: Report and suggest checking patterns
+- If individual document fails: Log error, continue with others, report in summary
+- If output directory cannot be created: Report error
+## Notes
+- Each document is processed independently using /structure
+- Batch processing is ideal for patient files, lab results, research papers
+- Low confidence items are aggregated for easy review
+- All outputs are self-contained and can be accessed individually