npm - structurecc - Versions diffs - 2.1.0 → 3.0.0 - Mend

structurecc 2.1.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/.claude-plugin/plugin.json +27 -0
package/README.md +156 -66
package/commands/structure-batch.md +281 -0
package/commands/structure.md +331 -0
package/install.js +93 -0
package/package.json +23 -22
package/prompts/chunk-extractor.md +289 -0
package/LICENSE +0 -21
package/agents/structurecc-classifier.md +0 -135
package/agents/structurecc-extract-chart.md +0 -354
package/agents/structurecc-extract-diagram.md +0 -343
package/agents/structurecc-extract-generic.md +0 -248
package/agents/structurecc-extract-heatmap.md +0 -322
package/agents/structurecc-extract-multipanel.md +0 -386
package/agents/structurecc-extract-table.md +0 -231
package/agents/structurecc-verifier.md +0 -265
package/bin/install.js +0 -186
package/commands/structure/structure.md +0 -931

package/commands/structure.md ADDED Viewed

@@ -0,0 +1,331 @@
+---
+name: structure
+description: Extract structured data from documents using Claude vision and parallel chunk agents
+argument-hint: <path> [--output dir]
+allowed-tools: Read, Write, Task, Glob, Bash
+model: opus
+---
+<command-name>structure</command-name>
+# Document Structure Extraction
+You are extracting structured data from a document using Claude's native vision capabilities and parallel Task agents.
+## Input
+**Document path:** $ARGUMENTS
+## Workflow
+### Step 1: Validate Input
+1. Check if the file exists at the provided path
+2. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
+3. If the path is invalid, inform the user and stop
+### Step 2: Determine Processing Strategy
+Based on document type:
+**For PDFs:**
+- Use the Read tool to read the PDF (Claude can read PDFs natively)
+- Determine total page count from the read
+- If ≤10 pages: Process as single chunk
+- If >10 pages: Split into chunks of 5 pages each
+**For Images (PNG, JPG, TIFF, BMP):**
+- Process as single chunk (one image = one chunk)
+**For Word Documents (.docx):**
+- Use Bash to extract text: `textutil -convert txt "<path>" -stdout`
+- Use Read tool to read the .docx file directly (Claude can read these)
+- Process as single chunk unless very large
+### Step 3: Create Output Directory
+Create output directory structure:
+```
+<source_dir>/<filename>_extracted/
+├── chunks/           # Individual chunk JSON files
+├── structure.json    # Final merged JSON
+└── STRUCTURE.md      # Human-readable markdown
+```
+### Step 4: Launch Chunk Agents (Parallel)
+For each chunk, launch a Task agent with subagent_type="general-purpose":
+**CRITICAL: Launch ALL chunk agents in a SINGLE message with multiple Task tool calls for true parallelism.**
+Each agent receives:
+1. The document path
+2. Their assigned page range (e.g., pages 1-5)
+3. The chunk extractor prompt (embedded below)
+4. Output path for their chunk JSON
+**Chunk Extractor Prompt for Agents:**
+```
+You are extracting structured data from a document chunk.
+## Your Assignment
+- Document: {document_path}
+- Pages: {start_page} to {end_page} (or "all" for images/small docs)
+- Output: Write JSON to {output_path}
+## Protocol
+### Step 1: SCAN
+Read the document using the Read tool. Look at ALL pages in your assigned range.
+List every element you see:
+- Tables (data tables, comparison tables, any tabular data)
+- Figures (charts, graphs, images, diagrams, gels, blots, plots)
+- Text blocks (paragraphs, headers, captions)
+- Equations (mathematical formulas)
+- Lists (bulleted, numbered)
+### Step 2: TRANSCRIBE
+For EACH element, transcribe EXACTLY what you see:
+- Every label, number, unit, symbol
+- Every axis label, legend entry, annotation
+- Every cell value, header, footnote
+- Every variable, subscript, superscript
+Rules:
+- Use [unclear] for illegible content
+- Use [ambiguous: A|B] for uncertain readings (e.g., "0" vs "O")
+- Preserve EXACT formatting (1,234.56 not 1234.56)
+- Include ALL flags, markers, asterisks, annotations
+- Transcribe units exactly as shown (mg/dL, not mg per dL)
+### Step 3: STRUCTURE
+Output as JSON with this schema:
+{
+  "chunk": {
+    "pages": [start_page, end_page],
+    "document": "filename"
+  },
+  "elements": [
+    {
+      "id": "element_1",
+      "page": N,
+      "type": "table|figure|text|equation|list",
+      "title": "Title if visible, null otherwise",
+      "caption": "Caption if visible, null otherwise",
+      "data": {
+        // Type-specific structured data - see below
+      },
+      "raw_transcription": "Exact text as seen in document",
+      "position": "top|middle|bottom of page",
+      "confidence": 0.0-1.0,
+      "notes": "Any observations about quality, legibility, etc."
+    }
+  ]
+}
+### Type-Specific Data Structures
+**For Tables:**
+{
+  "headers": ["Col1", "Col2", "Col3"],
+  "rows": [
+    ["value1", "value2", "value3"],
+    ["value4", "value5", {"value": "value6", "flag": "H"}]
+  ],
+  "footnotes": ["* p < 0.05", "** p < 0.01"]
+}
+**For Figures (charts/graphs):**
+{
+  "figure_type": "bar_chart|line_graph|scatter_plot|pie_chart|other",
+  "description": "Brief description of what the figure shows",
+  "axes": {
+    "x": {"label": "Time (hours)", "range": [0, 24]},
+    "y": {"label": "Concentration (ng/mL)", "range": [0, 100]}
+  },
+  "data_series": [
+    {"name": "Control", "color": "blue", "values": [[0, 10], [6, 45], [12, 80]]},
+    {"name": "Treatment", "color": "red", "values": [[0, 12], [6, 60], [12, 95]]}
+  ],
+  "legend": ["Control", "Treatment"],
+  "annotations": ["Arrow pointing to peak at t=12h"]
+}
+**For Figures (gels/blots):**
+{
+  "figure_type": "western_blot|gel_electrophoresis|other",
+  "description": "Western blot of protein X expression",
+  "lanes": [
+    {"position": 1, "label": "Ladder", "bands": [{"position": "250 kDa"}, {"position": "150 kDa"}]},
+    {"position": 2, "label": "Control", "bands": [{"position": "~50 kDa", "intensity": "strong"}]},
+    {"position": 3, "label": "Sample 1", "bands": [{"position": "~50 kDa", "intensity": "weak"}]}
+  ],
+  "loading_control": "Beta-actin at ~42 kDa",
+  "annotations": ["Arrow indicates target protein"]
+}
+**For Figures (generic images/diagrams):**
+{
+  "figure_type": "diagram|photograph|illustration|other",
+  "description": "Detailed description of what the image shows",
+  "visible_labels": ["Label 1", "Label 2"],
+  "annotations": ["Arrow pointing to structure X"],
+  "colors_used": ["Red indicates inflammation", "Blue shows normal tissue"]
+}
+**For Equations:**
+{
+  "latex": "E = mc^2",
+  "variables": {
+    "E": "energy",
+    "m": "mass",
+    "c": "speed of light"
+  }
+}
+**For Text Blocks:**
+{
+  "content": "The full text content",
+  "type": "paragraph|header|caption|footnote"
+}
+**For Lists:**
+{
+  "list_type": "bulleted|numbered",
+  "items": ["Item 1", "Item 2", "Item 3"]
+}
+## Rules
+1. Extract EXACTLY what's visible - never fabricate data
+2. If a figure has a caption nearby, include it
+3. Link figures to their captions within your chunk
+4. Low confidence (<0.8) = flag for review in notes
+5. For tables with merged cells, represent structure accurately
+6. Include ALL visible data points, not just a sample
+## Output
+Write your JSON to: {output_path}
+```
+### Step 5: Wait for Chunk Agents
+After launching all chunk agents, wait for them to complete.
+Each agent will write their chunk JSON to the chunks/ directory.
+### Step 6: Merge Chunks
+Once all chunks are complete:
+1. Read all chunk JSON files from chunks/ directory
+2. Merge into single structure with page offset correction:
+```python
+def merge_chunks(chunk_files):
+    result = {
+        "source": document_path,
+        "extracted": timestamp,
+        "pages": []
+    }
+    all_elements = []
+    element_counter = 1
+    for chunk_file in sorted(chunk_files):
+        chunk_data = json.load(chunk_file)
+        for element in chunk_data["elements"]:
+            # Renumber elements globally
+            element["id"] = f"element_{element_counter}"
+            element_counter += 1
+            all_elements.append(element)
+    # Group by page
+    pages = {}
+    for elem in all_elements:
+        page_num = elem["page"]
+        if page_num not in pages:
+            pages[page_num] = {"page": page_num, "elements": []}
+        pages[page_num]["elements"].append(elem)
+    result["pages"] = [pages[p] for p in sorted(pages.keys())]
+    # Calculate summary
+    result["summary"] = {
+        "total_pages": len(result["pages"]),
+        "tables": sum(1 for p in result["pages"] for e in p["elements"] if e["type"] == "table"),
+        "figures": sum(1 for p in result["pages"] for e in p["elements"] if e["type"] == "figure"),
+        "equations": sum(1 for p in result["pages"] for e in p["elements"] if e["type"] == "equation"),
+        "average_confidence": calculate_average_confidence(all_elements)
+    }
+    return result
+```
+3. Write merged result to structure.json
+### Step 7: Generate Markdown Summary
+Create STRUCTURE.md with human-readable format:
+```markdown
+# Document Extraction
+**Source:** {filename}
+**Extracted:** {timestamp}
+**Pages:** {total} | **Tables:** {count} | **Figures:** {count} | **Equations:** {count}
+**Average Confidence:** {score}
+---
+## Page {N}
+### {Element Title or "Element {id}"}
+{Markdown representation of the element}
+- For tables: Render as markdown table
+- For figures: Description + key data points
+- For equations: LaTeX in $$ blocks
+- For text: The content
+**Confidence:** {score}
+---
+```
+### Step 8: Display Completion
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│  EXTRACTION COMPLETE                                                 │
+├──────────────────────────────────────────────────────────────────────┤
+│                                                                      │
+│  Source: {filename}                                                  │
+│  Pages: {total} | Tables: {count} | Figures: {count}                 │
+│  Average Confidence: {score}                                         │
+│                                                                      │
+│  Output:                                                             │
+│    {output_dir}/structure.json                                       │
+│    {output_dir}/STRUCTURE.md                                         │
+│                                                                      │
+│  Low Confidence Items: {count}                                       │
+│  {List any elements with confidence < 0.8}                           │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+## Error Handling
+- If document cannot be read: Report error, suggest checking file format
+- If chunk agent fails: Report which chunk failed, continue with others
+- If merge fails: Save individual chunk files, report merge error
+- If output directory cannot be created: Use same directory as source
+## Notes
+- This command uses Claude's native vision - no external APIs needed
+- Parallel chunk processing maximizes throughput
+- Each chunk agent has 200K context - plenty for 5 pages
+- Chunks preserve figure-caption relationships (usually within same chunk)
+- Edge cases (figure on page 5, caption on page 6) are rare but detectable

package/install.js ADDED Viewed

@@ -0,0 +1,93 @@
+#!/usr/bin/env node
+const fs = require('fs');
+const path = require('path');
+const os = require('os');
+const PLUGIN_NAME = 'structurecc';
+const DEST_DIR = path.join(os.homedir(), '.claude', 'plugins', PLUGIN_NAME);
+// Files to copy
+const FILES = [
+  '.claude-plugin/plugin.json',
+  'commands/structure.md',
+  'commands/structure-batch.md',
+  'prompts/chunk-extractor.md',
+  'README.md'
+];
+function copyFile(src, dest) {
+  const destDir = path.dirname(dest);
+  if (!fs.existsSync(destDir)) {
+    fs.mkdirSync(destDir, { recursive: true });
+  }
+  fs.copyFileSync(src, dest);
+}
+function install() {
+  console.log(`
+┌──────────────────────────────────────────────────────────────────────┐
+│                                                                      │
+│   ███████╗████████╗██████╗ ██╗   ██╗ ██████╗████████╗██╗   ██╗██████╗│
+│   ██╔════╝╚══██╔══╝██╔══██╗██║   ██║██╔════╝╚══██╔══╝██║   ██║██╔══██╗
+│   ███████╗   ██║   ██████╔╝██║   ██║██║        ██║   ██║   ██║██████╔╝
+│   ╚════██║   ██║   ██╔══██╗██║   ██║██║        ██║   ██║   ██║██╔══██╗
+│   ███████║   ██║   ██║  ██║╚██████╔╝╚██████╗   ██║   ╚██████╔╝██║  ██║
+│   ╚══════╝   ╚═╝   ╚═╝  ╚═╝ ╚═════╝  ╚═════╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝
+│                                                                      │
+│   Document Structure Extraction  |  Claude Code Plugin  |  v3.0     │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+`);
+  const sourceDir = __dirname;
+  console.log(`Installing to: ${DEST_DIR}\n`);
+  // Create destination directory
+  if (!fs.existsSync(DEST_DIR)) {
+    fs.mkdirSync(DEST_DIR, { recursive: true });
+  }
+  // Copy each file
+  let copied = 0;
+  for (const file of FILES) {
+    const src = path.join(sourceDir, file);
+    const dest = path.join(DEST_DIR, file);
+    if (fs.existsSync(src)) {
+      copyFile(src, dest);
+      console.log(`  ✓ ${file}`);
+      copied++;
+    } else {
+      console.log(`  ✗ ${file} (not found)`);
+    }
+  }
+  console.log(`
+────────────────────────────────────────────────────────────────────────
+Installation complete! ${copied}/${FILES.length} files installed.
+USAGE:
+  /structure <path>           Extract structured data from a document
+  /structure:batch <dir>      Process multiple documents
+SUPPORTED FORMATS:
+  - PDF documents
+  - Word documents (.docx)
+  - Images (PNG, JPG, TIFF)
+EXAMPLE:
+  /structure ~/Documents/lab_report.pdf
+  /structure ~/Pictures/gel_image.png
+  /structure:batch ~/Documents/patient_files/
+Output: JSON + Markdown in same directory as source document
+────────────────────────────────────────────────────────────────────────
+`);
+}
+// Run installation
+install();

package/package.json CHANGED Viewed

@@ -1,33 +1,34 @@
 {
   "name": "structurecc",
-  "version": "2.1.0",
-  "description": "Extract structured data from PDFs, Word docs, and images using Claude Code.",
+  "version": "3.0.0",
+  "description": "Claude Code plugin for extracting structured data from documents using native vision and parallel Task agents",
+  "author": "UTMB Diagnostic Center",
+  "license": "MIT",
+  "bin": {
+    "structurecc": "./install.js"
+  },
+  "files": [
+    "install.js",
+    ".claude-plugin/",
+    "commands/",
+    "prompts/",
+    "README.md"
+  ],
   "keywords": [
+    "claude-code",
+    "plugin",
     "document-extraction",
     "pdf",
-    "structure",
-    "claude-code",
-    "llm",
-    "multimodal",
-    "tables",
-    "figures",
-    "charts",
-    "markdown",
-    "ai-agents"
+    "vision",
+    "structured-data",
+    "pharmacogenomics",
+    "medical-documents"
   ],
-  "author": "James Weatherhead",
-  "license": "MIT",
   "repository": {
     "type": "git",
     "url": "https://github.com/JamesWeatherhead/structure"
   },
-  "homepage": "https://github.com/JamesWeatherhead/structure#readme",
-  "bin": {
-    "structurecc": "./bin/install.js"
-  },
-  "files": [
-    "bin/",
-    "commands/",
-    "agents/"
-  ]
+  "engines": {
+    "node": ">=16.0.0"
+  }
 }