npm - structurecc - Versions diffs - 3.1.0 → 3.3.0 - Mend

structurecc 3.1.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/commands/structure.md +190 -21
package/package.json +1 -1

package/commands/structure.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 description: Extract structured data from documents (PDF, DOCX, images) using Claude vision
-argument-hint: <path>
+argument-hint: <path> [--verbose]
 ---
 # Document Structure Extraction
@@ -11,13 +11,23 @@ You are extracting structured data from a document using Claude's native vision
 **Document path:** $ARGUMENTS
+## Flags
+- `--verbose` - Keep all intermediate files (chunks/, pages/, debug logs). Default behavior is clean output only.
+Parse the arguments to detect `--verbose`:
+- If `--verbose` is present anywhere in $ARGUMENTS, set VERBOSE_MODE=true
+- Remove `--verbose` from the path to get the actual document path
+- Default: VERBOSE_MODE=false (clean output)
 ## Workflow
 ### Step 1: Validate Input
-1. Check if the file exists at the provided path
-2. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
-3. If the path is invalid, inform the user and stop
+1. Parse arguments for `--verbose` flag
+2. Check if the file exists at the provided path
+3. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
+4. If the path is invalid, inform the user and stop
 ### Step 2: Determine Processing Strategy
@@ -40,13 +50,28 @@ Based on document type:
 ### Step 3: Create Output Directory
 Create output directory structure:
+**Default (clean output):**
+```
+<source_dir>/<filename>_extracted/
+├── structure.json    # Final merged JSON (machine-readable)
+├── STRUCTURE.md      # Human-readable markdown summary
+└── images/           # Extracted figures (if any)
+```
+**With --verbose (keeps intermediates):**
 ```
 <source_dir>/<filename>_extracted/
-├── chunks/           # Individual chunk JSON files
 ├── structure.json    # Final merged JSON
-└── STRUCTURE.md      # Human-readable markdown
+├── STRUCTURE.md      # Human-readable markdown
+├── images/           # Extracted figures (if any)
+├── chunks/           # Individual chunk JSON files
+├── pages/            # Per-page PNG images (if generated)
+└── debug/            # Processing logs
 ```
+During processing, create a temporary `_processing/` subdirectory for intermediate files. This will be cleaned up at the end (unless --verbose).
 ### Step 4: Launch Chunk Agents (Parallel)
 For each chunk, launch a Task agent with subagent_type="general-purpose":
@@ -57,7 +82,7 @@ Each agent receives:
 1. The document path
 2. Their assigned page range (e.g., pages 1-5)
 3. The chunk extractor prompt (embedded below)
-4. Output path for their chunk JSON
+4. Output path for their chunk JSON (write to `_processing/chunks/` subdirectory)
 **Chunk Extractor Prompt for Agents:**
@@ -69,6 +94,12 @@ You are extracting structured data from a document chunk.
 - Pages: {start_page} to {end_page} (or "all" for images/small docs)
 - Output: Write JSON to {output_path}
+## Core Principle
+**Figures are data, not decorations.**
+A chart is a visual table. A graph is data in disguise. When you see axes and curves, you see data waiting to be extracted - not an image to describe.
 ## Protocol
 ### Step 1: SCAN
@@ -76,6 +107,7 @@ Read the document using the Read tool. Look at ALL pages in your assigned range.
 List every element you see:
 - Tables (data tables, comparison tables, any tabular data)
 - Figures (charts, graphs, images, diagrams, gels, blots, plots)
+  **Figures are data, not decorations. Read them like you read tables.**
 - Text blocks (paragraphs, headers, captions)
 - Equations (mathematical formulas)
 - Lists (bulleted, numbered)
@@ -94,6 +126,30 @@ Rules:
 - Include ALL flags, markers, asterisks, annotations
 - Transcribe units exactly as shown (mg/dL, not mg per dL)
+### Step 2B: EXTRACT FIGURE DATA
+**Figures are data, not decorations.**
+For every figure with axes:
+1. **Read the axes** (your coordinate system)
+   - X-axis: label, range, units
+   - Y-axis: label, range, units
+2. **Read the data points** (estimate from visual position)
+   - Where does each line/bar/point fall on the grid?
+   - For step functions: each step is a data point
+   - For curves: sample ~10 points across the range
+   - Format: [[x1, y1], [x2, y2], ...]
+3. **Read the legend** (what each series represents)
+   - Color, line style, label
+4. **Multi-panel = multiple extractions**
+   - Panel A, B, C, D each get FULL extraction
+A chart without data_series values is like a table without cell values - incomplete.
 ### Step 3: STRUCTURE
 Output as JSON with this schema:
@@ -133,19 +189,35 @@ Output as JSON with this schema:
 }
 **For Figures (charts/graphs):**
+Figures are data, not decorations. Extract the actual values.
 {
-  "figure_type": "bar_chart|line_graph|scatter_plot|pie_chart|other",
-  "description": "Brief description of what the figure shows",
+  "figure_type": "bar_chart|line_graph|scatter_plot|kaplan_meier|forest_plot|other",
+  "description": "Brief description (SECONDARY to actual data)",
   "axes": {
-    "x": {"label": "Time (hours)", "range": [0, 24]},
-    "y": {"label": "Concentration (ng/mL)", "range": [0, 100]}
+    "x": {"label": "Time (days)", "range": [0, 7000], "unit": "days"},
+    "y": {"label": "Cumulative Risk", "range": [0, 0.6], "unit": null}
   },
-  "data_series": [
-    {"name": "Control", "color": "blue", "values": [[0, 10], [6, 45], [12, 80]]},
-    {"name": "Treatment", "color": "red", "values": [[0, 12], [6, 60], [12, 95]]}
+  "data_series": [  // REQUIRED - the actual data, not optional
+    {
+      "name": "HSV Group",
+      "color": "purple",
+      "line_type": "step",
+      "values": [[0, 0], [1000, 0.05], [2000, 0.08], [3000, 0.12], [4000, 0.16], [5000, 0.22], [6000, 0.32], [7000, 0.55]]
+    },
+    {
+      "name": "Control Group",
+      "color": "blue",
+      "line_type": "step",
+      "values": [[0, 0], [1000, 0.01], [2000, 0.02], [3000, 0.03], [4000, 0.04], [5000, 0.05], [6000, 0.06], [7000, 0.07]]
+    }
+  ],
+  "legend": ["HSV Group", "Control Group", "HSV 95% CI", "Control 95% CI"],
+  "confidence_bands": [
+    {"series": "HSV Group", "color": "purple shaded", "description": "95% CI"}
   ],
-  "legend": ["Control", "Treatment"],
-  "annotations": ["Arrow pointing to peak at t=12h"]
+  "annotations": []
 }
 **For Figures (gels/blots):**
@@ -199,6 +271,11 @@ Output as JSON with this schema:
 4. Low confidence (<0.8) = flag for review in notes
 5. For tables with merged cells, represent structure accurately
 6. Include ALL visible data points, not just a sample
+7. FIGURES ARE DATA, NOT DECORATIONS
+   - A chart/graph without data_series is INCOMPLETE
+   - Estimate numeric values from visual position on the grid
+   - Multi-panel figures need FULL extraction for EACH panel
+   - If you can see axes and curves, you can extract data
 ## Output
 Write your JSON to: {output_path}
@@ -207,13 +284,13 @@ Write your JSON to: {output_path}
 ### Step 5: Wait for Chunk Agents
 After launching all chunk agents, wait for them to complete.
-Each agent will write their chunk JSON to the chunks/ directory.
+Each agent will write their chunk JSON to the `_processing/chunks/` directory.
 ### Step 6: Merge Chunks
 Once all chunks are complete:
-1. Read all chunk JSON files from chunks/ directory
+1. Read all chunk JSON files from `_processing/chunks/` directory
 2. Merge into single structure with page offset correction:
 ```python
@@ -289,8 +366,72 @@ Create STRUCTURE.md with human-readable format:
 ---
 ```
-### Step 8: Display Completion
+### Step 8: Clean Up Intermediate Files
+After generating the final outputs, clean up intermediate files **unless --verbose flag was provided**.
+**Default behavior (VERBOSE_MODE=false):**
+1. Move any extracted images from `_processing/` to `images/` directory
+2. Delete the entire `_processing/` directory and its contents:
+   - `_processing/chunks/` - intermediate chunk JSON files
+   - `_processing/pages/` - per-page images (if generated)
+   - `_processing/debug/` - any debug logs
+```bash
+# Move images if they exist
+if [ -d "_processing/images" ]; then
+    mv _processing/images ./images
+fi
+# Remove processing directory
+rm -rf _processing/
+```
+**Verbose behavior (VERBOSE_MODE=true):**
+1. Move intermediate files to permanent locations:
+   - `_processing/chunks/` → `chunks/`
+   - `_processing/pages/` → `pages/`
+   - `_processing/images/` → `images/`
+   - `_processing/debug/` → `debug/`
+2. Keep all files for debugging/inspection
+```bash
+# Move to permanent locations
+mv _processing/chunks ./chunks 2>/dev/null
+mv _processing/pages ./pages 2>/dev/null
+mv _processing/images ./images 2>/dev/null
+mv _processing/debug ./debug 2>/dev/null
+# Remove empty processing directory
+rmdir _processing 2>/dev/null
+```
+**Final output structure:**
+Default (clean):
+```
+<filename>_extracted/
+├── structure.json    # 28 KB - complete machine-readable data
+├── STRUCTURE.md      # 12 KB - human-readable summary
+└── images/           # Extracted figures (if any)
+```
+With --verbose:
+```
+<filename>_extracted/
+├── structure.json
+├── STRUCTURE.md
+├── images/
+├── chunks/           # Per-chunk JSON files
+├── pages/            # Per-page images
+└── debug/            # Processing logs
+```
+### Step 9: Display Completion
+**Default (clean) completion message:**
 ```
 ┌──────────────────────────────────────────────────────────────────────┐
 │  EXTRACTION COMPLETE                                                 │
@@ -300,9 +441,34 @@ Create STRUCTURE.md with human-readable format:
 │  Pages: {total} | Tables: {count} | Figures: {count}                 │
 │  Average Confidence: {score}                                         │
 │                                                                      │
+│  Output (2 files):                                                   │
+│    {output_dir}/structure.json    (machine-readable)                 │
+│    {output_dir}/STRUCTURE.md      (human-readable)                   │
+│                                                                      │
+│  Low Confidence Items: {count}                                       │
+│  {List any elements with confidence < 0.8}                           │
+│                                                                      │
+│  Tip: Use --verbose to keep intermediate files                       │
+│                                                                      │
+└──────────────────────────────────────────────────────────────────────┘
+```
+**Verbose completion message:**
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│  EXTRACTION COMPLETE (verbose mode)                                  │
+├──────────────────────────────────────────────────────────────────────┤
+│                                                                      │
+│  Source: {filename}                                                  │
+│  Pages: {total} | Tables: {count} | Figures: {count}                 │
+│  Average Confidence: {score}                                         │
+│                                                                      │
 │  Output:                                                             │
-│    {output_dir}/structure.json                                       │
-│    {output_dir}/STRUCTURE.md                                         │
+│    {output_dir}/structure.json    (final merged JSON)                │
+│    {output_dir}/STRUCTURE.md      (human-readable summary)           │
+│    {output_dir}/chunks/           (intermediate chunk files)         │
+│    {output_dir}/pages/            (per-page images)                  │
+│    {output_dir}/images/           (extracted figures)                │
 │                                                                      │
 │  Low Confidence Items: {count}                                       │
 │  {List any elements with confidence < 0.8}                           │
@@ -324,3 +490,6 @@ Create STRUCTURE.md with human-readable format:
 - Each chunk agent has 200K context - plenty for 5 pages
 - Chunks preserve figure-caption relationships (usually within same chunk)
 - Edge cases (figure on page 5, caption on page 6) are rare but detectable
+- **Default output is clean** (~40 KB total): structure.json + STRUCTURE.md + images/
+- Use `--verbose` to keep all intermediate files for debugging
+- Intermediate files are processed in `_processing/` directory during extraction

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "structurecc",
-  "version": "3.1.0",
+  "version": "3.3.0",
   "description": "Claude Code plugin for extracting structured data from documents using native vision and parallel Task agents",
   "author": "UTMB Diagnostic Center",
   "license": "MIT",