structurecc 3.1.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/commands/structure.md +190 -21
  2. package/package.json +1 -1
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  description: Extract structured data from documents (PDF, DOCX, images) using Claude vision
3
- argument-hint: <path>
3
+ argument-hint: <path> [--verbose]
4
4
  ---
5
5
 
6
6
  # Document Structure Extraction
@@ -11,13 +11,23 @@ You are extracting structured data from a document using Claude's native vision
11
11
 
12
12
  **Document path:** $ARGUMENTS
13
13
 
14
+ ## Flags
15
+
16
+ - `--verbose` - Keep all intermediate files (chunks/, pages/, debug logs). Default behavior is clean output only.
17
+
18
+ Parse the arguments to detect `--verbose`:
19
+ - If `--verbose` is present anywhere in $ARGUMENTS, set VERBOSE_MODE=true
20
+ - Remove `--verbose` from the path to get the actual document path
21
+ - Default: VERBOSE_MODE=false (clean output)
22
+
14
23
  ## Workflow
15
24
 
16
25
  ### Step 1: Validate Input
17
26
 
18
- 1. Check if the file exists at the provided path
19
- 2. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
20
- 3. If the path is invalid, inform the user and stop
27
+ 1. Parse arguments for `--verbose` flag
28
+ 2. Check if the file exists at the provided path
29
+ 3. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
30
+ 4. If the path is invalid, inform the user and stop
21
31
 
22
32
  ### Step 2: Determine Processing Strategy
23
33
 
@@ -40,13 +50,28 @@ Based on document type:
40
50
  ### Step 3: Create Output Directory
41
51
 
42
52
  Create output directory structure:
53
+
54
+ **Default (clean output):**
55
+ ```
56
+ <source_dir>/<filename>_extracted/
57
+ ├── structure.json # Final merged JSON (machine-readable)
58
+ ├── STRUCTURE.md # Human-readable markdown summary
59
+ └── images/ # Extracted figures (if any)
60
+ ```
61
+
62
+ **With --verbose (keeps intermediates):**
43
63
  ```
44
64
  <source_dir>/<filename>_extracted/
45
- ├── chunks/ # Individual chunk JSON files
46
65
  ├── structure.json # Final merged JSON
47
- └── STRUCTURE.md # Human-readable markdown
66
+ ├── STRUCTURE.md # Human-readable markdown
67
+ ├── images/ # Extracted figures (if any)
68
+ ├── chunks/ # Individual chunk JSON files
69
+ ├── pages/ # Per-page PNG images (if generated)
70
+ └── debug/ # Processing logs
48
71
  ```
49
72
 
73
+ During processing, create a temporary `_processing/` subdirectory for intermediate files. This will be cleaned up at the end (unless --verbose).
74
+
50
75
  ### Step 4: Launch Chunk Agents (Parallel)
51
76
 
52
77
  For each chunk, launch a Task agent with subagent_type="general-purpose":
@@ -57,7 +82,7 @@ Each agent receives:
57
82
  1. The document path
58
83
  2. Their assigned page range (e.g., pages 1-5)
59
84
  3. The chunk extractor prompt (embedded below)
60
- 4. Output path for their chunk JSON
85
+ 4. Output path for their chunk JSON (write to `_processing/chunks/` subdirectory)
61
86
 
62
87
  **Chunk Extractor Prompt for Agents:**
63
88
 
@@ -69,6 +94,12 @@ You are extracting structured data from a document chunk.
69
94
  - Pages: {start_page} to {end_page} (or "all" for images/small docs)
70
95
  - Output: Write JSON to {output_path}
71
96
 
97
+ ## Core Principle
98
+
99
+ **Figures are data, not decorations.**
100
+
101
+ A chart is a visual table. A graph is data in disguise. When you see axes and curves, you see data waiting to be extracted - not an image to describe.
102
+
72
103
  ## Protocol
73
104
 
74
105
  ### Step 1: SCAN
@@ -76,6 +107,7 @@ Read the document using the Read tool. Look at ALL pages in your assigned range.
76
107
  List every element you see:
77
108
  - Tables (data tables, comparison tables, any tabular data)
78
109
  - Figures (charts, graphs, images, diagrams, gels, blots, plots)
110
+ **Figures are data, not decorations. Read them like you read tables.**
79
111
  - Text blocks (paragraphs, headers, captions)
80
112
  - Equations (mathematical formulas)
81
113
  - Lists (bulleted, numbered)
@@ -94,6 +126,30 @@ Rules:
94
126
  - Include ALL flags, markers, asterisks, annotations
95
127
  - Transcribe units exactly as shown (mg/dL, not mg per dL)
96
128
 
129
+ ### Step 2B: EXTRACT FIGURE DATA
130
+
131
+ **Figures are data, not decorations.**
132
+
133
+ For every figure with axes:
134
+
135
+ 1. **Read the axes** (your coordinate system)
136
+ - X-axis: label, range, units
137
+ - Y-axis: label, range, units
138
+
139
+ 2. **Read the data points** (estimate from visual position)
140
+ - Where does each line/bar/point fall on the grid?
141
+ - For step functions: each step is a data point
142
+ - For curves: sample ~10 points across the range
143
+ - Format: [[x1, y1], [x2, y2], ...]
144
+
145
+ 3. **Read the legend** (what each series represents)
146
+ - Color, line style, label
147
+
148
+ 4. **Multi-panel = multiple extractions**
149
+ - Panel A, B, C, D each get FULL extraction
150
+
151
+ A chart without data_series values is like a table without cell values - incomplete.
152
+
97
153
  ### Step 3: STRUCTURE
98
154
  Output as JSON with this schema:
99
155
 
@@ -133,19 +189,35 @@ Output as JSON with this schema:
133
189
  }
134
190
 
135
191
  **For Figures (charts/graphs):**
192
+
193
+ Figures are data, not decorations. Extract the actual values.
194
+
136
195
  {
137
- "figure_type": "bar_chart|line_graph|scatter_plot|pie_chart|other",
138
- "description": "Brief description of what the figure shows",
196
+ "figure_type": "bar_chart|line_graph|scatter_plot|kaplan_meier|forest_plot|other",
197
+ "description": "Brief description (SECONDARY to actual data)",
139
198
  "axes": {
140
- "x": {"label": "Time (hours)", "range": [0, 24]},
141
- "y": {"label": "Concentration (ng/mL)", "range": [0, 100]}
199
+ "x": {"label": "Time (days)", "range": [0, 7000], "unit": "days"},
200
+ "y": {"label": "Cumulative Risk", "range": [0, 0.6], "unit": null}
142
201
  },
143
- "data_series": [
144
- {"name": "Control", "color": "blue", "values": [[0, 10], [6, 45], [12, 80]]},
145
- {"name": "Treatment", "color": "red", "values": [[0, 12], [6, 60], [12, 95]]}
202
+ "data_series": [ // REQUIRED - the actual data, not optional
203
+ {
204
+ "name": "HSV Group",
205
+ "color": "purple",
206
+ "line_type": "step",
207
+ "values": [[0, 0], [1000, 0.05], [2000, 0.08], [3000, 0.12], [4000, 0.16], [5000, 0.22], [6000, 0.32], [7000, 0.55]]
208
+ },
209
+ {
210
+ "name": "Control Group",
211
+ "color": "blue",
212
+ "line_type": "step",
213
+ "values": [[0, 0], [1000, 0.01], [2000, 0.02], [3000, 0.03], [4000, 0.04], [5000, 0.05], [6000, 0.06], [7000, 0.07]]
214
+ }
215
+ ],
216
+ "legend": ["HSV Group", "Control Group", "HSV 95% CI", "Control 95% CI"],
217
+ "confidence_bands": [
218
+ {"series": "HSV Group", "color": "purple shaded", "description": "95% CI"}
146
219
  ],
147
- "legend": ["Control", "Treatment"],
148
- "annotations": ["Arrow pointing to peak at t=12h"]
220
+ "annotations": []
149
221
  }
150
222
 
151
223
  **For Figures (gels/blots):**
@@ -199,6 +271,11 @@ Output as JSON with this schema:
199
271
  4. Low confidence (<0.8) = flag for review in notes
200
272
  5. For tables with merged cells, represent structure accurately
201
273
  6. Include ALL visible data points, not just a sample
274
+ 7. FIGURES ARE DATA, NOT DECORATIONS
275
+ - A chart/graph without data_series is INCOMPLETE
276
+ - Estimate numeric values from visual position on the grid
277
+ - Multi-panel figures need FULL extraction for EACH panel
278
+ - If you can see axes and curves, you can extract data
202
279
 
203
280
  ## Output
204
281
  Write your JSON to: {output_path}
@@ -207,13 +284,13 @@ Write your JSON to: {output_path}
207
284
  ### Step 5: Wait for Chunk Agents
208
285
 
209
286
  After launching all chunk agents, wait for them to complete.
210
- Each agent will write their chunk JSON to the chunks/ directory.
287
+ Each agent will write their chunk JSON to the `_processing/chunks/` directory.
211
288
 
212
289
  ### Step 6: Merge Chunks
213
290
 
214
291
  Once all chunks are complete:
215
292
 
216
- 1. Read all chunk JSON files from chunks/ directory
293
+ 1. Read all chunk JSON files from `_processing/chunks/` directory
217
294
  2. Merge into single structure with page offset correction:
218
295
 
219
296
  ```python
@@ -289,8 +366,72 @@ Create STRUCTURE.md with human-readable format:
289
366
  ---
290
367
  ```
291
368
 
292
- ### Step 8: Display Completion
369
+ ### Step 8: Clean Up Intermediate Files
370
+
371
+ After generating the final outputs, clean up intermediate files **unless --verbose flag was provided**.
372
+
373
+ **Default behavior (VERBOSE_MODE=false):**
374
+
375
+ 1. Move any extracted images from `_processing/` to `images/` directory
376
+ 2. Delete the entire `_processing/` directory and its contents:
377
+ - `_processing/chunks/` - intermediate chunk JSON files
378
+ - `_processing/pages/` - per-page images (if generated)
379
+ - `_processing/debug/` - any debug logs
380
+
381
+ ```bash
382
+ # Move images if they exist
383
+ if [ -d "_processing/images" ]; then
384
+ mv _processing/images ./images
385
+ fi
293
386
 
387
+ # Remove processing directory
388
+ rm -rf _processing/
389
+ ```
390
+
391
+ **Verbose behavior (VERBOSE_MODE=true):**
392
+
393
+ 1. Move intermediate files to permanent locations:
394
+ - `_processing/chunks/` → `chunks/`
395
+ - `_processing/pages/` → `pages/`
396
+ - `_processing/images/` → `images/`
397
+ - `_processing/debug/` → `debug/`
398
+ 2. Keep all files for debugging/inspection
399
+
400
+ ```bash
401
+ # Move to permanent locations
402
+ mv _processing/chunks ./chunks 2>/dev/null
403
+ mv _processing/pages ./pages 2>/dev/null
404
+ mv _processing/images ./images 2>/dev/null
405
+ mv _processing/debug ./debug 2>/dev/null
406
+
407
+ # Remove empty processing directory
408
+ rmdir _processing 2>/dev/null
409
+ ```
410
+
411
+ **Final output structure:**
412
+
413
+ Default (clean):
414
+ ```
415
+ <filename>_extracted/
416
+ ├── structure.json # 28 KB - complete machine-readable data
417
+ ├── STRUCTURE.md # 12 KB - human-readable summary
418
+ └── images/ # Extracted figures (if any)
419
+ ```
420
+
421
+ With --verbose:
422
+ ```
423
+ <filename>_extracted/
424
+ ├── structure.json
425
+ ├── STRUCTURE.md
426
+ ├── images/
427
+ ├── chunks/ # Per-chunk JSON files
428
+ ├── pages/ # Per-page images
429
+ └── debug/ # Processing logs
430
+ ```
431
+
432
+ ### Step 9: Display Completion
433
+
434
+ **Default (clean) completion message:**
294
435
  ```
295
436
  ┌──────────────────────────────────────────────────────────────────────┐
296
437
  │ EXTRACTION COMPLETE │
@@ -300,9 +441,34 @@ Create STRUCTURE.md with human-readable format:
300
441
  │ Pages: {total} | Tables: {count} | Figures: {count} │
301
442
  │ Average Confidence: {score} │
302
443
  │ │
444
+ │ Output (2 files): │
445
+ │ {output_dir}/structure.json (machine-readable) │
446
+ │ {output_dir}/STRUCTURE.md (human-readable) │
447
+ │ │
448
+ │ Low Confidence Items: {count} │
449
+ │ {List any elements with confidence < 0.8} │
450
+ │ │
451
+ │ Tip: Use --verbose to keep intermediate files │
452
+ │ │
453
+ └──────────────────────────────────────────────────────────────────────┘
454
+ ```
455
+
456
+ **Verbose completion message:**
457
+ ```
458
+ ┌──────────────────────────────────────────────────────────────────────┐
459
+ │ EXTRACTION COMPLETE (verbose mode) │
460
+ ├──────────────────────────────────────────────────────────────────────┤
461
+ │ │
462
+ │ Source: {filename} │
463
+ │ Pages: {total} | Tables: {count} | Figures: {count} │
464
+ │ Average Confidence: {score} │
465
+ │ │
303
466
  │ Output: │
304
- │ {output_dir}/structure.json
305
- │ {output_dir}/STRUCTURE.md
467
+ │ {output_dir}/structure.json (final merged JSON)
468
+ │ {output_dir}/STRUCTURE.md (human-readable summary)
469
+ │ {output_dir}/chunks/ (intermediate chunk files) │
470
+ │ {output_dir}/pages/ (per-page images) │
471
+ │ {output_dir}/images/ (extracted figures) │
306
472
  │ │
307
473
  │ Low Confidence Items: {count} │
308
474
  │ {List any elements with confidence < 0.8} │
@@ -324,3 +490,6 @@ Create STRUCTURE.md with human-readable format:
324
490
  - Each chunk agent has 200K context - plenty for 5 pages
325
491
  - Chunks preserve figure-caption relationships (usually within same chunk)
326
492
  - Edge cases (figure on page 5, caption on page 6) are rare but detectable
493
+ - **Default output is clean** (~40 KB total): structure.json + STRUCTURE.md + images/
494
+ - Use `--verbose` to keep all intermediate files for debugging
495
+ - Intermediate files are processed in `_processing/` directory during extraction
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "structurecc",
3
- "version": "3.1.0",
3
+ "version": "3.3.0",
4
4
  "description": "Claude Code plugin for extracting structured data from documents using native vision and parallel Task agents",
5
5
  "author": "UTMB Diagnostic Center",
6
6
  "license": "MIT",