structurecc 3.1.0 → 3.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/commands/structure.md +190 -21
- package/package.json +1 -1
package/commands/structure.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Extract structured data from documents (PDF, DOCX, images) using Claude vision
|
|
3
|
-
argument-hint: <path>
|
|
3
|
+
argument-hint: <path> [--verbose]
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Document Structure Extraction
|
|
@@ -11,13 +11,23 @@ You are extracting structured data from a document using Claude's native vision
|
|
|
11
11
|
|
|
12
12
|
**Document path:** $ARGUMENTS
|
|
13
13
|
|
|
14
|
+
## Flags
|
|
15
|
+
|
|
16
|
+
- `--verbose` - Keep all intermediate files (chunks/, pages/, debug logs). Default behavior is clean output only.
|
|
17
|
+
|
|
18
|
+
Parse the arguments to detect `--verbose`:
|
|
19
|
+
- If `--verbose` is present anywhere in $ARGUMENTS, set VERBOSE_MODE=true
|
|
20
|
+
- Remove `--verbose` from the path to get the actual document path
|
|
21
|
+
- Default: VERBOSE_MODE=false (clean output)
|
|
22
|
+
|
|
14
23
|
## Workflow
|
|
15
24
|
|
|
16
25
|
### Step 1: Validate Input
|
|
17
26
|
|
|
18
|
-
1.
|
|
19
|
-
2.
|
|
20
|
-
3.
|
|
27
|
+
1. Parse arguments for `--verbose` flag
|
|
28
|
+
2. Check if the file exists at the provided path
|
|
29
|
+
3. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
|
|
30
|
+
4. If the path is invalid, inform the user and stop
|
|
21
31
|
|
|
22
32
|
### Step 2: Determine Processing Strategy
|
|
23
33
|
|
|
@@ -40,13 +50,28 @@ Based on document type:
|
|
|
40
50
|
### Step 3: Create Output Directory
|
|
41
51
|
|
|
42
52
|
Create output directory structure:
|
|
53
|
+
|
|
54
|
+
**Default (clean output):**
|
|
55
|
+
```
|
|
56
|
+
<source_dir>/<filename>_extracted/
|
|
57
|
+
├── structure.json # Final merged JSON (machine-readable)
|
|
58
|
+
├── STRUCTURE.md # Human-readable markdown summary
|
|
59
|
+
└── images/ # Extracted figures (if any)
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
**With --verbose (keeps intermediates):**
|
|
43
63
|
```
|
|
44
64
|
<source_dir>/<filename>_extracted/
|
|
45
|
-
├── chunks/ # Individual chunk JSON files
|
|
46
65
|
├── structure.json # Final merged JSON
|
|
47
|
-
|
|
66
|
+
├── STRUCTURE.md # Human-readable markdown
|
|
67
|
+
├── images/ # Extracted figures (if any)
|
|
68
|
+
├── chunks/ # Individual chunk JSON files
|
|
69
|
+
├── pages/ # Per-page PNG images (if generated)
|
|
70
|
+
└── debug/ # Processing logs
|
|
48
71
|
```
|
|
49
72
|
|
|
73
|
+
During processing, create a temporary `_processing/` subdirectory for intermediate files. This will be cleaned up at the end (unless --verbose).
|
|
74
|
+
|
|
50
75
|
### Step 4: Launch Chunk Agents (Parallel)
|
|
51
76
|
|
|
52
77
|
For each chunk, launch a Task agent with subagent_type="general-purpose":
|
|
@@ -57,7 +82,7 @@ Each agent receives:
|
|
|
57
82
|
1. The document path
|
|
58
83
|
2. Their assigned page range (e.g., pages 1-5)
|
|
59
84
|
3. The chunk extractor prompt (embedded below)
|
|
60
|
-
4. Output path for their chunk JSON
|
|
85
|
+
4. Output path for their chunk JSON (write to `_processing/chunks/` subdirectory)
|
|
61
86
|
|
|
62
87
|
**Chunk Extractor Prompt for Agents:**
|
|
63
88
|
|
|
@@ -69,6 +94,12 @@ You are extracting structured data from a document chunk.
|
|
|
69
94
|
- Pages: {start_page} to {end_page} (or "all" for images/small docs)
|
|
70
95
|
- Output: Write JSON to {output_path}
|
|
71
96
|
|
|
97
|
+
## Core Principle
|
|
98
|
+
|
|
99
|
+
**Figures are data, not decorations.**
|
|
100
|
+
|
|
101
|
+
A chart is a visual table. A graph is data in disguise. When you see axes and curves, you see data waiting to be extracted - not an image to describe.
|
|
102
|
+
|
|
72
103
|
## Protocol
|
|
73
104
|
|
|
74
105
|
### Step 1: SCAN
|
|
@@ -76,6 +107,7 @@ Read the document using the Read tool. Look at ALL pages in your assigned range.
|
|
|
76
107
|
List every element you see:
|
|
77
108
|
- Tables (data tables, comparison tables, any tabular data)
|
|
78
109
|
- Figures (charts, graphs, images, diagrams, gels, blots, plots)
|
|
110
|
+
**Figures are data, not decorations. Read them like you read tables.**
|
|
79
111
|
- Text blocks (paragraphs, headers, captions)
|
|
80
112
|
- Equations (mathematical formulas)
|
|
81
113
|
- Lists (bulleted, numbered)
|
|
@@ -94,6 +126,30 @@ Rules:
|
|
|
94
126
|
- Include ALL flags, markers, asterisks, annotations
|
|
95
127
|
- Transcribe units exactly as shown (mg/dL, not mg per dL)
|
|
96
128
|
|
|
129
|
+
### Step 2B: EXTRACT FIGURE DATA
|
|
130
|
+
|
|
131
|
+
**Figures are data, not decorations.**
|
|
132
|
+
|
|
133
|
+
For every figure with axes:
|
|
134
|
+
|
|
135
|
+
1. **Read the axes** (your coordinate system)
|
|
136
|
+
- X-axis: label, range, units
|
|
137
|
+
- Y-axis: label, range, units
|
|
138
|
+
|
|
139
|
+
2. **Read the data points** (estimate from visual position)
|
|
140
|
+
- Where does each line/bar/point fall on the grid?
|
|
141
|
+
- For step functions: each step is a data point
|
|
142
|
+
- For curves: sample ~10 points across the range
|
|
143
|
+
- Format: [[x1, y1], [x2, y2], ...]
|
|
144
|
+
|
|
145
|
+
3. **Read the legend** (what each series represents)
|
|
146
|
+
- Color, line style, label
|
|
147
|
+
|
|
148
|
+
4. **Multi-panel = multiple extractions**
|
|
149
|
+
- Panel A, B, C, D each get FULL extraction
|
|
150
|
+
|
|
151
|
+
A chart without data_series values is like a table without cell values - incomplete.
|
|
152
|
+
|
|
97
153
|
### Step 3: STRUCTURE
|
|
98
154
|
Output as JSON with this schema:
|
|
99
155
|
|
|
@@ -133,19 +189,35 @@ Output as JSON with this schema:
|
|
|
133
189
|
}
|
|
134
190
|
|
|
135
191
|
**For Figures (charts/graphs):**
|
|
192
|
+
|
|
193
|
+
Figures are data, not decorations. Extract the actual values.
|
|
194
|
+
|
|
136
195
|
{
|
|
137
|
-
"figure_type": "bar_chart|line_graph|scatter_plot|
|
|
138
|
-
"description": "Brief description
|
|
196
|
+
"figure_type": "bar_chart|line_graph|scatter_plot|kaplan_meier|forest_plot|other",
|
|
197
|
+
"description": "Brief description (SECONDARY to actual data)",
|
|
139
198
|
"axes": {
|
|
140
|
-
"x": {"label": "Time (
|
|
141
|
-
"y": {"label": "
|
|
199
|
+
"x": {"label": "Time (days)", "range": [0, 7000], "unit": "days"},
|
|
200
|
+
"y": {"label": "Cumulative Risk", "range": [0, 0.6], "unit": null}
|
|
142
201
|
},
|
|
143
|
-
"data_series": [
|
|
144
|
-
{
|
|
145
|
-
|
|
202
|
+
"data_series": [ // REQUIRED - the actual data, not optional
|
|
203
|
+
{
|
|
204
|
+
"name": "HSV Group",
|
|
205
|
+
"color": "purple",
|
|
206
|
+
"line_type": "step",
|
|
207
|
+
"values": [[0, 0], [1000, 0.05], [2000, 0.08], [3000, 0.12], [4000, 0.16], [5000, 0.22], [6000, 0.32], [7000, 0.55]]
|
|
208
|
+
},
|
|
209
|
+
{
|
|
210
|
+
"name": "Control Group",
|
|
211
|
+
"color": "blue",
|
|
212
|
+
"line_type": "step",
|
|
213
|
+
"values": [[0, 0], [1000, 0.01], [2000, 0.02], [3000, 0.03], [4000, 0.04], [5000, 0.05], [6000, 0.06], [7000, 0.07]]
|
|
214
|
+
}
|
|
215
|
+
],
|
|
216
|
+
"legend": ["HSV Group", "Control Group", "HSV 95% CI", "Control 95% CI"],
|
|
217
|
+
"confidence_bands": [
|
|
218
|
+
{"series": "HSV Group", "color": "purple shaded", "description": "95% CI"}
|
|
146
219
|
],
|
|
147
|
-
"
|
|
148
|
-
"annotations": ["Arrow pointing to peak at t=12h"]
|
|
220
|
+
"annotations": []
|
|
149
221
|
}
|
|
150
222
|
|
|
151
223
|
**For Figures (gels/blots):**
|
|
@@ -199,6 +271,11 @@ Output as JSON with this schema:
|
|
|
199
271
|
4. Low confidence (<0.8) = flag for review in notes
|
|
200
272
|
5. For tables with merged cells, represent structure accurately
|
|
201
273
|
6. Include ALL visible data points, not just a sample
|
|
274
|
+
7. FIGURES ARE DATA, NOT DECORATIONS
|
|
275
|
+
- A chart/graph without data_series is INCOMPLETE
|
|
276
|
+
- Estimate numeric values from visual position on the grid
|
|
277
|
+
- Multi-panel figures need FULL extraction for EACH panel
|
|
278
|
+
- If you can see axes and curves, you can extract data
|
|
202
279
|
|
|
203
280
|
## Output
|
|
204
281
|
Write your JSON to: {output_path}
|
|
@@ -207,13 +284,13 @@ Write your JSON to: {output_path}
|
|
|
207
284
|
### Step 5: Wait for Chunk Agents
|
|
208
285
|
|
|
209
286
|
After launching all chunk agents, wait for them to complete.
|
|
210
|
-
Each agent will write their chunk JSON to the chunks
|
|
287
|
+
Each agent will write their chunk JSON to the `_processing/chunks/` directory.
|
|
211
288
|
|
|
212
289
|
### Step 6: Merge Chunks
|
|
213
290
|
|
|
214
291
|
Once all chunks are complete:
|
|
215
292
|
|
|
216
|
-
1. Read all chunk JSON files from chunks
|
|
293
|
+
1. Read all chunk JSON files from `_processing/chunks/` directory
|
|
217
294
|
2. Merge into single structure with page offset correction:
|
|
218
295
|
|
|
219
296
|
```python
|
|
@@ -289,8 +366,72 @@ Create STRUCTURE.md with human-readable format:
|
|
|
289
366
|
---
|
|
290
367
|
```
|
|
291
368
|
|
|
292
|
-
### Step 8:
|
|
369
|
+
### Step 8: Clean Up Intermediate Files
|
|
370
|
+
|
|
371
|
+
After generating the final outputs, clean up intermediate files **unless --verbose flag was provided**.
|
|
372
|
+
|
|
373
|
+
**Default behavior (VERBOSE_MODE=false):**
|
|
374
|
+
|
|
375
|
+
1. Move any extracted images from `_processing/` to `images/` directory
|
|
376
|
+
2. Delete the entire `_processing/` directory and its contents:
|
|
377
|
+
- `_processing/chunks/` - intermediate chunk JSON files
|
|
378
|
+
- `_processing/pages/` - per-page images (if generated)
|
|
379
|
+
- `_processing/debug/` - any debug logs
|
|
380
|
+
|
|
381
|
+
```bash
|
|
382
|
+
# Move images if they exist
|
|
383
|
+
if [ -d "_processing/images" ]; then
|
|
384
|
+
mv _processing/images ./images
|
|
385
|
+
fi
|
|
293
386
|
|
|
387
|
+
# Remove processing directory
|
|
388
|
+
rm -rf _processing/
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
**Verbose behavior (VERBOSE_MODE=true):**
|
|
392
|
+
|
|
393
|
+
1. Move intermediate files to permanent locations:
|
|
394
|
+
- `_processing/chunks/` → `chunks/`
|
|
395
|
+
- `_processing/pages/` → `pages/`
|
|
396
|
+
- `_processing/images/` → `images/`
|
|
397
|
+
- `_processing/debug/` → `debug/`
|
|
398
|
+
2. Keep all files for debugging/inspection
|
|
399
|
+
|
|
400
|
+
```bash
|
|
401
|
+
# Move to permanent locations
|
|
402
|
+
mv _processing/chunks ./chunks 2>/dev/null
|
|
403
|
+
mv _processing/pages ./pages 2>/dev/null
|
|
404
|
+
mv _processing/images ./images 2>/dev/null
|
|
405
|
+
mv _processing/debug ./debug 2>/dev/null
|
|
406
|
+
|
|
407
|
+
# Remove empty processing directory
|
|
408
|
+
rmdir _processing 2>/dev/null
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
**Final output structure:**
|
|
412
|
+
|
|
413
|
+
Default (clean):
|
|
414
|
+
```
|
|
415
|
+
<filename>_extracted/
|
|
416
|
+
├── structure.json # 28 KB - complete machine-readable data
|
|
417
|
+
├── STRUCTURE.md # 12 KB - human-readable summary
|
|
418
|
+
└── images/ # Extracted figures (if any)
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
With --verbose:
|
|
422
|
+
```
|
|
423
|
+
<filename>_extracted/
|
|
424
|
+
├── structure.json
|
|
425
|
+
├── STRUCTURE.md
|
|
426
|
+
├── images/
|
|
427
|
+
├── chunks/ # Per-chunk JSON files
|
|
428
|
+
├── pages/ # Per-page images
|
|
429
|
+
└── debug/ # Processing logs
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
### Step 9: Display Completion
|
|
433
|
+
|
|
434
|
+
**Default (clean) completion message:**
|
|
294
435
|
```
|
|
295
436
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
296
437
|
│ EXTRACTION COMPLETE │
|
|
@@ -300,9 +441,34 @@ Create STRUCTURE.md with human-readable format:
|
|
|
300
441
|
│ Pages: {total} | Tables: {count} | Figures: {count} │
|
|
301
442
|
│ Average Confidence: {score} │
|
|
302
443
|
│ │
|
|
444
|
+
│ Output (2 files): │
|
|
445
|
+
│ {output_dir}/structure.json (machine-readable) │
|
|
446
|
+
│ {output_dir}/STRUCTURE.md (human-readable) │
|
|
447
|
+
│ │
|
|
448
|
+
│ Low Confidence Items: {count} │
|
|
449
|
+
│ {List any elements with confidence < 0.8} │
|
|
450
|
+
│ │
|
|
451
|
+
│ Tip: Use --verbose to keep intermediate files │
|
|
452
|
+
│ │
|
|
453
|
+
└──────────────────────────────────────────────────────────────────────┘
|
|
454
|
+
```
|
|
455
|
+
|
|
456
|
+
**Verbose completion message:**
|
|
457
|
+
```
|
|
458
|
+
┌──────────────────────────────────────────────────────────────────────┐
|
|
459
|
+
│ EXTRACTION COMPLETE (verbose mode) │
|
|
460
|
+
├──────────────────────────────────────────────────────────────────────┤
|
|
461
|
+
│ │
|
|
462
|
+
│ Source: {filename} │
|
|
463
|
+
│ Pages: {total} | Tables: {count} | Figures: {count} │
|
|
464
|
+
│ Average Confidence: {score} │
|
|
465
|
+
│ │
|
|
303
466
|
│ Output: │
|
|
304
|
-
│ {output_dir}/structure.json
|
|
305
|
-
│ {output_dir}/STRUCTURE.md
|
|
467
|
+
│ {output_dir}/structure.json (final merged JSON) │
|
|
468
|
+
│ {output_dir}/STRUCTURE.md (human-readable summary) │
|
|
469
|
+
│ {output_dir}/chunks/ (intermediate chunk files) │
|
|
470
|
+
│ {output_dir}/pages/ (per-page images) │
|
|
471
|
+
│ {output_dir}/images/ (extracted figures) │
|
|
306
472
|
│ │
|
|
307
473
|
│ Low Confidence Items: {count} │
|
|
308
474
|
│ {List any elements with confidence < 0.8} │
|
|
@@ -324,3 +490,6 @@ Create STRUCTURE.md with human-readable format:
|
|
|
324
490
|
- Each chunk agent has 200K context - plenty for 5 pages
|
|
325
491
|
- Chunks preserve figure-caption relationships (usually within same chunk)
|
|
326
492
|
- Edge cases (figure on page 5, caption on page 6) are rare but detectable
|
|
493
|
+
- **Default output is clean** (~40 KB total): structure.json + STRUCTURE.md + images/
|
|
494
|
+
- Use `--verbose` to keep all intermediate files for debugging
|
|
495
|
+
- Intermediate files are processed in `_processing/` directory during extraction
|
package/package.json
CHANGED