structurecc 3.1.0 → 3.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/commands/structure.md +129 -12
- package/package.json +1 -1
package/commands/structure.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Extract structured data from documents (PDF, DOCX, images) using Claude vision
|
|
3
|
-
argument-hint: <path>
|
|
3
|
+
argument-hint: <path> [--verbose]
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Document Structure Extraction
|
|
@@ -11,13 +11,23 @@ You are extracting structured data from a document using Claude's native vision
|
|
|
11
11
|
|
|
12
12
|
**Document path:** $ARGUMENTS
|
|
13
13
|
|
|
14
|
+
## Flags
|
|
15
|
+
|
|
16
|
+
- `--verbose` - Keep all intermediate files (chunks/, pages/, debug logs). Default behavior is clean output only.
|
|
17
|
+
|
|
18
|
+
Parse the arguments to detect `--verbose`:
|
|
19
|
+
- If `--verbose` is present anywhere in $ARGUMENTS, set VERBOSE_MODE=true
|
|
20
|
+
- Remove `--verbose` from the path to get the actual document path
|
|
21
|
+
- Default: VERBOSE_MODE=false (clean output)
|
|
22
|
+
|
|
14
23
|
## Workflow
|
|
15
24
|
|
|
16
25
|
### Step 1: Validate Input
|
|
17
26
|
|
|
18
|
-
1.
|
|
19
|
-
2.
|
|
20
|
-
3.
|
|
27
|
+
1. Parse arguments for `--verbose` flag
|
|
28
|
+
2. Check if the file exists at the provided path
|
|
29
|
+
3. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
|
|
30
|
+
4. If the path is invalid, inform the user and stop
|
|
21
31
|
|
|
22
32
|
### Step 2: Determine Processing Strategy
|
|
23
33
|
|
|
@@ -40,13 +50,28 @@ Based on document type:
|
|
|
40
50
|
### Step 3: Create Output Directory
|
|
41
51
|
|
|
42
52
|
Create output directory structure:
|
|
53
|
+
|
|
54
|
+
**Default (clean output):**
|
|
55
|
+
```
|
|
56
|
+
<source_dir>/<filename>_extracted/
|
|
57
|
+
├── structure.json # Final merged JSON (machine-readable)
|
|
58
|
+
├── STRUCTURE.md # Human-readable markdown summary
|
|
59
|
+
└── images/ # Extracted figures (if any)
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
**With --verbose (keeps intermediates):**
|
|
43
63
|
```
|
|
44
64
|
<source_dir>/<filename>_extracted/
|
|
45
|
-
├── chunks/ # Individual chunk JSON files
|
|
46
65
|
├── structure.json # Final merged JSON
|
|
47
|
-
|
|
66
|
+
├── STRUCTURE.md # Human-readable markdown
|
|
67
|
+
├── images/ # Extracted figures (if any)
|
|
68
|
+
├── chunks/ # Individual chunk JSON files
|
|
69
|
+
├── pages/ # Per-page PNG images (if generated)
|
|
70
|
+
└── debug/ # Processing logs
|
|
48
71
|
```
|
|
49
72
|
|
|
73
|
+
During processing, create a temporary `_processing/` subdirectory for intermediate files. This will be cleaned up at the end (unless --verbose).
|
|
74
|
+
|
|
50
75
|
### Step 4: Launch Chunk Agents (Parallel)
|
|
51
76
|
|
|
52
77
|
For each chunk, launch a Task agent with subagent_type="general-purpose":
|
|
@@ -57,7 +82,7 @@ Each agent receives:
|
|
|
57
82
|
1. The document path
|
|
58
83
|
2. Their assigned page range (e.g., pages 1-5)
|
|
59
84
|
3. The chunk extractor prompt (embedded below)
|
|
60
|
-
4. Output path for their chunk JSON
|
|
85
|
+
4. Output path for their chunk JSON (write to `_processing/chunks/` subdirectory)
|
|
61
86
|
|
|
62
87
|
**Chunk Extractor Prompt for Agents:**
|
|
63
88
|
|
|
@@ -207,13 +232,13 @@ Write your JSON to: {output_path}
|
|
|
207
232
|
### Step 5: Wait for Chunk Agents
|
|
208
233
|
|
|
209
234
|
After launching all chunk agents, wait for them to complete.
|
|
210
|
-
Each agent will write their chunk JSON to the chunks
|
|
235
|
+
Each agent will write their chunk JSON to the `_processing/chunks/` directory.
|
|
211
236
|
|
|
212
237
|
### Step 6: Merge Chunks
|
|
213
238
|
|
|
214
239
|
Once all chunks are complete:
|
|
215
240
|
|
|
216
|
-
1. Read all chunk JSON files from chunks
|
|
241
|
+
1. Read all chunk JSON files from `_processing/chunks/` directory
|
|
217
242
|
2. Merge into single structure with page offset correction:
|
|
218
243
|
|
|
219
244
|
```python
|
|
@@ -289,8 +314,72 @@ Create STRUCTURE.md with human-readable format:
|
|
|
289
314
|
---
|
|
290
315
|
```
|
|
291
316
|
|
|
292
|
-
### Step 8:
|
|
317
|
+
### Step 8: Clean Up Intermediate Files
|
|
318
|
+
|
|
319
|
+
After generating the final outputs, clean up intermediate files **unless --verbose flag was provided**.
|
|
320
|
+
|
|
321
|
+
**Default behavior (VERBOSE_MODE=false):**
|
|
322
|
+
|
|
323
|
+
1. Move any extracted images from `_processing/` to `images/` directory
|
|
324
|
+
2. Delete the entire `_processing/` directory and its contents:
|
|
325
|
+
- `_processing/chunks/` - intermediate chunk JSON files
|
|
326
|
+
- `_processing/pages/` - per-page images (if generated)
|
|
327
|
+
- `_processing/debug/` - any debug logs
|
|
328
|
+
|
|
329
|
+
```bash
|
|
330
|
+
# Move images if they exist
|
|
331
|
+
if [ -d "_processing/images" ]; then
|
|
332
|
+
mv _processing/images ./images
|
|
333
|
+
fi
|
|
334
|
+
|
|
335
|
+
# Remove processing directory
|
|
336
|
+
rm -rf _processing/
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
**Verbose behavior (VERBOSE_MODE=true):**
|
|
340
|
+
|
|
341
|
+
1. Move intermediate files to permanent locations:
|
|
342
|
+
- `_processing/chunks/` → `chunks/`
|
|
343
|
+
- `_processing/pages/` → `pages/`
|
|
344
|
+
- `_processing/images/` → `images/`
|
|
345
|
+
- `_processing/debug/` → `debug/`
|
|
346
|
+
2. Keep all files for debugging/inspection
|
|
347
|
+
|
|
348
|
+
```bash
|
|
349
|
+
# Move to permanent locations
|
|
350
|
+
mv _processing/chunks ./chunks 2>/dev/null
|
|
351
|
+
mv _processing/pages ./pages 2>/dev/null
|
|
352
|
+
mv _processing/images ./images 2>/dev/null
|
|
353
|
+
mv _processing/debug ./debug 2>/dev/null
|
|
354
|
+
|
|
355
|
+
# Remove empty processing directory
|
|
356
|
+
rmdir _processing 2>/dev/null
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
**Final output structure:**
|
|
293
360
|
|
|
361
|
+
Default (clean):
|
|
362
|
+
```
|
|
363
|
+
<filename>_extracted/
|
|
364
|
+
├── structure.json # 28 KB - complete machine-readable data
|
|
365
|
+
├── STRUCTURE.md # 12 KB - human-readable summary
|
|
366
|
+
└── images/ # Extracted figures (if any)
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
With --verbose:
|
|
370
|
+
```
|
|
371
|
+
<filename>_extracted/
|
|
372
|
+
├── structure.json
|
|
373
|
+
├── STRUCTURE.md
|
|
374
|
+
├── images/
|
|
375
|
+
├── chunks/ # Per-chunk JSON files
|
|
376
|
+
├── pages/ # Per-page images
|
|
377
|
+
└── debug/ # Processing logs
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
### Step 9: Display Completion
|
|
381
|
+
|
|
382
|
+
**Default (clean) completion message:**
|
|
294
383
|
```
|
|
295
384
|
┌──────────────────────────────────────────────────────────────────────┐
|
|
296
385
|
│ EXTRACTION COMPLETE │
|
|
@@ -300,9 +389,34 @@ Create STRUCTURE.md with human-readable format:
|
|
|
300
389
|
│ Pages: {total} | Tables: {count} | Figures: {count} │
|
|
301
390
|
│ Average Confidence: {score} │
|
|
302
391
|
│ │
|
|
392
|
+
│ Output (2 files): │
|
|
393
|
+
│ {output_dir}/structure.json (machine-readable) │
|
|
394
|
+
│ {output_dir}/STRUCTURE.md (human-readable) │
|
|
395
|
+
│ │
|
|
396
|
+
│ Low Confidence Items: {count} │
|
|
397
|
+
│ {List any elements with confidence < 0.8} │
|
|
398
|
+
│ │
|
|
399
|
+
│ Tip: Use --verbose to keep intermediate files │
|
|
400
|
+
│ │
|
|
401
|
+
└──────────────────────────────────────────────────────────────────────┘
|
|
402
|
+
```
|
|
403
|
+
|
|
404
|
+
**Verbose completion message:**
|
|
405
|
+
```
|
|
406
|
+
┌──────────────────────────────────────────────────────────────────────┐
|
|
407
|
+
│ EXTRACTION COMPLETE (verbose mode) │
|
|
408
|
+
├──────────────────────────────────────────────────────────────────────┤
|
|
409
|
+
│ │
|
|
410
|
+
│ Source: {filename} │
|
|
411
|
+
│ Pages: {total} | Tables: {count} | Figures: {count} │
|
|
412
|
+
│ Average Confidence: {score} │
|
|
413
|
+
│ │
|
|
303
414
|
│ Output: │
|
|
304
|
-
│ {output_dir}/structure.json
|
|
305
|
-
│ {output_dir}/STRUCTURE.md
|
|
415
|
+
│ {output_dir}/structure.json (final merged JSON) │
|
|
416
|
+
│ {output_dir}/STRUCTURE.md (human-readable summary) │
|
|
417
|
+
│ {output_dir}/chunks/ (intermediate chunk files) │
|
|
418
|
+
│ {output_dir}/pages/ (per-page images) │
|
|
419
|
+
│ {output_dir}/images/ (extracted figures) │
|
|
306
420
|
│ │
|
|
307
421
|
│ Low Confidence Items: {count} │
|
|
308
422
|
│ {List any elements with confidence < 0.8} │
|
|
@@ -324,3 +438,6 @@ Create STRUCTURE.md with human-readable format:
|
|
|
324
438
|
- Each chunk agent has 200K context - plenty for 5 pages
|
|
325
439
|
- Chunks preserve figure-caption relationships (usually within same chunk)
|
|
326
440
|
- Edge cases (figure on page 5, caption on page 6) are rare but detectable
|
|
441
|
+
- **Default output is clean** (~40 KB total): structure.json + STRUCTURE.md + images/
|
|
442
|
+
- Use `--verbose` to keep all intermediate files for debugging
|
|
443
|
+
- Intermediate files are processed in `_processing/` directory during extraction
|
package/package.json
CHANGED