structurecc 2.0.5 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,27 @@
1
+ {
2
+ "name": "structurecc",
3
+ "version": "1.0.0",
4
+ "description": "Extract structured data from documents using Claude vision and parallel Task agents",
5
+ "author": "UTMB Diagnostic Center",
6
+ "skills": [
7
+ {
8
+ "name": "structure",
9
+ "description": "Extract structured data from a document (PDF, DOCX, image)",
10
+ "command": "structure",
11
+ "file": "commands/structure.md",
12
+ "argumentHint": "<path> [--output dir]",
13
+ "userInvocable": true
14
+ },
15
+ {
16
+ "name": "structure-batch",
17
+ "description": "Extract structured data from multiple documents in a directory",
18
+ "command": "structure:batch",
19
+ "file": "commands/structure-batch.md",
20
+ "argumentHint": "<directory> [--output dir]",
21
+ "userInvocable": true
22
+ }
23
+ ],
24
+ "prompts": {
25
+ "chunk-extractor": "prompts/chunk-extractor.md"
26
+ }
27
+ }
package/README.md CHANGED
@@ -1,106 +1,196 @@
1
- <h1 align="center">STRUCTURE</h1>
1
+ # structurecc
2
2
 
3
- <p align="center">
4
- <strong>Extract structured data from PDFs, Word docs, and images using Claude Code.</strong>
5
- </p>
3
+ Document Structure Extraction for Claude Code
6
4
 
7
- <p align="center">
8
- <a href="https://www.npmjs.com/package/structurecc"><img src="https://img.shields.io/npm/v/structurecc.svg" alt="npm version"></a>
9
- <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
10
- </p>
5
+ Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.
11
6
 
12
- <p align="center">
13
- <img src="assets/terminal.png" alt="structurecc" width="550">
14
- </p>
7
+ ## Installation
15
8
 
16
- ---
17
-
18
- ## Requirements
19
-
20
- - **Node.js** - [nodejs.org](https://nodejs.org/)
21
- - **Claude Code** - Requires API key or Pro/Max subscription
9
+ ```bash
10
+ npx structurecc
11
+ ```
22
12
 
23
- ---
13
+ This installs the plugin to `~/.claude/plugins/structurecc/`.
24
14
 
25
- ## Install
15
+ ## Usage
26
16
 
27
- ### Step 1: Install Claude Code
17
+ ### Single Document
28
18
 
29
19
  ```bash
30
- npm install -g @anthropic-ai/claude-code
20
+ /structure document.pdf
21
+ /structure lab_image.png
22
+ /structure report.docx
31
23
  ```
32
24
 
33
- <p align="center">
34
- <img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
35
- </p>
36
-
37
- ### Step 2: Install structurecc
25
+ ### Batch Processing
38
26
 
39
27
  ```bash
40
- npx structurecc
28
+ /structure:batch ./documents/
29
+ /structure:batch ./patient_files/ --output ./extracted/
41
30
  ```
42
31
 
43
- <p align="center">
44
- <img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
45
- </p>
32
+ ## Supported Formats
46
33
 
47
- ### Step 3: Start Claude Code
34
+ | Format | Extension | Notes |
35
+ |--------|-----------|-------|
36
+ | PDF | `.pdf` | Multi-page supported, chunked for large documents |
37
+ | Word | `.docx`, `.doc` | Text and embedded images extracted |
38
+ | Images | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp` | Single-page extraction |
48
39
 
49
- Navigate to your document folder and run:
40
+ ## Output
41
+
42
+ For each document, structurecc generates:
50
43
 
51
- ```bash
52
- cd ~/Desktop/documents
53
- claude
44
+ ```
45
+ document_extracted/
46
+ ├── chunks/ # Individual chunk extractions (for debugging)
47
+ ├── structure.json # Complete structured extraction
48
+ └── STRUCTURE.md # Human-readable markdown summary
54
49
  ```
55
50
 
56
- <p align="center">
57
- <img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
58
- </p>
51
+ ### structure.json
52
+
53
+ ```json
54
+ {
55
+ "source": "/path/to/document.pdf",
56
+ "extracted": "2026-01-30T14:30:22Z",
57
+ "pages": [
58
+ {
59
+ "page": 1,
60
+ "elements": [
61
+ {
62
+ "id": "element_1",
63
+ "type": "table",
64
+ "title": "Table 1. Lab Results",
65
+ "data": {
66
+ "headers": ["Test", "Result", "Units", "Reference"],
67
+ "rows": [
68
+ ["Glucose", "126", "mg/dL", "70-100"]
69
+ ]
70
+ },
71
+ "confidence": 0.98
72
+ }
73
+ ]
74
+ }
75
+ ],
76
+ "summary": {
77
+ "total_pages": 5,
78
+ "tables": 3,
79
+ "figures": 4,
80
+ "equations": 1,
81
+ "average_confidence": 0.94
82
+ }
83
+ }
84
+ ```
59
85
 
60
- ### Step 4: Run structure
86
+ ## Architecture
61
87
 
62
- Inside Claude Code:
88
+ structurecc uses a chunk-based parallel processing approach:
89
+
90
+ 1. **Document Analysis** - Determine page count and split into chunks (5 pages each)
91
+ 2. **Parallel Extraction** - Launch one Task agent per chunk for parallel processing
92
+ 3. **Chunk Merge** - Combine chunk results with page offset correction
93
+ 4. **Output Generation** - Create JSON and Markdown outputs
63
94
 
64
95
  ```
65
- /structure document.pdf
96
+ Document (20 pages)
97
+
98
+ ├── Chunk 1 (Pages 1-5) → Agent 1
99
+ ├── Chunk 2 (Pages 6-10) → Agent 2
100
+ ├── Chunk 3 (Pages 11-15)→ Agent 3
101
+ └── Chunk 4 (Pages 16-20)→ Agent 4
102
+
103
+
104
+ Merged Output
66
105
  ```
67
106
 
68
- <p align="center">
69
- <img src="assets/screenshots/step3.png" alt="Run /structure" width="520">
70
- </p>
107
+ This approach:
108
+ - Maximizes throughput via parallel processing
109
+ - Preserves context within chunks (figures and captions stay together)
110
+ - Uses Claude's native vision (no external APIs)
111
+ - Each agent has 200K context for thorough extraction
71
112
 
72
- Supports **PDF**, **DOCX**, **PNG**, **JPG**, and **TIFF**.
113
+ ## Element Types
73
114
 
74
- ---
115
+ ### Tables
75
116
 
76
- ## Output
117
+ Extracted with:
118
+ - Headers and all rows
119
+ - Cell values with exact formatting
120
+ - Flags (H, L, *, †)
121
+ - Footnotes
122
+ - Merged cell information
77
123
 
78
- ```
79
- document_extracted/
80
- ├── images/ # Extracted visuals
81
- ├── elements/ # Markdown per element
82
- └── STRUCTURED.md # Combined output
83
- ```
124
+ ### Figures
84
125
 
85
- ---
126
+ Supports various figure types:
127
+ - **Charts/Graphs**: Line, bar, scatter, pie with data series and axes
128
+ - **Scientific Images**: Western blots, gels, micrographs
129
+ - **Diagrams**: Flowcharts, illustrations, photographs
86
130
 
87
- ## Troubleshooting
131
+ Each figure includes:
132
+ - Title and caption
133
+ - Data points (when visible)
134
+ - Axis labels and ranges
135
+ - Annotations and legends
88
136
 
89
- | Issue | Solution |
90
- |-------|----------|
91
- | `npm: command not found` | Install Node.js from [nodejs.org](https://nodejs.org/) |
92
- | `/structure: No such file` | Run `claude` first, then type `/structure` inside Claude Code |
93
- | No images found | PDF may be text-only with no embedded images |
137
+ ### Equations
94
138
 
95
- ---
139
+ Extracted as:
140
+ - LaTeX representation
141
+ - Plain text fallback
142
+ - Variable definitions
96
143
 
97
- ## Uninstall
144
+ ### Text Blocks
98
145
 
99
- ```bash
100
- npx structurecc --uninstall
101
- ```
146
+ Captured with:
147
+ - Full content
148
+ - Type (header, paragraph, caption, footnote)
149
+ - Formatting information
150
+
151
+ ## Confidence Scores
152
+
153
+ Every element includes a confidence score (0.0-1.0):
154
+
155
+ | Score | Meaning |
156
+ |-------|---------|
157
+ | 0.95-1.00 | Crystal clear extraction |
158
+ | 0.85-0.94 | Clear with minor uncertainty |
159
+ | 0.70-0.84 | Readable but some ambiguity |
160
+ | < 0.70 | Needs manual verification |
161
+
162
+ Low confidence items are flagged in the output for review.
163
+
164
+ ## Use Cases
165
+
166
+ - **Medical Lab Results**: Extract patient data from PDF reports
167
+ - **Research Papers**: Structure tables and figures from publications
168
+ - **Scientific Images**: Transcribe gel/blot data for documentation
169
+ - **Patient Records**: Batch process document folders
170
+ - **Data Digitization**: Convert scanned documents to structured data
171
+
172
+ ## Requirements
173
+
174
+ - Claude Code CLI
175
+ - No external dependencies (uses Claude's native capabilities)
176
+
177
+ ## How It Works
178
+
179
+ structurecc leverages Claude's multimodal capabilities:
180
+
181
+ 1. **Claude Vision**: Reads PDFs and images natively without OCR
182
+ 2. **Parallel Agents**: Task tool spawns chunk agents for parallel processing
183
+ 3. **Structured Output**: JSON schema ensures consistent, parseable output
184
+ 4. **Markdown Summary**: Human-readable format for quick review
185
+
186
+ No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.
187
+
188
+ ## Limitations
102
189
 
103
- ---
190
+ - Very large documents (100+ pages) may require multiple runs
191
+ - Handwritten content has lower accuracy than printed text
192
+ - Low-resolution images may have reduced confidence scores
193
+ - Complex nested tables may require manual verification
104
194
 
105
195
  ## License
106
196
 
@@ -0,0 +1,281 @@
1
+ ---
2
+ name: structure-batch
3
+ description: Extract structured data from multiple documents in a directory
4
+ argument-hint: <directory> [--output dir] [--pattern *.pdf]
5
+ allowed-tools: Read, Write, Task, Glob, Bash
6
+ model: opus
7
+ ---
8
+
9
+ <command-name>structure:batch</command-name>
10
+
11
+ # Batch Document Structure Extraction
12
+
13
+ You are extracting structured data from multiple documents in a directory.
14
+
15
+ ## Input
16
+
17
+ **Directory path:** $ARGUMENTS
18
+
19
+ Parse arguments:
20
+ - First argument: Directory path (required)
21
+ - `--output <dir>`: Custom output directory (optional, defaults to source directory)
22
+ - `--pattern <glob>`: File pattern to match (optional, defaults to all supported types)
23
+
24
+ ## Workflow
25
+
26
+ ### Step 1: Discover Documents
27
+
28
+ Use Glob to find all supported documents in the directory:
29
+
30
+ ```
31
+ Supported patterns:
32
+ - *.pdf
33
+ - *.docx
34
+ - *.doc
35
+ - *.png
36
+ - *.jpg
37
+ - *.jpeg
38
+ - *.tiff
39
+ - *.bmp
40
+ ```
41
+
42
+ List all discovered documents:
43
+
44
+ ```
45
+ ┌──────────────────────────────────────────────────────────────────────┐
46
+ │ BATCH EXTRACTION │
47
+ ├──────────────────────────────────────────────────────────────────────┤
48
+ │ │
49
+ │ Directory: {path} │
50
+ │ │
51
+ │ Documents Found: {count} │
52
+ │ 1. document1.pdf (15 pages) │
53
+ │ 2. lab_results.png (1 image) │
54
+ │ 3. report.docx (8 pages) │
55
+ │ ... │
56
+ │ │
57
+ │ Estimated chunks: {total_chunks} │
58
+ │ │
59
+ └──────────────────────────────────────────────────────────────────────┘
60
+ ```
61
+
62
+ ### Step 2: Create Batch Output Directory
63
+
64
+ ```
65
+ {output_dir}/
66
+ ├── batch_summary.json # Summary of all extractions
67
+ ├── batch_summary.md # Human-readable summary
68
+ ├── document1_extracted/ # Per-document outputs
69
+ │ ├── structure.json
70
+ │ └── STRUCTURE.md
71
+ ├── lab_results_extracted/
72
+ │ ├── structure.json
73
+ │ └── STRUCTURE.md
74
+ └── report_extracted/
75
+ ├── structure.json
76
+ └── STRUCTURE.md
77
+ ```
78
+
79
+ ### Step 3: Process Documents
80
+
81
+ For each document, invoke the /structure command:
82
+
83
+ **Option A: Sequential Processing (safer, easier to track)**
84
+ Process one document at a time, reporting progress:
85
+
86
+ ```
87
+ Processing document 1/3: document1.pdf
88
+ Launching 3 chunk agents...
89
+ Chunks complete. Merging...
90
+ Done. Tables: 5, Figures: 3
91
+
92
+ Processing document 2/3: lab_results.png
93
+ Processing as single chunk...
94
+ Done. Tables: 1, Figures: 0
95
+
96
+ Processing document 3/3: report.docx
97
+ Launching 2 chunk agents...
98
+ Chunks complete. Merging...
99
+ Done. Tables: 2, Figures: 4
100
+ ```
101
+
102
+ **Option B: Parallel Documents (faster for many small docs)**
103
+ For directories with many single-page documents (like images), launch multiple document extractions in parallel using Task agents.
104
+
105
+ Choose based on:
106
+ - <5 documents or large PDFs: Sequential
107
+ - >5 small documents (images, short PDFs): Parallel
108
+
109
+ ### Step 4: Generate Batch Summary
110
+
111
+ **batch_summary.json:**
112
+
113
+ ```json
114
+ {
115
+ "batch": {
116
+ "directory": "/path/to/documents",
117
+ "processed": "2026-01-30T14:30:22Z",
118
+ "document_count": 3,
119
+ "total_pages": 24,
120
+ "total_elements": 45
121
+ },
122
+ "documents": [
123
+ {
124
+ "filename": "document1.pdf",
125
+ "pages": 15,
126
+ "tables": 5,
127
+ "figures": 3,
128
+ "equations": 1,
129
+ "average_confidence": 0.94,
130
+ "output": "document1_extracted/structure.json"
131
+ },
132
+ {
133
+ "filename": "lab_results.png",
134
+ "pages": 1,
135
+ "tables": 1,
136
+ "figures": 0,
137
+ "equations": 0,
138
+ "average_confidence": 0.97,
139
+ "output": "lab_results_extracted/structure.json"
140
+ },
141
+ {
142
+ "filename": "report.docx",
143
+ "pages": 8,
144
+ "tables": 2,
145
+ "figures": 4,
146
+ "equations": 0,
147
+ "average_confidence": 0.91,
148
+ "output": "report_extracted/structure.json"
149
+ }
150
+ ],
151
+ "summary": {
152
+ "total_tables": 8,
153
+ "total_figures": 7,
154
+ "total_equations": 1,
155
+ "overall_confidence": 0.94,
156
+ "low_confidence_items": 2
157
+ }
158
+ }
159
+ ```
160
+
161
+ **batch_summary.md:**
162
+
163
+ ```markdown
164
+ # Batch Extraction Summary
165
+
166
+ **Directory:** {path}
167
+ **Processed:** {timestamp}
168
+ **Documents:** {count}
169
+
170
+ ---
171
+
172
+ ## Overview
173
+
174
+ | Metric | Count |
175
+ |--------|-------|
176
+ | Total Documents | 3 |
177
+ | Total Pages | 24 |
178
+ | Total Tables | 8 |
179
+ | Total Figures | 7 |
180
+ | Total Equations | 1 |
181
+ | Overall Confidence | 94% |
182
+
183
+ ---
184
+
185
+ ## Document Details
186
+
187
+ ### 1. document1.pdf
188
+
189
+ - **Pages:** 15
190
+ - **Tables:** 5
191
+ - **Figures:** 3
192
+ - **Confidence:** 94%
193
+ - **Output:** [document1_extracted/](document1_extracted/)
194
+
195
+ ### 2. lab_results.png
196
+
197
+ - **Pages:** 1
198
+ - **Tables:** 1
199
+ - **Figures:** 0
200
+ - **Confidence:** 97%
201
+ - **Output:** [lab_results_extracted/](lab_results_extracted/)
202
+
203
+ ### 3. report.docx
204
+
205
+ - **Pages:** 8
206
+ - **Tables:** 2
207
+ - **Figures:** 4
208
+ - **Confidence:** 91%
209
+ - **Output:** [report_extracted/](report_extracted/)
210
+
211
+ ---
212
+
213
+ ## Low Confidence Items
214
+
215
+ Items with confidence < 80% requiring review:
216
+
217
+ 1. **document1.pdf, Page 7, Table 3** (72%)
218
+ - Reason: Partially obscured by watermark
219
+
220
+ 2. **report.docx, Page 5, Figure 2** (68%)
221
+ - Reason: Low resolution image
222
+
223
+ ---
224
+
225
+ ## Files Generated
226
+
227
+ ```
228
+ {output_dir}/
229
+ ├── batch_summary.json
230
+ ├── batch_summary.md
231
+ ├── document1_extracted/
232
+ │ ├── structure.json
233
+ │ └── STRUCTURE.md
234
+ ├── lab_results_extracted/
235
+ │ ├── structure.json
236
+ │ └── STRUCTURE.md
237
+ └── report_extracted/
238
+ ├── structure.json
239
+ └── STRUCTURE.md
240
+ ```
241
+ ```
242
+
243
+ ### Step 5: Display Completion
244
+
245
+ ```
246
+ ┌──────────────────────────────────────────────────────────────────────┐
247
+ │ BATCH EXTRACTION COMPLETE │
248
+ ├──────────────────────────────────────────────────────────────────────┤
249
+ │ │
250
+ │ Documents Processed: {count} │
251
+ │ Total Pages: {pages} │
252
+ │ Total Elements: {elements} │
253
+ │ │
254
+ │ Summary: │
255
+ │ Tables: {count} │
256
+ │ Figures: {count} │
257
+ │ Equations: {count} │
258
+ │ │
259
+ │ Overall Confidence: {score}% │
260
+ │ Items Needing Review: {count} │
261
+ │ │
262
+ │ Output: │
263
+ │ {output_dir}/batch_summary.json │
264
+ │ {output_dir}/batch_summary.md │
265
+ │ │
266
+ └──────────────────────────────────────────────────────────────────────┘
267
+ ```
268
+
269
+ ## Error Handling
270
+
271
+ - If directory doesn't exist: Report error
272
+ - If no supported documents found: Report and suggest checking patterns
273
+ - If individual document fails: Log error, continue with others, report in summary
274
+ - If output directory cannot be created: Report error
275
+
276
+ ## Notes
277
+
278
+ - Each document is processed independently using /structure
279
+ - Batch processing is ideal for patient files, lab results, research papers
280
+ - Low confidence items are aggregated for easy review
281
+ - All outputs are self-contained and can be accessed individually