structurecc 2.1.0 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,8 @@
1
+ {
2
+ "name": "structurecc",
3
+ "version": "3.0.0",
4
+ "description": "Extract structured data from documents using Claude vision and parallel Task agents",
5
+ "author": {
6
+ "name": "UTMB Diagnostic Center"
7
+ }
8
+ }
package/README.md CHANGED
@@ -1,106 +1,196 @@
1
- <h1 align="center">STRUCTURE</h1>
1
+ # structurecc
2
2
 
3
- <p align="center">
4
- <strong>Extract structured data from PDFs, Word docs, and images using Claude Code.</strong>
5
- </p>
3
+ Document Structure Extraction for Claude Code
6
4
 
7
- <p align="center">
8
- <a href="https://www.npmjs.com/package/structurecc"><img src="https://img.shields.io/npm/v/structurecc.svg" alt="npm version"></a>
9
- <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
10
- </p>
5
+ Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.
11
6
 
12
- <p align="center">
13
- <img src="assets/terminal.png" alt="structurecc" width="550">
14
- </p>
7
+ ## Installation
15
8
 
16
- ---
17
-
18
- ## Requirements
19
-
20
- - **Node.js** - [nodejs.org](https://nodejs.org/)
21
- - **Claude Code** - Requires API key or Pro/Max subscription
9
+ ```bash
10
+ npx structurecc
11
+ ```
22
12
 
23
- ---
13
+ This installs the plugin to `~/.claude/plugins/structurecc/`.
24
14
 
25
- ## Install
15
+ ## Usage
26
16
 
27
- ### Step 1: Install Claude Code
17
+ ### Single Document
28
18
 
29
19
  ```bash
30
- npm install -g @anthropic-ai/claude-code
20
+ /structure document.pdf
21
+ /structure lab_image.png
22
+ /structure report.docx
31
23
  ```
32
24
 
33
- <p align="center">
34
- <img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
35
- </p>
36
-
37
- ### Step 2: Install structurecc
25
+ ### Batch Processing
38
26
 
39
27
  ```bash
40
- npx structurecc
28
+ /structure:batch ./documents/
29
+ /structure:batch ./patient_files/ --output ./extracted/
41
30
  ```
42
31
 
43
- <p align="center">
44
- <img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
45
- </p>
32
+ ## Supported Formats
46
33
 
47
- ### Step 3: Start Claude Code
34
+ | Format | Extension | Notes |
35
+ |--------|-----------|-------|
36
+ | PDF | `.pdf` | Multi-page supported, chunked for large documents |
37
+ | Word | `.docx`, `.doc` | Text and embedded images extracted |
38
+ | Images | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp` | Single-page extraction |
48
39
 
49
- Navigate to your document folder and run:
40
+ ## Output
41
+
42
+ For each document, structurecc generates:
50
43
 
51
- ```bash
52
- cd ~/Desktop/documents
53
- claude
44
+ ```
45
+ document_extracted/
46
+ ├── chunks/ # Individual chunk extractions (for debugging)
47
+ ├── structure.json # Complete structured extraction
48
+ └── STRUCTURE.md # Human-readable markdown summary
54
49
  ```
55
50
 
56
- <p align="center">
57
- <img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
58
- </p>
51
+ ### structure.json
52
+
53
+ ```json
54
+ {
55
+ "source": "/path/to/document.pdf",
56
+ "extracted": "2026-01-30T14:30:22Z",
57
+ "pages": [
58
+ {
59
+ "page": 1,
60
+ "elements": [
61
+ {
62
+ "id": "element_1",
63
+ "type": "table",
64
+ "title": "Table 1. Lab Results",
65
+ "data": {
66
+ "headers": ["Test", "Result", "Units", "Reference"],
67
+ "rows": [
68
+ ["Glucose", "126", "mg/dL", "70-100"]
69
+ ]
70
+ },
71
+ "confidence": 0.98
72
+ }
73
+ ]
74
+ }
75
+ ],
76
+ "summary": {
77
+ "total_pages": 5,
78
+ "tables": 3,
79
+ "figures": 4,
80
+ "equations": 1,
81
+ "average_confidence": 0.94
82
+ }
83
+ }
84
+ ```
59
85
 
60
- ### Step 4: Run structure
86
+ ## Architecture
61
87
 
62
- Inside Claude Code:
88
+ structurecc uses a chunk-based parallel processing approach:
89
+
90
+ 1. **Document Analysis** - Determine page count and split into chunks (5 pages each)
91
+ 2. **Parallel Extraction** - Launch one Task agent per chunk for parallel processing
92
+ 3. **Chunk Merge** - Combine chunk results with page offset correction
93
+ 4. **Output Generation** - Create JSON and Markdown outputs
63
94
 
64
95
  ```
65
- /structure document.pdf
96
+ Document (20 pages)
97
+
98
+ ├── Chunk 1 (Pages 1-5) → Agent 1
99
+ ├── Chunk 2 (Pages 6-10) → Agent 2
100
+ ├── Chunk 3 (Pages 11-15)→ Agent 3
101
+ └── Chunk 4 (Pages 16-20)→ Agent 4
102
+
103
+
104
+ Merged Output
66
105
  ```
67
106
 
68
- <p align="center">
69
- <img src="assets/screenshots/step3.png" alt="Run /structure" width="520">
70
- </p>
107
+ This approach:
108
+ - Maximizes throughput via parallel processing
109
+ - Preserves context within chunks (figures and captions stay together)
110
+ - Uses Claude's native vision (no external APIs)
111
+ - Each agent has 200K context for thorough extraction
71
112
 
72
- Supports **PDF**, **DOCX**, **PNG**, **JPG**, and **TIFF**.
113
+ ## Element Types
73
114
 
74
- ---
115
+ ### Tables
75
116
 
76
- ## Output
117
+ Extracted with:
118
+ - Headers and all rows
119
+ - Cell values with exact formatting
120
+ - Flags (H, L, *, †)
121
+ - Footnotes
122
+ - Merged cell information
77
123
 
78
- ```
79
- document_extracted/
80
- ├── images/ # Extracted visuals
81
- ├── elements/ # Markdown per element
82
- └── STRUCTURED.md # Combined output
83
- ```
124
+ ### Figures
84
125
 
85
- ---
126
+ Supports various figure types:
127
+ - **Charts/Graphs**: Line, bar, scatter, pie with data series and axes
128
+ - **Scientific Images**: Western blots, gels, micrographs
129
+ - **Diagrams**: Flowcharts, illustrations, photographs
86
130
 
87
- ## Troubleshooting
131
+ Each figure includes:
132
+ - Title and caption
133
+ - Data points (when visible)
134
+ - Axis labels and ranges
135
+ - Annotations and legends
88
136
 
89
- | Issue | Solution |
90
- |-------|----------|
91
- | `npm: command not found` | Install Node.js from [nodejs.org](https://nodejs.org/) |
92
- | `/structure: No such file` | Run `claude` first, then type `/structure` inside Claude Code |
93
- | No images found | PDF may be text-only with no embedded images |
137
+ ### Equations
94
138
 
95
- ---
139
+ Extracted as:
140
+ - LaTeX representation
141
+ - Plain text fallback
142
+ - Variable definitions
96
143
 
97
- ## Uninstall
144
+ ### Text Blocks
98
145
 
99
- ```bash
100
- npx structurecc --uninstall
101
- ```
146
+ Captured with:
147
+ - Full content
148
+ - Type (header, paragraph, caption, footnote)
149
+ - Formatting information
150
+
151
+ ## Confidence Scores
152
+
153
+ Every element includes a confidence score (0.0-1.0):
154
+
155
+ | Score | Meaning |
156
+ |-------|---------|
157
+ | 0.95-1.00 | Crystal clear extraction |
158
+ | 0.85-0.94 | Clear with minor uncertainty |
159
+ | 0.70-0.84 | Readable but some ambiguity |
160
+ | < 0.70 | Needs manual verification |
161
+
162
+ Low confidence items are flagged in the output for review.
163
+
164
+ ## Use Cases
165
+
166
+ - **Medical Lab Results**: Extract patient data from PDF reports
167
+ - **Research Papers**: Structure tables and figures from publications
168
+ - **Scientific Images**: Transcribe gel/blot data for documentation
169
+ - **Patient Records**: Batch process document folders
170
+ - **Data Digitization**: Convert scanned documents to structured data
171
+
172
+ ## Requirements
173
+
174
+ - Claude Code CLI
175
+ - No external dependencies (uses Claude's native capabilities)
176
+
177
+ ## How It Works
178
+
179
+ structurecc leverages Claude's multimodal capabilities:
180
+
181
+ 1. **Claude Vision**: Reads PDFs and images natively without OCR
182
+ 2. **Parallel Agents**: Task tool spawns chunk agents for parallel processing
183
+ 3. **Structured Output**: JSON schema ensures consistent, parseable output
184
+ 4. **Markdown Summary**: Human-readable format for quick review
185
+
186
+ No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.
187
+
188
+ ## Limitations
102
189
 
103
- ---
190
+ - Very large documents (100+ pages) may require multiple runs
191
+ - Handwritten content has lower accuracy than printed text
192
+ - Low-resolution images may have reduced confidence scores
193
+ - Complex nested tables may require manual verification
104
194
 
105
195
  ## License
106
196
 
@@ -0,0 +1,276 @@
1
+ ---
2
+ description: Batch extract structured data from multiple documents in a directory
3
+ argument-hint: <directory>
4
+ ---
5
+
6
+ # Batch Document Structure Extraction
7
+
8
+ You are extracting structured data from multiple documents in a directory.
9
+
10
+ ## Input
11
+
12
+ **Directory path:** $ARGUMENTS
13
+
14
+ Parse arguments:
15
+ - First argument: Directory path (required)
16
+ - `--output <dir>`: Custom output directory (optional, defaults to source directory)
17
+ - `--pattern <glob>`: File pattern to match (optional, defaults to all supported types)
18
+
19
+ ## Workflow
20
+
21
+ ### Step 1: Discover Documents
22
+
23
+ Use Glob to find all supported documents in the directory:
24
+
25
+ ```
26
+ Supported patterns:
27
+ - *.pdf
28
+ - *.docx
29
+ - *.doc
30
+ - *.png
31
+ - *.jpg
32
+ - *.jpeg
33
+ - *.tiff
34
+ - *.bmp
35
+ ```
36
+
37
+ List all discovered documents:
38
+
39
+ ```
40
+ ┌──────────────────────────────────────────────────────────────────────┐
41
+ │ BATCH EXTRACTION │
42
+ ├──────────────────────────────────────────────────────────────────────┤
43
+ │ │
44
+ │ Directory: {path} │
45
+ │ │
46
+ │ Documents Found: {count} │
47
+ │ 1. document1.pdf (15 pages) │
48
+ │ 2. lab_results.png (1 image) │
49
+ │ 3. report.docx (8 pages) │
50
+ │ ... │
51
+ │ │
52
+ │ Estimated chunks: {total_chunks} │
53
+ │ │
54
+ └──────────────────────────────────────────────────────────────────────┘
55
+ ```
56
+
57
+ ### Step 2: Create Batch Output Directory
58
+
59
+ ```
60
+ {output_dir}/
61
+ ├── batch_summary.json # Summary of all extractions
62
+ ├── batch_summary.md # Human-readable summary
63
+ ├── document1_extracted/ # Per-document outputs
64
+ │ ├── structure.json
65
+ │ └── STRUCTURE.md
66
+ ├── lab_results_extracted/
67
+ │ ├── structure.json
68
+ │ └── STRUCTURE.md
69
+ └── report_extracted/
70
+ ├── structure.json
71
+ └── STRUCTURE.md
72
+ ```
73
+
74
+ ### Step 3: Process Documents
75
+
76
+ For each document, invoke the /structure command:
77
+
78
+ **Option A: Sequential Processing (safer, easier to track)**
79
+ Process one document at a time, reporting progress:
80
+
81
+ ```
82
+ Processing document 1/3: document1.pdf
83
+ Launching 3 chunk agents...
84
+ Chunks complete. Merging...
85
+ Done. Tables: 5, Figures: 3
86
+
87
+ Processing document 2/3: lab_results.png
88
+ Processing as single chunk...
89
+ Done. Tables: 1, Figures: 0
90
+
91
+ Processing document 3/3: report.docx
92
+ Launching 2 chunk agents...
93
+ Chunks complete. Merging...
94
+ Done. Tables: 2, Figures: 4
95
+ ```
96
+
97
+ **Option B: Parallel Documents (faster for many small docs)**
98
+ For directories with many single-page documents (like images), launch multiple document extractions in parallel using Task agents.
99
+
100
+ Choose based on:
101
+ - <5 documents or large PDFs: Sequential
102
+ - >5 small documents (images, short PDFs): Parallel
103
+
104
+ ### Step 4: Generate Batch Summary
105
+
106
+ **batch_summary.json:**
107
+
108
+ ```json
109
+ {
110
+ "batch": {
111
+ "directory": "/path/to/documents",
112
+ "processed": "2026-01-30T14:30:22Z",
113
+ "document_count": 3,
114
+ "total_pages": 24,
115
+ "total_elements": 45
116
+ },
117
+ "documents": [
118
+ {
119
+ "filename": "document1.pdf",
120
+ "pages": 15,
121
+ "tables": 5,
122
+ "figures": 3,
123
+ "equations": 1,
124
+ "average_confidence": 0.94,
125
+ "output": "document1_extracted/structure.json"
126
+ },
127
+ {
128
+ "filename": "lab_results.png",
129
+ "pages": 1,
130
+ "tables": 1,
131
+ "figures": 0,
132
+ "equations": 0,
133
+ "average_confidence": 0.97,
134
+ "output": "lab_results_extracted/structure.json"
135
+ },
136
+ {
137
+ "filename": "report.docx",
138
+ "pages": 8,
139
+ "tables": 2,
140
+ "figures": 4,
141
+ "equations": 0,
142
+ "average_confidence": 0.91,
143
+ "output": "report_extracted/structure.json"
144
+ }
145
+ ],
146
+ "summary": {
147
+ "total_tables": 8,
148
+ "total_figures": 7,
149
+ "total_equations": 1,
150
+ "overall_confidence": 0.94,
151
+ "low_confidence_items": 2
152
+ }
153
+ }
154
+ ```
155
+
156
+ **batch_summary.md:**
157
+
158
+ ```markdown
159
+ # Batch Extraction Summary
160
+
161
+ **Directory:** {path}
162
+ **Processed:** {timestamp}
163
+ **Documents:** {count}
164
+
165
+ ---
166
+
167
+ ## Overview
168
+
169
+ | Metric | Count |
170
+ |--------|-------|
171
+ | Total Documents | 3 |
172
+ | Total Pages | 24 |
173
+ | Total Tables | 8 |
174
+ | Total Figures | 7 |
175
+ | Total Equations | 1 |
176
+ | Overall Confidence | 94% |
177
+
178
+ ---
179
+
180
+ ## Document Details
181
+
182
+ ### 1. document1.pdf
183
+
184
+ - **Pages:** 15
185
+ - **Tables:** 5
186
+ - **Figures:** 3
187
+ - **Confidence:** 94%
188
+ - **Output:** [document1_extracted/](document1_extracted/)
189
+
190
+ ### 2. lab_results.png
191
+
192
+ - **Pages:** 1
193
+ - **Tables:** 1
194
+ - **Figures:** 0
195
+ - **Confidence:** 97%
196
+ - **Output:** [lab_results_extracted/](lab_results_extracted/)
197
+
198
+ ### 3. report.docx
199
+
200
+ - **Pages:** 8
201
+ - **Tables:** 2
202
+ - **Figures:** 4
203
+ - **Confidence:** 91%
204
+ - **Output:** [report_extracted/](report_extracted/)
205
+
206
+ ---
207
+
208
+ ## Low Confidence Items
209
+
210
+ Items with confidence < 80% requiring review:
211
+
212
+ 1. **document1.pdf, Page 7, Table 3** (72%)
213
+ - Reason: Partially obscured by watermark
214
+
215
+ 2. **report.docx, Page 5, Figure 2** (68%)
216
+ - Reason: Low resolution image
217
+
218
+ ---
219
+
220
+ ## Files Generated
221
+
222
+ ```
223
+ {output_dir}/
224
+ ├── batch_summary.json
225
+ ├── batch_summary.md
226
+ ├── document1_extracted/
227
+ │ ├── structure.json
228
+ │ └── STRUCTURE.md
229
+ ├── lab_results_extracted/
230
+ │ ├── structure.json
231
+ │ └── STRUCTURE.md
232
+ └── report_extracted/
233
+ ├── structure.json
234
+ └── STRUCTURE.md
235
+ ```
236
+ ```
237
+
238
+ ### Step 5: Display Completion
239
+
240
+ ```
241
+ ┌──────────────────────────────────────────────────────────────────────┐
242
+ │ BATCH EXTRACTION COMPLETE │
243
+ ├──────────────────────────────────────────────────────────────────────┤
244
+ │ │
245
+ │ Documents Processed: {count} │
246
+ │ Total Pages: {pages} │
247
+ │ Total Elements: {elements} │
248
+ │ │
249
+ │ Summary: │
250
+ │ Tables: {count} │
251
+ │ Figures: {count} │
252
+ │ Equations: {count} │
253
+ │ │
254
+ │ Overall Confidence: {score}% │
255
+ │ Items Needing Review: {count} │
256
+ │ │
257
+ │ Output: │
258
+ │ {output_dir}/batch_summary.json │
259
+ │ {output_dir}/batch_summary.md │
260
+ │ │
261
+ └──────────────────────────────────────────────────────────────────────┘
262
+ ```
263
+
264
+ ## Error Handling
265
+
266
+ - If directory doesn't exist: Report error
267
+ - If no supported documents found: Report and suggest checking patterns
268
+ - If individual document fails: Log error, continue with others, report in summary
269
+ - If output directory cannot be created: Report error
270
+
271
+ ## Notes
272
+
273
+ - Each document is processed independently using /structure
274
+ - Batch processing is ideal for patient files, lab results, research papers
275
+ - Low confidence items are aggregated for easy review
276
+ - All outputs are self-contained and can be accessed individually