structurecc 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +179 -0
- package/agents/structureit-extractor.md +70 -0
- package/bin/install.js +173 -0
- package/commands/structure/structure.md +242 -0
- package/package.json +34 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 James Weatherhead
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
<h1 align="center">STRUCTUREIT</h1>
|
|
2
|
+
|
|
3
|
+
<p align="center">
|
|
4
|
+
<strong>Agentic Document Extraction for Claude Code</strong><br>
|
|
5
|
+
<em>One command. Every figure. Every table.</em>
|
|
6
|
+
</p>
|
|
7
|
+
|
|
8
|
+
<p align="center">
|
|
9
|
+
<a href="https://www.npmjs.com/package/structurecc"><img src="https://img.shields.io/npm/v/structurecc.svg" alt="npm version"></a>
|
|
10
|
+
<a href="https://github.com/JamesWeatherhead/structurecc/stargazers"><img src="https://img.shields.io/github/stars/JamesWeatherhead/structurecc" alt="GitHub stars"></a>
|
|
11
|
+
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
|
|
12
|
+
</p>
|
|
13
|
+
|
|
14
|
+
<p align="center">
|
|
15
|
+
<em>Unstructured in. Structured out.</em>
|
|
16
|
+
</p>
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## The Problem
|
|
21
|
+
|
|
22
|
+
You have a PDF with figures, tables, and charts. You need that data.
|
|
23
|
+
|
|
24
|
+
**Manual approach:** Screenshot each figure. Copy tables cell by cell. Spend hours on one document.
|
|
25
|
+
|
|
26
|
+
**structurecc:**
|
|
27
|
+
```
|
|
28
|
+
/structure paper.pdf
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
Done.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## What It Does
|
|
36
|
+
|
|
37
|
+
```
|
|
38
|
+
PDF ───▶ [Agent 1] ───┐
|
|
39
|
+
[Agent 2] ───┤
|
|
40
|
+
[Agent 3] ───┼───▶ STRUCTURED.md
|
|
41
|
+
[Agent N] ───┘
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
1. **Extracts** every image from your document
|
|
45
|
+
2. **Spawns** one AI agent per image (running in parallel)
|
|
46
|
+
3. **Analyzes** each element exhaustively
|
|
47
|
+
4. **Outputs** clean, structured markdown
|
|
48
|
+
|
|
49
|
+
Like [Landing AI's Agentic Document Extraction](https://landing.ai/agentic-document-extraction), but running locally via Claude Code.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Install
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
npx structurecc
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Use
|
|
60
|
+
|
|
61
|
+
In Claude Code:
|
|
62
|
+
|
|
63
|
+
```
|
|
64
|
+
/structure path/to/document.pdf
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
Works with: **PDF, DOCX, PNG, JPG**
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## What You Get
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
document_extracted/
|
|
75
|
+
├── images/ # All extracted visuals
|
|
76
|
+
├── elements/ # One markdown file per element
|
|
77
|
+
│ ├── element_1.md # Table fully extracted
|
|
78
|
+
│ ├── element_2.md # Figure analyzed
|
|
79
|
+
│ └── ...
|
|
80
|
+
└── STRUCTURED.md # Everything combined
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Example: Table Extraction
|
|
84
|
+
|
|
85
|
+
```markdown
|
|
86
|
+
# Patient Demographics
|
|
87
|
+
|
|
88
|
+
**Type:** Table
|
|
89
|
+
**Source:** Page 3, clinical_trial.pdf
|
|
90
|
+
|
|
91
|
+
## Content
|
|
92
|
+
|
|
93
|
+
| Group | N | Age (mean±SD) | Male (%) |
|
|
94
|
+
|-------|---|---------------|----------|
|
|
95
|
+
| Treatment | 245 | 54.3±12.1 | 58.4 |
|
|
96
|
+
| Placebo | 248 | 53.8±11.9 | 56.9 |
|
|
97
|
+
| p-value | - | 0.67 | 0.73 |
|
|
98
|
+
|
|
99
|
+
## Notes
|
|
100
|
+
- * Missing data excluded
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### Example: Figure Analysis
|
|
104
|
+
|
|
105
|
+
```markdown
|
|
106
|
+
# Kaplan-Meier Survival Curves
|
|
107
|
+
|
|
108
|
+
**Type:** Figure
|
|
109
|
+
**Source:** Page 7, clinical_trial.pdf
|
|
110
|
+
|
|
111
|
+
## Content
|
|
112
|
+
|
|
113
|
+
Survival curves comparing treatment (blue) vs placebo (red) over 24 months.
|
|
114
|
+
|
|
115
|
+
- 12-month survival: Treatment 0.89, Placebo 0.78
|
|
116
|
+
- 24-month survival: Treatment 0.76, Placebo 0.61
|
|
117
|
+
- Log-rank p = 0.003
|
|
118
|
+
|
|
119
|
+
## Labels & Text
|
|
120
|
+
- "Survival Probability"
|
|
121
|
+
- "Time (months)"
|
|
122
|
+
- "Treatment (n=245)"
|
|
123
|
+
- "Placebo (n=248)"
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## How It Works
|
|
129
|
+
|
|
130
|
+
1. **Extract** - PyMuPDF pulls all images from PDF (or unzip DOCX media folder)
|
|
131
|
+
2. **Swarm** - Launch N parallel agents, one per image
|
|
132
|
+
3. **Analyze** - Each agent reads its image, extracts everything, writes markdown
|
|
133
|
+
4. **Combine** - Merge all element files into STRUCTURED.md
|
|
134
|
+
|
|
135
|
+
Agents run simultaneously. 10 images = 10 agents = fast.
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## Cost
|
|
140
|
+
|
|
141
|
+
Depends on document complexity:
|
|
142
|
+
|
|
143
|
+
| Document | Elements | ~Cost |
|
|
144
|
+
|----------|----------|-------|
|
|
145
|
+
| Simple paper | 5-10 | $0.50-$1 |
|
|
146
|
+
| Full paper | 15-25 | $2-$4 |
|
|
147
|
+
| Dense report | 40+ | $5-$10 |
|
|
148
|
+
|
|
149
|
+
Uses Claude's multimodal vision. Works best with **Opus 4.5**.
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Requirements
|
|
154
|
+
|
|
155
|
+
- Node.js
|
|
156
|
+
- Claude Code (`npm install -g @anthropic-ai/claude-code`)
|
|
157
|
+
- Anthropic API key or Claude Pro/Max
|
|
158
|
+
|
|
159
|
+
PyMuPDF installed automatically if needed.
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## Uninstall
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
npx structurecc --uninstall
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
## License
|
|
172
|
+
|
|
173
|
+
MIT
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
<p align="center">
|
|
178
|
+
<strong>Stop copying tables by hand.</strong>
|
|
179
|
+
</p>
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: structureit-extractor
|
|
3
|
+
description: Extract and analyze any visual element from documents
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Visual Element Extractor
|
|
7
|
+
|
|
8
|
+
You are an expert at extracting structured data from document images.
|
|
9
|
+
|
|
10
|
+
## Your Task
|
|
11
|
+
|
|
12
|
+
Given an image from a document, you:
|
|
13
|
+
|
|
14
|
+
1. **Identify** what type of visual this is
|
|
15
|
+
2. **Extract** every piece of data and text visible
|
|
16
|
+
3. **Structure** the output as clean markdown
|
|
17
|
+
4. **Ground** your extraction to the source location
|
|
18
|
+
|
|
19
|
+
## Visual Types You Handle
|
|
20
|
+
|
|
21
|
+
- Tables (any format - simple, complex, merged cells)
|
|
22
|
+
- Figures (scientific, photographs, illustrations)
|
|
23
|
+
- Charts (bar, line, pie, scatter, area, box plots)
|
|
24
|
+
- Heatmaps (color matrices, correlation plots, expression data)
|
|
25
|
+
- Diagrams (flowcharts, architectures, schematics, networks)
|
|
26
|
+
- Forms (fields, checkboxes, filled forms)
|
|
27
|
+
- Mixed (images containing multiple element types)
|
|
28
|
+
|
|
29
|
+
## Extraction Rules
|
|
30
|
+
|
|
31
|
+
1. **Be exhaustive** - Extract EVERY visible data point, label, annotation
|
|
32
|
+
2. **Be exact** - Copy text verbatim, preserve numbers precisely
|
|
33
|
+
3. **Be structured** - Output clean markdown tables when appropriate
|
|
34
|
+
4. **Be honest** - Mark unclear items with [?], note confidence levels
|
|
35
|
+
5. **Be grounded** - Always cite source page/location
|
|
36
|
+
|
|
37
|
+
## Output Format
|
|
38
|
+
|
|
39
|
+
```markdown
|
|
40
|
+
# [Descriptive Title Based on Content]
|
|
41
|
+
|
|
42
|
+
**Type:** [table/figure/chart/heatmap/diagram/form/mixed]
|
|
43
|
+
**Source:** Page [N], [document name]
|
|
44
|
+
|
|
45
|
+
## Content
|
|
46
|
+
|
|
47
|
+
[Primary extraction - tables as markdown tables, descriptions for figures,
|
|
48
|
+
data points for charts, etc.]
|
|
49
|
+
|
|
50
|
+
## Labels & Annotations
|
|
51
|
+
|
|
52
|
+
[All text visible in the image, verbatim]
|
|
53
|
+
|
|
54
|
+
## Notes
|
|
55
|
+
|
|
56
|
+
[Extraction observations, confidence levels, anything unclear]
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Special Handling
|
|
60
|
+
|
|
61
|
+
**For complex heatmaps:**
|
|
62
|
+
- Extract all row/column labels
|
|
63
|
+
- Describe the color scale
|
|
64
|
+
- Identify patterns and notable regions
|
|
65
|
+
- Sample representative values if matrix is very large
|
|
66
|
+
|
|
67
|
+
**For multi-panel figures:**
|
|
68
|
+
- Label each panel (A, B, C, etc.)
|
|
69
|
+
- Extract each panel's content separately
|
|
70
|
+
- Note relationships between panels
|
package/bin/install.js
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
|
|
3
|
+
const fs = require('fs');
|
|
4
|
+
const path = require('path');
|
|
5
|
+
const os = require('os');
|
|
6
|
+
|
|
7
|
+
const VERSION = '1.0.0';
|
|
8
|
+
const PACKAGE_NAME = 'structurecc';
|
|
9
|
+
|
|
10
|
+
// Colors
|
|
11
|
+
const colors = {
|
|
12
|
+
reset: '\x1b[0m',
|
|
13
|
+
bright: '\x1b[1m',
|
|
14
|
+
dim: '\x1b[2m',
|
|
15
|
+
red: '\x1b[31m',
|
|
16
|
+
green: '\x1b[32m',
|
|
17
|
+
yellow: '\x1b[33m',
|
|
18
|
+
cyan: '\x1b[36m',
|
|
19
|
+
magenta: '\x1b[35m',
|
|
20
|
+
white: '\x1b[37m'
|
|
21
|
+
};
|
|
22
|
+
|
|
23
|
+
function log(msg, color = '') {
|
|
24
|
+
console.log(`${color}${msg}${colors.reset}`);
|
|
25
|
+
}
|
|
26
|
+
|
|
27
|
+
function banner() {
|
|
28
|
+
console.log(`
|
|
29
|
+
${colors.cyan}
|
|
30
|
+
╔══════════════════════════════════════════════════════════════════════════════╗
|
|
31
|
+
║ ║
|
|
32
|
+
║ ███████╗████████╗██████╗ ██╗ ██╗ ██████╗████████╗██╗ ██╗██████╗ ███████╗║
|
|
33
|
+
║ ██╔════╝╚══██╔══╝██╔══██╗██║ ██║██╔════╝╚══██╔══╝██║ ██║██╔══██╗██╔════╝║
|
|
34
|
+
║ ███████╗ ██║ ██████╔╝██║ ██║██║ ██║ ██║ ██║██████╔╝█████╗ ║
|
|
35
|
+
║ ╚════██║ ██║ ██╔══██╗██║ ██║██║ ██║ ██║ ██║██╔══██╗██╔══╝ ║
|
|
36
|
+
║ ███████║ ██║ ██║ ██║╚██████╔╝╚██████╗ ██║ ╚██████╔╝██║ ██║███████╗║
|
|
37
|
+
║ ╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝║
|
|
38
|
+
║ ║
|
|
39
|
+
║ ${colors.reset}${colors.bright}Agentic Document Extraction${colors.reset}${colors.cyan} ║
|
|
40
|
+
║ ${colors.reset}${colors.dim}One command. Every figure. Every table.${colors.reset}${colors.cyan} ║
|
|
41
|
+
║ ║
|
|
42
|
+
╠══════════════════════════════════════════════════════════════════════════════╣
|
|
43
|
+
║ ║
|
|
44
|
+
║ ${colors.reset}${colors.yellow}PDF${colors.cyan} ───▶ ${colors.green}[Agent 1]${colors.cyan} ───┐ ║
|
|
45
|
+
║ ${colors.green}[Agent 2]${colors.cyan} ───┤ ║
|
|
46
|
+
║ ${colors.green}[Agent 3]${colors.cyan} ───┼───▶ ${colors.magenta}STRUCTURED.md${colors.cyan} ║
|
|
47
|
+
║ ${colors.green}[Agent N]${colors.cyan} ───┘ ║
|
|
48
|
+
║ ║
|
|
49
|
+
║ ${colors.reset}${colors.white}Unstructured in. Structured out.${colors.reset}${colors.cyan} ║
|
|
50
|
+
║ ║
|
|
51
|
+
╚══════════════════════════════════════════════════════════════════════════════╝
|
|
52
|
+
${colors.reset}
|
|
53
|
+
${colors.bright}structurecc${colors.reset} v${VERSION}
|
|
54
|
+
`);
|
|
55
|
+
}
|
|
56
|
+
|
|
57
|
+
function getClaudeDir() {
|
|
58
|
+
return path.join(os.homedir(), '.claude');
|
|
59
|
+
}
|
|
60
|
+
|
|
61
|
+
function ensureDir(dir) {
|
|
62
|
+
if (!fs.existsSync(dir)) {
|
|
63
|
+
fs.mkdirSync(dir, { recursive: true });
|
|
64
|
+
}
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
function copyDir(src, dest) {
|
|
68
|
+
ensureDir(dest);
|
|
69
|
+
const entries = fs.readdirSync(src, { withFileTypes: true });
|
|
70
|
+
|
|
71
|
+
for (const entry of entries) {
|
|
72
|
+
const srcPath = path.join(src, entry.name);
|
|
73
|
+
const destPath = path.join(dest, entry.name);
|
|
74
|
+
|
|
75
|
+
if (entry.isDirectory()) {
|
|
76
|
+
copyDir(srcPath, destPath);
|
|
77
|
+
} else {
|
|
78
|
+
fs.copyFileSync(srcPath, destPath);
|
|
79
|
+
}
|
|
80
|
+
}
|
|
81
|
+
}
|
|
82
|
+
|
|
83
|
+
function install() {
|
|
84
|
+
const claudeDir = getClaudeDir();
|
|
85
|
+
const commandsDir = path.join(claudeDir, 'commands', 'structure');
|
|
86
|
+
const agentsDir = path.join(claudeDir, 'agents');
|
|
87
|
+
|
|
88
|
+
const packageDir = path.resolve(__dirname, '..');
|
|
89
|
+
const srcCommandsDir = path.join(packageDir, 'commands', 'structure');
|
|
90
|
+
const srcAgentsDir = path.join(packageDir, 'agents');
|
|
91
|
+
|
|
92
|
+
log('Installing structurecc...', colors.yellow);
|
|
93
|
+
log('');
|
|
94
|
+
|
|
95
|
+
// Install command
|
|
96
|
+
if (fs.existsSync(srcCommandsDir)) {
|
|
97
|
+
copyDir(srcCommandsDir, commandsDir);
|
|
98
|
+
log(' ✓ Installed /structure command', colors.green);
|
|
99
|
+
}
|
|
100
|
+
|
|
101
|
+
// Install agents
|
|
102
|
+
if (fs.existsSync(srcAgentsDir)) {
|
|
103
|
+
const agentFiles = fs.readdirSync(srcAgentsDir);
|
|
104
|
+
ensureDir(agentsDir);
|
|
105
|
+
for (const file of agentFiles) {
|
|
106
|
+
if (file.startsWith('structureit-')) {
|
|
107
|
+
fs.copyFileSync(
|
|
108
|
+
path.join(srcAgentsDir, file),
|
|
109
|
+
path.join(agentsDir, file)
|
|
110
|
+
);
|
|
111
|
+
log(` ✓ Installed ${file.replace('.md', '')}`, colors.green);
|
|
112
|
+
}
|
|
113
|
+
}
|
|
114
|
+
}
|
|
115
|
+
|
|
116
|
+
log('');
|
|
117
|
+
log(`${colors.green}Done!${colors.reset}`);
|
|
118
|
+
log('');
|
|
119
|
+
log(`Run in Claude Code:`, colors.bright);
|
|
120
|
+
log(` /structure path/to/document.pdf`, colors.cyan);
|
|
121
|
+
log('');
|
|
122
|
+
log(`${colors.dim}Supports: PDF, DOCX, PNG, JPG, TIFF${colors.reset}`);
|
|
123
|
+
log('');
|
|
124
|
+
}
|
|
125
|
+
|
|
126
|
+
function uninstall() {
|
|
127
|
+
const claudeDir = getClaudeDir();
|
|
128
|
+
const commandsDir = path.join(claudeDir, 'commands', 'structure');
|
|
129
|
+
const agentsDir = path.join(claudeDir, 'agents');
|
|
130
|
+
|
|
131
|
+
log('Uninstalling structurecc...', colors.yellow);
|
|
132
|
+
|
|
133
|
+
if (fs.existsSync(commandsDir)) {
|
|
134
|
+
fs.rmSync(commandsDir, { recursive: true });
|
|
135
|
+
log(' ✓ /structure command removed', colors.green);
|
|
136
|
+
}
|
|
137
|
+
|
|
138
|
+
if (fs.existsSync(agentsDir)) {
|
|
139
|
+
const agentFiles = fs.readdirSync(agentsDir);
|
|
140
|
+
for (const file of agentFiles) {
|
|
141
|
+
if (file.startsWith('structureit-')) {
|
|
142
|
+
fs.unlinkSync(path.join(agentsDir, file));
|
|
143
|
+
log(` ✓ Removed ${file}`, colors.green);
|
|
144
|
+
}
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
|
|
148
|
+
log('');
|
|
149
|
+
log('Uninstall complete.', colors.green);
|
|
150
|
+
log('');
|
|
151
|
+
}
|
|
152
|
+
|
|
153
|
+
// Main
|
|
154
|
+
const args = process.argv.slice(2);
|
|
155
|
+
|
|
156
|
+
banner();
|
|
157
|
+
|
|
158
|
+
if (args.includes('--uninstall') || args.includes('-u')) {
|
|
159
|
+
uninstall();
|
|
160
|
+
} else if (args.includes('--help') || args.includes('-h')) {
|
|
161
|
+
log('Usage: npx structurecc [options]', colors.bright);
|
|
162
|
+
log('');
|
|
163
|
+
log('Options:', colors.bright);
|
|
164
|
+
log(' --help, -h Show this help', colors.dim);
|
|
165
|
+
log(' --uninstall, -u Remove from Claude Code', colors.dim);
|
|
166
|
+
log('');
|
|
167
|
+
log('After install, use in Claude Code:', colors.bright);
|
|
168
|
+
log(' /structure path/to/document.pdf', colors.cyan);
|
|
169
|
+
log(' /structure path/to/document.docx', colors.cyan);
|
|
170
|
+
log('');
|
|
171
|
+
} else {
|
|
172
|
+
install();
|
|
173
|
+
}
|
|
@@ -0,0 +1,242 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: structure
|
|
3
|
+
description: Extract structured data from PDFs and Word docs using AI agent swarms
|
|
4
|
+
arguments:
|
|
5
|
+
- name: path
|
|
6
|
+
description: Path to document (PDF, DOCX, or image)
|
|
7
|
+
required: true
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# /structure - Agentic Document Extraction
|
|
11
|
+
|
|
12
|
+
Turn complex documents into structured markdown using parallel AI subagents.
|
|
13
|
+
|
|
14
|
+
## Overview
|
|
15
|
+
|
|
16
|
+
1. Extract all images from the document
|
|
17
|
+
2. Spawn ONE subagent PER IMAGE (all in parallel)
|
|
18
|
+
3. Each agent analyzes its image and writes structured markdown
|
|
19
|
+
4. Combine into final STRUCTURED.md
|
|
20
|
+
|
|
21
|
+
## Step 1: Setup
|
|
22
|
+
|
|
23
|
+
Create output directory next to the document:
|
|
24
|
+
```
|
|
25
|
+
<document_name>_extracted/
|
|
26
|
+
├── images/ # Extracted visuals
|
|
27
|
+
├── elements/ # Per-element markdown
|
|
28
|
+
└── STRUCTURED.md # Final output
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Step 2: Extract Images
|
|
32
|
+
|
|
33
|
+
**For PDF files** - Use PyMuPDF:
|
|
34
|
+
|
|
35
|
+
```python
|
|
36
|
+
import fitz
|
|
37
|
+
import os
|
|
38
|
+
|
|
39
|
+
pdf_path = "<document_path>"
|
|
40
|
+
output_dir = "<output_dir>"
|
|
41
|
+
images_dir = os.path.join(output_dir, "images")
|
|
42
|
+
os.makedirs(images_dir, exist_ok=True)
|
|
43
|
+
|
|
44
|
+
doc = fitz.open(pdf_path)
|
|
45
|
+
extracted = []
|
|
46
|
+
|
|
47
|
+
for page_num in range(len(doc)):
|
|
48
|
+
page = doc[page_num]
|
|
49
|
+
for img_idx, img in enumerate(page.get_images()):
|
|
50
|
+
xref = img[0]
|
|
51
|
+
pix = fitz.Pixmap(doc, xref)
|
|
52
|
+
if pix.n - pix.alpha > 3:
|
|
53
|
+
pix = fitz.Pixmap(fitz.csRGB, pix)
|
|
54
|
+
|
|
55
|
+
img_name = f"p{page_num + 1}_img{img_idx + 1}.png"
|
|
56
|
+
pix.save(os.path.join(images_dir, img_name))
|
|
57
|
+
extracted.append({"path": os.path.join(images_dir, img_name), "page": page_num + 1, "name": img_name})
|
|
58
|
+
pix = None
|
|
59
|
+
|
|
60
|
+
doc.close()
|
|
61
|
+
print(f"Extracted {len(extracted)} images")
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
**For DOCX files** - Unzip and extract media:
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from zipfile import ZipFile
|
|
68
|
+
import os
|
|
69
|
+
|
|
70
|
+
docx_path = "<document_path>"
|
|
71
|
+
output_dir = "<output_dir>"
|
|
72
|
+
images_dir = os.path.join(output_dir, "images")
|
|
73
|
+
os.makedirs(images_dir, exist_ok=True)
|
|
74
|
+
|
|
75
|
+
extracted = []
|
|
76
|
+
with ZipFile(docx_path, 'r') as z:
|
|
77
|
+
for f in z.namelist():
|
|
78
|
+
if f.startswith('word/media/'):
|
|
79
|
+
name = os.path.basename(f)
|
|
80
|
+
path = os.path.join(images_dir, name)
|
|
81
|
+
with z.open(f) as src, open(path, 'wb') as dst:
|
|
82
|
+
dst.write(src.read())
|
|
83
|
+
extracted.append({"path": path, "name": name})
|
|
84
|
+
|
|
85
|
+
print(f"Extracted {len(extracted)} images")
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**For standalone images** - Just process directly.
|
|
89
|
+
|
|
90
|
+
Also extract main text:
|
|
91
|
+
- PDF: `page.get_text()` for each page
|
|
92
|
+
- DOCX: `textutil -convert txt "<path>" -stdout`
|
|
93
|
+
|
|
94
|
+
## Step 3: Spawn Agent Swarm
|
|
95
|
+
|
|
96
|
+
**CRITICAL:** Launch ALL agents in ONE message with MULTIPLE Task tool calls.
|
|
97
|
+
|
|
98
|
+
For EACH extracted image:
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
Task(
|
|
102
|
+
subagent_type: "general-purpose",
|
|
103
|
+
description: "Extract element [N]",
|
|
104
|
+
prompt: """
|
|
105
|
+
You are extracting structured data from a document image.
|
|
106
|
+
|
|
107
|
+
**Image:** <full_path_to_image>
|
|
108
|
+
**Source:** Page <N> of <document_name>
|
|
109
|
+
**Output:** Write to <output_dir>/elements/element_<N>.md
|
|
110
|
+
|
|
111
|
+
## Instructions
|
|
112
|
+
|
|
113
|
+
1. Read the image carefully
|
|
114
|
+
2. Identify what it contains (table, figure, chart, heatmap, diagram, etc.)
|
|
115
|
+
3. Extract ALL visible data - be exhaustive
|
|
116
|
+
4. Structure as clean markdown
|
|
117
|
+
|
|
118
|
+
## Output Format
|
|
119
|
+
|
|
120
|
+
Write this to the output file:
|
|
121
|
+
|
|
122
|
+
```markdown
|
|
123
|
+
# [Descriptive Title]
|
|
124
|
+
|
|
125
|
+
**Type:** [table/figure/chart/heatmap/diagram/other]
|
|
126
|
+
**Source:** Page [N], [document name]
|
|
127
|
+
|
|
128
|
+
## Content
|
|
129
|
+
|
|
130
|
+
[For tables: markdown table with all data]
|
|
131
|
+
[For figures: detailed description + all visible text/labels]
|
|
132
|
+
[For charts: data points, axes, trends]
|
|
133
|
+
[For heatmaps: labels, color scale, patterns]
|
|
134
|
+
[For diagrams: components, relationships, flow]
|
|
135
|
+
|
|
136
|
+
## Labels & Text
|
|
137
|
+
|
|
138
|
+
[Every piece of text visible, verbatim]
|
|
139
|
+
|
|
140
|
+
## Notes
|
|
141
|
+
|
|
142
|
+
[Confidence level, unclear items marked with [?]]
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
Be thorough. Extract every data point.
|
|
146
|
+
"""
|
|
147
|
+
)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
Launch 10 images = 10 Task calls in ONE message. They run in parallel.
|
|
151
|
+
|
|
152
|
+
## Step 4: Extract Main Text
|
|
153
|
+
|
|
154
|
+
Save document text to `elements/main_text.md`:
|
|
155
|
+
|
|
156
|
+
```markdown
|
|
157
|
+
# Main Document Text
|
|
158
|
+
|
|
159
|
+
**Source:** [document name]
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
[Full text extracted from document, preserving structure]
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
## Step 5: Combine Results
|
|
167
|
+
|
|
168
|
+
After all agents complete, read all `elements/*.md` files and create:
|
|
169
|
+
|
|
170
|
+
**STRUCTURED.md:**
|
|
171
|
+
|
|
172
|
+
```markdown
|
|
173
|
+
# [Document Name] - Structured Extraction
|
|
174
|
+
|
|
175
|
+
**Original:** [filename]
|
|
176
|
+
**Extracted:** [date/time]
|
|
177
|
+
**Elements:** [N] visual elements processed
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
## Main Text
|
|
182
|
+
|
|
183
|
+
[Content from main_text.md]
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## Visual Elements
|
|
188
|
+
|
|
189
|
+
### Element 1
|
|
190
|
+
[Content from element_1.md]
|
|
191
|
+
|
|
192
|
+
### Element 2
|
|
193
|
+
[Content from element_2.md]
|
|
194
|
+
|
|
195
|
+
[... continue for all elements ...]
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## Extraction Summary
|
|
200
|
+
|
|
201
|
+
| # | Type | Source | Status |
|
|
202
|
+
|---|------|--------|--------|
|
|
203
|
+
| 1 | Table | Page 2 | ✓ |
|
|
204
|
+
| 2 | Figure | Page 3 | ✓ |
|
|
205
|
+
| ... | ... | ... | ... |
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
## Step 6: Display Results
|
|
209
|
+
|
|
210
|
+
```
|
|
211
|
+
╔═══════════════════════════════════════════════════════════╗
|
|
212
|
+
║ EXTRACTION COMPLETE ║
|
|
213
|
+
╠═══════════════════════════════════════════════════════════╣
|
|
214
|
+
║ ║
|
|
215
|
+
║ Document: [name] ║
|
|
216
|
+
║ Output: [path]_extracted/ ║
|
|
217
|
+
║ ║
|
|
218
|
+
║ Extracted: [N] visual elements ║
|
|
219
|
+
║ ║
|
|
220
|
+
║ Files: ║
|
|
221
|
+
║ images/ [N] extracted images ║
|
|
222
|
+
║ elements/ [N] element markdown files ║
|
|
223
|
+
║ STRUCTURED.md Combined output ║
|
|
224
|
+
║ ║
|
|
225
|
+
╚═══════════════════════════════════════════════════════════╝
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
Then open: `open "<output_dir>/STRUCTURED.md"`
|
|
229
|
+
|
|
230
|
+
## Dependencies
|
|
231
|
+
|
|
232
|
+
Install PyMuPDF if not present:
|
|
233
|
+
```bash
|
|
234
|
+
pip3 install PyMuPDF --quiet
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
## Tips
|
|
238
|
+
|
|
239
|
+
- Use opus model for best extraction quality on complex visuals
|
|
240
|
+
- Each image = one agent = one API call
|
|
241
|
+
- Agents run in parallel for speed
|
|
242
|
+
- Check individual element files if one extraction looks wrong
|
package/package.json
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "structurecc",
|
|
3
|
+
"version": "1.0.0",
|
|
4
|
+
"description": "Agentic document extraction for Claude Code. One command. Every figure. Every table.",
|
|
5
|
+
"keywords": [
|
|
6
|
+
"document-extraction",
|
|
7
|
+
"pdf",
|
|
8
|
+
"structure",
|
|
9
|
+
"agentic",
|
|
10
|
+
"claude-code",
|
|
11
|
+
"llm",
|
|
12
|
+
"multimodal",
|
|
13
|
+
"tables",
|
|
14
|
+
"figures",
|
|
15
|
+
"markdown",
|
|
16
|
+
"ai-agents",
|
|
17
|
+
"ocr"
|
|
18
|
+
],
|
|
19
|
+
"author": "James Weatherhead",
|
|
20
|
+
"license": "MIT",
|
|
21
|
+
"repository": {
|
|
22
|
+
"type": "git",
|
|
23
|
+
"url": "https://github.com/JamesWeatherhead/structurecc"
|
|
24
|
+
},
|
|
25
|
+
"homepage": "https://github.com/JamesWeatherhead/structurecc#readme",
|
|
26
|
+
"bin": {
|
|
27
|
+
"structurecc": "./bin/install.js"
|
|
28
|
+
},
|
|
29
|
+
"files": [
|
|
30
|
+
"bin/",
|
|
31
|
+
"commands/",
|
|
32
|
+
"agents/"
|
|
33
|
+
]
|
|
34
|
+
}
|