structurecc 2.0.0 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +34 -307
- package/package.json +3 -8
package/README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
|
-
<h1 align="center">STRUCTURE
|
|
1
|
+
<h1 align="center">STRUCTURE</h1>
|
|
2
2
|
|
|
3
3
|
<p align="center">
|
|
4
|
-
<strong>
|
|
4
|
+
<strong>Extract structured data from PDFs, Word docs, and images using Claude Code.</strong>
|
|
5
5
|
</p>
|
|
6
6
|
|
|
7
7
|
<p align="center">
|
|
@@ -13,339 +13,84 @@
|
|
|
13
13
|
<img src="assets/terminal.png" alt="structurecc" width="550">
|
|
14
14
|
</p>
|
|
15
15
|
|
|
16
|
-
<p align="center">
|
|
17
|
-
<em>Works on Mac, Windows, and Linux</em>
|
|
18
|
-
</p>
|
|
19
|
-
|
|
20
|
-
---
|
|
21
|
-
|
|
22
|
-
## What's New in v2.0
|
|
23
|
-
|
|
24
|
-
**3-Phase Pipeline with Quality Verification**
|
|
25
|
-
|
|
26
|
-
```
|
|
27
|
-
Image → [Classify] → [Extract] → [Verify] → Output
|
|
28
|
-
↑_______↻_______↓
|
|
29
|
-
```
|
|
30
|
-
|
|
31
|
-
| Phase | Agent | Purpose |
|
|
32
|
-
|-------|-------|---------|
|
|
33
|
-
| 1. Classify | `structurecc-classifier` | Fast triage to route to correct extractor |
|
|
34
|
-
| 2. Extract | 6 specialized extractors | Type-specific verbatim extraction |
|
|
35
|
-
| 3. Verify | `structurecc-verifier` | Quality scoring with auto-revision |
|
|
36
|
-
|
|
37
|
-
**Verbatim Extraction** - Text is copied EXACTLY as shown. No paraphrasing, no "cleanup."
|
|
38
|
-
|
|
39
|
-
**Quality Scoring** - Each extraction gets a 0.0-1.0 score. Failures auto-retry up to 2x.
|
|
40
|
-
|
|
41
|
-
**Specialized Extractors** - Tables, charts, heatmaps, diagrams each get dedicated agents.
|
|
42
|
-
|
|
43
|
-
---
|
|
44
|
-
|
|
45
|
-
## The Problem
|
|
46
|
-
|
|
47
|
-
You have a 50-page PDF with figures, tables, and charts. You need that data.
|
|
48
|
-
|
|
49
|
-
**Manual approach:** Screenshot each figure. Transcribe tables cell by cell. Spend hours on one document.
|
|
50
|
-
|
|
51
|
-
**With structurecc:** One command. Walk away. Come back to perfectly structured markdown with quality verification.
|
|
52
|
-
|
|
53
|
-
```
|
|
54
|
-
/structure paper.pdf
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
Spawns parallel AI agents. Each agent analyzes one visual element. All run simultaneously. Quality verified. Done in minutes, not hours.
|
|
58
|
-
|
|
59
|
-
---
|
|
60
|
-
|
|
61
|
-
## Specialized Extractors
|
|
62
|
-
|
|
63
|
-
| Extractor | Handles |
|
|
64
|
-
|-----------|---------|
|
|
65
|
-
| `structurecc-extract-table` | Tables with cell-by-cell accuracy, merged cells, footnotes |
|
|
66
|
-
| `structurecc-extract-chart` | Kaplan-Meier, bar, line, scatter, forest plots with axes, legends, data |
|
|
67
|
-
| `structurecc-extract-heatmap` | Expression heatmaps, correlation matrices with full label extraction |
|
|
68
|
-
| `structurecc-extract-diagram` | CONSORT flows, timelines, network diagrams with all node text |
|
|
69
|
-
| `structurecc-extract-multipanel` | Multi-panel figures (A, B, C, D) with per-panel extraction |
|
|
70
|
-
| `structurecc-extract-generic` | Photographs, schematics, equations, other visuals |
|
|
71
|
-
|
|
72
|
-
---
|
|
73
|
-
|
|
74
|
-
## Quality Verification
|
|
75
|
-
|
|
76
|
-
Every extraction is verified against the source image:
|
|
77
|
-
|
|
78
|
-
```json
|
|
79
|
-
{
|
|
80
|
-
"scores": {
|
|
81
|
-
"completeness": 0.95,
|
|
82
|
-
"accuracy": 0.92,
|
|
83
|
-
"verbatim_compliance": 0.88,
|
|
84
|
-
"structure_correctness": 0.97,
|
|
85
|
-
"overall": 0.93
|
|
86
|
-
},
|
|
87
|
-
"pass": true,
|
|
88
|
-
"threshold": 0.90
|
|
89
|
-
}
|
|
90
|
-
```
|
|
91
|
-
|
|
92
|
-
| Score | Meaning |
|
|
93
|
-
|-------|---------|
|
|
94
|
-
| **completeness** | Was every visible element captured? |
|
|
95
|
-
| **accuracy** | Are values (numbers, stats) correct? |
|
|
96
|
-
| **verbatim_compliance** | Was text copied exactly as shown? |
|
|
97
|
-
| **structure_correctness** | Is the JSON structure valid? |
|
|
98
|
-
|
|
99
|
-
**Auto-revision:** If score < 0.90, extraction is re-run with specific feedback. Max 2 attempts.
|
|
100
|
-
|
|
101
|
-
---
|
|
102
|
-
|
|
103
|
-
## Before You Start
|
|
104
|
-
|
|
105
|
-
You need two things:
|
|
106
|
-
|
|
107
|
-
### 1. Node.js
|
|
108
|
-
|
|
109
|
-
Check if you have it:
|
|
110
|
-
|
|
111
|
-
```bash
|
|
112
|
-
node --version
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
If you see a version number, you're good. If you see "command not found", download Node.js from **[nodejs.org](https://nodejs.org/)** and install it.
|
|
116
|
-
|
|
117
|
-
### 2. Anthropic API Key or Pro/Max Plan
|
|
118
|
-
|
|
119
|
-
You need one of these to use Claude Code:
|
|
120
|
-
|
|
121
|
-
- **API key:** Get one at **[console.anthropic.com](https://console.anthropic.com/)**. Requires a payment method.
|
|
122
|
-
- **Pro or Max plan:** If you subscribe to Claude Pro ($20/mo) or Max ($100/mo), you can use Claude Code without a separate API key.
|
|
123
|
-
|
|
124
16
|
---
|
|
125
17
|
|
|
126
|
-
##
|
|
127
|
-
|
|
128
|
-
### Step 1: Open your terminal
|
|
18
|
+
## Requirements
|
|
129
19
|
|
|
130
|
-
**
|
|
131
|
-
|
|
132
|
-
**Windows:** Press `Win + X`, click "Terminal" or "PowerShell"
|
|
133
|
-
|
|
134
|
-
**Linux:** Press `Ctrl + Alt + T`
|
|
20
|
+
- **Node.js** - [nodejs.org](https://nodejs.org/)
|
|
21
|
+
- **Claude Code** - Requires API key or Pro/Max subscription
|
|
135
22
|
|
|
136
23
|
---
|
|
137
24
|
|
|
138
|
-
|
|
25
|
+
## Install
|
|
139
26
|
|
|
140
|
-
|
|
27
|
+
### Step 1: Install Claude Code
|
|
141
28
|
|
|
142
29
|
```bash
|
|
143
30
|
npm install -g @anthropic-ai/claude-code
|
|
144
31
|
```
|
|
145
32
|
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
### Step 3: Install structurecc
|
|
33
|
+
<p align="center">
|
|
34
|
+
<img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
|
|
35
|
+
</p>
|
|
151
36
|
|
|
152
|
-
|
|
37
|
+
### Step 2: Install structurecc
|
|
153
38
|
|
|
154
39
|
```bash
|
|
155
40
|
npx structurecc
|
|
156
41
|
```
|
|
157
42
|
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
### Step 4: Set up your document folder
|
|
163
|
-
|
|
164
|
-
Create a folder with your document:
|
|
165
|
-
|
|
166
|
-
```
|
|
167
|
-
documents/
|
|
168
|
-
└── document.pdf ← your PDF, DOCX, or image
|
|
169
|
-
```
|
|
170
|
-
|
|
171
|
-
**Put your document in a folder. That's it.**
|
|
172
|
-
|
|
173
|
-
---
|
|
43
|
+
<p align="center">
|
|
44
|
+
<img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
|
|
45
|
+
</p>
|
|
174
46
|
|
|
175
|
-
### Step
|
|
47
|
+
### Step 3: Start Claude Code
|
|
176
48
|
|
|
177
|
-
Navigate to your document folder and
|
|
49
|
+
Navigate to your document folder and run:
|
|
178
50
|
|
|
179
51
|
```bash
|
|
180
52
|
cd ~/Desktop/documents
|
|
181
53
|
claude
|
|
182
54
|
```
|
|
183
55
|
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
---
|
|
56
|
+
<p align="center">
|
|
57
|
+
<img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
|
|
58
|
+
</p>
|
|
189
59
|
|
|
190
|
-
### Step
|
|
60
|
+
### Step 4: Run structure
|
|
191
61
|
|
|
192
|
-
|
|
62
|
+
Inside Claude Code:
|
|
193
63
|
|
|
194
64
|
```
|
|
195
65
|
/structure document.pdf
|
|
196
66
|
```
|
|
197
67
|
|
|
198
|
-
|
|
68
|
+
<p align="center">
|
|
69
|
+
<img src="assets/screenshots/step3.png" alt="Run /structure" width="520">
|
|
70
|
+
</p>
|
|
199
71
|
|
|
200
|
-
|
|
201
|
-
1. Extract every image from your document
|
|
202
|
-
2. Classify each image (table, chart, heatmap, diagram, etc.)
|
|
203
|
-
3. Spawn specialized extractors in parallel
|
|
204
|
-
4. Verify each extraction against the source
|
|
205
|
-
5. Auto-revise failed extractions
|
|
206
|
-
6. Combine everything into `STRUCTURED.md`
|
|
72
|
+
Supports **PDF**, **DOCX**, **PNG**, **JPG**, and **TIFF**.
|
|
207
73
|
|
|
208
74
|
---
|
|
209
75
|
|
|
210
|
-
##
|
|
211
|
-
|
|
212
|
-
A comprehensive output directory with full traceability:
|
|
76
|
+
## Output
|
|
213
77
|
|
|
214
78
|
```
|
|
215
79
|
document_extracted/
|
|
216
|
-
├── images/
|
|
217
|
-
├──
|
|
218
|
-
|
|
219
|
-
│ └── ...
|
|
220
|
-
├── extractions/ # Phase 2: JSON extractions
|
|
221
|
-
│ ├── element_001.json
|
|
222
|
-
│ └── ...
|
|
223
|
-
├── verifications/ # Phase 3: quality scores
|
|
224
|
-
│ ├── element_001_verify.json
|
|
225
|
-
│ └── ...
|
|
226
|
-
├── elements/ # Markdown per element
|
|
227
|
-
│ ├── element_001.md
|
|
228
|
-
│ └── ...
|
|
229
|
-
├── STRUCTURED.md # Combined output
|
|
230
|
-
└── extraction_report.json # Quality metrics summary
|
|
231
|
-
```
|
|
232
|
-
|
|
233
|
-
### Quality Report
|
|
234
|
-
|
|
235
|
-
```json
|
|
236
|
-
{
|
|
237
|
-
"document": "clinical_trial.pdf",
|
|
238
|
-
"pipeline_version": "2.0.0",
|
|
239
|
-
"elements_total": 15,
|
|
240
|
-
"elements_passed": 13,
|
|
241
|
-
"elements_revised": 2,
|
|
242
|
-
"elements_human_review": 0,
|
|
243
|
-
"average_quality_score": 0.92
|
|
244
|
-
}
|
|
245
|
-
```
|
|
246
|
-
|
|
247
|
-
### Example: Table Extraction
|
|
248
|
-
|
|
249
|
-
```markdown
|
|
250
|
-
# Patient Demographics
|
|
251
|
-
|
|
252
|
-
**Type:** Table
|
|
253
|
-
**Source:** Page 3, clinical_trial.pdf
|
|
254
|
-
|
|
255
|
-
## Data
|
|
256
|
-
|
|
257
|
-
| Characteristic | Treatment (n=245) | Placebo (n=248) | P-value |
|
|
258
|
-
|---|---|---|---|
|
|
259
|
-
| Age, years | 54.3 ± 12.1 | 53.8 ± 11.9 | 0.67 |
|
|
260
|
-
| Male (%) | 58.4 | 56.9 | 0.73 |
|
|
261
|
-
| BMI (kg/m²) | 28.7 ± 4.2 | 28.4 ± 4.1 | 0.42 |
|
|
262
|
-
|
|
263
|
-
## Footnotes
|
|
264
|
-
- * Missing data excluded from analysis
|
|
265
|
-
- † Adjusted for baseline
|
|
266
|
-
```
|
|
267
|
-
|
|
268
|
-
### Example: Kaplan-Meier Extraction
|
|
269
|
-
|
|
270
|
-
```markdown
|
|
271
|
-
# Kaplan-Meier Survival Curves
|
|
272
|
-
|
|
273
|
-
**Type:** kaplan_meier
|
|
274
|
-
**Source:** Page 7, clinical_trial.pdf
|
|
275
|
-
|
|
276
|
-
## Axes
|
|
277
|
-
|
|
278
|
-
- **X-axis:** Time (Days) Since HSV Diagnosis
|
|
279
|
-
- Range: 0 to 7000
|
|
280
|
-
- **Y-axis:** Cumulative Risk of Dementia
|
|
281
|
-
- Range: 0 to 0.6
|
|
282
|
-
|
|
283
|
-
## Legend
|
|
284
|
-
|
|
285
|
-
- **HSV: Dementia Risk**: purple solid
|
|
286
|
-
- **Control: Dementia Risk**: dark blue solid
|
|
287
|
-
- **HSV: Dementia Risk 95% CI**: light purple shaded area
|
|
288
|
-
- **Control: Dementia Risk 95% CI**: light orange shaded area
|
|
289
|
-
|
|
290
|
-
## Statistical Annotations
|
|
291
|
-
|
|
292
|
-
- p_value: < 0.001
|
|
293
|
-
- hazard_ratio: 1.52 (95% CI: 1.38-1.68)
|
|
294
|
-
|
|
295
|
-
## Risk Table
|
|
296
|
-
|
|
297
|
-
| Time (days) | 0 | 1000 | 2000 | 3000 | 4000 | 5000 | 6000 | 7000 |
|
|
298
|
-
|---|---|---|---|---|---|---|---|---|
|
|
299
|
-
| HSV | 8,362 | 7,891 | 6,543 | 5,102 | 3,876 | 2,654 | 1,432 | 521 |
|
|
300
|
-
| Control | 41,810 | 39,765 | 33,421 | 26,543 | 19,876 | 13,543 | 7,654 | 2,876 |
|
|
80
|
+
├── images/ # Extracted visuals
|
|
81
|
+
├── elements/ # Markdown per element
|
|
82
|
+
└── STRUCTURED.md # Combined output
|
|
301
83
|
```
|
|
302
84
|
|
|
303
85
|
---
|
|
304
86
|
|
|
305
|
-
## Cost
|
|
306
|
-
|
|
307
|
-
| Document | Elements | ~Cost |
|
|
308
|
-
|----------|----------|-------|
|
|
309
|
-
| Simple paper | 5-10 | $1-$2 |
|
|
310
|
-
| Full paper | 15-25 | $3-$6 |
|
|
311
|
-
| Dense report | 40+ | $8-$15 |
|
|
312
|
-
|
|
313
|
-
Uses Claude's multimodal vision with model-appropriate routing:
|
|
314
|
-
- **Haiku** for classification (fast, cheap)
|
|
315
|
-
- **Opus** for extraction (highest quality)
|
|
316
|
-
- **Sonnet** for verification (balanced)
|
|
317
|
-
|
|
318
|
-
---
|
|
319
|
-
|
|
320
|
-
## Supported Formats
|
|
321
|
-
|
|
322
|
-
- **PDF** - Extracts embedded images via PyMuPDF
|
|
323
|
-
- **DOCX** - Extracts images from Word's media folder
|
|
324
|
-
- **PNG/JPG/TIFF** - Analyzes images directly
|
|
325
|
-
|
|
326
|
-
---
|
|
327
|
-
|
|
328
87
|
## Troubleshooting
|
|
329
88
|
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
You typed `/structure` in your regular terminal. You need to type it inside Claude Code. First run `claude` to start Claude Code, then type `/structure`.
|
|
337
|
-
|
|
338
|
-
**"No images found"**
|
|
339
|
-
|
|
340
|
-
Make sure your PDF contains actual images, not just text. Some PDFs render everything as text.
|
|
341
|
-
|
|
342
|
-
**Low quality scores**
|
|
343
|
-
|
|
344
|
-
Check `verifications/` for specific issues. Complex tables or poor image quality may need human review.
|
|
345
|
-
|
|
346
|
-
**Claude Code asks for an API key**
|
|
347
|
-
|
|
348
|
-
Either get an API key at [console.anthropic.com](https://console.anthropic.com/), or subscribe to Claude Pro/Max at [claude.ai](https://claude.ai/).
|
|
89
|
+
| Issue | Solution |
|
|
90
|
+
|-------|----------|
|
|
91
|
+
| `npm: command not found` | Install Node.js from [nodejs.org](https://nodejs.org/) |
|
|
92
|
+
| `/structure: No such file` | Run `claude` first, then type `/structure` inside Claude Code |
|
|
93
|
+
| No images found | PDF may be text-only with no embedded images |
|
|
349
94
|
|
|
350
95
|
---
|
|
351
96
|
|
|
@@ -357,24 +102,6 @@ npx structurecc --uninstall
|
|
|
357
102
|
|
|
358
103
|
---
|
|
359
104
|
|
|
360
|
-
## Upgrade from v1.x
|
|
361
|
-
|
|
362
|
-
Just run the installer again:
|
|
363
|
-
|
|
364
|
-
```bash
|
|
365
|
-
npx structurecc
|
|
366
|
-
```
|
|
367
|
-
|
|
368
|
-
The installer automatically removes the old `structurecc-extractor` and installs the new 8-agent pipeline.
|
|
369
|
-
|
|
370
|
-
---
|
|
371
|
-
|
|
372
105
|
## License
|
|
373
106
|
|
|
374
107
|
MIT
|
|
375
|
-
|
|
376
|
-
---
|
|
377
|
-
|
|
378
|
-
<p align="center">
|
|
379
|
-
<strong>Verbatim in. Quality verified out.</strong>
|
|
380
|
-
</p>
|
package/package.json
CHANGED
|
@@ -1,24 +1,19 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "structurecc",
|
|
3
|
-
"version": "2.0.
|
|
4
|
-
"description": "
|
|
3
|
+
"version": "2.0.1",
|
|
4
|
+
"description": "Extract structured data from PDFs, Word docs, and images using Claude Code.",
|
|
5
5
|
"keywords": [
|
|
6
6
|
"document-extraction",
|
|
7
7
|
"pdf",
|
|
8
8
|
"structure",
|
|
9
|
-
"agentic",
|
|
10
9
|
"claude-code",
|
|
11
10
|
"llm",
|
|
12
11
|
"multimodal",
|
|
13
12
|
"tables",
|
|
14
13
|
"figures",
|
|
15
14
|
"charts",
|
|
16
|
-
"heatmaps",
|
|
17
15
|
"markdown",
|
|
18
|
-
"ai-agents"
|
|
19
|
-
"ocr",
|
|
20
|
-
"verbatim",
|
|
21
|
-
"quality-assurance"
|
|
16
|
+
"ai-agents"
|
|
22
17
|
],
|
|
23
18
|
"author": "James Weatherhead",
|
|
24
19
|
"license": "MIT",
|