structurecc 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +34 -307
  2. package/package.json +3 -8
package/README.md CHANGED
@@ -1,7 +1,7 @@
1
- <h1 align="center">STRUCTURE v2.0</h1>
1
+ <h1 align="center">STRUCTURE</h1>
2
2
 
3
3
  <p align="center">
4
- <strong>Landing AI charges $500/month for agentic document structuring.<br>This is free.</strong>
4
+ <strong>Extract structured data from PDFs, Word docs, and images using Claude Code.</strong>
5
5
  </p>
6
6
 
7
7
  <p align="center">
@@ -13,339 +13,84 @@
13
13
  <img src="assets/terminal.png" alt="structurecc" width="550">
14
14
  </p>
15
15
 
16
- <p align="center">
17
- <em>Works on Mac, Windows, and Linux</em>
18
- </p>
19
-
20
- ---
21
-
22
- ## What's New in v2.0
23
-
24
- **3-Phase Pipeline with Quality Verification**
25
-
26
- ```
27
- Image → [Classify] → [Extract] → [Verify] → Output
28
- ↑_______↻_______↓
29
- ```
30
-
31
- | Phase | Agent | Purpose |
32
- |-------|-------|---------|
33
- | 1. Classify | `structurecc-classifier` | Fast triage to route to correct extractor |
34
- | 2. Extract | 6 specialized extractors | Type-specific verbatim extraction |
35
- | 3. Verify | `structurecc-verifier` | Quality scoring with auto-revision |
36
-
37
- **Verbatim Extraction** - Text is copied EXACTLY as shown. No paraphrasing, no "cleanup."
38
-
39
- **Quality Scoring** - Each extraction gets a 0.0-1.0 score. Failures auto-retry up to 2x.
40
-
41
- **Specialized Extractors** - Tables, charts, heatmaps, diagrams each get dedicated agents.
42
-
43
- ---
44
-
45
- ## The Problem
46
-
47
- You have a 50-page PDF with figures, tables, and charts. You need that data.
48
-
49
- **Manual approach:** Screenshot each figure. Transcribe tables cell by cell. Spend hours on one document.
50
-
51
- **With structurecc:** One command. Walk away. Come back to perfectly structured markdown with quality verification.
52
-
53
- ```
54
- /structure paper.pdf
55
- ```
56
-
57
- Spawns parallel AI agents. Each agent analyzes one visual element. All run simultaneously. Quality verified. Done in minutes, not hours.
58
-
59
- ---
60
-
61
- ## Specialized Extractors
62
-
63
- | Extractor | Handles |
64
- |-----------|---------|
65
- | `structurecc-extract-table` | Tables with cell-by-cell accuracy, merged cells, footnotes |
66
- | `structurecc-extract-chart` | Kaplan-Meier, bar, line, scatter, forest plots with axes, legends, data |
67
- | `structurecc-extract-heatmap` | Expression heatmaps, correlation matrices with full label extraction |
68
- | `structurecc-extract-diagram` | CONSORT flows, timelines, network diagrams with all node text |
69
- | `structurecc-extract-multipanel` | Multi-panel figures (A, B, C, D) with per-panel extraction |
70
- | `structurecc-extract-generic` | Photographs, schematics, equations, other visuals |
71
-
72
- ---
73
-
74
- ## Quality Verification
75
-
76
- Every extraction is verified against the source image:
77
-
78
- ```json
79
- {
80
- "scores": {
81
- "completeness": 0.95,
82
- "accuracy": 0.92,
83
- "verbatim_compliance": 0.88,
84
- "structure_correctness": 0.97,
85
- "overall": 0.93
86
- },
87
- "pass": true,
88
- "threshold": 0.90
89
- }
90
- ```
91
-
92
- | Score | Meaning |
93
- |-------|---------|
94
- | **completeness** | Was every visible element captured? |
95
- | **accuracy** | Are values (numbers, stats) correct? |
96
- | **verbatim_compliance** | Was text copied exactly as shown? |
97
- | **structure_correctness** | Is the JSON structure valid? |
98
-
99
- **Auto-revision:** If score < 0.90, extraction is re-run with specific feedback. Max 2 attempts.
100
-
101
- ---
102
-
103
- ## Before You Start
104
-
105
- You need two things:
106
-
107
- ### 1. Node.js
108
-
109
- Check if you have it:
110
-
111
- ```bash
112
- node --version
113
- ```
114
-
115
- If you see a version number, you're good. If you see "command not found", download Node.js from **[nodejs.org](https://nodejs.org/)** and install it.
116
-
117
- ### 2. Anthropic API Key or Pro/Max Plan
118
-
119
- You need one of these to use Claude Code:
120
-
121
- - **API key:** Get one at **[console.anthropic.com](https://console.anthropic.com/)**. Requires a payment method.
122
- - **Pro or Max plan:** If you subscribe to Claude Pro ($20/mo) or Max ($100/mo), you can use Claude Code without a separate API key.
123
-
124
16
  ---
125
17
 
126
- ## Setup (5 minutes)
127
-
128
- ### Step 1: Open your terminal
18
+ ## Requirements
129
19
 
130
- **Mac:** Press `Cmd + Space`, type `Terminal`, press Enter
131
-
132
- **Windows:** Press `Win + X`, click "Terminal" or "PowerShell"
133
-
134
- **Linux:** Press `Ctrl + Alt + T`
20
+ - **Node.js** - [nodejs.org](https://nodejs.org/)
21
+ - **Claude Code** - Requires API key or Pro/Max subscription
135
22
 
136
23
  ---
137
24
 
138
- ### Step 2: Install Claude Code
25
+ ## Install
139
26
 
140
- Copy this command and paste it into your terminal:
27
+ ### Step 1: Install Claude Code
141
28
 
142
29
  ```bash
143
30
  npm install -g @anthropic-ai/claude-code
144
31
  ```
145
32
 
146
- Wait for it to finish.
147
-
148
- ---
149
-
150
- ### Step 3: Install structurecc
33
+ <p align="center">
34
+ <img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
35
+ </p>
151
36
 
152
- Copy and run this:
37
+ ### Step 2: Install structurecc
153
38
 
154
39
  ```bash
155
40
  npx structurecc
156
41
  ```
157
42
 
158
- You will see a STRUCTURE banner and 8 agents being installed. You only do this once.
159
-
160
- ---
161
-
162
- ### Step 4: Set up your document folder
163
-
164
- Create a folder with your document:
165
-
166
- ```
167
- documents/
168
- └── document.pdf ← your PDF, DOCX, or image
169
- ```
170
-
171
- **Put your document in a folder. That's it.**
172
-
173
- ---
43
+ <p align="center">
44
+ <img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
45
+ </p>
174
46
 
175
- ### Step 5: Open Claude Code
47
+ ### Step 3: Start Claude Code
176
48
 
177
- Navigate to your document folder and start Claude Code:
49
+ Navigate to your document folder and run:
178
50
 
179
51
  ```bash
180
52
  cd ~/Desktop/documents
181
53
  claude
182
54
  ```
183
55
 
184
- **Windows users:** Replace `~/Desktop/documents` with your actual path, like `C:\Users\YourName\Desktop\documents`
185
-
186
- The first time you run `claude`, it will ask for your API key. Paste it in.
187
-
188
- ---
56
+ <p align="center">
57
+ <img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
58
+ </p>
189
59
 
190
- ### Step 6: Run structure
60
+ ### Step 4: Run structure
191
61
 
192
- Now you are inside Claude Code. Type this command:
62
+ Inside Claude Code:
193
63
 
194
64
  ```
195
65
  /structure document.pdf
196
66
  ```
197
67
 
198
- **Important:** The `/structure` command only works inside Claude Code. If you type it in your regular terminal, it will not work.
68
+ <p align="center">
69
+ <img src="assets/screenshots/step3.png" alt="Run /structure" width="520">
70
+ </p>
199
71
 
200
- structurecc will:
201
- 1. Extract every image from your document
202
- 2. Classify each image (table, chart, heatmap, diagram, etc.)
203
- 3. Spawn specialized extractors in parallel
204
- 4. Verify each extraction against the source
205
- 5. Auto-revise failed extractions
206
- 6. Combine everything into `STRUCTURED.md`
72
+ Supports **PDF**, **DOCX**, **PNG**, **JPG**, and **TIFF**.
207
73
 
208
74
  ---
209
75
 
210
- ## What You Get
211
-
212
- A comprehensive output directory with full traceability:
76
+ ## Output
213
77
 
214
78
  ```
215
79
  document_extracted/
216
- ├── images/ # All extracted visuals
217
- ├── classifications/ # Phase 1: type detection
218
- │ ├── element_001_class.json
219
- │ └── ...
220
- ├── extractions/ # Phase 2: JSON extractions
221
- │ ├── element_001.json
222
- │ └── ...
223
- ├── verifications/ # Phase 3: quality scores
224
- │ ├── element_001_verify.json
225
- │ └── ...
226
- ├── elements/ # Markdown per element
227
- │ ├── element_001.md
228
- │ └── ...
229
- ├── STRUCTURED.md # Combined output
230
- └── extraction_report.json # Quality metrics summary
231
- ```
232
-
233
- ### Quality Report
234
-
235
- ```json
236
- {
237
- "document": "clinical_trial.pdf",
238
- "pipeline_version": "2.0.0",
239
- "elements_total": 15,
240
- "elements_passed": 13,
241
- "elements_revised": 2,
242
- "elements_human_review": 0,
243
- "average_quality_score": 0.92
244
- }
245
- ```
246
-
247
- ### Example: Table Extraction
248
-
249
- ```markdown
250
- # Patient Demographics
251
-
252
- **Type:** Table
253
- **Source:** Page 3, clinical_trial.pdf
254
-
255
- ## Data
256
-
257
- | Characteristic | Treatment (n=245) | Placebo (n=248) | P-value |
258
- |---|---|---|---|
259
- | Age, years | 54.3 ± 12.1 | 53.8 ± 11.9 | 0.67 |
260
- | Male (%) | 58.4 | 56.9 | 0.73 |
261
- | BMI (kg/m²) | 28.7 ± 4.2 | 28.4 ± 4.1 | 0.42 |
262
-
263
- ## Footnotes
264
- - * Missing data excluded from analysis
265
- - † Adjusted for baseline
266
- ```
267
-
268
- ### Example: Kaplan-Meier Extraction
269
-
270
- ```markdown
271
- # Kaplan-Meier Survival Curves
272
-
273
- **Type:** kaplan_meier
274
- **Source:** Page 7, clinical_trial.pdf
275
-
276
- ## Axes
277
-
278
- - **X-axis:** Time (Days) Since HSV Diagnosis
279
- - Range: 0 to 7000
280
- - **Y-axis:** Cumulative Risk of Dementia
281
- - Range: 0 to 0.6
282
-
283
- ## Legend
284
-
285
- - **HSV: Dementia Risk**: purple solid
286
- - **Control: Dementia Risk**: dark blue solid
287
- - **HSV: Dementia Risk 95% CI**: light purple shaded area
288
- - **Control: Dementia Risk 95% CI**: light orange shaded area
289
-
290
- ## Statistical Annotations
291
-
292
- - p_value: < 0.001
293
- - hazard_ratio: 1.52 (95% CI: 1.38-1.68)
294
-
295
- ## Risk Table
296
-
297
- | Time (days) | 0 | 1000 | 2000 | 3000 | 4000 | 5000 | 6000 | 7000 |
298
- |---|---|---|---|---|---|---|---|---|
299
- | HSV | 8,362 | 7,891 | 6,543 | 5,102 | 3,876 | 2,654 | 1,432 | 521 |
300
- | Control | 41,810 | 39,765 | 33,421 | 26,543 | 19,876 | 13,543 | 7,654 | 2,876 |
80
+ ├── images/ # Extracted visuals
81
+ ├── elements/ # Markdown per element
82
+ └── STRUCTURED.md # Combined output
301
83
  ```
302
84
 
303
85
  ---
304
86
 
305
- ## Cost
306
-
307
- | Document | Elements | ~Cost |
308
- |----------|----------|-------|
309
- | Simple paper | 5-10 | $1-$2 |
310
- | Full paper | 15-25 | $3-$6 |
311
- | Dense report | 40+ | $8-$15 |
312
-
313
- Uses Claude's multimodal vision with model-appropriate routing:
314
- - **Haiku** for classification (fast, cheap)
315
- - **Opus** for extraction (highest quality)
316
- - **Sonnet** for verification (balanced)
317
-
318
- ---
319
-
320
- ## Supported Formats
321
-
322
- - **PDF** - Extracts embedded images via PyMuPDF
323
- - **DOCX** - Extracts images from Word's media folder
324
- - **PNG/JPG/TIFF** - Analyzes images directly
325
-
326
- ---
327
-
328
87
  ## Troubleshooting
329
88
 
330
- **"npm: command not found"**
331
-
332
- You need Node.js. Download it from [nodejs.org](https://nodejs.org/).
333
-
334
- **"bash: /structure: No such file or directory"**
335
-
336
- You typed `/structure` in your regular terminal. You need to type it inside Claude Code. First run `claude` to start Claude Code, then type `/structure`.
337
-
338
- **"No images found"**
339
-
340
- Make sure your PDF contains actual images, not just text. Some PDFs render everything as text.
341
-
342
- **Low quality scores**
343
-
344
- Check `verifications/` for specific issues. Complex tables or poor image quality may need human review.
345
-
346
- **Claude Code asks for an API key**
347
-
348
- Either get an API key at [console.anthropic.com](https://console.anthropic.com/), or subscribe to Claude Pro/Max at [claude.ai](https://claude.ai/).
89
+ | Issue | Solution |
90
+ |-------|----------|
91
+ | `npm: command not found` | Install Node.js from [nodejs.org](https://nodejs.org/) |
92
+ | `/structure: No such file` | Run `claude` first, then type `/structure` inside Claude Code |
93
+ | No images found | PDF may be text-only with no embedded images |
349
94
 
350
95
  ---
351
96
 
@@ -357,24 +102,6 @@ npx structurecc --uninstall
357
102
 
358
103
  ---
359
104
 
360
- ## Upgrade from v1.x
361
-
362
- Just run the installer again:
363
-
364
- ```bash
365
- npx structurecc
366
- ```
367
-
368
- The installer automatically removes the old `structurecc-extractor` and installs the new 8-agent pipeline.
369
-
370
- ---
371
-
372
105
  ## License
373
106
 
374
107
  MIT
375
-
376
- ---
377
-
378
- <p align="center">
379
- <strong>Verbatim in. Quality verified out.</strong>
380
- </p>
package/package.json CHANGED
@@ -1,24 +1,19 @@
1
1
  {
2
2
  "name": "structurecc",
3
- "version": "2.0.0",
4
- "description": "Agentic document structuring for Claude Code with verbatim extraction and quality verification. 3-phase pipeline: Classify → Extract → Verify.",
3
+ "version": "2.0.1",
4
+ "description": "Extract structured data from PDFs, Word docs, and images using Claude Code.",
5
5
  "keywords": [
6
6
  "document-extraction",
7
7
  "pdf",
8
8
  "structure",
9
- "agentic",
10
9
  "claude-code",
11
10
  "llm",
12
11
  "multimodal",
13
12
  "tables",
14
13
  "figures",
15
14
  "charts",
16
- "heatmaps",
17
15
  "markdown",
18
- "ai-agents",
19
- "ocr",
20
- "verbatim",
21
- "quality-assurance"
16
+ "ai-agents"
22
17
  ],
23
18
  "author": "James Weatherhead",
24
19
  "license": "MIT",