structurecc 1.0.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,4 +1,4 @@
1
- <h1 align="center">STRUCTURE</h1>
1
+ <h1 align="center">STRUCTURE v2.0</h1>
2
2
 
3
3
  <p align="center">
4
4
  <strong>Landing AI charges $500/month for agentic document structuring.<br>This is free.</strong>
@@ -19,29 +19,84 @@
19
19
 
20
20
  ---
21
21
 
22
+ ## What's New in v2.0
23
+
24
+ **3-Phase Pipeline with Quality Verification**
25
+
26
+ ```
27
+ Image → [Classify] → [Extract] → [Verify] → Output
28
+ ↑_______↻_______↓
29
+ ```
30
+
31
+ | Phase | Agent | Purpose |
32
+ |-------|-------|---------|
33
+ | 1. Classify | `structurecc-classifier` | Fast triage to route to correct extractor |
34
+ | 2. Extract | 6 specialized extractors | Type-specific verbatim extraction |
35
+ | 3. Verify | `structurecc-verifier` | Quality scoring with auto-revision |
36
+
37
+ **Verbatim Extraction** - Text is copied EXACTLY as shown. No paraphrasing, no "cleanup."
38
+
39
+ **Quality Scoring** - Each extraction gets a 0.0-1.0 score. Failures auto-retry up to 2x.
40
+
41
+ **Specialized Extractors** - Tables, charts, heatmaps, diagrams each get dedicated agents.
42
+
43
+ ---
44
+
22
45
  ## The Problem
23
46
 
24
47
  You have a 50-page PDF with figures, tables, and charts. You need that data.
25
48
 
26
49
  **Manual approach:** Screenshot each figure. Transcribe tables cell by cell. Spend hours on one document.
27
50
 
28
- **With structurecc:** One command. Walk away. Come back to perfectly structured markdown.
51
+ **With structurecc:** One command. Walk away. Come back to perfectly structured markdown with quality verification.
29
52
 
30
53
  ```
31
54
  /structure paper.pdf
32
55
  ```
33
56
 
34
- Spawns parallel AI agents. Each agent analyzes one visual element. All run simultaneously. Done in minutes, not hours.
57
+ Spawns parallel AI agents. Each agent analyzes one visual element. All run simultaneously. Quality verified. Done in minutes, not hours.
35
58
 
36
59
  ---
37
60
 
38
- ## What is this?
61
+ ## Specialized Extractors
39
62
 
40
- Give it a document. It extracts every image. Spawns one AI agent per image. Each agent exhaustively analyzes its element—tables become markdown tables, figures get descriptions, charts get data points extracted.
63
+ | Extractor | Handles |
64
+ |-----------|---------|
65
+ | `structurecc-extract-table` | Tables with cell-by-cell accuracy, merged cells, footnotes |
66
+ | `structurecc-extract-chart` | Kaplan-Meier, bar, line, scatter, forest plots with axes, legends, data |
67
+ | `structurecc-extract-heatmap` | Expression heatmaps, correlation matrices with full label extraction |
68
+ | `structurecc-extract-diagram` | CONSORT flows, timelines, network diagrams with all node text |
69
+ | `structurecc-extract-multipanel` | Multi-panel figures (A, B, C, D) with per-panel extraction |
70
+ | `structurecc-extract-generic` | Photographs, schematics, equations, other visuals |
41
71
 
42
- Runs inside **[Claude Code](https://docs.anthropic.com/en/docs/claude-code)** (Anthropic's terminal assistant). One command. ~$0.50-$5 per document.
72
+ ---
43
73
 
44
- Like [Landing AI's Agentic Document Extraction](https://landing.ai/agentic-document-extraction), but running locally via Claude Code.
74
+ ## Quality Verification
75
+
76
+ Every extraction is verified against the source image:
77
+
78
+ ```json
79
+ {
80
+ "scores": {
81
+ "completeness": 0.95,
82
+ "accuracy": 0.92,
83
+ "verbatim_compliance": 0.88,
84
+ "structure_correctness": 0.97,
85
+ "overall": 0.93
86
+ },
87
+ "pass": true,
88
+ "threshold": 0.90
89
+ }
90
+ ```
91
+
92
+ | Score | Meaning |
93
+ |-------|---------|
94
+ | **completeness** | Was every visible element captured? |
95
+ | **accuracy** | Are values (numbers, stats) correct? |
96
+ | **verbatim_compliance** | Was text copied exactly as shown? |
97
+ | **structure_correctness** | Is the JSON structure valid? |
98
+
99
+ **Auto-revision:** If score < 0.90, extraction is re-run with specific feedback. Max 2 attempts.
45
100
 
46
101
  ---
47
102
 
@@ -88,10 +143,6 @@ Copy this command and paste it into your terminal:
88
143
  npm install -g @anthropic-ai/claude-code
89
144
  ```
90
145
 
91
- <p align="center">
92
- <img src="assets/screenshots/step0.png" alt="Install Claude Code" width="550">
93
- </p>
94
-
95
146
  Wait for it to finish.
96
147
 
97
148
  ---
@@ -104,11 +155,7 @@ Copy and run this:
104
155
  npx structurecc
105
156
  ```
106
157
 
107
- <p align="center">
108
- <img src="assets/screenshots/step1.png" alt="Install structurecc" width="420">
109
- </p>
110
-
111
- You will see a STRUCTURE banner. That means it worked. You only do this once.
158
+ You will see a STRUCTURE banner and 8 agents being installed. You only do this once.
112
159
 
113
160
  ---
114
161
 
@@ -116,17 +163,9 @@ You will see a STRUCTURE banner. That means it worked. You only do this once.
116
163
 
117
164
  Create a folder with your document:
118
165
 
119
- <p align="center">
120
- <img src="assets/screenshots/step2.png" alt="Folder structure" width="380">
121
- </p>
122
-
123
166
  ```
124
167
  documents/
125
- ├── document.pdf ← your PDF, DOCX, or image
126
- └── images/ ← extracted images go here (created automatically)
127
- ├── figure_1.png
128
- ├── table_2.png
129
- └── chart_3.png
168
+ └── document.pdf ← your PDF, DOCX, or image
130
169
  ```
131
170
 
132
171
  **Put your document in a folder. That's it.**
@@ -142,10 +181,6 @@ cd ~/Desktop/documents
142
181
  claude
143
182
  ```
144
183
 
145
- <p align="center">
146
- <img src="assets/screenshots/step3a.png" alt="Start Claude Code" width="460">
147
- </p>
148
-
149
184
  **Windows users:** Replace `~/Desktop/documents` with your actual path, like `C:\Users\YourName\Desktop\documents`
150
185
 
151
186
  The first time you run `claude`, it will ask for your API key. Paste it in.
@@ -160,32 +195,53 @@ Now you are inside Claude Code. Type this command:
160
195
  /structure document.pdf
161
196
  ```
162
197
 
163
- <p align="center">
164
- <img src="assets/screenshots/step3.png" alt="Run /structure" width="500">
165
- </p>
166
-
167
198
  **Important:** The `/structure` command only works inside Claude Code. If you type it in your regular terminal, it will not work.
168
199
 
169
200
  structurecc will:
170
201
  1. Extract every image from your document
171
- 2. Spawn one agent per image (all running in parallel)
172
- 3. Each agent exhaustively analyzes its visual element
173
- 4. Combine everything into `STRUCTURED.md`
202
+ 2. Classify each image (table, chart, heatmap, diagram, etc.)
203
+ 3. Spawn specialized extractors in parallel
204
+ 4. Verify each extraction against the source
205
+ 5. Auto-revise failed extractions
206
+ 6. Combine everything into `STRUCTURED.md`
174
207
 
175
208
  ---
176
209
 
177
210
  ## What You Get
178
211
 
179
- A comprehensive markdown file with every visual element extracted:
212
+ A comprehensive output directory with full traceability:
180
213
 
181
214
  ```
182
215
  document_extracted/
183
- ├── images/ # All extracted visuals
184
- ├── elements/ # One markdown file per element
185
- │ ├── element_1.md # Table fully extracted
186
- ├── element_2.md # Figure analyzed
216
+ ├── images/ # All extracted visuals
217
+ ├── classifications/ # Phase 1: type detection
218
+ │ ├── element_001_class.json
219
+ └── ...
220
+ ├── extractions/ # Phase 2: JSON extractions
221
+ │ ├── element_001.json
222
+ │ └── ...
223
+ ├── verifications/ # Phase 3: quality scores
224
+ │ ├── element_001_verify.json
187
225
  │ └── ...
188
- └── STRUCTURED.md # Everything combined
226
+ ├── elements/ # Markdown per element
227
+ │ ├── element_001.md
228
+ │ └── ...
229
+ ├── STRUCTURED.md # Combined output
230
+ └── extraction_report.json # Quality metrics summary
231
+ ```
232
+
233
+ ### Quality Report
234
+
235
+ ```json
236
+ {
237
+ "document": "clinical_trial.pdf",
238
+ "pipeline_version": "2.0.0",
239
+ "elements_total": 15,
240
+ "elements_passed": 13,
241
+ "elements_revised": 2,
242
+ "elements_human_review": 0,
243
+ "average_quality_score": 0.92
244
+ }
189
245
  ```
190
246
 
191
247
  ### Example: Table Extraction
@@ -196,40 +252,52 @@ document_extracted/
196
252
  **Type:** Table
197
253
  **Source:** Page 3, clinical_trial.pdf
198
254
 
199
- ## Content
255
+ ## Data
200
256
 
201
- | Group | N | Age (mean±SD) | Male (%) |
202
- |-------|---|---------------|----------|
203
- | Treatment | 245 | 54.3±12.1 | 58.4 |
204
- | Placebo | 248 | 53.8±11.9 | 56.9 |
205
- | p-value | - | 0.67 | 0.73 |
257
+ | Characteristic | Treatment (n=245) | Placebo (n=248) | P-value |
258
+ |---|---|---|---|
259
+ | Age, years | 54.3 ± 12.1 | 53.8 ± 11.9 | 0.67 |
260
+ | Male (%) | 58.4 | 56.9 | 0.73 |
261
+ | BMI (kg/m²) | 28.7 ± 4.2 | 28.4 ± 4.1 | 0.42 |
206
262
 
207
- ## Notes
208
- - Confidence level: High
263
+ ## Footnotes
209
264
  - * Missing data excluded from analysis
265
+ - † Adjusted for baseline
210
266
  ```
211
267
 
212
- ### Example: Chart Analysis
268
+ ### Example: Kaplan-Meier Extraction
213
269
 
214
270
  ```markdown
215
271
  # Kaplan-Meier Survival Curves
216
272
 
217
- **Type:** Chart (Line/Survival)
273
+ **Type:** kaplan_meier
218
274
  **Source:** Page 7, clinical_trial.pdf
219
275
 
220
- ## Content
276
+ ## Axes
277
+
278
+ - **X-axis:** Time (Days) Since HSV Diagnosis
279
+ - Range: 0 to 7000
280
+ - **Y-axis:** Cumulative Risk of Dementia
281
+ - Range: 0 to 0.6
282
+
283
+ ## Legend
284
+
285
+ - **HSV: Dementia Risk**: purple solid
286
+ - **Control: Dementia Risk**: dark blue solid
287
+ - **HSV: Dementia Risk 95% CI**: light purple shaded area
288
+ - **Control: Dementia Risk 95% CI**: light orange shaded area
221
289
 
222
- Survival curves comparing treatment (blue) vs placebo (red) over 24 months.
290
+ ## Statistical Annotations
223
291
 
224
- Key data points:
225
- - 12-month survival: Treatment 0.89, Placebo 0.78
226
- - 24-month survival: Treatment 0.76, Placebo 0.61
227
- - Log-rank p = 0.003
292
+ - p_value: < 0.001
293
+ - hazard_ratio: 1.52 (95% CI: 1.38-1.68)
228
294
 
229
- ## Labels & Annotations
230
- - Y-axis: "Survival Probability"
231
- - X-axis: "Time (months)"
232
- - Legend: "Treatment (n=245)", "Placebo (n=248)"
295
+ ## Risk Table
296
+
297
+ | Time (days) | 0 | 1000 | 2000 | 3000 | 4000 | 5000 | 6000 | 7000 |
298
+ |---|---|---|---|---|---|---|---|---|
299
+ | HSV | 8,362 | 7,891 | 6,543 | 5,102 | 3,876 | 2,654 | 1,432 | 521 |
300
+ | Control | 41,810 | 39,765 | 33,421 | 26,543 | 19,876 | 13,543 | 7,654 | 2,876 |
233
301
  ```
234
302
 
235
303
  ---
@@ -238,11 +306,14 @@ Key data points:
238
306
 
239
307
  | Document | Elements | ~Cost |
240
308
  |----------|----------|-------|
241
- | Simple paper | 5-10 | $0.50-$1 |
242
- | Full paper | 15-25 | $2-$4 |
243
- | Dense report | 40+ | $5-$10 |
309
+ | Simple paper | 5-10 | $1-$2 |
310
+ | Full paper | 15-25 | $3-$6 |
311
+ | Dense report | 40+ | $8-$15 |
244
312
 
245
- Uses Claude's multimodal vision. Works best with **Opus 4.5** for complex tables and charts.
313
+ Uses Claude's multimodal vision with model-appropriate routing:
314
+ - **Haiku** for classification (fast, cheap)
315
+ - **Opus** for extraction (highest quality)
316
+ - **Sonnet** for verification (balanced)
246
317
 
247
318
  ---
248
319
 
@@ -268,6 +339,10 @@ You typed `/structure` in your regular terminal. You need to type it inside Clau
268
339
 
269
340
  Make sure your PDF contains actual images, not just text. Some PDFs render everything as text.
270
341
 
342
+ **Low quality scores**
343
+
344
+ Check `verifications/` for specific issues. Complex tables or poor image quality may need human review.
345
+
271
346
  **Claude Code asks for an API key**
272
347
 
273
348
  Either get an API key at [console.anthropic.com](https://console.anthropic.com/), or subscribe to Claude Pro/Max at [claude.ai](https://claude.ai/).
@@ -282,6 +357,18 @@ npx structurecc --uninstall
282
357
 
283
358
  ---
284
359
 
360
+ ## Upgrade from v1.x
361
+
362
+ Just run the installer again:
363
+
364
+ ```bash
365
+ npx structurecc
366
+ ```
367
+
368
+ The installer automatically removes the old `structurecc-extractor` and installs the new 8-agent pipeline.
369
+
370
+ ---
371
+
285
372
  ## License
286
373
 
287
374
  MIT
@@ -289,5 +376,5 @@ MIT
289
376
  ---
290
377
 
291
378
  <p align="center">
292
- <strong>Unstructured in. Structured out.</strong>
379
+ <strong>Verbatim in. Quality verified out.</strong>
293
380
  </p>
@@ -0,0 +1,135 @@
1
+ ---
2
+ name: structurecc-classifier
3
+ description: Phase 1 - Classify visual elements for specialized extraction
4
+ ---
5
+
6
+ # Visual Element Classifier
7
+
8
+ You are a rapid visual classifier. Your ONLY job is to identify what type of visual element an image contains so the correct specialized extractor can be dispatched.
9
+
10
+ ## Classification Task
11
+
12
+ Given an image, output a JSON classification. Nothing else.
13
+
14
+ ## Classification Types
15
+
16
+ | Type | Description |
17
+ |------|-------------|
18
+ | `table_simple` | Standard grid table with clear rows/columns, no merged cells |
19
+ | `table_complex` | Table with merged cells, nested headers, or irregular structure |
20
+ | `chart_kaplan_meier` | Survival curves / time-to-event plots with step functions |
21
+ | `chart_bar` | Bar charts (horizontal or vertical), grouped or stacked |
22
+ | `chart_line` | Line graphs showing trends over continuous x-axis |
23
+ | `chart_scatter` | Scatter plots with individual data points |
24
+ | `chart_box` | Box plots / whisker plots showing distributions |
25
+ | `chart_pie` | Pie charts or donut charts |
26
+ | `chart_area` | Area charts (filled line charts) |
27
+ | `chart_forest` | Forest plots (meta-analysis results) |
28
+ | `chart_volcano` | Volcano plots (differential expression) |
29
+ | `heatmap` | Color-coded matrix (correlation, expression, etc.) |
30
+ | `diagram_flowchart` | Process flows with boxes and arrows |
31
+ | `diagram_timeline` | Temporal sequences, study timelines, CONSORT diagrams |
32
+ | `diagram_network` | Network graphs, pathway diagrams, interaction maps |
33
+ | `diagram_schematic` | Technical schematics, anatomical diagrams |
34
+ | `diagram_venn` | Venn diagrams showing set overlaps |
35
+ | `multi_panel` | Composite figure with labeled panels (A, B, C, D) |
36
+ | `photograph` | Real-world photographs, microscopy images, scans |
37
+ | `equation` | Mathematical equations, formulas |
38
+ | `text_block` | Text-heavy image, caption, or label |
39
+ | `unknown` | Cannot confidently classify |
40
+
41
+ ## Output Format
42
+
43
+ Return ONLY valid JSON:
44
+
45
+ ```json
46
+ {
47
+ "classification": {
48
+ "primary_type": "chart_kaplan_meier",
49
+ "confidence": 0.95,
50
+ "secondary_type": null,
51
+ "is_multi_panel": false,
52
+ "panel_count": 1,
53
+ "contains_table": false,
54
+ "contains_text_annotations": true
55
+ },
56
+ "routing": {
57
+ "extractor": "structurecc-extract-chart",
58
+ "extraction_hints": ["survival_curve", "two_groups", "has_risk_table"]
59
+ }
60
+ }
61
+ ```
62
+
63
+ ## Field Definitions
64
+
65
+ ### classification
66
+ - `primary_type`: Main visual type from the table above
67
+ - `confidence`: 0.0-1.0 confidence in classification
68
+ - `secondary_type`: If image contains a secondary element (e.g., chart with embedded table)
69
+ - `is_multi_panel`: True if figure has labeled sub-panels (A, B, C...)
70
+ - `panel_count`: Number of panels if multi-panel
71
+ - `contains_table`: True if any tabular data is present
72
+ - `contains_text_annotations`: True if significant text labels/annotations present
73
+
74
+ ### routing
75
+ - `extractor`: Which specialized extractor to use
76
+ - `extraction_hints`: List of specific features to watch for
77
+
78
+ ## Extractor Routing
79
+
80
+ | Primary Type | Extractor |
81
+ |--------------|-----------|
82
+ | `table_simple`, `table_complex` | `structurecc-extract-table` |
83
+ | `chart_*` | `structurecc-extract-chart` |
84
+ | `heatmap` | `structurecc-extract-heatmap` |
85
+ | `diagram_*` | `structurecc-extract-diagram` |
86
+ | `multi_panel` | `structurecc-extract-multipanel` |
87
+ | `photograph`, `equation`, `text_block`, `unknown` | `structurecc-extract-generic` |
88
+
89
+ ## Rules
90
+
91
+ 1. **Be fast** - This is a triage step, not deep analysis
92
+ 2. **Be decisive** - Pick the best match, use confidence to express uncertainty
93
+ 3. **Detect multi-panel** - If you see A, B, C, D labels, set `is_multi_panel: true`
94
+ 4. **Note secondary elements** - Charts often have risk tables, legends, etc.
95
+ 5. **Output ONLY JSON** - No explanations, no markdown, just the JSON object
96
+
97
+ ## Examples
98
+
99
+ **Kaplan-Meier curve with risk table below:**
100
+ ```json
101
+ {
102
+ "classification": {
103
+ "primary_type": "chart_kaplan_meier",
104
+ "confidence": 0.98,
105
+ "secondary_type": "table_simple",
106
+ "is_multi_panel": false,
107
+ "panel_count": 1,
108
+ "contains_table": true,
109
+ "contains_text_annotations": true
110
+ },
111
+ "routing": {
112
+ "extractor": "structurecc-extract-chart",
113
+ "extraction_hints": ["survival_curve", "has_risk_table", "has_confidence_intervals"]
114
+ }
115
+ }
116
+ ```
117
+
118
+ **Four-panel figure with A=bar chart, B=heatmap, C=box plot, D=table:**
119
+ ```json
120
+ {
121
+ "classification": {
122
+ "primary_type": "multi_panel",
123
+ "confidence": 0.99,
124
+ "secondary_type": null,
125
+ "is_multi_panel": true,
126
+ "panel_count": 4,
127
+ "contains_table": true,
128
+ "contains_text_annotations": true
129
+ },
130
+ "routing": {
131
+ "extractor": "structurecc-extract-multipanel",
132
+ "extraction_hints": ["panel_A_bar", "panel_B_heatmap", "panel_C_boxplot", "panel_D_table"]
133
+ }
134
+ }
135
+ ```