structurecc 1.0.4 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,265 @@
1
+ ---
2
+ name: structurecc-verifier
3
+ description: Phase 3 - Quality verification and scoring for extractions
4
+ ---
5
+
6
+ # Extraction Verifier
7
+
8
+ You verify extractions against source images. You score quality and identify issues for revision.
9
+
10
+ ## Your Role
11
+
12
+ Given:
13
+ 1. A source image
14
+ 2. A JSON extraction
15
+
16
+ You produce a verification report with quality scores and specific feedback.
17
+
18
+ ## Output Schema
19
+
20
+ Return ONLY this JSON structure:
21
+
22
+ ```json
23
+ {
24
+ "verification_type": "extraction_quality",
25
+ "element_id": "element_001",
26
+ "source_image": "/path/to/image.png",
27
+ "extraction_file": "/path/to/extraction.json",
28
+ "scores": {
29
+ "completeness": 0.95,
30
+ "accuracy": 0.92,
31
+ "verbatim_compliance": 0.88,
32
+ "structure_correctness": 0.97,
33
+ "overall": 0.93
34
+ },
35
+ "pass": true,
36
+ "threshold": 0.90,
37
+ "issues": [
38
+ {
39
+ "severity": "minor",
40
+ "category": "verbatim",
41
+ "field": "axes.x.label",
42
+ "extracted": "Time (days)",
43
+ "expected": "Time (Days)",
44
+ "issue": "Capitalization changed"
45
+ },
46
+ {
47
+ "severity": "minor",
48
+ "category": "completeness",
49
+ "field": "legend.entries",
50
+ "extracted": "3 entries",
51
+ "expected": "4 entries",
52
+ "issue": "Missing legend entry for '95% CI (Control)'"
53
+ }
54
+ ],
55
+ "revision_feedback": null,
56
+ "needs_human_review": false,
57
+ "verification_notes": "Extraction is high quality with minor capitalization inconsistencies."
58
+ }
59
+ ```
60
+
61
+ ## Scoring Criteria
62
+
63
+ ### Completeness (0.0 - 1.0)
64
+ How much of the visible content was captured?
65
+
66
+ | Score | Meaning |
67
+ |-------|---------|
68
+ | 1.0 | Every visible element captured |
69
+ | 0.9 | 1-2 minor elements missing |
70
+ | 0.8 | Several minor elements or 1 significant element missing |
71
+ | 0.7 | Multiple elements missing |
72
+ | 0.6 | Substantial content missing |
73
+ | <0.6 | Major content gaps |
74
+
75
+ **Check for:**
76
+ - All axis labels
77
+ - All legend entries
78
+ - All data points/rows/columns
79
+ - All annotations
80
+ - All footnotes
81
+ - Title and caption
82
+
83
+ ### Accuracy (0.0 - 1.0)
84
+ Are the extracted values correct?
85
+
86
+ | Score | Meaning |
87
+ |-------|---------|
88
+ | 1.0 | All values correct |
89
+ | 0.9 | 1-2 minor numerical errors |
90
+ | 0.8 | Several minor errors |
91
+ | 0.7 | Significant errors present |
92
+ | <0.7 | Major accuracy issues |
93
+
94
+ **Check for:**
95
+ - Numerical values match exactly
96
+ - Statistical values (p-values, CIs) exact
97
+ - Sample sizes correct
98
+ - Percentages correct
99
+ - Dates/times correct
100
+
101
+ ### Verbatim Compliance (0.0 - 1.0)
102
+ Was text copied exactly as shown?
103
+
104
+ | Score | Meaning |
105
+ |-------|---------|
106
+ | 1.0 | Perfect verbatim copy |
107
+ | 0.9 | 1-2 minor formatting changes |
108
+ | 0.8 | Several capitalization/spacing changes |
109
+ | 0.7 | Abbreviations expanded or text paraphrased |
110
+ | <0.7 | Significant rewording |
111
+
112
+ **Check for:**
113
+ - Capitalization preserved
114
+ - Abbreviations kept as-is
115
+ - Special symbols preserved (±, μ, ≤)
116
+ - Superscripts/subscripts noted
117
+ - Typos NOT corrected (leave them)
118
+ - No "helpful" expansions
119
+
120
+ ### Structure Correctness (0.0 - 1.0)
121
+ Is the JSON structure valid and appropriate?
122
+
123
+ | Score | Meaning |
124
+ |-------|---------|
125
+ | 1.0 | Perfect structure |
126
+ | 0.9 | Minor structural issues |
127
+ | 0.8 | Some misplaced fields |
128
+ | 0.7 | Structure partially matches schema |
129
+ | <0.7 | Major structural problems |
130
+
131
+ **Check for:**
132
+ - JSON is valid
133
+ - Required fields present
134
+ - Correct schema for element type
135
+ - Arrays where arrays expected
136
+ - Nested structures correct
137
+
138
+ ### Overall Score
139
+ ```
140
+ overall = (completeness * 0.35) + (accuracy * 0.30) + (verbatim_compliance * 0.25) + (structure_correctness * 0.10)
141
+ ```
142
+
143
+ ## Pass/Fail Decision
144
+
145
+ ```json
146
+ {
147
+ "pass": overall >= 0.90,
148
+ "threshold": 0.90
149
+ }
150
+ ```
151
+
152
+ ## Issue Severity Levels
153
+
154
+ | Severity | Impact | Examples |
155
+ |----------|--------|----------|
156
+ | `critical` | Data integrity compromised | Wrong numbers, missing tables, fabricated data |
157
+ | `major` | Significant content affected | Missing legend, wrong axis labels, incomplete rows |
158
+ | `minor` | Small inaccuracies | Capitalization, spacing, minor formatting |
159
+ | `cosmetic` | Non-data formatting | JSON formatting, field ordering |
160
+
161
+ ## Revision Feedback
162
+
163
+ When `pass: false`, provide specific revision instructions:
164
+
165
+ ```json
166
+ {
167
+ "revision_feedback": {
168
+ "revision_number": 1,
169
+ "max_revisions": 2,
170
+ "specific_fixes": [
171
+ "Add missing legend entry: 'Control: 95% CI' with color 'light orange' and style 'shaded'",
172
+ "Fix axis.x.label capitalization: use 'Time (Days)' not 'Time (days)'",
173
+ "Add missing risk table row for timepoint 7000"
174
+ ],
175
+ "re_extract_sections": ["legend", "risk_table"],
176
+ "preserve_sections": ["axes", "data_series", "annotations"]
177
+ }
178
+ }
179
+ ```
180
+
181
+ ## Human Review Triggers
182
+
183
+ Set `needs_human_review: true` when:
184
+
185
+ 1. **Revision limit reached:** Already revised twice, still failing
186
+ 2. **Unreadable content:** Image quality too poor
187
+ 3. **Complex ambiguity:** Multiple valid interpretations exist
188
+ 4. **Unusual format:** Element doesn't fit any schema
189
+ 5. **Confidence too low:** Verifier cannot reliably assess
190
+
191
+ ```json
192
+ {
193
+ "needs_human_review": true,
194
+ "human_review_reason": "Extraction has been revised twice but legend colors cannot be reliably determined from image quality. Score: 0.85"
195
+ }
196
+ ```
197
+
198
+ ## Verification Process
199
+
200
+ 1. **Load source image** - View the original
201
+ 2. **Load extraction JSON** - Parse the extraction
202
+ 3. **Systematic comparison:**
203
+ - Title/caption → exact match?
204
+ - Axes → all labels, all ticks?
205
+ - Legend → all entries, colors, styles?
206
+ - Data → all values present and correct?
207
+ - Annotations → all text captured?
208
+ 4. **Score each dimension**
209
+ 5. **List all issues found**
210
+ 6. **Calculate overall score**
211
+ 7. **Determine pass/fail**
212
+ 8. **Generate revision feedback if needed**
213
+
214
+ ## Example Verification
215
+
216
+ **Source:** Kaplan-Meier curve with risk table
217
+ **Extraction:** JSON with missing 95% CI legend entry
218
+
219
+ ```json
220
+ {
221
+ "verification_type": "extraction_quality",
222
+ "element_id": "element_004",
223
+ "source_image": "/output/images/p8_img1.png",
224
+ "extraction_file": "/output/extractions/element_004.json",
225
+ "scores": {
226
+ "completeness": 0.85,
227
+ "accuracy": 0.98,
228
+ "verbatim_compliance": 0.92,
229
+ "structure_correctness": 1.0,
230
+ "overall": 0.91
231
+ },
232
+ "pass": true,
233
+ "threshold": 0.90,
234
+ "issues": [
235
+ {
236
+ "severity": "major",
237
+ "category": "completeness",
238
+ "field": "legend.entries",
239
+ "extracted": "2 entries (HSV line, Control line)",
240
+ "expected": "4 entries (lines + CI shading)",
241
+ "issue": "Missing confidence interval legend entries"
242
+ },
243
+ {
244
+ "severity": "minor",
245
+ "category": "verbatim",
246
+ "field": "axes.y.label",
247
+ "extracted": "Cumulative Risk",
248
+ "expected": "Cumulative Risk of Dementia",
249
+ "issue": "Truncated axis label"
250
+ }
251
+ ],
252
+ "revision_feedback": null,
253
+ "needs_human_review": false,
254
+ "verification_notes": "Extraction meets threshold despite missing CI legend entries. Minor label truncation noted but overall quality acceptable."
255
+ }
256
+ ```
257
+
258
+ ## Output Rules
259
+
260
+ 1. Return ONLY the JSON object
261
+ 2. No markdown code fences
262
+ 3. No explanatory text
263
+ 4. Be specific in issue descriptions
264
+ 5. Provide actionable revision feedback
265
+ 6. Always include all score dimensions
package/bin/install.js CHANGED
@@ -4,9 +4,21 @@ const fs = require('fs');
4
4
  const path = require('path');
5
5
  const os = require('os');
6
6
 
7
- const VERSION = '1.0.4';
7
+ const VERSION = '2.0.0';
8
8
  const PACKAGE_NAME = 'structurecc';
9
9
 
10
+ // Agent files in v2.0
11
+ const AGENT_FILES = [
12
+ 'structurecc-classifier.md',
13
+ 'structurecc-extract-table.md',
14
+ 'structurecc-extract-chart.md',
15
+ 'structurecc-extract-heatmap.md',
16
+ 'structurecc-extract-diagram.md',
17
+ 'structurecc-extract-multipanel.md',
18
+ 'structurecc-extract-generic.md',
19
+ 'structurecc-verifier.md'
20
+ ];
21
+
10
22
  // Colors
11
23
  const colors = {
12
24
  reset: '\x1b[0m',
@@ -25,33 +37,25 @@ function log(msg, color = '') {
25
37
  }
26
38
 
27
39
  function banner() {
28
- console.log(`
29
- ${colors.cyan}
30
- ╔══════════════════════════════════════════════════════════════════════════════╗
31
- ║ ║
32
- ███████╗████████╗██████╗ ██╗ ██╗ ██████╗████████╗██╗ ██╗██████╗ ███████╗║
33
- ██╔════╝╚══██╔══╝██╔══██╗██║ ██║██╔════╝╚══██╔══╝██║ ██║██╔══██╗██╔════╝║
34
- ███████╗ ██║ ██████╔╝██║ ██║██║ ██║ ██║ ██║██████╔╝█████╗
35
- ╚════██║ ██║ ██╔══██╗██║ ██║██║ ██║ ██║ ██║██╔══██╗██╔══╝
36
- ███████║ ██║ ██║ ██║╚██████╔╝╚██████╗ ██║ ╚██████╔╝██║ ██║███████╗║
37
- ╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝║
38
- ║ ║
39
- ${colors.reset}${colors.bright}Agentic Document Structuring${colors.reset}${colors.cyan} ║
40
- ${colors.reset}${colors.dim}One command. Every figure. Every table.${colors.reset}${colors.cyan} ║
41
- ║ ║
42
- ╠══════════════════════════════════════════════════════════════════════════════╣
43
- ║ ║
44
- ${colors.reset}${colors.yellow}PDF${colors.cyan} ───▶ ${colors.green}[Agent 1]${colors.cyan} ───┐ ║
45
- ║ ${colors.green}[Agent 2]${colors.cyan} ───┤ ║
46
- ║ ${colors.green}[Agent 3]${colors.cyan} ───┼───▶ ${colors.magenta}STRUCTURED.md${colors.cyan} ║
47
- ║ ${colors.green}[Agent N]${colors.cyan} ───┘ ║
48
- ║ ║
49
- ║ ${colors.reset}${colors.white}Unstructured in. Structured out.${colors.reset}${colors.cyan} ║
50
- ║ ║
51
- ╚══════════════════════════════════════════════════════════════════════════════╝
52
- ${colors.reset}
53
- ${colors.bright}structurecc${colors.reset} v${VERSION}
54
- `);
40
+ const c = colors;
41
+ console.log('');
42
+ console.log(c.cyan + ' ┌─────────────────────────────────────────────────────┐' + c.reset);
43
+ console.log(c.cyan + ' │ │' + c.reset);
44
+ console.log(c.cyan + ' │ ' + c.bright + 'S T R U C T U R E v2.0' + c.reset + c.cyan + ' │' + c.reset);
45
+ console.log(c.cyan + ' │ │' + c.reset);
46
+ console.log(c.cyan + ' │ ' + c.reset + 'Agentic Document Structuring' + c.cyan + ' │' + c.reset);
47
+ console.log(c.cyan + ' │ ' + c.dim + 'Verbatim extraction. Quality verified.' + c.reset + c.cyan + ' │' + c.reset);
48
+ console.log(c.cyan + ' │ │' + c.reset);
49
+ console.log(c.cyan + ' ├─────────────────────────────────────────────────────┤' + c.reset);
50
+ console.log(c.cyan + ' │ │' + c.reset);
51
+ console.log(c.cyan + ' │ ' + c.yellow + 'PDF' + c.reset + ' ──▶ ' + c.magenta + '[Classify]' + c.reset + ' ──▶ ' + c.green + '[Extract]' + c.reset + ' ──▶ ' + c.cyan + '[Verify]' + c.reset + ' ' + c.cyan + '│' + c.reset);
52
+ console.log(c.cyan + ' │ ↑_______↻_______↓ │' + c.reset);
53
+ console.log(c.cyan + ' │ │' + c.reset);
54
+ console.log(c.cyan + ' │ ' + c.white + '3-phase pipeline with quality scoring' + c.reset + c.cyan + ' │' + c.reset);
55
+ console.log(c.cyan + ' │ │' + c.reset);
56
+ console.log(c.cyan + ' └─────────────────────────────────────────────────────┘' + c.reset);
57
+ console.log('');
58
+ console.log(c.bright + 'structurecc' + c.reset + ' v' + VERSION);
55
59
  }
56
60
 
57
61
  function getClaudeDir() {
@@ -89,33 +93,59 @@ function install() {
89
93
  const srcCommandsDir = path.join(packageDir, 'commands', 'structure');
90
94
  const srcAgentsDir = path.join(packageDir, 'agents');
91
95
 
92
- log('Installing structurecc...', colors.yellow);
96
+ log('Installing structurecc v2.0...', colors.yellow);
93
97
  log('');
94
98
 
95
99
  // Install command
96
100
  if (fs.existsSync(srcCommandsDir)) {
97
101
  copyDir(srcCommandsDir, commandsDir);
98
- log(' ✓ Installed /structure command', colors.green);
102
+ log(' ✓ Installed /structure command (3-phase pipeline)', colors.green);
99
103
  }
100
104
 
101
105
  // Install agents
102
106
  if (fs.existsSync(srcAgentsDir)) {
103
- const agentFiles = fs.readdirSync(srcAgentsDir);
104
107
  ensureDir(agentsDir);
105
- for (const file of agentFiles) {
106
- if (file.startsWith('structurecc-')) {
107
- fs.copyFileSync(
108
- path.join(srcAgentsDir, file),
109
- path.join(agentsDir, file)
110
- );
111
- log(` ✓ Installed ${file.replace('.md', '')}`, colors.green);
108
+ let installed = 0;
109
+ let skipped = 0;
110
+
111
+ for (const file of AGENT_FILES) {
112
+ const srcPath = path.join(srcAgentsDir, file);
113
+ const destPath = path.join(agentsDir, file);
114
+
115
+ if (fs.existsSync(srcPath)) {
116
+ fs.copyFileSync(srcPath, destPath);
117
+ const agentName = file.replace('.md', '');
118
+ log(` ✓ Installed ${agentName}`, colors.green);
119
+ installed++;
120
+ } else {
121
+ log(` ⚠ Missing ${file}`, colors.yellow);
122
+ skipped++;
112
123
  }
113
124
  }
125
+
126
+ // Remove old extractor if present
127
+ const oldExtractor = path.join(agentsDir, 'structurecc-extractor.md');
128
+ if (fs.existsSync(oldExtractor)) {
129
+ fs.unlinkSync(oldExtractor);
130
+ log(' ✓ Removed legacy structurecc-extractor', colors.dim);
131
+ }
132
+
133
+ log('');
134
+ log(` Agents installed: ${installed}`, colors.dim);
135
+ if (skipped > 0) {
136
+ log(` Agents skipped: ${skipped}`, colors.yellow);
137
+ }
114
138
  }
115
139
 
116
140
  log('');
117
141
  log(`${colors.green}Done!${colors.reset}`);
118
142
  log('');
143
+ log(`${colors.bright}What's new in v2.0:${colors.reset}`);
144
+ log(` • 3-phase pipeline: Classify → Extract → Verify`, colors.dim);
145
+ log(` • 7 specialized extractors (tables, charts, heatmaps, etc.)`, colors.dim);
146
+ log(` • Verbatim extraction with quality scoring`, colors.dim);
147
+ log(` • Auto-revision loop for failed extractions`, colors.dim);
148
+ log('');
119
149
  log(`Run in Claude Code:`, colors.bright);
120
150
  log(` /structure path/to/document.pdf`, colors.cyan);
121
151
  log('');
@@ -136,13 +166,17 @@ function uninstall() {
136
166
  }
137
167
 
138
168
  if (fs.existsSync(agentsDir)) {
169
+ let removed = 0;
170
+ // Remove all structurecc agents (both old and new)
139
171
  const agentFiles = fs.readdirSync(agentsDir);
140
172
  for (const file of agentFiles) {
141
173
  if (file.startsWith('structurecc-')) {
142
174
  fs.unlinkSync(path.join(agentsDir, file));
143
175
  log(` ✓ Removed ${file}`, colors.green);
176
+ removed++;
144
177
  }
145
178
  }
179
+ log(` Total agents removed: ${removed}`, colors.dim);
146
180
  }
147
181
 
148
182
  log('');
@@ -150,6 +184,14 @@ function uninstall() {
150
184
  log('');
151
185
  }
152
186
 
187
+ function showVersion() {
188
+ log(`structurecc v${VERSION}`, colors.bright);
189
+ log('');
190
+ log('Pipeline: 3-phase with verification', colors.dim);
191
+ log('Agents: 8 (classifier + 6 extractors + verifier)', colors.dim);
192
+ log('');
193
+ }
194
+
153
195
  // Main
154
196
  const args = process.argv.slice(2);
155
197
 
@@ -157,17 +199,33 @@ banner();
157
199
 
158
200
  if (args.includes('--uninstall') || args.includes('-u')) {
159
201
  uninstall();
202
+ } else if (args.includes('--version') || args.includes('-v')) {
203
+ showVersion();
160
204
  } else if (args.includes('--help') || args.includes('-h')) {
161
205
  log('Usage: npx structurecc [options]', colors.bright);
162
206
  log('');
163
207
  log('Options:', colors.bright);
164
208
  log(' --help, -h Show this help', colors.dim);
209
+ log(' --version, -v Show version info', colors.dim);
165
210
  log(' --uninstall, -u Remove from Claude Code', colors.dim);
166
211
  log('');
167
212
  log('After install, use in Claude Code:', colors.bright);
168
213
  log(' /structure path/to/document.pdf', colors.cyan);
169
214
  log(' /structure path/to/document.docx', colors.cyan);
170
215
  log('');
216
+ log('Pipeline:', colors.bright);
217
+ log(' Phase 1: Classification (haiku - fast triage)', colors.dim);
218
+ log(' Phase 2: Specialized extraction (opus - quality)', colors.dim);
219
+ log(' Phase 3: Verification (sonnet - balance)', colors.dim);
220
+ log('');
221
+ log('Extractors:', colors.bright);
222
+ log(' • structurecc-extract-table - Tables with cell-by-cell accuracy', colors.dim);
223
+ log(' • structurecc-extract-chart - Charts with axes, legends, data', colors.dim);
224
+ log(' • structurecc-extract-heatmap - Heatmaps with color scales', colors.dim);
225
+ log(' • structurecc-extract-diagram - Flowcharts, timelines, networks', colors.dim);
226
+ log(' • structurecc-extract-multipanel - Multi-panel figures (A,B,C,D)', colors.dim);
227
+ log(' • structurecc-extract-generic - Fallback for other visuals', colors.dim);
228
+ log('');
171
229
  } else {
172
230
  install();
173
231
  }