structurecc 2.0.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -168,13 +168,65 @@ Return ONLY this JSON structure:
168
168
 
169
169
  ## Chart Type Specifications
170
170
 
171
- ### Kaplan-Meier / Survival Curves
171
+ ### Kaplan-Meier / Survival / Cumulative Incidence Curves
172
+
173
+ **CRITICAL: These are STEP FUNCTIONS, not smooth lines.**
174
+
172
175
  Required fields:
173
- - Step function data points
174
- - Risk table (if present)
175
- - Censoring marks (if visible)
176
- - Confidence interval bands (colors and styles)
177
- - Log-rank p-value
176
+ - **Curve style**: Must note `"curve_style": "step_function"`
177
+ - **Step function data points** - Capture visible steps/jumps
178
+ - **Risk table** (if present below chart)
179
+ - **Censoring marks** (vertical ticks if visible)
180
+ - **Confidence interval bands** - Note BOTH colors (often different per group)
181
+ - **Log-rank p-value** or other statistical annotation
182
+ - **Endpoint values** - Where each curve ends (y-value at max x)
183
+
184
+ **LEGEND EXTRACTION - VERBATIM:**
185
+ Read the legend box word-for-word. Common patterns:
186
+ - "HSV: Dementia Risk" (not "HSV group")
187
+ - "Control: Dementia Risk" (not "Control group")
188
+ - "HSV: Dementia Risk 95% CI" (for confidence bands)
189
+ - "Control: Dementia Risk 95% CI"
190
+
191
+ **COLOR PRECISION:**
192
+ - Main lines: Often purple vs dark blue
193
+ - CI bands: Often light purple vs yellow/orange (DIFFERENT colors!)
194
+ - Be specific: "light purple shaded band", "yellow/orange shaded band"
195
+
196
+ **Example for cumulative incidence:**
197
+ ```json
198
+ {
199
+ "chart_type": "kaplan_meier",
200
+ "curve_style": "step_function",
201
+ "chart_metadata": {
202
+ "title": "Cumulative Incidence of Dementia",
203
+ "source_page": 7
204
+ },
205
+ "axes": {
206
+ "x": {"label": "Time (Days) Since HSV Diagnosis", "min": 0, "max": 7000},
207
+ "y": {"label": "Cumulative Risk of Dementia", "min": 0.0, "max": 0.6}
208
+ },
209
+ "legend": {
210
+ "position": "right",
211
+ "title": "Legend",
212
+ "entries": [
213
+ {"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid step"},
214
+ {"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid step"},
215
+ {"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
216
+ {"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
217
+ ]
218
+ },
219
+ "curve_endpoints": [
220
+ {"series": "HSV: Dementia Risk", "final_x": 7000, "final_y": 0.32},
221
+ {"series": "Control: Dementia Risk", "final_x": 7000, "final_y": 0.05}
222
+ ],
223
+ "key_observations": [
224
+ "Step-function curves with visible jumps at event times",
225
+ "Curves diverge early and separation increases",
226
+ "CI bands widen substantially after day 5000"
227
+ ]
228
+ }
229
+ ```
178
230
 
179
231
  ### Bar Charts
180
232
  ```json
@@ -5,7 +5,7 @@ description: Phase 2 - Verbatim multi-panel figure extraction (A, B, C, D panels
5
5
 
6
6
  # Multi-Panel Figure Extractor
7
7
 
8
- You extract multi-panel figures by processing EACH PANEL SEPARATELY with full verbatim accuracy.
8
+ You extract multi-panel figures by processing EACH PANEL SEPARATELY with ABSOLUTE verbatim accuracy.
9
9
 
10
10
  ## VERBATIM EXTRACTION RULES
11
11
 
@@ -16,6 +16,82 @@ You extract multi-panel figures by processing EACH PANEL SEPARATELY with full ve
16
16
  3. **Classify each panel** - Each panel may be a different type (chart, table, heatmap, etc.)
17
17
  4. **Preserve panel relationships** - Note when panels share legends, axes, or data
18
18
 
19
+ ## LEGEND EXTRACTION - VERBATIM REQUIRED
20
+
21
+ **CRITICAL FOR KAPLAN-MEIER/SURVIVAL CURVES:**
22
+
23
+ Read the legend box carefully and extract EVERY entry EXACTLY as written:
24
+
25
+ Example - If you see:
26
+ ```
27
+ Legend
28
+ — HSV: Dementia Risk
29
+ — Control: Dementia Risk
30
+ ▒ HSV: Dementia Risk 95% CI
31
+ ▒ Control: Dementia Risk 95% CI
32
+ ```
33
+
34
+ You MUST output:
35
+ ```json
36
+ "legend": {
37
+ "entries": [
38
+ {"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid"},
39
+ {"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid"},
40
+ {"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
41
+ {"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
42
+ ]
43
+ }
44
+ ```
45
+
46
+ Do NOT summarize as "HSV group" or "Control group" - use the EXACT text from the legend.
47
+
48
+ ## KAPLAN-MEIER / CUMULATIVE INCIDENCE CURVES
49
+
50
+ For survival/incidence curves, capture:
51
+
52
+ 1. **Curve Shape**: Note that these are STEP FUNCTIONS, not smooth lines
53
+ 2. **Key Inflection Points**: Where curves separate, where steps occur
54
+ 3. **Endpoint Values**: The y-value where each curve ends (be precise)
55
+ 4. **Confidence Intervals**: Shaded bands - note BOTH colors (often different for each group)
56
+ 5. **Widening CI**: Note if confidence intervals widen over time
57
+
58
+ Example output for a Kaplan-Meier panel:
59
+ ```json
60
+ {
61
+ "panel_id": "A",
62
+ "panel_type": "chart_kaplan_meier",
63
+ "panel_title": "Overall HSV vs Control",
64
+ "extraction": {
65
+ "chart_type": "kaplan_meier",
66
+ "curve_style": "step_function",
67
+ "axes": {
68
+ "x": {"label": "Time (Days) Since HSV Diagnosis", "min": 0, "max": 7000, "ticks": [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000]},
69
+ "y": {"label": "Cumulative Risk of Dementia", "min": 0.0, "max": 0.6, "ticks": [0, 0.2, 0.4, 0.6]}
70
+ },
71
+ "legend": {
72
+ "position": "right",
73
+ "title": "Legend",
74
+ "entries": [
75
+ {"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid step"},
76
+ {"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid step"},
77
+ {"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
78
+ {"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
79
+ ]
80
+ },
81
+ "curve_endpoints": [
82
+ {"series": "HSV", "final_x": 7000, "final_y": 0.32, "note": "steep step around day 6500"},
83
+ {"series": "Control", "final_x": 7000, "final_y": 0.05, "note": "relatively flat"}
84
+ ],
85
+ "key_observations": [
86
+ "Curves diverge early (~day 500) and separation increases over time",
87
+ "HSV group shows multiple step increases, particularly steep after day 5000",
88
+ "Control group remains relatively flat throughout",
89
+ "95% CI bands widen substantially after day 5000, especially for HSV group"
90
+ ]
91
+ }
92
+ }
93
+ ```
94
+
19
95
  ## Output Schema
20
96
 
21
97
  Return ONLY this JSON structure:
package/bin/install.js CHANGED
@@ -4,7 +4,7 @@ const fs = require('fs');
4
4
  const path = require('path');
5
5
  const os = require('os');
6
6
 
7
- const VERSION = '2.0.0';
7
+ const VERSION = '2.1.0';
8
8
  const PACKAGE_NAME = 'structurecc';
9
9
 
10
10
  // Agent files in v2.0
@@ -39,23 +39,18 @@ function log(msg, color = '') {
39
39
  function banner() {
40
40
  const c = colors;
41
41
  console.log('');
42
- console.log(c.cyan + ' ┌─────────────────────────────────────────────────────┐' + c.reset);
43
- console.log(c.cyan + ' │ │' + c.reset);
44
- console.log(c.cyan + ' ' + c.bright + 'S T R U C T U R E v2.0' + c.reset + c.cyan + ' │' + c.reset);
45
- console.log(c.cyan + ' │ │' + c.reset);
46
- console.log(c.cyan + ' ' + c.reset + 'Agentic Document Structuring' + c.cyan + ' │' + c.reset);
47
- console.log(c.cyan + ' ' + c.dim + 'Verbatim extraction. Quality verified.' + c.reset + c.cyan + ' │' + c.reset);
48
- console.log(c.cyan + ' │ │' + c.reset);
49
- console.log(c.cyan + ' ├─────────────────────────────────────────────────────┤' + c.reset);
50
- console.log(c.cyan + ' │ │' + c.reset);
51
- console.log(c.cyan + '' + c.yellow + 'PDF' + c.reset + ' ──▶ ' + c.magenta + '[Classify]' + c.reset + ' ──▶ ' + c.green + '[Extract]' + c.reset + ' ──▶ ' + c.cyan + '[Verify]' + c.reset + ' ' + c.cyan + '│' + c.reset);
52
- console.log(c.cyan + ' │ ↑_______↻_______↓ │' + c.reset);
53
- console.log(c.cyan + ' │ │' + c.reset);
54
- console.log(c.cyan + ' │ ' + c.white + '3-phase pipeline with quality scoring' + c.reset + c.cyan + ' │' + c.reset);
55
- console.log(c.cyan + ' │ │' + c.reset);
56
- console.log(c.cyan + ' └─────────────────────────────────────────────────────┘' + c.reset);
42
+ console.log(c.cyan + ' ███████╗████████╗██████╗ ██╗ ██╗ ██████╗████████╗██╗ ██╗██████╗ ███████╗' + c.reset);
43
+ console.log(c.cyan + ' ██╔════╝╚══██╔══╝██╔══██╗██║ ██║██╔════╝╚══██╔══╝██║ ██║██╔══██╗██╔════╝' + c.reset);
44
+ console.log(c.cyan + ' ███████╗ ██║ ██████╔╝██║ ██║██║ ██║ ██║ ██║██████╔╝█████╗ ' + c.reset);
45
+ console.log(c.cyan + ' ╚════██║ ██║ ██╔══██╗██║ ██║██║ ██║ ██║ ██║██╔══██╗██╔══╝ ' + c.reset);
46
+ console.log(c.cyan + ' ███████║ ██║ ██║ ██║╚██████╔╝╚██████╗ ██║ ╚██████╔╝██║ ██║███████╗' + c.reset);
47
+ console.log(c.cyan + ' ╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝' + c.reset);
48
+ console.log('');
49
+ console.log(c.dim + ' Agentic Document Structuring' + c.reset + ' ' + c.yellow + 'v' + VERSION + c.reset);
50
+ console.log(c.dim + ' Verbatim extraction. Quality verified.' + c.reset);
51
+ console.log('');
52
+ console.log(c.dim + ' ─────────────────────────────────────────────────────────────────────────────' + c.reset);
57
53
  console.log('');
58
- console.log(c.bright + 'structurecc' + c.reset + ' v' + VERSION);
59
54
  }
60
55
 
61
56
  function getClaudeDir() {
@@ -93,13 +88,13 @@ function install() {
93
88
  const srcCommandsDir = path.join(packageDir, 'commands', 'structure');
94
89
  const srcAgentsDir = path.join(packageDir, 'agents');
95
90
 
96
- log('Installing structurecc v2.0...', colors.yellow);
91
+ log('Installing...', colors.dim);
97
92
  log('');
98
93
 
99
94
  // Install command
100
95
  if (fs.existsSync(srcCommandsDir)) {
101
96
  copyDir(srcCommandsDir, commandsDir);
102
- log(' ✓ Installed /structure command (3-phase pipeline)', colors.green);
97
+ log(' ✓ /structure command', colors.green);
103
98
  }
104
99
 
105
100
  // Install agents
@@ -115,11 +110,8 @@ function install() {
115
110
  if (fs.existsSync(srcPath)) {
116
111
  fs.copyFileSync(srcPath, destPath);
117
112
  const agentName = file.replace('.md', '');
118
- log(` ✓ Installed ${agentName}`, colors.green);
113
+ log(` ✓ ${agentName}`, colors.green);
119
114
  installed++;
120
- } else {
121
- log(` ⚠ Missing ${file}`, colors.yellow);
122
- skipped++;
123
115
  }
124
116
  }
125
117
 
@@ -127,29 +119,11 @@ function install() {
127
119
  const oldExtractor = path.join(agentsDir, 'structurecc-extractor.md');
128
120
  if (fs.existsSync(oldExtractor)) {
129
121
  fs.unlinkSync(oldExtractor);
130
- log(' ✓ Removed legacy structurecc-extractor', colors.dim);
131
- }
132
-
133
- log('');
134
- log(` Agents installed: ${installed}`, colors.dim);
135
- if (skipped > 0) {
136
- log(` Agents skipped: ${skipped}`, colors.yellow);
137
122
  }
138
123
  }
139
124
 
140
125
  log('');
141
- log(`${colors.green}Done!${colors.reset}`);
142
- log('');
143
- log(`${colors.bright}What's new in v2.0:${colors.reset}`);
144
- log(` • 3-phase pipeline: Classify → Extract → Verify`, colors.dim);
145
- log(` • 7 specialized extractors (tables, charts, heatmaps, etc.)`, colors.dim);
146
- log(` • Verbatim extraction with quality scoring`, colors.dim);
147
- log(` • Auto-revision loop for failed extractions`, colors.dim);
148
- log('');
149
- log(`Run in Claude Code:`, colors.bright);
150
- log(` /structure path/to/document.pdf`, colors.cyan);
151
- log('');
152
- log(`${colors.dim}Supports: PDF, DOCX, PNG, JPG, TIFF${colors.reset}`);
126
+ log(`${colors.green}Done!${colors.reset} Run ${colors.cyan}/structure <path>${colors.reset} to extract.`);
153
127
  log('');
154
128
  }
155
129
 
@@ -187,9 +161,6 @@ function uninstall() {
187
161
  function showVersion() {
188
162
  log(`structurecc v${VERSION}`, colors.bright);
189
163
  log('');
190
- log('Pipeline: 3-phase with verification', colors.dim);
191
- log('Agents: 8 (classifier + 6 extractors + verifier)', colors.dim);
192
- log('');
193
164
  }
194
165
 
195
166
  // Main
@@ -204,27 +175,11 @@ if (args.includes('--uninstall') || args.includes('-u')) {
204
175
  } else if (args.includes('--help') || args.includes('-h')) {
205
176
  log('Usage: npx structurecc [options]', colors.bright);
206
177
  log('');
207
- log('Options:', colors.bright);
208
- log(' --help, -h Show this help', colors.dim);
209
- log(' --version, -v Show version info', colors.dim);
210
- log(' --uninstall, -u Remove from Claude Code', colors.dim);
178
+ log('Options:', colors.dim);
179
+ log(' --uninstall, -u Remove from Claude Code');
211
180
  log('');
212
- log('After install, use in Claude Code:', colors.bright);
181
+ log('After install, run in Claude Code:', colors.bright);
213
182
  log(' /structure path/to/document.pdf', colors.cyan);
214
- log(' /structure path/to/document.docx', colors.cyan);
215
- log('');
216
- log('Pipeline:', colors.bright);
217
- log(' Phase 1: Classification (haiku - fast triage)', colors.dim);
218
- log(' Phase 2: Specialized extraction (opus - quality)', colors.dim);
219
- log(' Phase 3: Verification (sonnet - balance)', colors.dim);
220
- log('');
221
- log('Extractors:', colors.bright);
222
- log(' • structurecc-extract-table - Tables with cell-by-cell accuracy', colors.dim);
223
- log(' • structurecc-extract-chart - Charts with axes, legends, data', colors.dim);
224
- log(' • structurecc-extract-heatmap - Heatmaps with color scales', colors.dim);
225
- log(' • structurecc-extract-diagram - Flowcharts, timelines, networks', colors.dim);
226
- log(' • structurecc-extract-multipanel - Multi-panel figures (A,B,C,D)', colors.dim);
227
- log(' • structurecc-extract-generic - Fallback for other visuals', colors.dim);
228
183
  log('');
229
184
  } else {
230
185
  install();
@@ -52,7 +52,132 @@ for subdir in ["images", "classifications", "extractions", "verifications", "ele
52
52
  print(f"Output directory: {output_dir}")
53
53
  ```
54
54
 
55
- ## Step 2: Extract Images
55
+ ## Step 2: Extract Text and Images
56
+
57
+ **CRITICAL:** Extract BOTH the manuscript text AND images. Figures need context from the paper.
58
+
59
+ ### 2A: Extract Manuscript Text (PDF)
60
+
61
+ ```python
62
+ import fitz
63
+ import re
64
+ import json
65
+ from pathlib import Path
66
+
67
+ pdf_path = "<document_path>"
68
+ output_dir = Path("<output_dir>")
69
+
70
+ doc = fitz.open(pdf_path)
71
+ full_text = []
72
+ page_texts = {}
73
+
74
+ for page_num in range(len(doc)):
75
+ page = doc[page_num]
76
+ text = page.get_text("text")
77
+ full_text.append(text)
78
+ page_texts[page_num + 1] = text
79
+
80
+ # Save full manuscript text
81
+ with open(output_dir / "manuscript_text.txt", "w") as f:
82
+ f.write("\n\n---PAGE BREAK---\n\n".join(full_text))
83
+
84
+ # Parse figure and table captions
85
+ def extract_captions(text):
86
+ """Extract Figure and Table captions from manuscript text."""
87
+ captions = {"figures": {}, "tables": {}}
88
+
89
+ # Figure patterns: "Figure 1.", "Figure 1:", "Fig. 1.", "FIGURE 1"
90
+ fig_pattern = r'(?:Figure|Fig\.?|FIGURE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
91
+ for match in re.finditer(fig_pattern, text, re.IGNORECASE):
92
+ fig_num = match.group(1)
93
+ caption = match.group(2).strip()
94
+ # Clean up caption (remove extra whitespace)
95
+ caption = ' '.join(caption.split())
96
+ captions["figures"][fig_num] = caption
97
+
98
+ # Table patterns: "Table 1.", "Table 1:", "TABLE 1"
99
+ table_pattern = r'(?:Table|TABLE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
100
+ for match in re.finditer(table_pattern, text, re.IGNORECASE):
101
+ table_num = match.group(1)
102
+ caption = match.group(2).strip()
103
+ caption = ' '.join(caption.split())
104
+ captions["tables"][table_num] = caption
105
+
106
+ return captions
107
+
108
+ all_text = "\n".join(full_text)
109
+ captions = extract_captions(all_text)
110
+
111
+ # Save captions
112
+ with open(output_dir / "captions.json", "w") as f:
113
+ json.dump(captions, f, indent=2)
114
+
115
+ print(f"Extracted text from {len(doc)} pages")
116
+ print(f"Found {len(captions['figures'])} figure captions")
117
+ print(f"Found {len(captions['tables'])} table captions")
118
+
119
+ doc.close()
120
+ ```
121
+
122
+ ### 2B: Extract Context Snippets
123
+
124
+ For each figure/table, extract surrounding manuscript context:
125
+
126
+ ```python
127
+ def extract_context_for_element(text, element_type, element_num, window=500):
128
+ """Extract text context surrounding references to a figure/table."""
129
+ contexts = []
130
+
131
+ if element_type == "figure":
132
+ patterns = [
133
+ rf'(?:Figure|Fig\.?)\s*{element_num}\b',
134
+ rf'(?:figure|fig\.?)\s*{element_num}\b'
135
+ ]
136
+ else:
137
+ patterns = [
138
+ rf'(?:Table)\s*{element_num}\b',
139
+ rf'(?:table)\s*{element_num}\b'
140
+ ]
141
+
142
+ for pattern in patterns:
143
+ for match in re.finditer(pattern, text):
144
+ start = max(0, match.start() - window)
145
+ end = min(len(text), match.end() + window)
146
+ context = text[start:end].strip()
147
+ # Clean up
148
+ context = ' '.join(context.split())
149
+ if context not in contexts:
150
+ contexts.append(context)
151
+
152
+ return contexts
153
+
154
+ # Extract contexts for each figure
155
+ figure_contexts = {}
156
+ for fig_num in captions["figures"]:
157
+ contexts = extract_context_for_element(all_text, "figure", fig_num)
158
+ figure_contexts[fig_num] = {
159
+ "caption": captions["figures"][fig_num],
160
+ "contexts": contexts
161
+ }
162
+
163
+ # Extract contexts for each table
164
+ table_contexts = {}
165
+ for table_num in captions["tables"]:
166
+ contexts = extract_context_for_element(all_text, "table", table_num)
167
+ table_contexts[table_num] = {
168
+ "caption": captions["tables"][table_num],
169
+ "contexts": contexts
170
+ }
171
+
172
+ # Save context data
173
+ with open(output_dir / "manuscript_context.json", "w") as f:
174
+ json.dump({
175
+ "figures": figure_contexts,
176
+ "tables": table_contexts
177
+ }, f, indent=2)
178
+ ```
179
+
180
+ ### 2C: Extract Images
56
181
 
57
182
  **For PDF files** - Use PyMuPDF:
58
183
 
@@ -166,7 +291,7 @@ After classifications complete, read each classification file and dispatch to th
166
291
  | `multi_panel` | `structurecc-extract-multipanel.md` |
167
292
  | Everything else | `structurecc-extract-generic.md` |
168
293
 
169
- For EACH element, spawn the appropriate extractor:
294
+ For EACH element, spawn the appropriate extractor WITH manuscript context:
170
295
 
171
296
  ```
172
297
  Task(
@@ -185,12 +310,31 @@ Read the agent instructions from:
185
310
  **Source:** Page <N> of <document_name>
186
311
  **Output:** Write JSON to <output_dir>/extractions/<element_id>.json
187
312
 
313
+ ## MANUSCRIPT CONTEXT (Use this to understand the figure)
314
+
315
+ **Figure Caption:**
316
+ <caption_from_captions.json>
317
+
318
+ **Relevant Manuscript Text:**
319
+ <context_snippets_from_manuscript_context.json>
320
+
321
+ ---
322
+
188
323
  Follow the extractor instructions EXACTLY. Output ONLY valid JSON.
189
- Remember: VERBATIM extraction only. Copy text exactly as shown.
324
+
325
+ CRITICAL REQUIREMENTS:
326
+ 1. VERBATIM extraction - Copy ALL text exactly as shown in the image
327
+ 2. Use manuscript context to understand what the figure shows
328
+ 3. Include the figure caption in your extraction
329
+ 4. For charts: capture EXACT legend text, axis labels, tick values
330
+ 5. For Kaplan-Meier/survival curves: note step-function nature, describe curve progression
331
+ 6. Describe colors precisely (e.g., "purple line", "light purple shaded 95% CI", "yellow/orange shaded band")
190
332
  """
191
333
  )
192
334
  ```
193
335
 
336
+ **IMPORTANT:** Read `manuscript_context.json` to get the caption and context for each element.
337
+
194
338
  Launch ALL extractions in ONE message for parallel processing.
195
339
 
196
340
  ## Step 5: Phase 3 - Verification (Parallel)
@@ -317,48 +461,73 @@ for extract_file in extractions_dir.glob("*.json"):
317
461
  **Markdown conversion function:**
318
462
 
319
463
  ```python
320
- def json_to_markdown(extraction: dict) -> str:
321
- """Convert JSON extraction to clean markdown."""
464
+ import json
465
+ from pathlib import Path
466
+
467
+ def json_to_markdown(extraction: dict, context: dict = None) -> str:
468
+ """Convert JSON extraction to clean markdown with manuscript context."""
322
469
 
323
470
  ext_type = extraction.get("extraction_type")
324
471
 
325
472
  if ext_type == "table":
326
- return table_to_markdown(extraction)
473
+ return table_to_markdown(extraction, context)
327
474
  elif ext_type == "chart":
328
- return chart_to_markdown(extraction)
475
+ return chart_to_markdown(extraction, context)
329
476
  elif ext_type == "heatmap":
330
- return heatmap_to_markdown(extraction)
477
+ return heatmap_to_markdown(extraction, context)
331
478
  elif ext_type == "diagram":
332
- return diagram_to_markdown(extraction)
479
+ return diagram_to_markdown(extraction, context)
333
480
  elif ext_type == "multi_panel":
334
- return multipanel_to_markdown(extraction)
481
+ return multipanel_to_markdown(extraction, context)
335
482
  else:
336
- return generic_to_markdown(extraction)
483
+ return generic_to_markdown(extraction, context)
484
+
485
+ # Load manuscript context for element processing
486
+ def get_context_for_element(output_dir: Path, element_num: int, element_type: str = "figure"):
487
+ """Get manuscript context for a specific element."""
488
+ context_file = output_dir / "manuscript_context.json"
489
+ if not context_file.exists():
490
+ return None
337
491
 
492
+ with open(context_file) as f:
493
+ manuscript_context = json.load(f)
338
494
 
339
- def table_to_markdown(ext: dict) -> str:
495
+ key = "figures" if element_type == "figure" else "tables"
496
+ return manuscript_context.get(key, {}).get(str(element_num))
497
+
498
+
499
+ def table_to_markdown(ext: dict, context: dict = None) -> str:
340
500
  md = []
341
501
  meta = ext.get("table_metadata", {})
342
502
 
343
- md.append(f"# {meta.get('title', 'Table')}")
344
- md.append(f"\n**Type:** Table")
503
+ md.append(f"## {meta.get('title', 'Table')}")
504
+ md.append(f"\n**Type:** table ({meta.get('table_type', 'standard')})")
345
505
  md.append(f"**Source:** Page {meta.get('source_page', '?')}")
346
506
 
507
+ # Add manuscript context if available
508
+ if context:
509
+ if context.get("caption"):
510
+ md.append(f"\n> **Table Caption (from manuscript):** {context['caption']}")
511
+ if context.get("contexts"):
512
+ md.append("\n### Manuscript Context\n")
513
+ for ctx in context["contexts"][:2]:
514
+ md.append(f"> {ctx[:400]}...")
515
+
347
516
  if meta.get("caption"):
348
- md.append(f"\n> {meta['caption']}")
517
+ md.append(f"\n> **Caption (from image):** {meta['caption']}")
349
518
 
350
- md.append("\n## Data\n")
519
+ md.append("\n### Data\n")
351
520
  md.append(ext.get("markdown_table", ""))
352
521
 
353
522
  if meta.get("footnotes"):
354
- md.append("\n## Footnotes\n")
523
+ md.append("\n### Footnotes\n")
355
524
  for fn in meta["footnotes"]:
356
525
  md.append(f"- {fn}")
357
526
 
358
527
  return "\n".join(md)
359
528
 
360
529
 
361
- def chart_to_markdown(ext: dict) -> str:
530
+ def chart_to_markdown(ext: dict, context: dict = None) -> str:
362
531
  md = []
363
532
  meta = ext.get("chart_metadata", {})
364
533
 
@@ -366,31 +535,61 @@ def chart_to_markdown(ext: dict) -> str:
366
535
  md.append(f"\n**Type:** {ext.get('chart_type', 'Chart')}")
367
536
  md.append(f"**Source:** Page {meta.get('source_page', '?')}")
368
537
 
538
+ # Add manuscript context if available
539
+ if context:
540
+ if context.get("caption"):
541
+ md.append(f"\n> **Caption:** {context['caption']}")
542
+ if context.get("contexts"):
543
+ md.append("\n### Manuscript Context\n")
544
+ for ctx in context["contexts"][:2]: # Limit to 2 most relevant
545
+ md.append(f"> {ctx[:500]}...") # Truncate long contexts
546
+
369
547
  axes = ext.get("axes", {})
370
548
  md.append("\n## Axes\n")
371
549
  if axes.get("x"):
372
550
  md.append(f"- **X-axis:** {axes['x'].get('label', 'unlabeled')}")
373
551
  md.append(f" - Range: {axes['x'].get('min')} to {axes['x'].get('max')}")
552
+ if axes['x'].get('ticks'):
553
+ md.append(f" - Ticks: {axes['x']['ticks']}")
374
554
  if axes.get("y"):
375
555
  md.append(f"- **Y-axis:** {axes['y'].get('label', 'unlabeled')}")
376
556
  md.append(f" - Range: {axes['y'].get('min')} to {axes['y'].get('max')}")
557
+ if axes['y'].get('ticks'):
558
+ md.append(f" - Ticks: {axes['y']['ticks']}")
377
559
 
378
560
  legend = ext.get("legend", {})
379
561
  if legend.get("entries"):
380
- md.append("\n## Legend\n")
562
+ md.append("\n## Legend (Verbatim)\n")
381
563
  for entry in legend["entries"]:
382
564
  style = entry.get("line_style") or entry.get("style", "")
383
- md.append(f"- **{entry['label']}**: {entry.get('color', '')} {style}")
565
+ color = entry.get("color", "")
566
+ md.append(f"- **\"{entry['label']}\"**: {color} {style}")
567
+
568
+ # Data series details (for Kaplan-Meier etc.)
569
+ series = ext.get("data_series", [])
570
+ if series:
571
+ md.append("\n## Data Series\n")
572
+ for s in series:
573
+ md.append(f"### {s.get('name', 'Series')}")
574
+ if s.get("data_points"):
575
+ md.append("Key data points:")
576
+ for pt in s["data_points"][:10]: # First 10 points
577
+ md.append(f" - x={pt.get('x')}, y={pt.get('y')}")
384
578
 
385
579
  stats = ext.get("statistical_annotations", [])
386
580
  if stats:
387
581
  md.append("\n## Statistical Annotations\n")
388
582
  for stat in stats:
389
- md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
583
+ if stat.get("type") == "hazard_ratio":
584
+ md.append(f"- Hazard Ratio: {stat.get('value')} (95% CI: {stat.get('ci_lower')}-{stat.get('ci_upper')})")
585
+ elif stat.get("type") == "p_value":
586
+ md.append(f"- {stat.get('test', 'P-value')}: {stat.get('value')}")
587
+ else:
588
+ md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
390
589
 
391
590
  risk = ext.get("risk_table", {})
392
591
  if risk.get("present"):
393
- md.append("\n## Risk Table\n")
592
+ md.append("\n## Risk Table (Number at Risk)\n")
394
593
  headers = risk.get("headers", [])
395
594
  md.append("| " + " | ".join(headers) + " |")
396
595
  md.append("| " + " | ".join(["---"] * len(headers)) + " |")
@@ -399,18 +598,137 @@ def chart_to_markdown(ext: dict) -> str:
399
598
  md.append("| " + " | ".join(values) + " |")
400
599
 
401
600
  return "\n".join(md)
601
+
602
+
603
+ def multipanel_to_markdown(ext: dict, context: dict = None) -> str:
604
+ """Convert multi-panel figure extraction to detailed markdown."""
605
+ md = []
606
+ meta = ext.get("figure_metadata", {})
607
+
608
+ md.append(f"## {meta.get('title', 'Multi-Panel Figure')}")
609
+ md.append(f"\n**Type:** multi_panel ({meta.get('total_panels', '?')} panels)")
610
+ md.append(f"**Source:** Page {meta.get('source_page', '?')}")
611
+ md.append(f"**Layout:** {meta.get('layout', 'unknown')}")
612
+
613
+ # Add manuscript context if available
614
+ if context:
615
+ if context.get("caption"):
616
+ md.append(f"\n> **Figure Caption (from manuscript):** {context['caption']}")
617
+ if context.get("contexts"):
618
+ md.append("\n### Manuscript Context\n")
619
+ md.append("*How this figure is described in the paper:*\n")
620
+ for ctx in context["contexts"][:2]:
621
+ md.append(f"> ...{ctx[:500]}...\n")
622
+
623
+ # Process each panel in detail
624
+ panels = ext.get("panels", [])
625
+ for panel in panels:
626
+ panel_id = panel.get("panel_id", "?")
627
+ panel_type = panel.get("panel_type", "unknown")
628
+ panel_title = panel.get("panel_title", f"Panel {panel_id}")
629
+
630
+ md.append(f"\n### Panel {panel_id}: {panel_title}")
631
+ md.append(f"**Type:** {panel_type}")
632
+
633
+ extraction = panel.get("extraction", {})
634
+
635
+ # Axes
636
+ axes = extraction.get("axes", {})
637
+ if axes:
638
+ md.append("\n**Axes:**")
639
+ if axes.get("x"):
640
+ x = axes["x"]
641
+ md.append(f"- X-axis: \"{x.get('label', 'unlabeled')}\"")
642
+ md.append(f" - Range: {x.get('min')} to {x.get('max')}")
643
+ if x.get("ticks"):
644
+ md.append(f" - Tick values: {x['ticks']}")
645
+ if axes.get("y"):
646
+ y = axes["y"]
647
+ md.append(f"- Y-axis: \"{y.get('label', 'unlabeled')}\"")
648
+ md.append(f" - Range: {y.get('min')} to {y.get('max')}")
649
+ if y.get("ticks"):
650
+ md.append(f" - Tick values: {y['ticks']}")
651
+
652
+ # Legend (VERBATIM)
653
+ legend = extraction.get("legend", {})
654
+ if legend.get("entries"):
655
+ md.append("\n**Legend (Verbatim from image):**")
656
+ if legend.get("title"):
657
+ md.append(f"*{legend['title']}*")
658
+ for entry in legend["entries"]:
659
+ label = entry.get("label", "")
660
+ color = entry.get("color", "")
661
+ style = entry.get("line_style") or entry.get("style", "")
662
+ md.append(f"- \"{label}\" — {color} {style}")
663
+
664
+ # Curve endpoints (for Kaplan-Meier)
665
+ endpoints = extraction.get("curve_endpoints", [])
666
+ if endpoints:
667
+ md.append("\n**Curve Endpoints:**")
668
+ for ep in endpoints:
669
+ md.append(f"- {ep.get('series', 'Series')}: y={ep.get('final_y')} at x={ep.get('final_x')}")
670
+ if ep.get("note"):
671
+ md.append(f" - Note: {ep['note']}")
672
+
673
+ # Key observations
674
+ observations = extraction.get("key_observations", [])
675
+ if observations:
676
+ md.append("\n**Key Observations:**")
677
+ for obs in observations:
678
+ md.append(f"- {obs}")
679
+
680
+ # Statistical annotations
681
+ stats = extraction.get("statistical_annotations", [])
682
+ if stats:
683
+ md.append("\n**Statistical Annotations:**")
684
+ for stat in stats:
685
+ md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
686
+
687
+ # All visible text
688
+ all_text = extraction.get("all_visible_text", [])
689
+ if all_text:
690
+ md.append("\n**All Visible Text:**")
691
+ md.append(f"```\n{', '.join(all_text)}\n```")
692
+
693
+ # Shared elements
694
+ shared = ext.get("shared_elements", {})
695
+ if shared.get("shared_legend") or shared.get("cross_references"):
696
+ md.append("\n### Shared Elements")
697
+ if shared.get("shared_legend"):
698
+ md.append(f"- Shared legend applies to panels: {shared['shared_legend'].get('applies_to', [])}")
699
+ if shared.get("cross_references"):
700
+ for ref in shared["cross_references"]:
701
+ md.append(f"- {ref}")
702
+
703
+ return "\n".join(md)
402
704
  ```
403
705
 
404
- ## Step 8: Generate Combined STRUCTURED.md
706
+ ## Step 8: Generate Combined STRUCTURED.md with Manuscript Context
405
707
 
406
708
  ```python
709
+ import json
407
710
  from pathlib import Path
408
711
  from datetime import datetime
409
712
 
410
713
  output_dir = Path("<output_dir>")
411
714
  elements_dir = output_dir / "elements"
715
+ extractions_dir = output_dir / "extractions"
412
716
  doc_name = "<document_name>"
413
717
 
718
+ # Load manuscript context
719
+ context_file = output_dir / "manuscript_context.json"
720
+ manuscript_context = {}
721
+ if context_file.exists():
722
+ with open(context_file) as f:
723
+ manuscript_context = json.load(f)
724
+
725
+ # Load captions
726
+ captions_file = output_dir / "captions.json"
727
+ captions = {"figures": {}, "tables": {}}
728
+ if captions_file.exists():
729
+ with open(captions_file) as f:
730
+ captions = json.load(f)
731
+
414
732
  # Read all element files in order
415
733
  element_files = sorted(elements_dir.glob("element_*.md"))
416
734
 
@@ -419,21 +737,70 @@ sections.append(f"# {doc_name} - Structured Extraction")
419
737
  sections.append(f"\n**Original:** {doc_name}")
420
738
  sections.append(f"**Extracted:** {datetime.now().isoformat()}")
421
739
  sections.append(f"**Elements:** {len(element_files)} visual elements processed")
422
- sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification)")
740
+ sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification + manuscript context)")
423
741
  sections.append("\n---\n")
424
742
 
425
- # Add each element
743
+ # Table of contents
744
+ sections.append("## Table of Contents\n")
426
745
  for i, elem_file in enumerate(element_files, 1):
746
+ elem_id = elem_file.stem
747
+ # Try to get title from extraction
748
+ extract_file = extractions_dir / f"{elem_id}.json"
749
+ title = f"Element {i}"
750
+ if extract_file.exists():
751
+ with open(extract_file) as f:
752
+ ext = json.load(f)
753
+ title = ext.get("chart_metadata", {}).get("title") or \
754
+ ext.get("table_metadata", {}).get("title") or \
755
+ ext.get("figure_metadata", {}).get("title") or \
756
+ f"Element {i}"
757
+ sections.append(f"{i}. [{title}](#{elem_id})")
758
+ sections.append("\n---\n")
759
+
760
+ # Add each element with context
761
+ for i, elem_file in enumerate(element_files, 1):
762
+ elem_id = elem_file.stem
763
+
427
764
  with open(elem_file) as f:
428
765
  content = f.read()
429
766
 
430
- sections.append(f"## Element {i}")
767
+ sections.append(f'<a id="{elem_id}"></a>\n')
768
+ sections.append(f"## Element {i}: {elem_id}\n")
769
+
770
+ # Try to match with manuscript context
771
+ # Heuristic: Figure 1 = first figure element, etc.
772
+ fig_num = str(i)
773
+ if fig_num in manuscript_context.get("figures", {}):
774
+ ctx = manuscript_context["figures"][fig_num]
775
+ if ctx.get("caption"):
776
+ sections.append(f"\n> **Figure Caption:** {ctx['caption']}\n")
777
+ if ctx.get("contexts"):
778
+ sections.append("\n### Manuscript Context\n")
779
+ sections.append("*Relevant text from the manuscript:*\n")
780
+ for c in ctx["contexts"][:2]:
781
+ # Truncate and clean
782
+ clean_ctx = ' '.join(c.split())[:600]
783
+ sections.append(f"> ...{clean_ctx}...\n")
784
+
431
785
  sections.append(content)
432
786
  sections.append("\n---\n")
433
787
 
788
+ # Add manuscript summary section
789
+ if manuscript_context.get("figures") or manuscript_context.get("tables"):
790
+ sections.append("## Manuscript References Summary\n")
791
+ sections.append("### Figure Captions from Manuscript\n")
792
+ for fig_num, ctx in manuscript_context.get("figures", {}).items():
793
+ sections.append(f"- **Figure {fig_num}:** {ctx.get('caption', 'No caption found')}")
794
+ sections.append("\n### Table Captions from Manuscript\n")
795
+ for table_num, ctx in manuscript_context.get("tables", {}).items():
796
+ sections.append(f"- **Table {table_num}:** {ctx.get('caption', 'No caption found')}")
797
+ sections.append("\n---\n")
798
+
434
799
  # Write combined file
435
800
  with open(output_dir / "STRUCTURED.md", "w") as f:
436
801
  f.write("\n".join(sections))
802
+
803
+ print(f"Generated STRUCTURED.md with {len(element_files)} elements and manuscript context")
437
804
  ```
438
805
 
439
806
  ## Step 9: Generate Quality Report
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "structurecc",
3
- "version": "2.0.3",
3
+ "version": "2.1.0",
4
4
  "description": "Extract structured data from PDFs, Word docs, and images using Claude Code.",
5
5
  "keywords": [
6
6
  "document-extraction",