npm - structurecc - Versions diffs - 2.0.3 → 2.1.0 - Mend

structurecc 2.0.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/agents/structurecc-extract-chart.md +58 -6
package/agents/structurecc-extract-multipanel.md +77 -1
package/bin/install.js +19 -64
package/commands/structure/structure.md +393 -26
package/package.json +1 -1

package/agents/structurecc-extract-chart.md CHANGED Viewed

@@ -168,13 +168,65 @@ Return ONLY this JSON structure:
 ## Chart Type Specifications
-### Kaplan-Meier / Survival Curves
+### Kaplan-Meier / Survival / Cumulative Incidence Curves
+**CRITICAL: These are STEP FUNCTIONS, not smooth lines.**
 Required fields:
-- Step function data points
-- Risk table (if present)
-- Censoring marks (if visible)
-- Confidence interval bands (colors and styles)
-- Log-rank p-value
+- **Curve style**: Must note `"curve_style": "step_function"`
+- **Step function data points** - Capture visible steps/jumps
+- **Risk table** (if present below chart)
+- **Censoring marks** (vertical ticks if visible)
+- **Confidence interval bands** - Note BOTH colors (often different per group)
+- **Log-rank p-value** or other statistical annotation
+- **Endpoint values** - Where each curve ends (y-value at max x)
+**LEGEND EXTRACTION - VERBATIM:**
+Read the legend box word-for-word. Common patterns:
+- "HSV: Dementia Risk" (not "HSV group")
+- "Control: Dementia Risk" (not "Control group")
+- "HSV: Dementia Risk 95% CI" (for confidence bands)
+- "Control: Dementia Risk 95% CI"
+**COLOR PRECISION:**
+- Main lines: Often purple vs dark blue
+- CI bands: Often light purple vs yellow/orange (DIFFERENT colors!)
+- Be specific: "light purple shaded band", "yellow/orange shaded band"
+**Example for cumulative incidence:**
+```json
+{
+  "chart_type": "kaplan_meier",
+  "curve_style": "step_function",
+  "chart_metadata": {
+    "title": "Cumulative Incidence of Dementia",
+    "source_page": 7
+  },
+  "axes": {
+    "x": {"label": "Time (Days) Since HSV Diagnosis", "min": 0, "max": 7000},
+    "y": {"label": "Cumulative Risk of Dementia", "min": 0.0, "max": 0.6}
+  },
+  "legend": {
+    "position": "right",
+    "title": "Legend",
+    "entries": [
+      {"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid step"},
+      {"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid step"},
+      {"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
+      {"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
+    ]
+  },
+  "curve_endpoints": [
+    {"series": "HSV: Dementia Risk", "final_x": 7000, "final_y": 0.32},
+    {"series": "Control: Dementia Risk", "final_x": 7000, "final_y": 0.05}
+  ],
+  "key_observations": [
+    "Step-function curves with visible jumps at event times",
+    "Curves diverge early and separation increases",
+    "CI bands widen substantially after day 5000"
+  ]
+}
+```
 ### Bar Charts
 ```json

package/agents/structurecc-extract-multipanel.md CHANGED Viewed

@@ -5,7 +5,7 @@ description: Phase 2 - Verbatim multi-panel figure extraction (A, B, C, D panels
 # Multi-Panel Figure Extractor
-You extract multi-panel figures by processing EACH PANEL SEPARATELY with full verbatim accuracy.
+You extract multi-panel figures by processing EACH PANEL SEPARATELY with ABSOLUTE verbatim accuracy.
 ## VERBATIM EXTRACTION RULES
@@ -16,6 +16,82 @@ You extract multi-panel figures by processing EACH PANEL SEPARATELY with full ve
 3. **Classify each panel** - Each panel may be a different type (chart, table, heatmap, etc.)
 4. **Preserve panel relationships** - Note when panels share legends, axes, or data
+## LEGEND EXTRACTION - VERBATIM REQUIRED
+**CRITICAL FOR KAPLAN-MEIER/SURVIVAL CURVES:**
+Read the legend box carefully and extract EVERY entry EXACTLY as written:
+Example - If you see:
+```
+Legend
+— HSV: Dementia Risk
+— Control: Dementia Risk
+▒ HSV: Dementia Risk 95% CI
+▒ Control: Dementia Risk 95% CI
+```
+You MUST output:
+```json
+"legend": {
+  "entries": [
+    {"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid"},
+    {"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid"},
+    {"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
+    {"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
+  ]
+}
+```
+Do NOT summarize as "HSV group" or "Control group" - use the EXACT text from the legend.
+## KAPLAN-MEIER / CUMULATIVE INCIDENCE CURVES
+For survival/incidence curves, capture:
+1. **Curve Shape**: Note that these are STEP FUNCTIONS, not smooth lines
+2. **Key Inflection Points**: Where curves separate, where steps occur
+3. **Endpoint Values**: The y-value where each curve ends (be precise)
+4. **Confidence Intervals**: Shaded bands - note BOTH colors (often different for each group)
+5. **Widening CI**: Note if confidence intervals widen over time
+Example output for a Kaplan-Meier panel:
+```json
+{
+  "panel_id": "A",
+  "panel_type": "chart_kaplan_meier",
+  "panel_title": "Overall HSV vs Control",
+  "extraction": {
+    "chart_type": "kaplan_meier",
+    "curve_style": "step_function",
+    "axes": {
+      "x": {"label": "Time (Days) Since HSV Diagnosis", "min": 0, "max": 7000, "ticks": [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000]},
+      "y": {"label": "Cumulative Risk of Dementia", "min": 0.0, "max": 0.6, "ticks": [0, 0.2, 0.4, 0.6]}
+    },
+    "legend": {
+      "position": "right",
+      "title": "Legend",
+      "entries": [
+        {"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid step"},
+        {"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid step"},
+        {"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
+        {"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
+      ]
+    },
+    "curve_endpoints": [
+      {"series": "HSV", "final_x": 7000, "final_y": 0.32, "note": "steep step around day 6500"},
+      {"series": "Control", "final_x": 7000, "final_y": 0.05, "note": "relatively flat"}
+    ],
+    "key_observations": [
+      "Curves diverge early (~day 500) and separation increases over time",
+      "HSV group shows multiple step increases, particularly steep after day 5000",
+      "Control group remains relatively flat throughout",
+      "95% CI bands widen substantially after day 5000, especially for HSV group"
+    ]
+  }
+}
+```
 ## Output Schema
 Return ONLY this JSON structure:

package/bin/install.js CHANGED Viewed

@@ -4,7 +4,7 @@ const fs = require('fs');
 const path = require('path');
 const os = require('os');
-const VERSION = '2.0.0';
+const VERSION = '2.1.0';
 const PACKAGE_NAME = 'structurecc';
 // Agent files in v2.0
@@ -39,23 +39,18 @@ function log(msg, color = '') {
 function banner() {
   const c = colors;
   console.log('');
-  console.log(c.cyan + '  ┌─────────────────────────────────────────────────────┐' + c.reset);
-  console.log(c.cyan + '  │                                                     │' + c.reset);
-  console.log(c.cyan + '  │  ' + c.bright + 'S T R U C T U R E   v2.0' + c.reset + c.cyan + '                       │' + c.reset);
-  console.log(c.cyan + '  │                                                     │' + c.reset);
-  console.log(c.cyan + '  │  ' + c.reset + 'Agentic Document Structuring' + c.cyan + '                   │' + c.reset);
-  console.log(c.cyan + '  │  ' + c.dim + 'Verbatim extraction. Quality verified.' + c.reset + c.cyan + '         │' + c.reset);
-  console.log(c.cyan + '  │                                                     │' + c.reset);
-  console.log(c.cyan + '  ├─────────────────────────────────────────────────────┤' + c.reset);
-  console.log(c.cyan + '  │                                                     │' + c.reset);
-  console.log(c.cyan + '  │  ' + c.yellow + 'PDF' + c.reset + ' ──▶ ' + c.magenta + '[Classify]' + c.reset + ' ──▶ ' + c.green + '[Extract]' + c.reset + ' ──▶ ' + c.cyan + '[Verify]' + c.reset + ' ' + c.cyan + '│' + c.reset);
-  console.log(c.cyan + '  │                              ↑_______↻_______↓      │' + c.reset);
-  console.log(c.cyan + '  │                                                     │' + c.reset);
-  console.log(c.cyan + '  │  ' + c.white + '3-phase pipeline with quality scoring' + c.reset + c.cyan + '         │' + c.reset);
-  console.log(c.cyan + '  │                                                     │' + c.reset);
-  console.log(c.cyan + '  └─────────────────────────────────────────────────────┘' + c.reset);
+  console.log(c.cyan + '  ███████╗████████╗██████╗ ██╗   ██╗ ██████╗████████╗██╗   ██╗██████╗ ███████╗' + c.reset);
+  console.log(c.cyan + '  ██╔════╝╚══██╔══╝██╔══██╗██║   ██║██╔════╝╚══██╔══╝██║   ██║██╔══██╗██╔════╝' + c.reset);
+  console.log(c.cyan + '  ███████╗   ██║   ██████╔╝██║   ██║██║        ██║   ██║   ██║██████╔╝█████╗  ' + c.reset);
+  console.log(c.cyan + '  ╚════██║   ██║   ██╔══██╗██║   ██║██║        ██║   ██║   ██║██╔══██╗██╔══╝  ' + c.reset);
+  console.log(c.cyan + '  ███████║   ██║   ██║  ██║╚██████╔╝╚██████╗   ██║   ╚██████╔╝██║  ██║███████╗' + c.reset);
+  console.log(c.cyan + '  ╚══════╝   ╚═╝   ╚═╝  ╚═╝ ╚═════╝  ╚═════╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝╚══════╝' + c.reset);
+  console.log('');
+  console.log(c.dim + '  Agentic Document Structuring' + c.reset + '                              ' + c.yellow + 'v' + VERSION + c.reset);
+  console.log(c.dim + '  Verbatim extraction. Quality verified.' + c.reset);
+  console.log('');
+  console.log(c.dim + '  ─────────────────────────────────────────────────────────────────────────────' + c.reset);
   console.log('');
-  console.log(c.bright + 'structurecc' + c.reset + ' v' + VERSION);
 }
 function getClaudeDir() {
@@ -93,13 +88,13 @@ function install() {
   const srcCommandsDir = path.join(packageDir, 'commands', 'structure');
   const srcAgentsDir = path.join(packageDir, 'agents');
-  log('Installing structurecc v2.0...', colors.yellow);
+  log('Installing...', colors.dim);
   log('');
   // Install command
   if (fs.existsSync(srcCommandsDir)) {
     copyDir(srcCommandsDir, commandsDir);
-    log('  ✓ Installed /structure command (3-phase pipeline)', colors.green);
+    log('  ✓ /structure command', colors.green);
   }
   // Install agents
@@ -115,11 +110,8 @@ function install() {
       if (fs.existsSync(srcPath)) {
         fs.copyFileSync(srcPath, destPath);
         const agentName = file.replace('.md', '');
-        log(`  ✓ Installed ${agentName}`, colors.green);
+        log(`  ✓ ${agentName}`, colors.green);
         installed++;
-      } else {
-        log(`  ⚠ Missing ${file}`, colors.yellow);
-        skipped++;
       }
     }
@@ -127,29 +119,11 @@ function install() {
     const oldExtractor = path.join(agentsDir, 'structurecc-extractor.md');
     if (fs.existsSync(oldExtractor)) {
       fs.unlinkSync(oldExtractor);
-      log('  ✓ Removed legacy structurecc-extractor', colors.dim);
-    }
-    log('');
-    log(`  Agents installed: ${installed}`, colors.dim);
-    if (skipped > 0) {
-      log(`  Agents skipped: ${skipped}`, colors.yellow);
     }
   }
   log('');
-  log(`${colors.green}Done!${colors.reset}`);
-  log('');
-  log(`${colors.bright}What's new in v2.0:${colors.reset}`);
-  log(`  • 3-phase pipeline: Classify → Extract → Verify`, colors.dim);
-  log(`  • 7 specialized extractors (tables, charts, heatmaps, etc.)`, colors.dim);
-  log(`  • Verbatim extraction with quality scoring`, colors.dim);
-  log(`  • Auto-revision loop for failed extractions`, colors.dim);
-  log('');
-  log(`Run in Claude Code:`, colors.bright);
-  log(`  /structure path/to/document.pdf`, colors.cyan);
-  log('');
-  log(`${colors.dim}Supports: PDF, DOCX, PNG, JPG, TIFF${colors.reset}`);
+  log(`${colors.green}Done!${colors.reset} Run ${colors.cyan}/structure <path>${colors.reset} to extract.`);
   log('');
 }
@@ -187,9 +161,6 @@ function uninstall() {
 function showVersion() {
   log(`structurecc v${VERSION}`, colors.bright);
   log('');
-  log('Pipeline: 3-phase with verification', colors.dim);
-  log('Agents: 8 (classifier + 6 extractors + verifier)', colors.dim);
-  log('');
 }
 // Main
@@ -204,27 +175,11 @@ if (args.includes('--uninstall') || args.includes('-u')) {
 } else if (args.includes('--help') || args.includes('-h')) {
   log('Usage: npx structurecc [options]', colors.bright);
   log('');
-  log('Options:', colors.bright);
-  log('  --help, -h       Show this help', colors.dim);
-  log('  --version, -v    Show version info', colors.dim);
-  log('  --uninstall, -u  Remove from Claude Code', colors.dim);
+  log('Options:', colors.dim);
+  log('  --uninstall, -u  Remove from Claude Code');
   log('');
-  log('After install, use in Claude Code:', colors.bright);
+  log('After install, run in Claude Code:', colors.bright);
   log('  /structure path/to/document.pdf', colors.cyan);
-  log('  /structure path/to/document.docx', colors.cyan);
-  log('');
-  log('Pipeline:', colors.bright);
-  log('  Phase 1: Classification (haiku - fast triage)', colors.dim);
-  log('  Phase 2: Specialized extraction (opus - quality)', colors.dim);
-  log('  Phase 3: Verification (sonnet - balance)', colors.dim);
-  log('');
-  log('Extractors:', colors.bright);
-  log('  • structurecc-extract-table    - Tables with cell-by-cell accuracy', colors.dim);
-  log('  • structurecc-extract-chart    - Charts with axes, legends, data', colors.dim);
-  log('  • structurecc-extract-heatmap  - Heatmaps with color scales', colors.dim);
-  log('  • structurecc-extract-diagram  - Flowcharts, timelines, networks', colors.dim);
-  log('  • structurecc-extract-multipanel - Multi-panel figures (A,B,C,D)', colors.dim);
-  log('  • structurecc-extract-generic  - Fallback for other visuals', colors.dim);
   log('');
 } else {
   install();

package/commands/structure/structure.md CHANGED Viewed

@@ -52,7 +52,132 @@ for subdir in ["images", "classifications", "extractions", "verifications", "ele
 print(f"Output directory: {output_dir}")
 ```
-## Step 2: Extract Images
+## Step 2: Extract Text and Images
+**CRITICAL:** Extract BOTH the manuscript text AND images. Figures need context from the paper.
+### 2A: Extract Manuscript Text (PDF)
+```python
+import fitz
+import re
+import json
+from pathlib import Path
+pdf_path = "<document_path>"
+output_dir = Path("<output_dir>")
+doc = fitz.open(pdf_path)
+full_text = []
+page_texts = {}
+for page_num in range(len(doc)):
+    page = doc[page_num]
+    text = page.get_text("text")
+    full_text.append(text)
+    page_texts[page_num + 1] = text
+# Save full manuscript text
+with open(output_dir / "manuscript_text.txt", "w") as f:
+    f.write("\n\n---PAGE BREAK---\n\n".join(full_text))
+# Parse figure and table captions
+def extract_captions(text):
+    """Extract Figure and Table captions from manuscript text."""
+    captions = {"figures": {}, "tables": {}}
+    # Figure patterns: "Figure 1.", "Figure 1:", "Fig. 1.", "FIGURE 1"
+    fig_pattern = r'(?:Figure|Fig\.?|FIGURE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
+    for match in re.finditer(fig_pattern, text, re.IGNORECASE):
+        fig_num = match.group(1)
+        caption = match.group(2).strip()
+        # Clean up caption (remove extra whitespace)
+        caption = ' '.join(caption.split())
+        captions["figures"][fig_num] = caption
+    # Table patterns: "Table 1.", "Table 1:", "TABLE 1"
+    table_pattern = r'(?:Table|TABLE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
+    for match in re.finditer(table_pattern, text, re.IGNORECASE):
+        table_num = match.group(1)
+        caption = match.group(2).strip()
+        caption = ' '.join(caption.split())
+        captions["tables"][table_num] = caption
+    return captions
+all_text = "\n".join(full_text)
+captions = extract_captions(all_text)
+# Save captions
+with open(output_dir / "captions.json", "w") as f:
+    json.dump(captions, f, indent=2)
+print(f"Extracted text from {len(doc)} pages")
+print(f"Found {len(captions['figures'])} figure captions")
+print(f"Found {len(captions['tables'])} table captions")
+doc.close()
+```
+### 2B: Extract Context Snippets
+For each figure/table, extract surrounding manuscript context:
+```python
+def extract_context_for_element(text, element_type, element_num, window=500):
+    """Extract text context surrounding references to a figure/table."""
+    contexts = []
+    if element_type == "figure":
+        patterns = [
+            rf'(?:Figure|Fig\.?)\s*{element_num}\b',
+            rf'(?:figure|fig\.?)\s*{element_num}\b'
+        ]
+    else:
+        patterns = [
+            rf'(?:Table)\s*{element_num}\b',
+            rf'(?:table)\s*{element_num}\b'
+        ]
+    for pattern in patterns:
+        for match in re.finditer(pattern, text):
+            start = max(0, match.start() - window)
+            end = min(len(text), match.end() + window)
+            context = text[start:end].strip()
+            # Clean up
+            context = ' '.join(context.split())
+            if context not in contexts:
+                contexts.append(context)
+    return contexts
+# Extract contexts for each figure
+figure_contexts = {}
+for fig_num in captions["figures"]:
+    contexts = extract_context_for_element(all_text, "figure", fig_num)
+    figure_contexts[fig_num] = {
+        "caption": captions["figures"][fig_num],
+        "contexts": contexts
+    }
+# Extract contexts for each table
+table_contexts = {}
+for table_num in captions["tables"]:
+    contexts = extract_context_for_element(all_text, "table", table_num)
+    table_contexts[table_num] = {
+        "caption": captions["tables"][table_num],
+        "contexts": contexts
+    }
+# Save context data
+with open(output_dir / "manuscript_context.json", "w") as f:
+    json.dump({
+        "figures": figure_contexts,
+        "tables": table_contexts
+    }, f, indent=2)
+```
+### 2C: Extract Images
 **For PDF files** - Use PyMuPDF:
@@ -166,7 +291,7 @@ After classifications complete, read each classification file and dispatch to th
 | `multi_panel` | `structurecc-extract-multipanel.md` |
 | Everything else | `structurecc-extract-generic.md` |
-For EACH element, spawn the appropriate extractor:
+For EACH element, spawn the appropriate extractor WITH manuscript context:
 ```
 Task(
@@ -185,12 +310,31 @@ Read the agent instructions from:
 **Source:** Page <N> of <document_name>
 **Output:** Write JSON to <output_dir>/extractions/<element_id>.json
+## MANUSCRIPT CONTEXT (Use this to understand the figure)
+**Figure Caption:**
+<caption_from_captions.json>
+**Relevant Manuscript Text:**
+<context_snippets_from_manuscript_context.json>
+---
 Follow the extractor instructions EXACTLY. Output ONLY valid JSON.
-Remember: VERBATIM extraction only. Copy text exactly as shown.
+CRITICAL REQUIREMENTS:
+1. VERBATIM extraction - Copy ALL text exactly as shown in the image
+2. Use manuscript context to understand what the figure shows
+3. Include the figure caption in your extraction
+4. For charts: capture EXACT legend text, axis labels, tick values
+5. For Kaplan-Meier/survival curves: note step-function nature, describe curve progression
+6. Describe colors precisely (e.g., "purple line", "light purple shaded 95% CI", "yellow/orange shaded band")
 """
 )
 ```
+**IMPORTANT:** Read `manuscript_context.json` to get the caption and context for each element.
 Launch ALL extractions in ONE message for parallel processing.
 ## Step 5: Phase 3 - Verification (Parallel)
@@ -317,48 +461,73 @@ for extract_file in extractions_dir.glob("*.json"):
 **Markdown conversion function:**
 ```python
-def json_to_markdown(extraction: dict) -> str:
-    """Convert JSON extraction to clean markdown."""
+import json
+from pathlib import Path
+def json_to_markdown(extraction: dict, context: dict = None) -> str:
+    """Convert JSON extraction to clean markdown with manuscript context."""
     ext_type = extraction.get("extraction_type")
     if ext_type == "table":
-        return table_to_markdown(extraction)
+        return table_to_markdown(extraction, context)
     elif ext_type == "chart":
-        return chart_to_markdown(extraction)
+        return chart_to_markdown(extraction, context)
     elif ext_type == "heatmap":
-        return heatmap_to_markdown(extraction)
+        return heatmap_to_markdown(extraction, context)
     elif ext_type == "diagram":
-        return diagram_to_markdown(extraction)
+        return diagram_to_markdown(extraction, context)
     elif ext_type == "multi_panel":
-        return multipanel_to_markdown(extraction)
+        return multipanel_to_markdown(extraction, context)
     else:
-        return generic_to_markdown(extraction)
+        return generic_to_markdown(extraction, context)
+# Load manuscript context for element processing
+def get_context_for_element(output_dir: Path, element_num: int, element_type: str = "figure"):
+    """Get manuscript context for a specific element."""
+    context_file = output_dir / "manuscript_context.json"
+    if not context_file.exists():
+        return None
+    with open(context_file) as f:
+        manuscript_context = json.load(f)
-def table_to_markdown(ext: dict) -> str:
+    key = "figures" if element_type == "figure" else "tables"
+    return manuscript_context.get(key, {}).get(str(element_num))
+def table_to_markdown(ext: dict, context: dict = None) -> str:
     md = []
     meta = ext.get("table_metadata", {})
-    md.append(f"# {meta.get('title', 'Table')}")
-    md.append(f"\n**Type:** Table")
+    md.append(f"## {meta.get('title', 'Table')}")
+    md.append(f"\n**Type:** table ({meta.get('table_type', 'standard')})")
     md.append(f"**Source:** Page {meta.get('source_page', '?')}")
+    # Add manuscript context if available
+    if context:
+        if context.get("caption"):
+            md.append(f"\n> **Table Caption (from manuscript):** {context['caption']}")
+        if context.get("contexts"):
+            md.append("\n### Manuscript Context\n")
+            for ctx in context["contexts"][:2]:
+                md.append(f"> {ctx[:400]}...")
     if meta.get("caption"):
-        md.append(f"\n> {meta['caption']}")
+        md.append(f"\n> **Caption (from image):** {meta['caption']}")
-    md.append("\n## Data\n")
+    md.append("\n### Data\n")
     md.append(ext.get("markdown_table", ""))
     if meta.get("footnotes"):
-        md.append("\n## Footnotes\n")
+        md.append("\n### Footnotes\n")
         for fn in meta["footnotes"]:
             md.append(f"- {fn}")
     return "\n".join(md)
-def chart_to_markdown(ext: dict) -> str:
+def chart_to_markdown(ext: dict, context: dict = None) -> str:
     md = []
     meta = ext.get("chart_metadata", {})
@@ -366,31 +535,61 @@ def chart_to_markdown(ext: dict) -> str:
     md.append(f"\n**Type:** {ext.get('chart_type', 'Chart')}")
     md.append(f"**Source:** Page {meta.get('source_page', '?')}")
+    # Add manuscript context if available
+    if context:
+        if context.get("caption"):
+            md.append(f"\n> **Caption:** {context['caption']}")
+        if context.get("contexts"):
+            md.append("\n### Manuscript Context\n")
+            for ctx in context["contexts"][:2]:  # Limit to 2 most relevant
+                md.append(f"> {ctx[:500]}...")  # Truncate long contexts
     axes = ext.get("axes", {})
     md.append("\n## Axes\n")
     if axes.get("x"):
         md.append(f"- **X-axis:** {axes['x'].get('label', 'unlabeled')}")
         md.append(f"  - Range: {axes['x'].get('min')} to {axes['x'].get('max')}")
+        if axes['x'].get('ticks'):
+            md.append(f"  - Ticks: {axes['x']['ticks']}")
     if axes.get("y"):
         md.append(f"- **Y-axis:** {axes['y'].get('label', 'unlabeled')}")
         md.append(f"  - Range: {axes['y'].get('min')} to {axes['y'].get('max')}")
+        if axes['y'].get('ticks'):
+            md.append(f"  - Ticks: {axes['y']['ticks']}")
     legend = ext.get("legend", {})
     if legend.get("entries"):
-        md.append("\n## Legend\n")
+        md.append("\n## Legend (Verbatim)\n")
         for entry in legend["entries"]:
             style = entry.get("line_style") or entry.get("style", "")
-            md.append(f"- **{entry['label']}**: {entry.get('color', '')} {style}")
+            color = entry.get("color", "")
+            md.append(f"- **\"{entry['label']}\"**: {color} {style}")
+    # Data series details (for Kaplan-Meier etc.)
+    series = ext.get("data_series", [])
+    if series:
+        md.append("\n## Data Series\n")
+        for s in series:
+            md.append(f"### {s.get('name', 'Series')}")
+            if s.get("data_points"):
+                md.append("Key data points:")
+                for pt in s["data_points"][:10]:  # First 10 points
+                    md.append(f"  - x={pt.get('x')}, y={pt.get('y')}")
     stats = ext.get("statistical_annotations", [])
     if stats:
         md.append("\n## Statistical Annotations\n")
         for stat in stats:
-            md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
+            if stat.get("type") == "hazard_ratio":
+                md.append(f"- Hazard Ratio: {stat.get('value')} (95% CI: {stat.get('ci_lower')}-{stat.get('ci_upper')})")
+            elif stat.get("type") == "p_value":
+                md.append(f"- {stat.get('test', 'P-value')}: {stat.get('value')}")
+            else:
+                md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
     risk = ext.get("risk_table", {})
     if risk.get("present"):
-        md.append("\n## Risk Table\n")
+        md.append("\n## Risk Table (Number at Risk)\n")
         headers = risk.get("headers", [])
         md.append("| " + " | ".join(headers) + " |")
         md.append("| " + " | ".join(["---"] * len(headers)) + " |")
@@ -399,18 +598,137 @@ def chart_to_markdown(ext: dict) -> str:
             md.append("| " + " | ".join(values) + " |")
     return "\n".join(md)
+def multipanel_to_markdown(ext: dict, context: dict = None) -> str:
+    """Convert multi-panel figure extraction to detailed markdown."""
+    md = []
+    meta = ext.get("figure_metadata", {})
+    md.append(f"## {meta.get('title', 'Multi-Panel Figure')}")
+    md.append(f"\n**Type:** multi_panel ({meta.get('total_panels', '?')} panels)")
+    md.append(f"**Source:** Page {meta.get('source_page', '?')}")
+    md.append(f"**Layout:** {meta.get('layout', 'unknown')}")
+    # Add manuscript context if available
+    if context:
+        if context.get("caption"):
+            md.append(f"\n> **Figure Caption (from manuscript):** {context['caption']}")
+        if context.get("contexts"):
+            md.append("\n### Manuscript Context\n")
+            md.append("*How this figure is described in the paper:*\n")
+            for ctx in context["contexts"][:2]:
+                md.append(f"> ...{ctx[:500]}...\n")
+    # Process each panel in detail
+    panels = ext.get("panels", [])
+    for panel in panels:
+        panel_id = panel.get("panel_id", "?")
+        panel_type = panel.get("panel_type", "unknown")
+        panel_title = panel.get("panel_title", f"Panel {panel_id}")
+        md.append(f"\n### Panel {panel_id}: {panel_title}")
+        md.append(f"**Type:** {panel_type}")
+        extraction = panel.get("extraction", {})
+        # Axes
+        axes = extraction.get("axes", {})
+        if axes:
+            md.append("\n**Axes:**")
+            if axes.get("x"):
+                x = axes["x"]
+                md.append(f"- X-axis: \"{x.get('label', 'unlabeled')}\"")
+                md.append(f"  - Range: {x.get('min')} to {x.get('max')}")
+                if x.get("ticks"):
+                    md.append(f"  - Tick values: {x['ticks']}")
+            if axes.get("y"):
+                y = axes["y"]
+                md.append(f"- Y-axis: \"{y.get('label', 'unlabeled')}\"")
+                md.append(f"  - Range: {y.get('min')} to {y.get('max')}")
+                if y.get("ticks"):
+                    md.append(f"  - Tick values: {y['ticks']}")
+        # Legend (VERBATIM)
+        legend = extraction.get("legend", {})
+        if legend.get("entries"):
+            md.append("\n**Legend (Verbatim from image):**")
+            if legend.get("title"):
+                md.append(f"*{legend['title']}*")
+            for entry in legend["entries"]:
+                label = entry.get("label", "")
+                color = entry.get("color", "")
+                style = entry.get("line_style") or entry.get("style", "")
+                md.append(f"- \"{label}\" — {color} {style}")
+        # Curve endpoints (for Kaplan-Meier)
+        endpoints = extraction.get("curve_endpoints", [])
+        if endpoints:
+            md.append("\n**Curve Endpoints:**")
+            for ep in endpoints:
+                md.append(f"- {ep.get('series', 'Series')}: y={ep.get('final_y')} at x={ep.get('final_x')}")
+                if ep.get("note"):
+                    md.append(f"  - Note: {ep['note']}")
+        # Key observations
+        observations = extraction.get("key_observations", [])
+        if observations:
+            md.append("\n**Key Observations:**")
+            for obs in observations:
+                md.append(f"- {obs}")
+        # Statistical annotations
+        stats = extraction.get("statistical_annotations", [])
+        if stats:
+            md.append("\n**Statistical Annotations:**")
+            for stat in stats:
+                md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
+        # All visible text
+        all_text = extraction.get("all_visible_text", [])
+        if all_text:
+            md.append("\n**All Visible Text:**")
+            md.append(f"```\n{', '.join(all_text)}\n```")
+    # Shared elements
+    shared = ext.get("shared_elements", {})
+    if shared.get("shared_legend") or shared.get("cross_references"):
+        md.append("\n### Shared Elements")
+        if shared.get("shared_legend"):
+            md.append(f"- Shared legend applies to panels: {shared['shared_legend'].get('applies_to', [])}")
+        if shared.get("cross_references"):
+            for ref in shared["cross_references"]:
+                md.append(f"- {ref}")
+    return "\n".join(md)
 ```
-## Step 8: Generate Combined STRUCTURED.md
+## Step 8: Generate Combined STRUCTURED.md with Manuscript Context
 ```python
+import json
 from pathlib import Path
 from datetime import datetime
 output_dir = Path("<output_dir>")
 elements_dir = output_dir / "elements"
+extractions_dir = output_dir / "extractions"
 doc_name = "<document_name>"
+# Load manuscript context
+context_file = output_dir / "manuscript_context.json"
+manuscript_context = {}
+if context_file.exists():
+    with open(context_file) as f:
+        manuscript_context = json.load(f)
+# Load captions
+captions_file = output_dir / "captions.json"
+captions = {"figures": {}, "tables": {}}
+if captions_file.exists():
+    with open(captions_file) as f:
+        captions = json.load(f)
 # Read all element files in order
 element_files = sorted(elements_dir.glob("element_*.md"))
@@ -419,21 +737,70 @@ sections.append(f"# {doc_name} - Structured Extraction")
 sections.append(f"\n**Original:** {doc_name}")
 sections.append(f"**Extracted:** {datetime.now().isoformat()}")
 sections.append(f"**Elements:** {len(element_files)} visual elements processed")
-sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification)")
+sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification + manuscript context)")
 sections.append("\n---\n")
-# Add each element
+# Table of contents
+sections.append("## Table of Contents\n")
 for i, elem_file in enumerate(element_files, 1):
+    elem_id = elem_file.stem
+    # Try to get title from extraction
+    extract_file = extractions_dir / f"{elem_id}.json"
+    title = f"Element {i}"
+    if extract_file.exists():
+        with open(extract_file) as f:
+            ext = json.load(f)
+        title = ext.get("chart_metadata", {}).get("title") or \
+                ext.get("table_metadata", {}).get("title") or \
+                ext.get("figure_metadata", {}).get("title") or \
+                f"Element {i}"
+    sections.append(f"{i}. [{title}](#{elem_id})")
+sections.append("\n---\n")
+# Add each element with context
+for i, elem_file in enumerate(element_files, 1):
+    elem_id = elem_file.stem
     with open(elem_file) as f:
         content = f.read()
-    sections.append(f"## Element {i}")
+    sections.append(f'<a id="{elem_id}"></a>\n')
+    sections.append(f"## Element {i}: {elem_id}\n")
+    # Try to match with manuscript context
+    # Heuristic: Figure 1 = first figure element, etc.
+    fig_num = str(i)
+    if fig_num in manuscript_context.get("figures", {}):
+        ctx = manuscript_context["figures"][fig_num]
+        if ctx.get("caption"):
+            sections.append(f"\n> **Figure Caption:** {ctx['caption']}\n")
+        if ctx.get("contexts"):
+            sections.append("\n### Manuscript Context\n")
+            sections.append("*Relevant text from the manuscript:*\n")
+            for c in ctx["contexts"][:2]:
+                # Truncate and clean
+                clean_ctx = ' '.join(c.split())[:600]
+                sections.append(f"> ...{clean_ctx}...\n")
     sections.append(content)
     sections.append("\n---\n")
+# Add manuscript summary section
+if manuscript_context.get("figures") or manuscript_context.get("tables"):
+    sections.append("## Manuscript References Summary\n")
+    sections.append("### Figure Captions from Manuscript\n")
+    for fig_num, ctx in manuscript_context.get("figures", {}).items():
+        sections.append(f"- **Figure {fig_num}:** {ctx.get('caption', 'No caption found')}")
+    sections.append("\n### Table Captions from Manuscript\n")
+    for table_num, ctx in manuscript_context.get("tables", {}).items():
+        sections.append(f"- **Table {table_num}:** {ctx.get('caption', 'No caption found')}")
+    sections.append("\n---\n")
 # Write combined file
 with open(output_dir / "STRUCTURED.md", "w") as f:
     f.write("\n".join(sections))
+print(f"Generated STRUCTURED.md with {len(element_files)} elements and manuscript context")
 ```
 ## Step 9: Generate Quality Report

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "structurecc",
-  "version": "2.0.3",
+  "version": "2.1.0",
   "description": "Extract structured data from PDFs, Word docs, and images using Claude Code.",
   "keywords": [
     "document-extraction",