npm - structurecc - Versions diffs - 2.1.0 → 3.0.0 - Mend

structurecc 2.1.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/.claude-plugin/plugin.json +27 -0
package/README.md +156 -66
package/commands/structure-batch.md +281 -0
package/commands/structure.md +331 -0
package/install.js +93 -0
package/package.json +23 -22
package/prompts/chunk-extractor.md +289 -0
package/LICENSE +0 -21
package/agents/structurecc-classifier.md +0 -135
package/agents/structurecc-extract-chart.md +0 -354
package/agents/structurecc-extract-diagram.md +0 -343
package/agents/structurecc-extract-generic.md +0 -248
package/agents/structurecc-extract-heatmap.md +0 -322
package/agents/structurecc-extract-multipanel.md +0 -386
package/agents/structurecc-extract-table.md +0 -231
package/agents/structurecc-verifier.md +0 -265
package/bin/install.js +0 -186
package/commands/structure/structure.md +0 -931

package/commands/structure/structure.md DELETED Viewed

@@ -1,931 +0,0 @@
----
-name: structure
-description: Extract structured data from PDFs and Word docs using AI agent swarms with verbatim accuracy
-arguments:
-  - name: path
-    description: Path to document (PDF, DOCX, or image)
-    required: true
----
-# /structure - Agentic Document Extraction v2.0
-Turn complex documents into structured JSON + markdown using a 3-phase pipeline with verification.
-## Pipeline Overview
-```
-Image → [Phase 1: Classify] → [Phase 2: Extract] → [Phase 3: Verify] → Output
-                                     ↑__________REVISION LOOP__________↓
-```
-1. **Classify** - Identify visual element type (table, chart, heatmap, diagram, etc.)
-2. **Extract** - Use specialized extractor for that type with verbatim accuracy
-3. **Verify** - Score extraction quality, trigger revision if < 0.9
-## Step 1: Setup Output Directory
-Create output directory next to the document:
-```
-<document_name>_extracted/
-├── images/                    # Raw extracted images
-├── classifications/           # Phase 1: type detection results
-├── extractions/              # Phase 2: JSON extractions
-├── verifications/            # Phase 3: quality scores
-├── elements/                 # Final markdown per element
-├── STRUCTURED.md             # Combined markdown output
-└── extraction_report.json    # Quality metrics summary
-```
-```python
-import os
-from pathlib import Path
-from datetime import datetime
-doc_path = "<document_path>"
-doc_name = Path(doc_path).stem
-output_dir = Path(doc_path).parent / f"{doc_name}_extracted"
-# Create all subdirectories
-for subdir in ["images", "classifications", "extractions", "verifications", "elements"]:
-    (output_dir / subdir).mkdir(parents=True, exist_ok=True)
-print(f"Output directory: {output_dir}")
-```
-## Step 2: Extract Text and Images
-**CRITICAL:** Extract BOTH the manuscript text AND images. Figures need context from the paper.
-### 2A: Extract Manuscript Text (PDF)
-```python
-import fitz
-import re
-import json
-from pathlib import Path
-pdf_path = "<document_path>"
-output_dir = Path("<output_dir>")
-doc = fitz.open(pdf_path)
-full_text = []
-page_texts = {}
-for page_num in range(len(doc)):
-    page = doc[page_num]
-    text = page.get_text("text")
-    full_text.append(text)
-    page_texts[page_num + 1] = text
-# Save full manuscript text
-with open(output_dir / "manuscript_text.txt", "w") as f:
-    f.write("\n\n---PAGE BREAK---\n\n".join(full_text))
-# Parse figure and table captions
-def extract_captions(text):
-    """Extract Figure and Table captions from manuscript text."""
-    captions = {"figures": {}, "tables": {}}
-    # Figure patterns: "Figure 1.", "Figure 1:", "Fig. 1.", "FIGURE 1"
-    fig_pattern = r'(?:Figure|Fig\.?|FIGURE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
-    for match in re.finditer(fig_pattern, text, re.IGNORECASE):
-        fig_num = match.group(1)
-        caption = match.group(2).strip()
-        # Clean up caption (remove extra whitespace)
-        caption = ' '.join(caption.split())
-        captions["figures"][fig_num] = caption
-    # Table patterns: "Table 1.", "Table 1:", "TABLE 1"
-    table_pattern = r'(?:Table|TABLE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
-    for match in re.finditer(table_pattern, text, re.IGNORECASE):
-        table_num = match.group(1)
-        caption = match.group(2).strip()
-        caption = ' '.join(caption.split())
-        captions["tables"][table_num] = caption
-    return captions
-all_text = "\n".join(full_text)
-captions = extract_captions(all_text)
-# Save captions
-with open(output_dir / "captions.json", "w") as f:
-    json.dump(captions, f, indent=2)
-print(f"Extracted text from {len(doc)} pages")
-print(f"Found {len(captions['figures'])} figure captions")
-print(f"Found {len(captions['tables'])} table captions")
-doc.close()
-```
-### 2B: Extract Context Snippets
-For each figure/table, extract surrounding manuscript context:
-```python
-def extract_context_for_element(text, element_type, element_num, window=500):
-    """Extract text context surrounding references to a figure/table."""
-    contexts = []
-    if element_type == "figure":
-        patterns = [
-            rf'(?:Figure|Fig\.?)\s*{element_num}\b',
-            rf'(?:figure|fig\.?)\s*{element_num}\b'
-        ]
-    else:
-        patterns = [
-            rf'(?:Table)\s*{element_num}\b',
-            rf'(?:table)\s*{element_num}\b'
-        ]
-    for pattern in patterns:
-        for match in re.finditer(pattern, text):
-            start = max(0, match.start() - window)
-            end = min(len(text), match.end() + window)
-            context = text[start:end].strip()
-            # Clean up
-            context = ' '.join(context.split())
-            if context not in contexts:
-                contexts.append(context)
-    return contexts
-# Extract contexts for each figure
-figure_contexts = {}
-for fig_num in captions["figures"]:
-    contexts = extract_context_for_element(all_text, "figure", fig_num)
-    figure_contexts[fig_num] = {
-        "caption": captions["figures"][fig_num],
-        "contexts": contexts
-    }
-# Extract contexts for each table
-table_contexts = {}
-for table_num in captions["tables"]:
-    contexts = extract_context_for_element(all_text, "table", table_num)
-    table_contexts[table_num] = {
-        "caption": captions["tables"][table_num],
-        "contexts": contexts
-    }
-# Save context data
-with open(output_dir / "manuscript_context.json", "w") as f:
-    json.dump({
-        "figures": figure_contexts,
-        "tables": table_contexts
-    }, f, indent=2)
-```
-### 2C: Extract Images
-**For PDF files** - Use PyMuPDF:
-```python
-import fitz
-import json
-pdf_path = "<document_path>"
-output_dir = Path("<output_dir>")
-images_dir = output_dir / "images"
-doc = fitz.open(pdf_path)
-extracted = []
-for page_num in range(len(doc)):
-    page = doc[page_num]
-    for img_idx, img in enumerate(page.get_images()):
-        xref = img[0]
-        pix = fitz.Pixmap(doc, xref)
-        if pix.n - pix.alpha > 3:
-            pix = fitz.Pixmap(fitz.csRGB, pix)
-        img_name = f"p{page_num + 1}_img{img_idx + 1}.png"
-        img_path = images_dir / img_name
-        pix.save(str(img_path))
-        extracted.append({
-            "id": f"element_{len(extracted) + 1:03d}",
-            "path": str(img_path),
-            "page": page_num + 1,
-            "name": img_name
-        })
-        pix = None
-doc.close()
-# Save image manifest
-with open(output_dir / "image_manifest.json", "w") as f:
-    json.dump(extracted, f, indent=2)
-print(f"Extracted {len(extracted)} images")
-```
-**For DOCX files** - Unzip and extract media:
-```python
-from zipfile import ZipFile
-docx_path = "<document_path>"
-output_dir = Path("<output_dir>")
-images_dir = output_dir / "images"
-extracted = []
-with ZipFile(docx_path, 'r') as z:
-    for f in z.namelist():
-        if f.startswith('word/media/'):
-            name = os.path.basename(f)
-            path = images_dir / name
-            with z.open(f) as src, open(path, 'wb') as dst:
-                dst.write(src.read())
-            extracted.append({
-                "id": f"element_{len(extracted) + 1:03d}",
-                "path": str(path),
-                "name": name
-            })
-# Save image manifest
-with open(output_dir / "image_manifest.json", "w") as f:
-    json.dump(extracted, f, indent=2)
-print(f"Extracted {len(extracted)} images")
-```
-## Step 3: Phase 1 - Classification (Parallel)
-**CRITICAL:** Launch ALL classification agents in ONE message.
-For EACH extracted image, spawn a classification agent:
-```
-Task(
-  subagent_type: "general-purpose",
-  model: "haiku",  # Fast classification
-  description: "Classify element [N]",
-  prompt: """
-You are a visual element classifier. Read the agent instructions from:
-~/.claude/agents/structurecc-classifier.md
-**Image:** <full_path_to_image>
-**Element ID:** <element_id>
-**Output:** Write JSON to <output_dir>/classifications/<element_id>_class.json
-Analyze the image and output the classification JSON as specified in the agent file.
-"""
-)
-```
-Launch 10 images = 10 Task calls in ONE message. They run in parallel.
-## Step 4: Phase 2 - Specialized Extraction (Parallel)
-After classifications complete, read each classification file and dispatch to the correct extractor.
-**Extractor Routing:**
-| Classification | Extractor Agent |
-|---------------|-----------------|
-| `table_simple`, `table_complex` | `structurecc-extract-table.md` |
-| `chart_*` (all chart types) | `structurecc-extract-chart.md` |
-| `heatmap` | `structurecc-extract-heatmap.md` |
-| `diagram_*` (all diagram types) | `structurecc-extract-diagram.md` |
-| `multi_panel` | `structurecc-extract-multipanel.md` |
-| Everything else | `structurecc-extract-generic.md` |
-For EACH element, spawn the appropriate extractor WITH manuscript context:
-```
-Task(
-  subagent_type: "general-purpose",
-  model: "opus",  # Best quality for extraction
-  description: "Extract element [N]",
-  prompt: """
-You are extracting structured data from a visual element.
-Read the agent instructions from:
-~/.claude/agents/<appropriate_extractor>.md
-**Image:** <full_path_to_image>
-**Element ID:** <element_id>
-**Classification:** <classification_type>
-**Source:** Page <N> of <document_name>
-**Output:** Write JSON to <output_dir>/extractions/<element_id>.json
-## MANUSCRIPT CONTEXT (Use this to understand the figure)
-**Figure Caption:**
-<caption_from_captions.json>
-**Relevant Manuscript Text:**
-<context_snippets_from_manuscript_context.json>
----
-Follow the extractor instructions EXACTLY. Output ONLY valid JSON.
-CRITICAL REQUIREMENTS:
-1. VERBATIM extraction - Copy ALL text exactly as shown in the image
-2. Use manuscript context to understand what the figure shows
-3. Include the figure caption in your extraction
-4. For charts: capture EXACT legend text, axis labels, tick values
-5. For Kaplan-Meier/survival curves: note step-function nature, describe curve progression
-6. Describe colors precisely (e.g., "purple line", "light purple shaded 95% CI", "yellow/orange shaded band")
-"""
-)
-```
-**IMPORTANT:** Read `manuscript_context.json` to get the caption and context for each element.
-Launch ALL extractions in ONE message for parallel processing.
-## Step 5: Phase 3 - Verification (Parallel)
-After extractions complete, verify each extraction:
-```
-Task(
-  subagent_type: "general-purpose",
-  model: "sonnet",  # Good balance for verification
-  description: "Verify element [N]",
-  prompt: """
-You are verifying an extraction against its source image.
-Read the agent instructions from:
-~/.claude/agents/structurecc-verifier.md
-**Source Image:** <full_path_to_image>
-**Extraction JSON:** <output_dir>/extractions/<element_id>.json
-**Element ID:** <element_id>
-**Output:** Write JSON to <output_dir>/verifications/<element_id>_verify.json
-Compare the extraction to the source image and produce a verification report.
-"""
-)
-```
-Launch ALL verifications in ONE message.
-## Step 6: Revision Loop
-After verifications complete, check for failures:
-```python
-import json
-from pathlib import Path
-output_dir = Path("<output_dir>")
-verifications_dir = output_dir / "verifications"
-needs_revision = []
-passed = []
-needs_human_review = []
-for verify_file in verifications_dir.glob("*_verify.json"):
-    with open(verify_file) as f:
-        result = json.load(f)
-    element_id = result["element_id"]
-    if result.get("needs_human_review"):
-        needs_human_review.append(element_id)
-    elif not result["pass"]:
-        revision_num = result.get("revision_feedback", {}).get("revision_number", 0)
-        if revision_num < 2:
-            needs_revision.append({
-                "element_id": element_id,
-                "feedback": result["revision_feedback"],
-                "score": result["scores"]["overall"]
-            })
-        else:
-            needs_human_review.append(element_id)
-    else:
-        passed.append(element_id)
-print(f"Passed: {len(passed)}")
-print(f"Needs revision: {len(needs_revision)}")
-print(f"Needs human review: {len(needs_human_review)}")
-```
-For elements needing revision, re-run extraction with specific feedback:
-```
-Task(
-  subagent_type: "general-purpose",
-  model: "opus",
-  description: "Re-extract element [N] (revision)",
-  prompt: """
-REVISION EXTRACTION - Previous extraction failed verification.
-Read the agent instructions from:
-~/.claude/agents/<appropriate_extractor>.md
-**Image:** <full_path_to_image>
-**Element ID:** <element_id>
-**Previous Score:** <score>
-**Output:** Write JSON to <output_dir>/extractions/<element_id>.json
-SPECIFIC FIXES REQUIRED:
-<list_specific_fixes_from_revision_feedback>
-Focus on fixing these specific issues while preserving correct sections.
-"""
-)
-```
-After re-extraction, re-verify. Max 2 revision attempts per element.
-## Step 7: Generate Markdown Elements
-After all verifications pass (or reach human review), convert JSON extractions to markdown:
-```python
-import json
-from pathlib import Path
-output_dir = Path("<output_dir>")
-extractions_dir = output_dir / "extractions"
-elements_dir = output_dir / "elements"
-for extract_file in extractions_dir.glob("*.json"):
-    element_id = extract_file.stem
-    with open(extract_file) as f:
-        extraction = json.load(f)
-    # Convert to markdown based on type
-    md_content = json_to_markdown(extraction)
-    with open(elements_dir / f"{element_id}.md", "w") as f:
-        f.write(md_content)
-```
-**Markdown conversion function:**
-```python
-import json
-from pathlib import Path
-def json_to_markdown(extraction: dict, context: dict = None) -> str:
-    """Convert JSON extraction to clean markdown with manuscript context."""
-    ext_type = extraction.get("extraction_type")
-    if ext_type == "table":
-        return table_to_markdown(extraction, context)
-    elif ext_type == "chart":
-        return chart_to_markdown(extraction, context)
-    elif ext_type == "heatmap":
-        return heatmap_to_markdown(extraction, context)
-    elif ext_type == "diagram":
-        return diagram_to_markdown(extraction, context)
-    elif ext_type == "multi_panel":
-        return multipanel_to_markdown(extraction, context)
-    else:
-        return generic_to_markdown(extraction, context)
-# Load manuscript context for element processing
-def get_context_for_element(output_dir: Path, element_num: int, element_type: str = "figure"):
-    """Get manuscript context for a specific element."""
-    context_file = output_dir / "manuscript_context.json"
-    if not context_file.exists():
-        return None
-    with open(context_file) as f:
-        manuscript_context = json.load(f)
-    key = "figures" if element_type == "figure" else "tables"
-    return manuscript_context.get(key, {}).get(str(element_num))
-def table_to_markdown(ext: dict, context: dict = None) -> str:
-    md = []
-    meta = ext.get("table_metadata", {})
-    md.append(f"## {meta.get('title', 'Table')}")
-    md.append(f"\n**Type:** table ({meta.get('table_type', 'standard')})")
-    md.append(f"**Source:** Page {meta.get('source_page', '?')}")
-    # Add manuscript context if available
-    if context:
-        if context.get("caption"):
-            md.append(f"\n> **Table Caption (from manuscript):** {context['caption']}")
-        if context.get("contexts"):
-            md.append("\n### Manuscript Context\n")
-            for ctx in context["contexts"][:2]:
-                md.append(f"> {ctx[:400]}...")
-    if meta.get("caption"):
-        md.append(f"\n> **Caption (from image):** {meta['caption']}")
-    md.append("\n### Data\n")
-    md.append(ext.get("markdown_table", ""))
-    if meta.get("footnotes"):
-        md.append("\n### Footnotes\n")
-        for fn in meta["footnotes"]:
-            md.append(f"- {fn}")
-    return "\n".join(md)
-def chart_to_markdown(ext: dict, context: dict = None) -> str:
-    md = []
-    meta = ext.get("chart_metadata", {})
-    md.append(f"# {meta.get('title', 'Chart')}")
-    md.append(f"\n**Type:** {ext.get('chart_type', 'Chart')}")
-    md.append(f"**Source:** Page {meta.get('source_page', '?')}")
-    # Add manuscript context if available
-    if context:
-        if context.get("caption"):
-            md.append(f"\n> **Caption:** {context['caption']}")
-        if context.get("contexts"):
-            md.append("\n### Manuscript Context\n")
-            for ctx in context["contexts"][:2]:  # Limit to 2 most relevant
-                md.append(f"> {ctx[:500]}...")  # Truncate long contexts
-    axes = ext.get("axes", {})
-    md.append("\n## Axes\n")
-    if axes.get("x"):
-        md.append(f"- **X-axis:** {axes['x'].get('label', 'unlabeled')}")
-        md.append(f"  - Range: {axes['x'].get('min')} to {axes['x'].get('max')}")
-        if axes['x'].get('ticks'):
-            md.append(f"  - Ticks: {axes['x']['ticks']}")
-    if axes.get("y"):
-        md.append(f"- **Y-axis:** {axes['y'].get('label', 'unlabeled')}")
-        md.append(f"  - Range: {axes['y'].get('min')} to {axes['y'].get('max')}")
-        if axes['y'].get('ticks'):
-            md.append(f"  - Ticks: {axes['y']['ticks']}")
-    legend = ext.get("legend", {})
-    if legend.get("entries"):
-        md.append("\n## Legend (Verbatim)\n")
-        for entry in legend["entries"]:
-            style = entry.get("line_style") or entry.get("style", "")
-            color = entry.get("color", "")
-            md.append(f"- **\"{entry['label']}\"**: {color} {style}")
-    # Data series details (for Kaplan-Meier etc.)
-    series = ext.get("data_series", [])
-    if series:
-        md.append("\n## Data Series\n")
-        for s in series:
-            md.append(f"### {s.get('name', 'Series')}")
-            if s.get("data_points"):
-                md.append("Key data points:")
-                for pt in s["data_points"][:10]:  # First 10 points
-                    md.append(f"  - x={pt.get('x')}, y={pt.get('y')}")
-    stats = ext.get("statistical_annotations", [])
-    if stats:
-        md.append("\n## Statistical Annotations\n")
-        for stat in stats:
-            if stat.get("type") == "hazard_ratio":
-                md.append(f"- Hazard Ratio: {stat.get('value')} (95% CI: {stat.get('ci_lower')}-{stat.get('ci_upper')})")
-            elif stat.get("type") == "p_value":
-                md.append(f"- {stat.get('test', 'P-value')}: {stat.get('value')}")
-            else:
-                md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
-    risk = ext.get("risk_table", {})
-    if risk.get("present"):
-        md.append("\n## Risk Table (Number at Risk)\n")
-        headers = risk.get("headers", [])
-        md.append("| " + " | ".join(headers) + " |")
-        md.append("| " + " | ".join(["---"] * len(headers)) + " |")
-        for row in risk.get("rows", []):
-            values = [row.get("group", "")] + row.get("values", [])
-            md.append("| " + " | ".join(values) + " |")
-    return "\n".join(md)
-def multipanel_to_markdown(ext: dict, context: dict = None) -> str:
-    """Convert multi-panel figure extraction to detailed markdown."""
-    md = []
-    meta = ext.get("figure_metadata", {})
-    md.append(f"## {meta.get('title', 'Multi-Panel Figure')}")
-    md.append(f"\n**Type:** multi_panel ({meta.get('total_panels', '?')} panels)")
-    md.append(f"**Source:** Page {meta.get('source_page', '?')}")
-    md.append(f"**Layout:** {meta.get('layout', 'unknown')}")
-    # Add manuscript context if available
-    if context:
-        if context.get("caption"):
-            md.append(f"\n> **Figure Caption (from manuscript):** {context['caption']}")
-        if context.get("contexts"):
-            md.append("\n### Manuscript Context\n")
-            md.append("*How this figure is described in the paper:*\n")
-            for ctx in context["contexts"][:2]:
-                md.append(f"> ...{ctx[:500]}...\n")
-    # Process each panel in detail
-    panels = ext.get("panels", [])
-    for panel in panels:
-        panel_id = panel.get("panel_id", "?")
-        panel_type = panel.get("panel_type", "unknown")
-        panel_title = panel.get("panel_title", f"Panel {panel_id}")
-        md.append(f"\n### Panel {panel_id}: {panel_title}")
-        md.append(f"**Type:** {panel_type}")
-        extraction = panel.get("extraction", {})
-        # Axes
-        axes = extraction.get("axes", {})
-        if axes:
-            md.append("\n**Axes:**")
-            if axes.get("x"):
-                x = axes["x"]
-                md.append(f"- X-axis: \"{x.get('label', 'unlabeled')}\"")
-                md.append(f"  - Range: {x.get('min')} to {x.get('max')}")
-                if x.get("ticks"):
-                    md.append(f"  - Tick values: {x['ticks']}")
-            if axes.get("y"):
-                y = axes["y"]
-                md.append(f"- Y-axis: \"{y.get('label', 'unlabeled')}\"")
-                md.append(f"  - Range: {y.get('min')} to {y.get('max')}")
-                if y.get("ticks"):
-                    md.append(f"  - Tick values: {y['ticks']}")
-        # Legend (VERBATIM)
-        legend = extraction.get("legend", {})
-        if legend.get("entries"):
-            md.append("\n**Legend (Verbatim from image):**")
-            if legend.get("title"):
-                md.append(f"*{legend['title']}*")
-            for entry in legend["entries"]:
-                label = entry.get("label", "")
-                color = entry.get("color", "")
-                style = entry.get("line_style") or entry.get("style", "")
-                md.append(f"- \"{label}\" — {color} {style}")
-        # Curve endpoints (for Kaplan-Meier)
-        endpoints = extraction.get("curve_endpoints", [])
-        if endpoints:
-            md.append("\n**Curve Endpoints:**")
-            for ep in endpoints:
-                md.append(f"- {ep.get('series', 'Series')}: y={ep.get('final_y')} at x={ep.get('final_x')}")
-                if ep.get("note"):
-                    md.append(f"  - Note: {ep['note']}")
-        # Key observations
-        observations = extraction.get("key_observations", [])
-        if observations:
-            md.append("\n**Key Observations:**")
-            for obs in observations:
-                md.append(f"- {obs}")
-        # Statistical annotations
-        stats = extraction.get("statistical_annotations", [])
-        if stats:
-            md.append("\n**Statistical Annotations:**")
-            for stat in stats:
-                md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
-        # All visible text
-        all_text = extraction.get("all_visible_text", [])
-        if all_text:
-            md.append("\n**All Visible Text:**")
-            md.append(f"```\n{', '.join(all_text)}\n```")
-    # Shared elements
-    shared = ext.get("shared_elements", {})
-    if shared.get("shared_legend") or shared.get("cross_references"):
-        md.append("\n### Shared Elements")
-        if shared.get("shared_legend"):
-            md.append(f"- Shared legend applies to panels: {shared['shared_legend'].get('applies_to', [])}")
-        if shared.get("cross_references"):
-            for ref in shared["cross_references"]:
-                md.append(f"- {ref}")
-    return "\n".join(md)
-```
-## Step 8: Generate Combined STRUCTURED.md with Manuscript Context
-```python
-import json
-from pathlib import Path
-from datetime import datetime
-output_dir = Path("<output_dir>")
-elements_dir = output_dir / "elements"
-extractions_dir = output_dir / "extractions"
-doc_name = "<document_name>"
-# Load manuscript context
-context_file = output_dir / "manuscript_context.json"
-manuscript_context = {}
-if context_file.exists():
-    with open(context_file) as f:
-        manuscript_context = json.load(f)
-# Load captions
-captions_file = output_dir / "captions.json"
-captions = {"figures": {}, "tables": {}}
-if captions_file.exists():
-    with open(captions_file) as f:
-        captions = json.load(f)
-# Read all element files in order
-element_files = sorted(elements_dir.glob("element_*.md"))
-sections = []
-sections.append(f"# {doc_name} - Structured Extraction")
-sections.append(f"\n**Original:** {doc_name}")
-sections.append(f"**Extracted:** {datetime.now().isoformat()}")
-sections.append(f"**Elements:** {len(element_files)} visual elements processed")
-sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification + manuscript context)")
-sections.append("\n---\n")
-# Table of contents
-sections.append("## Table of Contents\n")
-for i, elem_file in enumerate(element_files, 1):
-    elem_id = elem_file.stem
-    # Try to get title from extraction
-    extract_file = extractions_dir / f"{elem_id}.json"
-    title = f"Element {i}"
-    if extract_file.exists():
-        with open(extract_file) as f:
-            ext = json.load(f)
-        title = ext.get("chart_metadata", {}).get("title") or \
-                ext.get("table_metadata", {}).get("title") or \
-                ext.get("figure_metadata", {}).get("title") or \
-                f"Element {i}"
-    sections.append(f"{i}. [{title}](#{elem_id})")
-sections.append("\n---\n")
-# Add each element with context
-for i, elem_file in enumerate(element_files, 1):
-    elem_id = elem_file.stem
-    with open(elem_file) as f:
-        content = f.read()
-    sections.append(f'<a id="{elem_id}"></a>\n')
-    sections.append(f"## Element {i}: {elem_id}\n")
-    # Try to match with manuscript context
-    # Heuristic: Figure 1 = first figure element, etc.
-    fig_num = str(i)
-    if fig_num in manuscript_context.get("figures", {}):
-        ctx = manuscript_context["figures"][fig_num]
-        if ctx.get("caption"):
-            sections.append(f"\n> **Figure Caption:** {ctx['caption']}\n")
-        if ctx.get("contexts"):
-            sections.append("\n### Manuscript Context\n")
-            sections.append("*Relevant text from the manuscript:*\n")
-            for c in ctx["contexts"][:2]:
-                # Truncate and clean
-                clean_ctx = ' '.join(c.split())[:600]
-                sections.append(f"> ...{clean_ctx}...\n")
-    sections.append(content)
-    sections.append("\n---\n")
-# Add manuscript summary section
-if manuscript_context.get("figures") or manuscript_context.get("tables"):
-    sections.append("## Manuscript References Summary\n")
-    sections.append("### Figure Captions from Manuscript\n")
-    for fig_num, ctx in manuscript_context.get("figures", {}).items():
-        sections.append(f"- **Figure {fig_num}:** {ctx.get('caption', 'No caption found')}")
-    sections.append("\n### Table Captions from Manuscript\n")
-    for table_num, ctx in manuscript_context.get("tables", {}).items():
-        sections.append(f"- **Table {table_num}:** {ctx.get('caption', 'No caption found')}")
-    sections.append("\n---\n")
-# Write combined file
-with open(output_dir / "STRUCTURED.md", "w") as f:
-    f.write("\n".join(sections))
-print(f"Generated STRUCTURED.md with {len(element_files)} elements and manuscript context")
-```
-## Step 9: Generate Quality Report
-```python
-import json
-from pathlib import Path
-output_dir = Path("<output_dir>")
-verifications_dir = output_dir / "verifications"
-report = {
-    "document": "<document_name>",
-    "timestamp": datetime.now().isoformat(),
-    "pipeline_version": "2.0.0",
-    "elements_total": 0,
-    "elements_passed": 0,
-    "elements_revised": 0,
-    "elements_human_review": 0,
-    "average_quality_score": 0.0,
-    "element_details": []
-}
-scores = []
-for verify_file in sorted(verifications_dir.glob("*_verify.json")):
-    with open(verify_file) as f:
-        result = json.load(f)
-    report["elements_total"] += 1
-    detail = {
-        "element_id": result["element_id"],
-        "scores": result["scores"],
-        "status": "passed" if result["pass"] else "failed",
-        "issues_count": len(result.get("issues", []))
-    }
-    if result.get("needs_human_review"):
-        detail["status"] = "human_review"
-        report["elements_human_review"] += 1
-    elif result["pass"]:
-        report["elements_passed"] += 1
-    if result.get("revision_feedback", {}).get("revision_number", 0) > 0:
-        report["elements_revised"] += 1
-    scores.append(result["scores"]["overall"])
-    report["element_details"].append(detail)
-report["average_quality_score"] = sum(scores) / len(scores) if scores else 0
-# Write report
-with open(output_dir / "extraction_report.json", "w") as f:
-    json.dump(report, f, indent=2)
-print(f"\nQuality Report:")
-print(f"  Total elements: {report['elements_total']}")
-print(f"  Passed: {report['elements_passed']}")
-print(f"  Revised: {report['elements_revised']}")
-print(f"  Human review: {report['elements_human_review']}")
-print(f"  Average score: {report['average_quality_score']:.2f}")
-```
-## Step 10: Display Results
-```
-╔═══════════════════════════════════════════════════════════════════════════════╗
-║  EXTRACTION COMPLETE                                                          ║
-╠═══════════════════════════════════════════════════════════════════════════════╣
-║                                                                               ║
-║  Document:  [name]                                                            ║
-║  Output:    [path]_extracted/                                                 ║
-║  Pipeline:  structurecc v2.0 (3-phase with verification)                      ║
-║                                                                               ║
-║  QUALITY SUMMARY                                                              ║
-║  ──────────────────────────────────────────────────────                       ║
-║  Total elements:     [N]                                                      ║
-║  Passed (≥0.90):     [N] ✓                                                    ║
-║  Revised:            [N] ↻                                                    ║
-║  Human review:       [N] ⚠                                                    ║
-║  Average score:      [0.XX]                                                   ║
-║                                                                               ║
-║  FILES                                                                        ║
-║  ──────────────────────────────────────────────────────                       ║
-║  images/              [N] extracted images                                    ║
-║  classifications/     [N] type classifications                                ║
-║  extractions/         [N] JSON extractions                                    ║
-║  verifications/       [N] quality verifications                               ║
-║  elements/            [N] markdown files                                      ║
-║  STRUCTURED.md        Combined output                                         ║
-║  extraction_report.json  Quality metrics                                      ║
-║                                                                               ║
-╚═══════════════════════════════════════════════════════════════════════════════╝
-```
-Then open: `open "<output_dir>/STRUCTURED.md"`
-## Dependencies
-Install PyMuPDF if not present:
-```bash
-pip3 install PyMuPDF --quiet
-```
-## Tips
-- Use opus model for best extraction quality on complex visuals
-- Classification uses haiku for speed (it's just triage)
-- Verification uses sonnet for good balance
-- Each phase runs in parallel for speed
-- Max 2 revision attempts before human review
-- Check `extraction_report.json` for quality metrics
-- Individual JSON extractions preserved for programmatic use
-## Troubleshooting
-**Low quality scores:**
-- Check source image quality
-- Complex tables may need human review
-- Handwritten text is challenging
-**Revision loop stuck:**
-- After 2 revisions, element goes to human review
-- Check verifications/ for specific issues
-**Missing elements:**
-- Some PDFs render text as images - check image count
-- Very small images may be logos/icons (expected)