structurecc 2.1.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,931 +0,0 @@
1
- ---
2
- name: structure
3
- description: Extract structured data from PDFs and Word docs using AI agent swarms with verbatim accuracy
4
- arguments:
5
- - name: path
6
- description: Path to document (PDF, DOCX, or image)
7
- required: true
8
- ---
9
-
10
- # /structure - Agentic Document Extraction v2.0
11
-
12
- Turn complex documents into structured JSON + markdown using a 3-phase pipeline with verification.
13
-
14
- ## Pipeline Overview
15
-
16
- ```
17
- Image → [Phase 1: Classify] → [Phase 2: Extract] → [Phase 3: Verify] → Output
18
- ↑__________REVISION LOOP__________↓
19
- ```
20
-
21
- 1. **Classify** - Identify visual element type (table, chart, heatmap, diagram, etc.)
22
- 2. **Extract** - Use specialized extractor for that type with verbatim accuracy
23
- 3. **Verify** - Score extraction quality, trigger revision if < 0.9
24
-
25
- ## Step 1: Setup Output Directory
26
-
27
- Create output directory next to the document:
28
- ```
29
- <document_name>_extracted/
30
- ├── images/ # Raw extracted images
31
- ├── classifications/ # Phase 1: type detection results
32
- ├── extractions/ # Phase 2: JSON extractions
33
- ├── verifications/ # Phase 3: quality scores
34
- ├── elements/ # Final markdown per element
35
- ├── STRUCTURED.md # Combined markdown output
36
- └── extraction_report.json # Quality metrics summary
37
- ```
38
-
39
- ```python
40
- import os
41
- from pathlib import Path
42
- from datetime import datetime
43
-
44
- doc_path = "<document_path>"
45
- doc_name = Path(doc_path).stem
46
- output_dir = Path(doc_path).parent / f"{doc_name}_extracted"
47
-
48
- # Create all subdirectories
49
- for subdir in ["images", "classifications", "extractions", "verifications", "elements"]:
50
- (output_dir / subdir).mkdir(parents=True, exist_ok=True)
51
-
52
- print(f"Output directory: {output_dir}")
53
- ```
54
-
55
- ## Step 2: Extract Text and Images
56
-
57
- **CRITICAL:** Extract BOTH the manuscript text AND images. Figures need context from the paper.
58
-
59
- ### 2A: Extract Manuscript Text (PDF)
60
-
61
- ```python
62
- import fitz
63
- import re
64
- import json
65
- from pathlib import Path
66
-
67
- pdf_path = "<document_path>"
68
- output_dir = Path("<output_dir>")
69
-
70
- doc = fitz.open(pdf_path)
71
- full_text = []
72
- page_texts = {}
73
-
74
- for page_num in range(len(doc)):
75
- page = doc[page_num]
76
- text = page.get_text("text")
77
- full_text.append(text)
78
- page_texts[page_num + 1] = text
79
-
80
- # Save full manuscript text
81
- with open(output_dir / "manuscript_text.txt", "w") as f:
82
- f.write("\n\n---PAGE BREAK---\n\n".join(full_text))
83
-
84
- # Parse figure and table captions
85
- def extract_captions(text):
86
- """Extract Figure and Table captions from manuscript text."""
87
- captions = {"figures": {}, "tables": {}}
88
-
89
- # Figure patterns: "Figure 1.", "Figure 1:", "Fig. 1.", "FIGURE 1"
90
- fig_pattern = r'(?:Figure|Fig\.?|FIGURE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
91
- for match in re.finditer(fig_pattern, text, re.IGNORECASE):
92
- fig_num = match.group(1)
93
- caption = match.group(2).strip()
94
- # Clean up caption (remove extra whitespace)
95
- caption = ' '.join(caption.split())
96
- captions["figures"][fig_num] = caption
97
-
98
- # Table patterns: "Table 1.", "Table 1:", "TABLE 1"
99
- table_pattern = r'(?:Table|TABLE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
100
- for match in re.finditer(table_pattern, text, re.IGNORECASE):
101
- table_num = match.group(1)
102
- caption = match.group(2).strip()
103
- caption = ' '.join(caption.split())
104
- captions["tables"][table_num] = caption
105
-
106
- return captions
107
-
108
- all_text = "\n".join(full_text)
109
- captions = extract_captions(all_text)
110
-
111
- # Save captions
112
- with open(output_dir / "captions.json", "w") as f:
113
- json.dump(captions, f, indent=2)
114
-
115
- print(f"Extracted text from {len(doc)} pages")
116
- print(f"Found {len(captions['figures'])} figure captions")
117
- print(f"Found {len(captions['tables'])} table captions")
118
-
119
- doc.close()
120
- ```
121
-
122
- ### 2B: Extract Context Snippets
123
-
124
- For each figure/table, extract surrounding manuscript context:
125
-
126
- ```python
127
- def extract_context_for_element(text, element_type, element_num, window=500):
128
- """Extract text context surrounding references to a figure/table."""
129
- contexts = []
130
-
131
- if element_type == "figure":
132
- patterns = [
133
- rf'(?:Figure|Fig\.?)\s*{element_num}\b',
134
- rf'(?:figure|fig\.?)\s*{element_num}\b'
135
- ]
136
- else:
137
- patterns = [
138
- rf'(?:Table)\s*{element_num}\b',
139
- rf'(?:table)\s*{element_num}\b'
140
- ]
141
-
142
- for pattern in patterns:
143
- for match in re.finditer(pattern, text):
144
- start = max(0, match.start() - window)
145
- end = min(len(text), match.end() + window)
146
- context = text[start:end].strip()
147
- # Clean up
148
- context = ' '.join(context.split())
149
- if context not in contexts:
150
- contexts.append(context)
151
-
152
- return contexts
153
-
154
- # Extract contexts for each figure
155
- figure_contexts = {}
156
- for fig_num in captions["figures"]:
157
- contexts = extract_context_for_element(all_text, "figure", fig_num)
158
- figure_contexts[fig_num] = {
159
- "caption": captions["figures"][fig_num],
160
- "contexts": contexts
161
- }
162
-
163
- # Extract contexts for each table
164
- table_contexts = {}
165
- for table_num in captions["tables"]:
166
- contexts = extract_context_for_element(all_text, "table", table_num)
167
- table_contexts[table_num] = {
168
- "caption": captions["tables"][table_num],
169
- "contexts": contexts
170
- }
171
-
172
- # Save context data
173
- with open(output_dir / "manuscript_context.json", "w") as f:
174
- json.dump({
175
- "figures": figure_contexts,
176
- "tables": table_contexts
177
- }, f, indent=2)
178
- ```
179
-
180
- ### 2C: Extract Images
181
-
182
- **For PDF files** - Use PyMuPDF:
183
-
184
- ```python
185
- import fitz
186
- import json
187
-
188
- pdf_path = "<document_path>"
189
- output_dir = Path("<output_dir>")
190
- images_dir = output_dir / "images"
191
-
192
- doc = fitz.open(pdf_path)
193
- extracted = []
194
-
195
- for page_num in range(len(doc)):
196
- page = doc[page_num]
197
- for img_idx, img in enumerate(page.get_images()):
198
- xref = img[0]
199
- pix = fitz.Pixmap(doc, xref)
200
- if pix.n - pix.alpha > 3:
201
- pix = fitz.Pixmap(fitz.csRGB, pix)
202
-
203
- img_name = f"p{page_num + 1}_img{img_idx + 1}.png"
204
- img_path = images_dir / img_name
205
- pix.save(str(img_path))
206
- extracted.append({
207
- "id": f"element_{len(extracted) + 1:03d}",
208
- "path": str(img_path),
209
- "page": page_num + 1,
210
- "name": img_name
211
- })
212
- pix = None
213
-
214
- doc.close()
215
-
216
- # Save image manifest
217
- with open(output_dir / "image_manifest.json", "w") as f:
218
- json.dump(extracted, f, indent=2)
219
-
220
- print(f"Extracted {len(extracted)} images")
221
- ```
222
-
223
- **For DOCX files** - Unzip and extract media:
224
-
225
- ```python
226
- from zipfile import ZipFile
227
-
228
- docx_path = "<document_path>"
229
- output_dir = Path("<output_dir>")
230
- images_dir = output_dir / "images"
231
-
232
- extracted = []
233
- with ZipFile(docx_path, 'r') as z:
234
- for f in z.namelist():
235
- if f.startswith('word/media/'):
236
- name = os.path.basename(f)
237
- path = images_dir / name
238
- with z.open(f) as src, open(path, 'wb') as dst:
239
- dst.write(src.read())
240
- extracted.append({
241
- "id": f"element_{len(extracted) + 1:03d}",
242
- "path": str(path),
243
- "name": name
244
- })
245
-
246
- # Save image manifest
247
- with open(output_dir / "image_manifest.json", "w") as f:
248
- json.dump(extracted, f, indent=2)
249
-
250
- print(f"Extracted {len(extracted)} images")
251
- ```
252
-
253
- ## Step 3: Phase 1 - Classification (Parallel)
254
-
255
- **CRITICAL:** Launch ALL classification agents in ONE message.
256
-
257
- For EACH extracted image, spawn a classification agent:
258
-
259
- ```
260
- Task(
261
- subagent_type: "general-purpose",
262
- model: "haiku", # Fast classification
263
- description: "Classify element [N]",
264
- prompt: """
265
- You are a visual element classifier. Read the agent instructions from:
266
- ~/.claude/agents/structurecc-classifier.md
267
-
268
- **Image:** <full_path_to_image>
269
- **Element ID:** <element_id>
270
- **Output:** Write JSON to <output_dir>/classifications/<element_id>_class.json
271
-
272
- Analyze the image and output the classification JSON as specified in the agent file.
273
- """
274
- )
275
- ```
276
-
277
- Launch 10 images = 10 Task calls in ONE message. They run in parallel.
278
-
279
- ## Step 4: Phase 2 - Specialized Extraction (Parallel)
280
-
281
- After classifications complete, read each classification file and dispatch to the correct extractor.
282
-
283
- **Extractor Routing:**
284
-
285
- | Classification | Extractor Agent |
286
- |---------------|-----------------|
287
- | `table_simple`, `table_complex` | `structurecc-extract-table.md` |
288
- | `chart_*` (all chart types) | `structurecc-extract-chart.md` |
289
- | `heatmap` | `structurecc-extract-heatmap.md` |
290
- | `diagram_*` (all diagram types) | `structurecc-extract-diagram.md` |
291
- | `multi_panel` | `structurecc-extract-multipanel.md` |
292
- | Everything else | `structurecc-extract-generic.md` |
293
-
294
- For EACH element, spawn the appropriate extractor WITH manuscript context:
295
-
296
- ```
297
- Task(
298
- subagent_type: "general-purpose",
299
- model: "opus", # Best quality for extraction
300
- description: "Extract element [N]",
301
- prompt: """
302
- You are extracting structured data from a visual element.
303
-
304
- Read the agent instructions from:
305
- ~/.claude/agents/<appropriate_extractor>.md
306
-
307
- **Image:** <full_path_to_image>
308
- **Element ID:** <element_id>
309
- **Classification:** <classification_type>
310
- **Source:** Page <N> of <document_name>
311
- **Output:** Write JSON to <output_dir>/extractions/<element_id>.json
312
-
313
- ## MANUSCRIPT CONTEXT (Use this to understand the figure)
314
-
315
- **Figure Caption:**
316
- <caption_from_captions.json>
317
-
318
- **Relevant Manuscript Text:**
319
- <context_snippets_from_manuscript_context.json>
320
-
321
- ---
322
-
323
- Follow the extractor instructions EXACTLY. Output ONLY valid JSON.
324
-
325
- CRITICAL REQUIREMENTS:
326
- 1. VERBATIM extraction - Copy ALL text exactly as shown in the image
327
- 2. Use manuscript context to understand what the figure shows
328
- 3. Include the figure caption in your extraction
329
- 4. For charts: capture EXACT legend text, axis labels, tick values
330
- 5. For Kaplan-Meier/survival curves: note step-function nature, describe curve progression
331
- 6. Describe colors precisely (e.g., "purple line", "light purple shaded 95% CI", "yellow/orange shaded band")
332
- """
333
- )
334
- ```
335
-
336
- **IMPORTANT:** Read `manuscript_context.json` to get the caption and context for each element.
337
-
338
- Launch ALL extractions in ONE message for parallel processing.
339
-
340
- ## Step 5: Phase 3 - Verification (Parallel)
341
-
342
- After extractions complete, verify each extraction:
343
-
344
- ```
345
- Task(
346
- subagent_type: "general-purpose",
347
- model: "sonnet", # Good balance for verification
348
- description: "Verify element [N]",
349
- prompt: """
350
- You are verifying an extraction against its source image.
351
-
352
- Read the agent instructions from:
353
- ~/.claude/agents/structurecc-verifier.md
354
-
355
- **Source Image:** <full_path_to_image>
356
- **Extraction JSON:** <output_dir>/extractions/<element_id>.json
357
- **Element ID:** <element_id>
358
- **Output:** Write JSON to <output_dir>/verifications/<element_id>_verify.json
359
-
360
- Compare the extraction to the source image and produce a verification report.
361
- """
362
- )
363
- ```
364
-
365
- Launch ALL verifications in ONE message.
366
-
367
- ## Step 6: Revision Loop
368
-
369
- After verifications complete, check for failures:
370
-
371
- ```python
372
- import json
373
- from pathlib import Path
374
-
375
- output_dir = Path("<output_dir>")
376
- verifications_dir = output_dir / "verifications"
377
-
378
- needs_revision = []
379
- passed = []
380
- needs_human_review = []
381
-
382
- for verify_file in verifications_dir.glob("*_verify.json"):
383
- with open(verify_file) as f:
384
- result = json.load(f)
385
-
386
- element_id = result["element_id"]
387
-
388
- if result.get("needs_human_review"):
389
- needs_human_review.append(element_id)
390
- elif not result["pass"]:
391
- revision_num = result.get("revision_feedback", {}).get("revision_number", 0)
392
- if revision_num < 2:
393
- needs_revision.append({
394
- "element_id": element_id,
395
- "feedback": result["revision_feedback"],
396
- "score": result["scores"]["overall"]
397
- })
398
- else:
399
- needs_human_review.append(element_id)
400
- else:
401
- passed.append(element_id)
402
-
403
- print(f"Passed: {len(passed)}")
404
- print(f"Needs revision: {len(needs_revision)}")
405
- print(f"Needs human review: {len(needs_human_review)}")
406
- ```
407
-
408
- For elements needing revision, re-run extraction with specific feedback:
409
-
410
- ```
411
- Task(
412
- subagent_type: "general-purpose",
413
- model: "opus",
414
- description: "Re-extract element [N] (revision)",
415
- prompt: """
416
- REVISION EXTRACTION - Previous extraction failed verification.
417
-
418
- Read the agent instructions from:
419
- ~/.claude/agents/<appropriate_extractor>.md
420
-
421
- **Image:** <full_path_to_image>
422
- **Element ID:** <element_id>
423
- **Previous Score:** <score>
424
- **Output:** Write JSON to <output_dir>/extractions/<element_id>.json
425
-
426
- SPECIFIC FIXES REQUIRED:
427
- <list_specific_fixes_from_revision_feedback>
428
-
429
- Focus on fixing these specific issues while preserving correct sections.
430
- """
431
- )
432
- ```
433
-
434
- After re-extraction, re-verify. Max 2 revision attempts per element.
435
-
436
- ## Step 7: Generate Markdown Elements
437
-
438
- After all verifications pass (or reach human review), convert JSON extractions to markdown:
439
-
440
- ```python
441
- import json
442
- from pathlib import Path
443
-
444
- output_dir = Path("<output_dir>")
445
- extractions_dir = output_dir / "extractions"
446
- elements_dir = output_dir / "elements"
447
-
448
- for extract_file in extractions_dir.glob("*.json"):
449
- element_id = extract_file.stem
450
-
451
- with open(extract_file) as f:
452
- extraction = json.load(f)
453
-
454
- # Convert to markdown based on type
455
- md_content = json_to_markdown(extraction)
456
-
457
- with open(elements_dir / f"{element_id}.md", "w") as f:
458
- f.write(md_content)
459
- ```
460
-
461
- **Markdown conversion function:**
462
-
463
- ```python
464
- import json
465
- from pathlib import Path
466
-
467
- def json_to_markdown(extraction: dict, context: dict = None) -> str:
468
- """Convert JSON extraction to clean markdown with manuscript context."""
469
-
470
- ext_type = extraction.get("extraction_type")
471
-
472
- if ext_type == "table":
473
- return table_to_markdown(extraction, context)
474
- elif ext_type == "chart":
475
- return chart_to_markdown(extraction, context)
476
- elif ext_type == "heatmap":
477
- return heatmap_to_markdown(extraction, context)
478
- elif ext_type == "diagram":
479
- return diagram_to_markdown(extraction, context)
480
- elif ext_type == "multi_panel":
481
- return multipanel_to_markdown(extraction, context)
482
- else:
483
- return generic_to_markdown(extraction, context)
484
-
485
- # Load manuscript context for element processing
486
- def get_context_for_element(output_dir: Path, element_num: int, element_type: str = "figure"):
487
- """Get manuscript context for a specific element."""
488
- context_file = output_dir / "manuscript_context.json"
489
- if not context_file.exists():
490
- return None
491
-
492
- with open(context_file) as f:
493
- manuscript_context = json.load(f)
494
-
495
- key = "figures" if element_type == "figure" else "tables"
496
- return manuscript_context.get(key, {}).get(str(element_num))
497
-
498
-
499
- def table_to_markdown(ext: dict, context: dict = None) -> str:
500
- md = []
501
- meta = ext.get("table_metadata", {})
502
-
503
- md.append(f"## {meta.get('title', 'Table')}")
504
- md.append(f"\n**Type:** table ({meta.get('table_type', 'standard')})")
505
- md.append(f"**Source:** Page {meta.get('source_page', '?')}")
506
-
507
- # Add manuscript context if available
508
- if context:
509
- if context.get("caption"):
510
- md.append(f"\n> **Table Caption (from manuscript):** {context['caption']}")
511
- if context.get("contexts"):
512
- md.append("\n### Manuscript Context\n")
513
- for ctx in context["contexts"][:2]:
514
- md.append(f"> {ctx[:400]}...")
515
-
516
- if meta.get("caption"):
517
- md.append(f"\n> **Caption (from image):** {meta['caption']}")
518
-
519
- md.append("\n### Data\n")
520
- md.append(ext.get("markdown_table", ""))
521
-
522
- if meta.get("footnotes"):
523
- md.append("\n### Footnotes\n")
524
- for fn in meta["footnotes"]:
525
- md.append(f"- {fn}")
526
-
527
- return "\n".join(md)
528
-
529
-
530
- def chart_to_markdown(ext: dict, context: dict = None) -> str:
531
- md = []
532
- meta = ext.get("chart_metadata", {})
533
-
534
- md.append(f"# {meta.get('title', 'Chart')}")
535
- md.append(f"\n**Type:** {ext.get('chart_type', 'Chart')}")
536
- md.append(f"**Source:** Page {meta.get('source_page', '?')}")
537
-
538
- # Add manuscript context if available
539
- if context:
540
- if context.get("caption"):
541
- md.append(f"\n> **Caption:** {context['caption']}")
542
- if context.get("contexts"):
543
- md.append("\n### Manuscript Context\n")
544
- for ctx in context["contexts"][:2]: # Limit to 2 most relevant
545
- md.append(f"> {ctx[:500]}...") # Truncate long contexts
546
-
547
- axes = ext.get("axes", {})
548
- md.append("\n## Axes\n")
549
- if axes.get("x"):
550
- md.append(f"- **X-axis:** {axes['x'].get('label', 'unlabeled')}")
551
- md.append(f" - Range: {axes['x'].get('min')} to {axes['x'].get('max')}")
552
- if axes['x'].get('ticks'):
553
- md.append(f" - Ticks: {axes['x']['ticks']}")
554
- if axes.get("y"):
555
- md.append(f"- **Y-axis:** {axes['y'].get('label', 'unlabeled')}")
556
- md.append(f" - Range: {axes['y'].get('min')} to {axes['y'].get('max')}")
557
- if axes['y'].get('ticks'):
558
- md.append(f" - Ticks: {axes['y']['ticks']}")
559
-
560
- legend = ext.get("legend", {})
561
- if legend.get("entries"):
562
- md.append("\n## Legend (Verbatim)\n")
563
- for entry in legend["entries"]:
564
- style = entry.get("line_style") or entry.get("style", "")
565
- color = entry.get("color", "")
566
- md.append(f"- **\"{entry['label']}\"**: {color} {style}")
567
-
568
- # Data series details (for Kaplan-Meier etc.)
569
- series = ext.get("data_series", [])
570
- if series:
571
- md.append("\n## Data Series\n")
572
- for s in series:
573
- md.append(f"### {s.get('name', 'Series')}")
574
- if s.get("data_points"):
575
- md.append("Key data points:")
576
- for pt in s["data_points"][:10]: # First 10 points
577
- md.append(f" - x={pt.get('x')}, y={pt.get('y')}")
578
-
579
- stats = ext.get("statistical_annotations", [])
580
- if stats:
581
- md.append("\n## Statistical Annotations\n")
582
- for stat in stats:
583
- if stat.get("type") == "hazard_ratio":
584
- md.append(f"- Hazard Ratio: {stat.get('value')} (95% CI: {stat.get('ci_lower')}-{stat.get('ci_upper')})")
585
- elif stat.get("type") == "p_value":
586
- md.append(f"- {stat.get('test', 'P-value')}: {stat.get('value')}")
587
- else:
588
- md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
589
-
590
- risk = ext.get("risk_table", {})
591
- if risk.get("present"):
592
- md.append("\n## Risk Table (Number at Risk)\n")
593
- headers = risk.get("headers", [])
594
- md.append("| " + " | ".join(headers) + " |")
595
- md.append("| " + " | ".join(["---"] * len(headers)) + " |")
596
- for row in risk.get("rows", []):
597
- values = [row.get("group", "")] + row.get("values", [])
598
- md.append("| " + " | ".join(values) + " |")
599
-
600
- return "\n".join(md)
601
-
602
-
603
- def multipanel_to_markdown(ext: dict, context: dict = None) -> str:
604
- """Convert multi-panel figure extraction to detailed markdown."""
605
- md = []
606
- meta = ext.get("figure_metadata", {})
607
-
608
- md.append(f"## {meta.get('title', 'Multi-Panel Figure')}")
609
- md.append(f"\n**Type:** multi_panel ({meta.get('total_panels', '?')} panels)")
610
- md.append(f"**Source:** Page {meta.get('source_page', '?')}")
611
- md.append(f"**Layout:** {meta.get('layout', 'unknown')}")
612
-
613
- # Add manuscript context if available
614
- if context:
615
- if context.get("caption"):
616
- md.append(f"\n> **Figure Caption (from manuscript):** {context['caption']}")
617
- if context.get("contexts"):
618
- md.append("\n### Manuscript Context\n")
619
- md.append("*How this figure is described in the paper:*\n")
620
- for ctx in context["contexts"][:2]:
621
- md.append(f"> ...{ctx[:500]}...\n")
622
-
623
- # Process each panel in detail
624
- panels = ext.get("panels", [])
625
- for panel in panels:
626
- panel_id = panel.get("panel_id", "?")
627
- panel_type = panel.get("panel_type", "unknown")
628
- panel_title = panel.get("panel_title", f"Panel {panel_id}")
629
-
630
- md.append(f"\n### Panel {panel_id}: {panel_title}")
631
- md.append(f"**Type:** {panel_type}")
632
-
633
- extraction = panel.get("extraction", {})
634
-
635
- # Axes
636
- axes = extraction.get("axes", {})
637
- if axes:
638
- md.append("\n**Axes:**")
639
- if axes.get("x"):
640
- x = axes["x"]
641
- md.append(f"- X-axis: \"{x.get('label', 'unlabeled')}\"")
642
- md.append(f" - Range: {x.get('min')} to {x.get('max')}")
643
- if x.get("ticks"):
644
- md.append(f" - Tick values: {x['ticks']}")
645
- if axes.get("y"):
646
- y = axes["y"]
647
- md.append(f"- Y-axis: \"{y.get('label', 'unlabeled')}\"")
648
- md.append(f" - Range: {y.get('min')} to {y.get('max')}")
649
- if y.get("ticks"):
650
- md.append(f" - Tick values: {y['ticks']}")
651
-
652
- # Legend (VERBATIM)
653
- legend = extraction.get("legend", {})
654
- if legend.get("entries"):
655
- md.append("\n**Legend (Verbatim from image):**")
656
- if legend.get("title"):
657
- md.append(f"*{legend['title']}*")
658
- for entry in legend["entries"]:
659
- label = entry.get("label", "")
660
- color = entry.get("color", "")
661
- style = entry.get("line_style") or entry.get("style", "")
662
- md.append(f"- \"{label}\" — {color} {style}")
663
-
664
- # Curve endpoints (for Kaplan-Meier)
665
- endpoints = extraction.get("curve_endpoints", [])
666
- if endpoints:
667
- md.append("\n**Curve Endpoints:**")
668
- for ep in endpoints:
669
- md.append(f"- {ep.get('series', 'Series')}: y={ep.get('final_y')} at x={ep.get('final_x')}")
670
- if ep.get("note"):
671
- md.append(f" - Note: {ep['note']}")
672
-
673
- # Key observations
674
- observations = extraction.get("key_observations", [])
675
- if observations:
676
- md.append("\n**Key Observations:**")
677
- for obs in observations:
678
- md.append(f"- {obs}")
679
-
680
- # Statistical annotations
681
- stats = extraction.get("statistical_annotations", [])
682
- if stats:
683
- md.append("\n**Statistical Annotations:**")
684
- for stat in stats:
685
- md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
686
-
687
- # All visible text
688
- all_text = extraction.get("all_visible_text", [])
689
- if all_text:
690
- md.append("\n**All Visible Text:**")
691
- md.append(f"```\n{', '.join(all_text)}\n```")
692
-
693
- # Shared elements
694
- shared = ext.get("shared_elements", {})
695
- if shared.get("shared_legend") or shared.get("cross_references"):
696
- md.append("\n### Shared Elements")
697
- if shared.get("shared_legend"):
698
- md.append(f"- Shared legend applies to panels: {shared['shared_legend'].get('applies_to', [])}")
699
- if shared.get("cross_references"):
700
- for ref in shared["cross_references"]:
701
- md.append(f"- {ref}")
702
-
703
- return "\n".join(md)
704
- ```
705
-
706
- ## Step 8: Generate Combined STRUCTURED.md with Manuscript Context
707
-
708
- ```python
709
- import json
710
- from pathlib import Path
711
- from datetime import datetime
712
-
713
- output_dir = Path("<output_dir>")
714
- elements_dir = output_dir / "elements"
715
- extractions_dir = output_dir / "extractions"
716
- doc_name = "<document_name>"
717
-
718
- # Load manuscript context
719
- context_file = output_dir / "manuscript_context.json"
720
- manuscript_context = {}
721
- if context_file.exists():
722
- with open(context_file) as f:
723
- manuscript_context = json.load(f)
724
-
725
- # Load captions
726
- captions_file = output_dir / "captions.json"
727
- captions = {"figures": {}, "tables": {}}
728
- if captions_file.exists():
729
- with open(captions_file) as f:
730
- captions = json.load(f)
731
-
732
- # Read all element files in order
733
- element_files = sorted(elements_dir.glob("element_*.md"))
734
-
735
- sections = []
736
- sections.append(f"# {doc_name} - Structured Extraction")
737
- sections.append(f"\n**Original:** {doc_name}")
738
- sections.append(f"**Extracted:** {datetime.now().isoformat()}")
739
- sections.append(f"**Elements:** {len(element_files)} visual elements processed")
740
- sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification + manuscript context)")
741
- sections.append("\n---\n")
742
-
743
- # Table of contents
744
- sections.append("## Table of Contents\n")
745
- for i, elem_file in enumerate(element_files, 1):
746
- elem_id = elem_file.stem
747
- # Try to get title from extraction
748
- extract_file = extractions_dir / f"{elem_id}.json"
749
- title = f"Element {i}"
750
- if extract_file.exists():
751
- with open(extract_file) as f:
752
- ext = json.load(f)
753
- title = ext.get("chart_metadata", {}).get("title") or \
754
- ext.get("table_metadata", {}).get("title") or \
755
- ext.get("figure_metadata", {}).get("title") or \
756
- f"Element {i}"
757
- sections.append(f"{i}. [{title}](#{elem_id})")
758
- sections.append("\n---\n")
759
-
760
- # Add each element with context
761
- for i, elem_file in enumerate(element_files, 1):
762
- elem_id = elem_file.stem
763
-
764
- with open(elem_file) as f:
765
- content = f.read()
766
-
767
- sections.append(f'<a id="{elem_id}"></a>\n')
768
- sections.append(f"## Element {i}: {elem_id}\n")
769
-
770
- # Try to match with manuscript context
771
- # Heuristic: Figure 1 = first figure element, etc.
772
- fig_num = str(i)
773
- if fig_num in manuscript_context.get("figures", {}):
774
- ctx = manuscript_context["figures"][fig_num]
775
- if ctx.get("caption"):
776
- sections.append(f"\n> **Figure Caption:** {ctx['caption']}\n")
777
- if ctx.get("contexts"):
778
- sections.append("\n### Manuscript Context\n")
779
- sections.append("*Relevant text from the manuscript:*\n")
780
- for c in ctx["contexts"][:2]:
781
- # Truncate and clean
782
- clean_ctx = ' '.join(c.split())[:600]
783
- sections.append(f"> ...{clean_ctx}...\n")
784
-
785
- sections.append(content)
786
- sections.append("\n---\n")
787
-
788
- # Add manuscript summary section
789
- if manuscript_context.get("figures") or manuscript_context.get("tables"):
790
- sections.append("## Manuscript References Summary\n")
791
- sections.append("### Figure Captions from Manuscript\n")
792
- for fig_num, ctx in manuscript_context.get("figures", {}).items():
793
- sections.append(f"- **Figure {fig_num}:** {ctx.get('caption', 'No caption found')}")
794
- sections.append("\n### Table Captions from Manuscript\n")
795
- for table_num, ctx in manuscript_context.get("tables", {}).items():
796
- sections.append(f"- **Table {table_num}:** {ctx.get('caption', 'No caption found')}")
797
- sections.append("\n---\n")
798
-
799
- # Write combined file
800
- with open(output_dir / "STRUCTURED.md", "w") as f:
801
- f.write("\n".join(sections))
802
-
803
- print(f"Generated STRUCTURED.md with {len(element_files)} elements and manuscript context")
804
- ```
805
-
806
- ## Step 9: Generate Quality Report
807
-
808
- ```python
809
- import json
810
- from pathlib import Path
811
-
812
- output_dir = Path("<output_dir>")
813
- verifications_dir = output_dir / "verifications"
814
-
815
- report = {
816
- "document": "<document_name>",
817
- "timestamp": datetime.now().isoformat(),
818
- "pipeline_version": "2.0.0",
819
- "elements_total": 0,
820
- "elements_passed": 0,
821
- "elements_revised": 0,
822
- "elements_human_review": 0,
823
- "average_quality_score": 0.0,
824
- "element_details": []
825
- }
826
-
827
- scores = []
828
- for verify_file in sorted(verifications_dir.glob("*_verify.json")):
829
- with open(verify_file) as f:
830
- result = json.load(f)
831
-
832
- report["elements_total"] += 1
833
-
834
- detail = {
835
- "element_id": result["element_id"],
836
- "scores": result["scores"],
837
- "status": "passed" if result["pass"] else "failed",
838
- "issues_count": len(result.get("issues", []))
839
- }
840
-
841
- if result.get("needs_human_review"):
842
- detail["status"] = "human_review"
843
- report["elements_human_review"] += 1
844
- elif result["pass"]:
845
- report["elements_passed"] += 1
846
-
847
- if result.get("revision_feedback", {}).get("revision_number", 0) > 0:
848
- report["elements_revised"] += 1
849
-
850
- scores.append(result["scores"]["overall"])
851
- report["element_details"].append(detail)
852
-
853
- report["average_quality_score"] = sum(scores) / len(scores) if scores else 0
854
-
855
- # Write report
856
- with open(output_dir / "extraction_report.json", "w") as f:
857
- json.dump(report, f, indent=2)
858
-
859
- print(f"\nQuality Report:")
860
- print(f" Total elements: {report['elements_total']}")
861
- print(f" Passed: {report['elements_passed']}")
862
- print(f" Revised: {report['elements_revised']}")
863
- print(f" Human review: {report['elements_human_review']}")
864
- print(f" Average score: {report['average_quality_score']:.2f}")
865
- ```
866
-
867
- ## Step 10: Display Results
868
-
869
- ```
870
- ╔═══════════════════════════════════════════════════════════════════════════════╗
871
- ║ EXTRACTION COMPLETE ║
872
- ╠═══════════════════════════════════════════════════════════════════════════════╣
873
- ║ ║
874
- ║ Document: [name] ║
875
- ║ Output: [path]_extracted/ ║
876
- ║ Pipeline: structurecc v2.0 (3-phase with verification) ║
877
- ║ ║
878
- ║ QUALITY SUMMARY ║
879
- ║ ────────────────────────────────────────────────────── ║
880
- ║ Total elements: [N] ║
881
- ║ Passed (≥0.90): [N] ✓ ║
882
- ║ Revised: [N] ↻ ║
883
- ║ Human review: [N] ⚠ ║
884
- ║ Average score: [0.XX] ║
885
- ║ ║
886
- ║ FILES ║
887
- ║ ────────────────────────────────────────────────────── ║
888
- ║ images/ [N] extracted images ║
889
- ║ classifications/ [N] type classifications ║
890
- ║ extractions/ [N] JSON extractions ║
891
- ║ verifications/ [N] quality verifications ║
892
- ║ elements/ [N] markdown files ║
893
- ║ STRUCTURED.md Combined output ║
894
- ║ extraction_report.json Quality metrics ║
895
- ║ ║
896
- ╚═══════════════════════════════════════════════════════════════════════════════╝
897
- ```
898
-
899
- Then open: `open "<output_dir>/STRUCTURED.md"`
900
-
901
- ## Dependencies
902
-
903
- Install PyMuPDF if not present:
904
- ```bash
905
- pip3 install PyMuPDF --quiet
906
- ```
907
-
908
- ## Tips
909
-
910
- - Use opus model for best extraction quality on complex visuals
911
- - Classification uses haiku for speed (it's just triage)
912
- - Verification uses sonnet for good balance
913
- - Each phase runs in parallel for speed
914
- - Max 2 revision attempts before human review
915
- - Check `extraction_report.json` for quality metrics
916
- - Individual JSON extractions preserved for programmatic use
917
-
918
- ## Troubleshooting
919
-
920
- **Low quality scores:**
921
- - Check source image quality
922
- - Complex tables may need human review
923
- - Handwritten text is challenging
924
-
925
- **Revision loop stuck:**
926
- - After 2 revisions, element goes to human review
927
- - Check verifications/ for specific issues
928
-
929
- **Missing elements:**
930
- - Some PDFs render text as images - check image count
931
- - Very small images may be logos/icons (expected)