@nexus-cortex/server 4.26.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (130) hide show
  1. package/.cortex/agents/AGENT_PROFILE_GUIDE.md +307 -0
  2. package/.cortex/agents/README.md +268 -0
  3. package/.cortex/agents/a-frontend-landing-page-designer.md +41 -0
  4. package/.cortex/agents/autoresearch-agent.md +49 -0
  5. package/.cortex/agents/code-reviewer.md +63 -0
  6. package/.cortex/agents/context-research.md +26 -0
  7. package/.cortex/agents/doc-writer.md +92 -0
  8. package/.cortex/agents/explore.md +63 -0
  9. package/.cortex/agents/new-model-api-integrator-analyst.md +41 -0
  10. package/.cortex/agents/plan.md +109 -0
  11. package/.cortex/agents/pr-architecture-reviewer.md +77 -0
  12. package/.cortex/agents/pr-code-quality.md +78 -0
  13. package/.cortex/agents/pr-implementer.md +50 -0
  14. package/.cortex/agents/pr-security-auditor.md +62 -0
  15. package/.cortex/agents/pr-test-writer.md +67 -0
  16. package/.cortex/agents/refactor.md +118 -0
  17. package/.cortex/agents/test-writer.md +72 -0
  18. package/.cortex/agents/web-researcher.md +72 -0
  19. package/.cortex/bench/tasks/sample-tasks.json +20 -0
  20. package/.cortex/commands/compare.md +14 -0
  21. package/.cortex/commands/deps.md +16 -0
  22. package/.cortex/commands/diff.md +14 -0
  23. package/.cortex/commands/explain.md +16 -0
  24. package/.cortex/commands/find-bug.md +13 -0
  25. package/.cortex/commands/profile.md +15 -0
  26. package/.cortex/commands/review.md +18 -0
  27. package/.cortex/commands/search.md +16 -0
  28. package/.cortex/commands/test.md +15 -0
  29. package/.cortex/permissions.dev.json +20 -0
  30. package/.cortex/permissions.example.json +71 -0
  31. package/.cortex/permissions.prod.json +63 -0
  32. package/.cortex/permissions.test.json +19 -0
  33. package/.cortex/skills/autoresearch/SKILL.md +77 -0
  34. package/.cortex/skills/autoresearch/personas/README.md +45 -0
  35. package/.cortex/skills/autoresearch/personas/aggressive-refactor.md +25 -0
  36. package/.cortex/skills/autoresearch/personas/creative.md +29 -0
  37. package/.cortex/skills/autoresearch/personas/perf-hunter.md +27 -0
  38. package/.cortex/skills/autoresearch/personas/precise.md +23 -0
  39. package/.cortex/skills/autoresearch/personas/root-cause.md +26 -0
  40. package/.cortex/skills/autoresearch/personas/security-auditor.md +29 -0
  41. package/.cortex/skills/autoresearch/personas/skeptic-reviewer.md +31 -0
  42. package/.cortex/skills/autoresearch/personas/test-first.md +25 -0
  43. package/.cortex/skills/best-of-n/SKILL.md +76 -0
  44. package/.cortex/skills/cortex/SKILL.md +834 -0
  45. package/.cortex/skills/cortex-bench/SKILL.md +354 -0
  46. package/.cortex/skills/docx/SKILL.md +83 -0
  47. package/.cortex/skills/pdf-documents/SKILL.md +297 -0
  48. package/.cortex/skills/pdf-documents/sections/01-image-acquisition.md +132 -0
  49. package/.cortex/skills/pdf-documents/sections/02-ai-image-generation.md +274 -0
  50. package/.cortex/skills/pdf-documents/sections/03-paper-sizes.md +89 -0
  51. package/.cortex/skills/pdf-documents/sections/04-design-system.md +549 -0
  52. package/.cortex/skills/pdf-documents/sections/05-css-print-rules.md +135 -0
  53. package/.cortex/skills/pdf-documents/sections/06-svg-charts.md +100 -0
  54. package/.cortex/skills/pdf-documents/sections/07-templates.md +224 -0
  55. package/.cortex/skills/pdf-documents/sections/08-scaled-output.md +164 -0
  56. package/.cortex/skills/pdf-documents/sections/09-preview-qa.md +66 -0
  57. package/.cortex/skills/pdf-documents/sections/10-reading-pdfs.md +499 -0
  58. package/.cortex/skills/pdf-documents/sections/11-form-filling.md +241 -0
  59. package/.cortex/skills/pptx/SKILL.md +90 -0
  60. package/.cortex/skills/resume-analyst/SKILL.md +373 -0
  61. package/.cortex/skills/verify-work/SKILL.md +74 -0
  62. package/.cortex/skills/xlsx/SKILL.md +101 -0
  63. package/.cortex/system-messages/messages/WORK_QUALITY.md +159 -0
  64. package/.cortex/system-messages/registry.json +18 -0
  65. package/LICENSE +202 -0
  66. package/NOTICE +2 -0
  67. package/README.md +13 -0
  68. package/bin/cortex-daemon.js +47 -0
  69. package/bin/cortex-server.js +15 -0
  70. package/dist/index.d.ts +30 -0
  71. package/dist/index.d.ts.map +1 -0
  72. package/dist/index.js +513 -0
  73. package/dist/index.js.map +1 -0
  74. package/dist/middleware/cors.d.ts +10 -0
  75. package/dist/middleware/cors.d.ts.map +1 -0
  76. package/dist/middleware/cors.js +11 -0
  77. package/dist/middleware/cors.js.map +1 -0
  78. package/dist/middleware/errorHandler.d.ts +10 -0
  79. package/dist/middleware/errorHandler.d.ts.map +1 -0
  80. package/dist/middleware/errorHandler.js +15 -0
  81. package/dist/middleware/errorHandler.js.map +1 -0
  82. package/dist/routes/approval.d.ts +2 -0
  83. package/dist/routes/approval.d.ts.map +1 -0
  84. package/dist/routes/approval.js +96 -0
  85. package/dist/routes/approval.js.map +1 -0
  86. package/dist/routes/config.d.ts +2 -0
  87. package/dist/routes/config.d.ts.map +1 -0
  88. package/dist/routes/config.js +70 -0
  89. package/dist/routes/config.js.map +1 -0
  90. package/dist/routes/health.d.ts +2 -0
  91. package/dist/routes/health.d.ts.map +1 -0
  92. package/dist/routes/health.js +1031 -0
  93. package/dist/routes/health.js.map +1 -0
  94. package/dist/routes/mcp.d.ts +2 -0
  95. package/dist/routes/mcp.d.ts.map +1 -0
  96. package/dist/routes/mcp.js +251 -0
  97. package/dist/routes/mcp.js.map +1 -0
  98. package/dist/routes/messages.d.ts +5 -0
  99. package/dist/routes/messages.d.ts.map +1 -0
  100. package/dist/routes/messages.js +136 -0
  101. package/dist/routes/messages.js.map +1 -0
  102. package/dist/routes/middleware.d.ts +2 -0
  103. package/dist/routes/middleware.d.ts.map +1 -0
  104. package/dist/routes/middleware.js +146 -0
  105. package/dist/routes/middleware.js.map +1 -0
  106. package/dist/routes/models.d.ts +2 -0
  107. package/dist/routes/models.d.ts.map +1 -0
  108. package/dist/routes/models.js +29 -0
  109. package/dist/routes/models.js.map +1 -0
  110. package/dist/routes/permissions.d.ts +2 -0
  111. package/dist/routes/permissions.d.ts.map +1 -0
  112. package/dist/routes/permissions.js +253 -0
  113. package/dist/routes/permissions.js.map +1 -0
  114. package/dist/routes/pr.d.ts +2 -0
  115. package/dist/routes/pr.d.ts.map +1 -0
  116. package/dist/routes/pr.js +222 -0
  117. package/dist/routes/pr.js.map +1 -0
  118. package/dist/routes/sessions.d.ts +2 -0
  119. package/dist/routes/sessions.d.ts.map +1 -0
  120. package/dist/routes/sessions.js +628 -0
  121. package/dist/routes/sessions.js.map +1 -0
  122. package/dist/routes/system-messages.d.ts +2 -0
  123. package/dist/routes/system-messages.d.ts.map +1 -0
  124. package/dist/routes/system-messages.js +146 -0
  125. package/dist/routes/system-messages.js.map +1 -0
  126. package/dist/routes/tools.d.ts +2 -0
  127. package/dist/routes/tools.d.ts.map +1 -0
  128. package/dist/routes/tools.js +79 -0
  129. package/dist/routes/tools.js.map +1 -0
  130. package/package.json +63 -0
@@ -0,0 +1,499 @@
1
+ # Reading and Deconstructing PDFs
2
+
3
+ ## The Problem
4
+
5
+ Agents currently read PDFs via OCR or text extraction, which destroys:
6
+ - Table structure (becomes ambiguous whitespace)
7
+ - Diagram/chart data (becomes nothing or alt text)
8
+ - Heading hierarchy (inferred from font size, unreliable)
9
+ - Image-text association (captions detach from figures)
10
+ - Links (become plain text)
11
+
12
+ ## Runtime Context — Choose Your Path
13
+
14
+ | Runtime | Best Tool | How |
15
+ |---|---|---|
16
+ | **Claude Code** | Built-in Read tool | `Read({ file_path: "doc.pdf", pages: "1-10" })` — up to 20 pages/request |
17
+ | **Agent with browser-MCP access** | nexus-browser-sandbox | `run_command` with `pdf2html` utility or `pdf2htmlEX` (both pre-installed) |
18
+ | **Server-side / CLI** | PyMuPDF directly | `pip install pymupdf` then Python script |
19
+ | **Browser-only (no sandbox)** | Vision model on screenshots | Browse PDF, screenshot pages, send to VLM |
20
+
21
+ ---
22
+
23
+ ## Nexus Browser Sandbox (for agents with browser-MCP access)
24
+
25
+ The `nexus-browser-sandbox` container has **PyMuPDF**, **pdf2htmlEX**, and a pre-built extraction utility installed. Any agent with nexus-browser MCP access can use them via `run_command` or `run_code` — no setup, no pip install.
26
+
27
+ ### Container Additions (Dockerfile)
28
+
29
+ ```dockerfile
30
+ # ── PDF Reading Toolkit ──────────────────────────────────────
31
+ # PyMuPDF — structured HTML extraction, image extraction, text
32
+ RUN pip install --no-cache-dir pymupdf beautifulsoup4
33
+
34
+ # pdf2htmlEX — pixel-perfect visual fidelity conversion
35
+ RUN apt-get update && apt-get install -y --no-install-recommends \
36
+ pdf2htmlex \
37
+ && rm -rf /var/lib/apt/lists/*
38
+
39
+ # Pre-built extraction utility
40
+ COPY pdf2html /usr/local/bin/pdf2html
41
+ RUN chmod +x /usr/local/bin/pdf2html
42
+ ```
43
+
44
+ ### Extraction Utility: `/usr/local/bin/pdf2html`
45
+
46
+ A Python script pre-installed in the container. Takes a PDF path, outputs structured HTML + a JSON manifest to stdout. Agents call it via `run_command` and get agent-ready output with zero boilerplate.
47
+
48
+ ```python
49
+ #!/usr/bin/env python3
50
+ """pdf2html — Extract structured HTML + manifest from a PDF.
51
+
52
+ Usage:
53
+ pdf2html input.pdf # structured HTML (agent consumption)
54
+ pdf2html input.pdf --manifest # JSON manifest only (outline, tables, images)
55
+ pdf2html input.pdf --images /out/dir # also extract images as PNG files
56
+ pdf2html input.pdf --lossless # delegates to pdf2htmlEX (pixel-perfect)
57
+ pdf2html input.pdf --text # plain text only (cheapest)
58
+ pdf2html input.pdf --pages 1-5 # specific page range
59
+ """
60
+
61
+ import sys
62
+ import os
63
+ import json
64
+ import argparse
65
+
66
+ def extract_structured(pdf_path, pages=None, extract_images_dir=None):
67
+ """PyMuPDF structured extraction — HTML + manifest."""
68
+ import fitz
69
+ doc = fitz.open(pdf_path)
70
+
71
+ page_range = range(len(doc))
72
+ if pages:
73
+ start, end = pages
74
+ page_range = range(max(0, start - 1), min(len(doc), end))
75
+
76
+ html_parts = []
77
+ manifest = {
78
+ "source": os.path.basename(pdf_path),
79
+ "pages": len(doc),
80
+ "extracted_pages": list(page_range),
81
+ "images": [],
82
+ "tables": [],
83
+ "headings": [],
84
+ }
85
+
86
+ for i in page_range:
87
+ page = doc[i]
88
+ page_num = i + 1
89
+
90
+ html_parts.append(f'<div class="page" data-page="{page_num}">')
91
+ html_parts.append(page.get_text('html'))
92
+ html_parts.append('</div>')
93
+
94
+ # Extract images
95
+ images = page.get_images(full=True)
96
+ for img_idx, img_info in enumerate(images):
97
+ img_id = f"img-p{page_num}-{img_idx}"
98
+ manifest["images"].append({
99
+ "id": img_id,
100
+ "page": page_num,
101
+ "xref": img_info[0],
102
+ "width": img_info[2],
103
+ "height": img_info[3],
104
+ })
105
+
106
+ if extract_images_dir:
107
+ xref = img_info[0]
108
+ pix = fitz.Pixmap(doc, xref)
109
+ if pix.n >= 5: # CMYK
110
+ pix = fitz.Pixmap(fitz.csRGB, pix)
111
+ out_path = os.path.join(extract_images_dir, f"{img_id}.png")
112
+ pix.save(out_path)
113
+ manifest["images"][-1]["file"] = out_path
114
+
115
+ # Extract text blocks for headings (heuristic: large font = heading)
116
+ blocks = page.get_text("dict")["blocks"]
117
+ for block in blocks:
118
+ if "lines" not in block:
119
+ continue
120
+ for line in block["lines"]:
121
+ for span in line["spans"]:
122
+ size = span["size"]
123
+ text = span["text"].strip()
124
+ if text and size >= 14:
125
+ level = 1 if size >= 22 else 2 if size >= 16 else 3
126
+ manifest["headings"].append({
127
+ "level": level,
128
+ "text": text,
129
+ "page": page_num,
130
+ "font_size": round(size, 1),
131
+ })
132
+
133
+ # Detect tables (PyMuPDF table detection)
134
+ try:
135
+ tables = page.find_tables()
136
+ for t_idx, table in enumerate(tables):
137
+ t_id = f"table-p{page_num}-{t_idx}"
138
+ extracted = table.extract()
139
+ headers = extracted[0] if extracted else []
140
+ manifest["tables"].append({
141
+ "id": t_id,
142
+ "page": page_num,
143
+ "headers": [h for h in headers if h],
144
+ "rows": len(extracted),
145
+ "cols": len(headers),
146
+ })
147
+ except Exception:
148
+ pass # table detection not available in older PyMuPDF
149
+
150
+ full_html = f"""<!DOCTYPE html>
151
+ <html><head><meta charset="UTF-8"><title>{manifest['source']}</title>
152
+ <style>
153
+ .page {{ border-bottom: 2px solid #ccc; padding: 20px 0; margin-bottom: 20px; }}
154
+ .page::before {{ content: "Page " attr(data-page); font-size: 10pt; color: #888;
155
+ display: block; margin-bottom: 10px; }}
156
+ img {{ max-width: 100%; height: auto; }}
157
+ table {{ border-collapse: collapse; margin: 10px 0; }}
158
+ td, th {{ border: 1px solid #ddd; padding: 4px 8px; }}
159
+ </style>
160
+ </head><body>{''.join(html_parts)}</body></html>"""
161
+
162
+ return full_html, manifest
163
+
164
+
165
+ def extract_text_only(pdf_path, pages=None):
166
+ """Cheapest extraction — plain text."""
167
+ import fitz
168
+ doc = fitz.open(pdf_path)
169
+ page_range = range(len(doc))
170
+ if pages:
171
+ start, end = pages
172
+ page_range = range(max(0, start - 1), min(len(doc), end))
173
+
174
+ parts = []
175
+ for i in page_range:
176
+ parts.append(f"\n--- PAGE {i+1} ---\n")
177
+ parts.append(doc[i].get_text("text"))
178
+ return ''.join(parts)
179
+
180
+
181
+ def run_pdf2htmlex(pdf_path):
182
+ """Delegate to pdf2htmlEX for pixel-perfect conversion."""
183
+ import subprocess
184
+ out_dir = "/tmp/pdf2htmlex_out"
185
+ os.makedirs(out_dir, exist_ok=True)
186
+ basename = os.path.splitext(os.path.basename(pdf_path))[0]
187
+ result = subprocess.run(
188
+ ["pdf2htmlEX", "--zoom", "1.5", "--process-outline", "1",
189
+ "--dest-dir", out_dir, pdf_path],
190
+ capture_output=True, text=True
191
+ )
192
+ if result.returncode != 0:
193
+ print(f"pdf2htmlEX error: {result.stderr}", file=sys.stderr)
194
+ sys.exit(1)
195
+
196
+ out_file = os.path.join(out_dir, f"{basename}.html")
197
+ if os.path.exists(out_file):
198
+ with open(out_file) as f:
199
+ print(f.read())
200
+ else:
201
+ print(f"Output not found at {out_file}", file=sys.stderr)
202
+ sys.exit(1)
203
+
204
+
205
+ def parse_pages(pages_str):
206
+ """Parse '1-5' or '3' into (start, end) tuple."""
207
+ if '-' in pages_str:
208
+ parts = pages_str.split('-')
209
+ return (int(parts[0]), int(parts[1]))
210
+ else:
211
+ n = int(pages_str)
212
+ return (n, n)
213
+
214
+
215
+ def main():
216
+ parser = argparse.ArgumentParser(description="Extract structured HTML from PDF")
217
+ parser.add_argument("pdf", help="Path to PDF file")
218
+ parser.add_argument("--manifest", action="store_true", help="Output JSON manifest only")
219
+ parser.add_argument("--images", metavar="DIR", help="Extract images to directory")
220
+ parser.add_argument("--lossless", action="store_true", help="Use pdf2htmlEX (pixel-perfect)")
221
+ parser.add_argument("--text", action="store_true", help="Plain text only (cheapest)")
222
+ parser.add_argument("--pages", metavar="RANGE", help="Page range, e.g. 1-5 or 3")
223
+ args = parser.parse_args()
224
+
225
+ if not os.path.exists(args.pdf):
226
+ print(f"File not found: {args.pdf}", file=sys.stderr)
227
+ sys.exit(1)
228
+
229
+ pages = parse_pages(args.pages) if args.pages else None
230
+
231
+ if args.lossless:
232
+ run_pdf2htmlex(args.pdf)
233
+ return
234
+
235
+ if args.text:
236
+ print(extract_text_only(args.pdf, pages))
237
+ return
238
+
239
+ if args.images:
240
+ os.makedirs(args.images, exist_ok=True)
241
+
242
+ html, manifest = extract_structured(args.pdf, pages, args.images)
243
+
244
+ if args.manifest:
245
+ print(json.dumps(manifest, indent=2))
246
+ else:
247
+ # Print manifest as HTML comment at top, then full HTML
248
+ print(f"<!-- MANIFEST\n{json.dumps(manifest, indent=2)}\n-->")
249
+ print(html)
250
+
251
+
252
+ if __name__ == "__main__":
253
+ main()
254
+ ```
255
+
256
+ ### Agent Workflow via nexus-browser MCP
257
+
258
+ ```bash
259
+ # ── Structured extraction (default — best for agent consumption) ──
260
+ run_command({ command: "pdf2html /tmp/document.pdf" })
261
+ # Returns: HTML comment with JSON manifest + structured HTML body
262
+
263
+ # ── Manifest only (cheapest — just the outline, tables, image refs) ──
264
+ run_command({ command: "pdf2html /tmp/document.pdf --manifest" })
265
+ # Returns: JSON with pages, headings, table summaries, image locations
266
+
267
+ # ── Plain text only (cheapest tokens) ──
268
+ run_command({ command: "pdf2html /tmp/document.pdf --text" })
269
+ # Returns: raw text with page markers
270
+
271
+ # ── Extract images to files ──
272
+ run_command({ command: "pdf2html /tmp/document.pdf --images /tmp/pdf-images/" })
273
+ # Returns: HTML + manifest with image file paths
274
+
275
+ # ── Pixel-perfect lossless conversion (pdf2htmlEX) ──
276
+ run_command({ command: "pdf2html /tmp/document.pdf --lossless" })
277
+ # Returns: pixel-perfect HTML (absolute-positioned divs, visually exact)
278
+
279
+ # ── Specific pages only ──
280
+ run_command({ command: "pdf2html /tmp/document.pdf --pages 1-5 --manifest" })
281
+ ```
282
+
283
+ **Getting the PDF into the sandbox:**
284
+ ```bash
285
+ # Option 1: Download from URL
286
+ run_command({ command: "curl -sL -o /tmp/document.pdf 'https://example.com/report.pdf'" })
287
+
288
+ # Option 2: Browse to PDF URL (Chromium downloads it)
289
+ browse({ url: "https://example.com/report.pdf" })
290
+ # Then find it in the browser's download directory
291
+
292
+ # Option 3: If PDF is already in VFS, copy via the agent's file tools
293
+ # (depends on sandbox ↔ VFS bridge)
294
+ ```
295
+
296
+ ### When to Use --lossless (pdf2htmlEX)
297
+
298
+ The `--lossless` flag delegates to pdf2htmlEX instead of PyMuPDF. Use it when:
299
+ - User asks for **exact visual reproduction** of the PDF in a browser
300
+ - The PDF has complex **multi-column layouts** that PyMuPDF misinterprets
301
+ - The output will be **displayed to a human**, not parsed by an agent
302
+ - You need to **embed the PDF in a web page** with exact fidelity
303
+
304
+ Do NOT use `--lossless` for agent consumption — the output is CSS absolute-positioned divs that are impossible to parse semantically. Use the default PyMuPDF path for anything an agent needs to understand.
305
+
306
+ ---
307
+
308
+ ## Direct PyMuPDF Usage (CLI / Server-Side)
309
+
310
+ When you have Python available (Claude Code bash, server-side scripts, CI), use PyMuPDF directly:
311
+
312
+ ```bash
313
+ pip install pymupdf
314
+ python3 -c "
315
+ import fitz
316
+ doc = fitz.open('input.pdf')
317
+ html_parts = []
318
+ for i, page in enumerate(doc):
319
+ html_parts.append(f'<div class=\"page\" data-page=\"{i+1}\">')
320
+ html_parts.append(page.get_text('html'))
321
+ html_parts.append('</div>')
322
+ with open('output.html', 'w') as f:
323
+ f.write('<html><body>' + ''.join(html_parts) + '</body></html>')
324
+ "
325
+ ```
326
+
327
+ ## Deconstructing PDF-to-HTML for Agent Consumption
328
+
329
+ Once you have HTML from any extraction method, deconstruct it into an indexed structure:
330
+
331
+ ### Step 1: Build a Media Index
332
+
333
+ ```python
334
+ from bs4 import BeautifulSoup
335
+ import json
336
+
337
+ with open('converted.html') as f:
338
+ soup = BeautifulSoup(f.read(), 'html.parser')
339
+
340
+ media_index = {
341
+ "images": [],
342
+ "tables": [],
343
+ "headings": [],
344
+ "links": [],
345
+ "figures": [],
346
+ }
347
+
348
+ for i, img in enumerate(soup.find_all('img')):
349
+ media_index["images"].append({
350
+ "id": f"img-{i}",
351
+ "src": img.get('src', '')[:100],
352
+ "alt": img.get('alt', ''),
353
+ "page": img.find_parent(attrs={"data-page": True}).get("data-page", "?"),
354
+ "width": img.get('width'),
355
+ "height": img.get('height'),
356
+ })
357
+
358
+ for i, table in enumerate(soup.find_all('table')):
359
+ rows = table.find_all('tr')
360
+ headers = [th.get_text(strip=True) for th in rows[0].find_all(['th', 'td'])] if rows else []
361
+ media_index["tables"].append({
362
+ "id": f"table-{i}",
363
+ "headers": headers,
364
+ "row_count": len(rows),
365
+ "page": table.find_parent(attrs={"data-page": True}).get("data-page", "?"),
366
+ })
367
+
368
+ for h in soup.find_all(['h1','h2','h3','h4','h5','h6']):
369
+ media_index["headings"].append({
370
+ "level": int(h.name[1]),
371
+ "text": h.get_text(strip=True),
372
+ })
373
+
374
+ print(json.dumps(media_index, indent=2))
375
+ ```
376
+
377
+ ### Step 2: Extract Text as Structured Markdown
378
+
379
+ ```python
380
+ def html_to_agent_markdown(html_path):
381
+ with open(html_path) as f:
382
+ soup = BeautifulSoup(f.read(), 'html.parser')
383
+
384
+ sections = []
385
+ for page_div in soup.find_all('div', class_='page'):
386
+ page_num = page_div.get('data-page', '?')
387
+ sections.append(f"\n<!-- PAGE {page_num} -->\n")
388
+
389
+ for elem in page_div.children:
390
+ if elem.name in ('h1','h2','h3','h4','h5','h6'):
391
+ level = int(elem.name[1])
392
+ sections.append(f"{'#' * level} {elem.get_text(strip=True)}\n")
393
+ elif elem.name == 'table':
394
+ sections.append(table_to_markdown(elem))
395
+ elif elem.name == 'p':
396
+ text = elem.get_text(strip=True)
397
+ if text:
398
+ sections.append(f"{text}\n")
399
+ elif elem.name == 'img':
400
+ alt = elem.get('alt', 'image')
401
+ sections.append(f"![{alt}](ref:img-{id})\n")
402
+ elif elem.name == 'figure':
403
+ caption = elem.find('figcaption')
404
+ sections.append(f"[FIGURE: {caption.get_text(strip=True) if caption else 'untitled'}]\n")
405
+
406
+ return '\n'.join(sections)
407
+ ```
408
+
409
+ ### Step 3: Agent Consumption Format
410
+
411
+ The final output for agent ingestion should be:
412
+
413
+ ```
414
+ DOCUMENT_MANIFEST:
415
+ source: input.pdf
416
+ pages: 13
417
+ images: 4 (indexed as img-0 through img-3)
418
+ tables: 3 (indexed as table-0 through table-2)
419
+ headings: 12 (see outline below)
420
+
421
+ OUTLINE:
422
+ 1. Introduction
423
+ 1.1 Background
424
+ 1.2 Methodology
425
+ 2. Findings
426
+ 2.1 Data Analysis (contains table-0, table-1)
427
+ 2.2 Visual Results (contains img-1, img-2)
428
+ 3. Conclusions
429
+
430
+ TABLE_SUMMARIES:
431
+ table-0 (page 4): Revenue by Quarter | 4 cols | 12 rows
432
+ table-1 (page 5): Cost Breakdown | 6 cols | 8 rows
433
+ table-2 (page 9): Comparison Matrix | 5 cols | 5 rows
434
+
435
+ IMAGE_DESCRIPTIONS:
436
+ img-0 (page 1): [cover photo] — requires vision model for description
437
+ img-1 (page 6): [bar chart] — requires vision model for data extraction
438
+ img-2 (page 7): [photograph] — requires vision model for description
439
+ img-3 (page 11): [diagram] — requires vision model for description
440
+
441
+ TEXT_CONTENT:
442
+ [structured markdown follows...]
443
+ ```
444
+
445
+ ## Handling Charts and Diagrams in PDFs
446
+
447
+ Charts in PDFs are stored as either:
448
+ 1. **Vector paths** (lines, curves, fills) — can be extracted as SVG but lose data labels
449
+ 2. **Rasterized images** — requires vision model to extract data
450
+
451
+ **Recommended approach for charts:**
452
+ ```
453
+ 1. Extract chart region as image (PyMuPDF page.get_pixmap(clip=rect))
454
+ 2. Send image to vision model: "Extract the data from this chart as a markdown table"
455
+ 3. Include both: image reference (for visual context) + extracted data table (for agent reasoning)
456
+ ```
457
+
458
+ ```python
459
+ import fitz
460
+
461
+ doc = fitz.open('input.pdf')
462
+ page = doc[5] # page with the chart
463
+
464
+ chart_rect = fitz.Rect(50, 200, 500, 450)
465
+ pix = page.get_pixmap(clip=chart_rect, dpi=300)
466
+ pix.save('chart-page6.png')
467
+ ```
468
+
469
+ ## Tool Selection Guide
470
+
471
+ | Tool | Text | Tables | Images | Layout | Speed | Available In |
472
+ |------|------|--------|--------|--------|-------|-------------|
473
+ | **pdf2html utility** | Excellent | Good | Extract as PNG | Positioned HTML | Fast | nexus-browser-sandbox |
474
+ | **pdf2htmlEX** | Excellent | Visual only | Embeds | **Pixel-perfect** | Medium | nexus-browser-sandbox |
475
+ | **PyMuPDF (fitz)** | Excellent | Good (HTML mode) | Extract as PNG | Positioned HTML | Fast | sandbox, CLI, server |
476
+ | **Claude Code Read** | Good | Basic | No | Text only | Instant | Claude Code |
477
+ | **Tabula** | N/A | **Best** (structured) | N/A | N/A | Medium | CLI (Java) |
478
+ | **Camelot** | N/A | Excellent | N/A | N/A | Medium | CLI (Python) |
479
+ | **Vision model** | Good | Good | **Best** (understanding) | Visual | Expensive | Any agent |
480
+
481
+ **Decision tree:**
482
+ ```
483
+ Agent with browser-MCP access?
484
+ → run_command("pdf2html doc.pdf") # structured extraction
485
+ → run_command("pdf2html doc.pdf --lossless") # visual fidelity
486
+
487
+ Agent in Claude Code?
488
+ → Read tool for text (up to 20 pages)
489
+ → Bash + PyMuPDF for structured HTML
490
+
491
+ Need tables specifically?
492
+ → Tabula or Camelot (CLI only)
493
+
494
+ Need chart/diagram data?
495
+ → Extract image region + vision model
496
+
497
+ User wants exact visual reproduction?
498
+ → pdf2htmlEX via --lossless flag
499
+ ```