@nexus-cortex/server 4.26.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.cortex/agents/AGENT_PROFILE_GUIDE.md +307 -0
- package/.cortex/agents/README.md +268 -0
- package/.cortex/agents/a-frontend-landing-page-designer.md +41 -0
- package/.cortex/agents/autoresearch-agent.md +49 -0
- package/.cortex/agents/code-reviewer.md +63 -0
- package/.cortex/agents/context-research.md +26 -0
- package/.cortex/agents/doc-writer.md +92 -0
- package/.cortex/agents/explore.md +63 -0
- package/.cortex/agents/new-model-api-integrator-analyst.md +41 -0
- package/.cortex/agents/plan.md +109 -0
- package/.cortex/agents/pr-architecture-reviewer.md +77 -0
- package/.cortex/agents/pr-code-quality.md +78 -0
- package/.cortex/agents/pr-implementer.md +50 -0
- package/.cortex/agents/pr-security-auditor.md +62 -0
- package/.cortex/agents/pr-test-writer.md +67 -0
- package/.cortex/agents/refactor.md +118 -0
- package/.cortex/agents/test-writer.md +72 -0
- package/.cortex/agents/web-researcher.md +72 -0
- package/.cortex/bench/tasks/sample-tasks.json +20 -0
- package/.cortex/commands/compare.md +14 -0
- package/.cortex/commands/deps.md +16 -0
- package/.cortex/commands/diff.md +14 -0
- package/.cortex/commands/explain.md +16 -0
- package/.cortex/commands/find-bug.md +13 -0
- package/.cortex/commands/profile.md +15 -0
- package/.cortex/commands/review.md +18 -0
- package/.cortex/commands/search.md +16 -0
- package/.cortex/commands/test.md +15 -0
- package/.cortex/permissions.dev.json +20 -0
- package/.cortex/permissions.example.json +71 -0
- package/.cortex/permissions.prod.json +63 -0
- package/.cortex/permissions.test.json +19 -0
- package/.cortex/skills/autoresearch/SKILL.md +77 -0
- package/.cortex/skills/autoresearch/personas/README.md +45 -0
- package/.cortex/skills/autoresearch/personas/aggressive-refactor.md +25 -0
- package/.cortex/skills/autoresearch/personas/creative.md +29 -0
- package/.cortex/skills/autoresearch/personas/perf-hunter.md +27 -0
- package/.cortex/skills/autoresearch/personas/precise.md +23 -0
- package/.cortex/skills/autoresearch/personas/root-cause.md +26 -0
- package/.cortex/skills/autoresearch/personas/security-auditor.md +29 -0
- package/.cortex/skills/autoresearch/personas/skeptic-reviewer.md +31 -0
- package/.cortex/skills/autoresearch/personas/test-first.md +25 -0
- package/.cortex/skills/best-of-n/SKILL.md +76 -0
- package/.cortex/skills/cortex/SKILL.md +834 -0
- package/.cortex/skills/cortex-bench/SKILL.md +354 -0
- package/.cortex/skills/docx/SKILL.md +83 -0
- package/.cortex/skills/pdf-documents/SKILL.md +297 -0
- package/.cortex/skills/pdf-documents/sections/01-image-acquisition.md +132 -0
- package/.cortex/skills/pdf-documents/sections/02-ai-image-generation.md +274 -0
- package/.cortex/skills/pdf-documents/sections/03-paper-sizes.md +89 -0
- package/.cortex/skills/pdf-documents/sections/04-design-system.md +549 -0
- package/.cortex/skills/pdf-documents/sections/05-css-print-rules.md +135 -0
- package/.cortex/skills/pdf-documents/sections/06-svg-charts.md +100 -0
- package/.cortex/skills/pdf-documents/sections/07-templates.md +224 -0
- package/.cortex/skills/pdf-documents/sections/08-scaled-output.md +164 -0
- package/.cortex/skills/pdf-documents/sections/09-preview-qa.md +66 -0
- package/.cortex/skills/pdf-documents/sections/10-reading-pdfs.md +499 -0
- package/.cortex/skills/pdf-documents/sections/11-form-filling.md +241 -0
- package/.cortex/skills/pptx/SKILL.md +90 -0
- package/.cortex/skills/resume-analyst/SKILL.md +373 -0
- package/.cortex/skills/verify-work/SKILL.md +74 -0
- package/.cortex/skills/xlsx/SKILL.md +101 -0
- package/.cortex/system-messages/messages/WORK_QUALITY.md +159 -0
- package/.cortex/system-messages/registry.json +18 -0
- package/LICENSE +202 -0
- package/NOTICE +2 -0
- package/README.md +13 -0
- package/bin/cortex-daemon.js +47 -0
- package/bin/cortex-server.js +15 -0
- package/dist/index.d.ts +30 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +513 -0
- package/dist/index.js.map +1 -0
- package/dist/middleware/cors.d.ts +10 -0
- package/dist/middleware/cors.d.ts.map +1 -0
- package/dist/middleware/cors.js +11 -0
- package/dist/middleware/cors.js.map +1 -0
- package/dist/middleware/errorHandler.d.ts +10 -0
- package/dist/middleware/errorHandler.d.ts.map +1 -0
- package/dist/middleware/errorHandler.js +15 -0
- package/dist/middleware/errorHandler.js.map +1 -0
- package/dist/routes/approval.d.ts +2 -0
- package/dist/routes/approval.d.ts.map +1 -0
- package/dist/routes/approval.js +96 -0
- package/dist/routes/approval.js.map +1 -0
- package/dist/routes/config.d.ts +2 -0
- package/dist/routes/config.d.ts.map +1 -0
- package/dist/routes/config.js +70 -0
- package/dist/routes/config.js.map +1 -0
- package/dist/routes/health.d.ts +2 -0
- package/dist/routes/health.d.ts.map +1 -0
- package/dist/routes/health.js +1031 -0
- package/dist/routes/health.js.map +1 -0
- package/dist/routes/mcp.d.ts +2 -0
- package/dist/routes/mcp.d.ts.map +1 -0
- package/dist/routes/mcp.js +251 -0
- package/dist/routes/mcp.js.map +1 -0
- package/dist/routes/messages.d.ts +5 -0
- package/dist/routes/messages.d.ts.map +1 -0
- package/dist/routes/messages.js +136 -0
- package/dist/routes/messages.js.map +1 -0
- package/dist/routes/middleware.d.ts +2 -0
- package/dist/routes/middleware.d.ts.map +1 -0
- package/dist/routes/middleware.js +146 -0
- package/dist/routes/middleware.js.map +1 -0
- package/dist/routes/models.d.ts +2 -0
- package/dist/routes/models.d.ts.map +1 -0
- package/dist/routes/models.js +29 -0
- package/dist/routes/models.js.map +1 -0
- package/dist/routes/permissions.d.ts +2 -0
- package/dist/routes/permissions.d.ts.map +1 -0
- package/dist/routes/permissions.js +253 -0
- package/dist/routes/permissions.js.map +1 -0
- package/dist/routes/pr.d.ts +2 -0
- package/dist/routes/pr.d.ts.map +1 -0
- package/dist/routes/pr.js +222 -0
- package/dist/routes/pr.js.map +1 -0
- package/dist/routes/sessions.d.ts +2 -0
- package/dist/routes/sessions.d.ts.map +1 -0
- package/dist/routes/sessions.js +628 -0
- package/dist/routes/sessions.js.map +1 -0
- package/dist/routes/system-messages.d.ts +2 -0
- package/dist/routes/system-messages.d.ts.map +1 -0
- package/dist/routes/system-messages.js +146 -0
- package/dist/routes/system-messages.js.map +1 -0
- package/dist/routes/tools.d.ts +2 -0
- package/dist/routes/tools.d.ts.map +1 -0
- package/dist/routes/tools.js +79 -0
- package/dist/routes/tools.js.map +1 -0
- package/package.json +63 -0
|
@@ -0,0 +1,499 @@
|
|
|
1
|
+
# Reading and Deconstructing PDFs
|
|
2
|
+
|
|
3
|
+
## The Problem
|
|
4
|
+
|
|
5
|
+
Agents currently read PDFs via OCR or text extraction, which destroys:
|
|
6
|
+
- Table structure (becomes ambiguous whitespace)
|
|
7
|
+
- Diagram/chart data (becomes nothing or alt text)
|
|
8
|
+
- Heading hierarchy (inferred from font size, unreliable)
|
|
9
|
+
- Image-text association (captions detach from figures)
|
|
10
|
+
- Links (become plain text)
|
|
11
|
+
|
|
12
|
+
## Runtime Context — Choose Your Path
|
|
13
|
+
|
|
14
|
+
| Runtime | Best Tool | How |
|
|
15
|
+
|---|---|---|
|
|
16
|
+
| **Claude Code** | Built-in Read tool | `Read({ file_path: "doc.pdf", pages: "1-10" })` — up to 20 pages/request |
|
|
17
|
+
| **Agent with browser-MCP access** | nexus-browser-sandbox | `run_command` with `pdf2html` utility or `pdf2htmlEX` (both pre-installed) |
|
|
18
|
+
| **Server-side / CLI** | PyMuPDF directly | `pip install pymupdf` then Python script |
|
|
19
|
+
| **Browser-only (no sandbox)** | Vision model on screenshots | Browse PDF, screenshot pages, send to VLM |
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Nexus Browser Sandbox (for agents with browser-MCP access)
|
|
24
|
+
|
|
25
|
+
The `nexus-browser-sandbox` container has **PyMuPDF**, **pdf2htmlEX**, and a pre-built extraction utility installed. Any agent with nexus-browser MCP access can use them via `run_command` or `run_code` — no setup, no pip install.
|
|
26
|
+
|
|
27
|
+
### Container Additions (Dockerfile)
|
|
28
|
+
|
|
29
|
+
```dockerfile
|
|
30
|
+
# ── PDF Reading Toolkit ──────────────────────────────────────
|
|
31
|
+
# PyMuPDF — structured HTML extraction, image extraction, text
|
|
32
|
+
RUN pip install --no-cache-dir pymupdf beautifulsoup4
|
|
33
|
+
|
|
34
|
+
# pdf2htmlEX — pixel-perfect visual fidelity conversion
|
|
35
|
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
36
|
+
pdf2htmlex \
|
|
37
|
+
&& rm -rf /var/lib/apt/lists/*
|
|
38
|
+
|
|
39
|
+
# Pre-built extraction utility
|
|
40
|
+
COPY pdf2html /usr/local/bin/pdf2html
|
|
41
|
+
RUN chmod +x /usr/local/bin/pdf2html
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Extraction Utility: `/usr/local/bin/pdf2html`
|
|
45
|
+
|
|
46
|
+
A Python script pre-installed in the container. Takes a PDF path, outputs structured HTML + a JSON manifest to stdout. Agents call it via `run_command` and get agent-ready output with zero boilerplate.
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
#!/usr/bin/env python3
|
|
50
|
+
"""pdf2html — Extract structured HTML + manifest from a PDF.
|
|
51
|
+
|
|
52
|
+
Usage:
|
|
53
|
+
pdf2html input.pdf # structured HTML (agent consumption)
|
|
54
|
+
pdf2html input.pdf --manifest # JSON manifest only (outline, tables, images)
|
|
55
|
+
pdf2html input.pdf --images /out/dir # also extract images as PNG files
|
|
56
|
+
pdf2html input.pdf --lossless # delegates to pdf2htmlEX (pixel-perfect)
|
|
57
|
+
pdf2html input.pdf --text # plain text only (cheapest)
|
|
58
|
+
pdf2html input.pdf --pages 1-5 # specific page range
|
|
59
|
+
"""
|
|
60
|
+
|
|
61
|
+
import sys
|
|
62
|
+
import os
|
|
63
|
+
import json
|
|
64
|
+
import argparse
|
|
65
|
+
|
|
66
|
+
def extract_structured(pdf_path, pages=None, extract_images_dir=None):
|
|
67
|
+
"""PyMuPDF structured extraction — HTML + manifest."""
|
|
68
|
+
import fitz
|
|
69
|
+
doc = fitz.open(pdf_path)
|
|
70
|
+
|
|
71
|
+
page_range = range(len(doc))
|
|
72
|
+
if pages:
|
|
73
|
+
start, end = pages
|
|
74
|
+
page_range = range(max(0, start - 1), min(len(doc), end))
|
|
75
|
+
|
|
76
|
+
html_parts = []
|
|
77
|
+
manifest = {
|
|
78
|
+
"source": os.path.basename(pdf_path),
|
|
79
|
+
"pages": len(doc),
|
|
80
|
+
"extracted_pages": list(page_range),
|
|
81
|
+
"images": [],
|
|
82
|
+
"tables": [],
|
|
83
|
+
"headings": [],
|
|
84
|
+
}
|
|
85
|
+
|
|
86
|
+
for i in page_range:
|
|
87
|
+
page = doc[i]
|
|
88
|
+
page_num = i + 1
|
|
89
|
+
|
|
90
|
+
html_parts.append(f'<div class="page" data-page="{page_num}">')
|
|
91
|
+
html_parts.append(page.get_text('html'))
|
|
92
|
+
html_parts.append('</div>')
|
|
93
|
+
|
|
94
|
+
# Extract images
|
|
95
|
+
images = page.get_images(full=True)
|
|
96
|
+
for img_idx, img_info in enumerate(images):
|
|
97
|
+
img_id = f"img-p{page_num}-{img_idx}"
|
|
98
|
+
manifest["images"].append({
|
|
99
|
+
"id": img_id,
|
|
100
|
+
"page": page_num,
|
|
101
|
+
"xref": img_info[0],
|
|
102
|
+
"width": img_info[2],
|
|
103
|
+
"height": img_info[3],
|
|
104
|
+
})
|
|
105
|
+
|
|
106
|
+
if extract_images_dir:
|
|
107
|
+
xref = img_info[0]
|
|
108
|
+
pix = fitz.Pixmap(doc, xref)
|
|
109
|
+
if pix.n >= 5: # CMYK
|
|
110
|
+
pix = fitz.Pixmap(fitz.csRGB, pix)
|
|
111
|
+
out_path = os.path.join(extract_images_dir, f"{img_id}.png")
|
|
112
|
+
pix.save(out_path)
|
|
113
|
+
manifest["images"][-1]["file"] = out_path
|
|
114
|
+
|
|
115
|
+
# Extract text blocks for headings (heuristic: large font = heading)
|
|
116
|
+
blocks = page.get_text("dict")["blocks"]
|
|
117
|
+
for block in blocks:
|
|
118
|
+
if "lines" not in block:
|
|
119
|
+
continue
|
|
120
|
+
for line in block["lines"]:
|
|
121
|
+
for span in line["spans"]:
|
|
122
|
+
size = span["size"]
|
|
123
|
+
text = span["text"].strip()
|
|
124
|
+
if text and size >= 14:
|
|
125
|
+
level = 1 if size >= 22 else 2 if size >= 16 else 3
|
|
126
|
+
manifest["headings"].append({
|
|
127
|
+
"level": level,
|
|
128
|
+
"text": text,
|
|
129
|
+
"page": page_num,
|
|
130
|
+
"font_size": round(size, 1),
|
|
131
|
+
})
|
|
132
|
+
|
|
133
|
+
# Detect tables (PyMuPDF table detection)
|
|
134
|
+
try:
|
|
135
|
+
tables = page.find_tables()
|
|
136
|
+
for t_idx, table in enumerate(tables):
|
|
137
|
+
t_id = f"table-p{page_num}-{t_idx}"
|
|
138
|
+
extracted = table.extract()
|
|
139
|
+
headers = extracted[0] if extracted else []
|
|
140
|
+
manifest["tables"].append({
|
|
141
|
+
"id": t_id,
|
|
142
|
+
"page": page_num,
|
|
143
|
+
"headers": [h for h in headers if h],
|
|
144
|
+
"rows": len(extracted),
|
|
145
|
+
"cols": len(headers),
|
|
146
|
+
})
|
|
147
|
+
except Exception:
|
|
148
|
+
pass # table detection not available in older PyMuPDF
|
|
149
|
+
|
|
150
|
+
full_html = f"""<!DOCTYPE html>
|
|
151
|
+
<html><head><meta charset="UTF-8"><title>{manifest['source']}</title>
|
|
152
|
+
<style>
|
|
153
|
+
.page {{ border-bottom: 2px solid #ccc; padding: 20px 0; margin-bottom: 20px; }}
|
|
154
|
+
.page::before {{ content: "Page " attr(data-page); font-size: 10pt; color: #888;
|
|
155
|
+
display: block; margin-bottom: 10px; }}
|
|
156
|
+
img {{ max-width: 100%; height: auto; }}
|
|
157
|
+
table {{ border-collapse: collapse; margin: 10px 0; }}
|
|
158
|
+
td, th {{ border: 1px solid #ddd; padding: 4px 8px; }}
|
|
159
|
+
</style>
|
|
160
|
+
</head><body>{''.join(html_parts)}</body></html>"""
|
|
161
|
+
|
|
162
|
+
return full_html, manifest
|
|
163
|
+
|
|
164
|
+
|
|
165
|
+
def extract_text_only(pdf_path, pages=None):
|
|
166
|
+
"""Cheapest extraction — plain text."""
|
|
167
|
+
import fitz
|
|
168
|
+
doc = fitz.open(pdf_path)
|
|
169
|
+
page_range = range(len(doc))
|
|
170
|
+
if pages:
|
|
171
|
+
start, end = pages
|
|
172
|
+
page_range = range(max(0, start - 1), min(len(doc), end))
|
|
173
|
+
|
|
174
|
+
parts = []
|
|
175
|
+
for i in page_range:
|
|
176
|
+
parts.append(f"\n--- PAGE {i+1} ---\n")
|
|
177
|
+
parts.append(doc[i].get_text("text"))
|
|
178
|
+
return ''.join(parts)
|
|
179
|
+
|
|
180
|
+
|
|
181
|
+
def run_pdf2htmlex(pdf_path):
|
|
182
|
+
"""Delegate to pdf2htmlEX for pixel-perfect conversion."""
|
|
183
|
+
import subprocess
|
|
184
|
+
out_dir = "/tmp/pdf2htmlex_out"
|
|
185
|
+
os.makedirs(out_dir, exist_ok=True)
|
|
186
|
+
basename = os.path.splitext(os.path.basename(pdf_path))[0]
|
|
187
|
+
result = subprocess.run(
|
|
188
|
+
["pdf2htmlEX", "--zoom", "1.5", "--process-outline", "1",
|
|
189
|
+
"--dest-dir", out_dir, pdf_path],
|
|
190
|
+
capture_output=True, text=True
|
|
191
|
+
)
|
|
192
|
+
if result.returncode != 0:
|
|
193
|
+
print(f"pdf2htmlEX error: {result.stderr}", file=sys.stderr)
|
|
194
|
+
sys.exit(1)
|
|
195
|
+
|
|
196
|
+
out_file = os.path.join(out_dir, f"{basename}.html")
|
|
197
|
+
if os.path.exists(out_file):
|
|
198
|
+
with open(out_file) as f:
|
|
199
|
+
print(f.read())
|
|
200
|
+
else:
|
|
201
|
+
print(f"Output not found at {out_file}", file=sys.stderr)
|
|
202
|
+
sys.exit(1)
|
|
203
|
+
|
|
204
|
+
|
|
205
|
+
def parse_pages(pages_str):
|
|
206
|
+
"""Parse '1-5' or '3' into (start, end) tuple."""
|
|
207
|
+
if '-' in pages_str:
|
|
208
|
+
parts = pages_str.split('-')
|
|
209
|
+
return (int(parts[0]), int(parts[1]))
|
|
210
|
+
else:
|
|
211
|
+
n = int(pages_str)
|
|
212
|
+
return (n, n)
|
|
213
|
+
|
|
214
|
+
|
|
215
|
+
def main():
|
|
216
|
+
parser = argparse.ArgumentParser(description="Extract structured HTML from PDF")
|
|
217
|
+
parser.add_argument("pdf", help="Path to PDF file")
|
|
218
|
+
parser.add_argument("--manifest", action="store_true", help="Output JSON manifest only")
|
|
219
|
+
parser.add_argument("--images", metavar="DIR", help="Extract images to directory")
|
|
220
|
+
parser.add_argument("--lossless", action="store_true", help="Use pdf2htmlEX (pixel-perfect)")
|
|
221
|
+
parser.add_argument("--text", action="store_true", help="Plain text only (cheapest)")
|
|
222
|
+
parser.add_argument("--pages", metavar="RANGE", help="Page range, e.g. 1-5 or 3")
|
|
223
|
+
args = parser.parse_args()
|
|
224
|
+
|
|
225
|
+
if not os.path.exists(args.pdf):
|
|
226
|
+
print(f"File not found: {args.pdf}", file=sys.stderr)
|
|
227
|
+
sys.exit(1)
|
|
228
|
+
|
|
229
|
+
pages = parse_pages(args.pages) if args.pages else None
|
|
230
|
+
|
|
231
|
+
if args.lossless:
|
|
232
|
+
run_pdf2htmlex(args.pdf)
|
|
233
|
+
return
|
|
234
|
+
|
|
235
|
+
if args.text:
|
|
236
|
+
print(extract_text_only(args.pdf, pages))
|
|
237
|
+
return
|
|
238
|
+
|
|
239
|
+
if args.images:
|
|
240
|
+
os.makedirs(args.images, exist_ok=True)
|
|
241
|
+
|
|
242
|
+
html, manifest = extract_structured(args.pdf, pages, args.images)
|
|
243
|
+
|
|
244
|
+
if args.manifest:
|
|
245
|
+
print(json.dumps(manifest, indent=2))
|
|
246
|
+
else:
|
|
247
|
+
# Print manifest as HTML comment at top, then full HTML
|
|
248
|
+
print(f"<!-- MANIFEST\n{json.dumps(manifest, indent=2)}\n-->")
|
|
249
|
+
print(html)
|
|
250
|
+
|
|
251
|
+
|
|
252
|
+
if __name__ == "__main__":
|
|
253
|
+
main()
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
### Agent Workflow via nexus-browser MCP
|
|
257
|
+
|
|
258
|
+
```bash
|
|
259
|
+
# ── Structured extraction (default — best for agent consumption) ──
|
|
260
|
+
run_command({ command: "pdf2html /tmp/document.pdf" })
|
|
261
|
+
# Returns: HTML comment with JSON manifest + structured HTML body
|
|
262
|
+
|
|
263
|
+
# ── Manifest only (cheapest — just the outline, tables, image refs) ──
|
|
264
|
+
run_command({ command: "pdf2html /tmp/document.pdf --manifest" })
|
|
265
|
+
# Returns: JSON with pages, headings, table summaries, image locations
|
|
266
|
+
|
|
267
|
+
# ── Plain text only (cheapest tokens) ──
|
|
268
|
+
run_command({ command: "pdf2html /tmp/document.pdf --text" })
|
|
269
|
+
# Returns: raw text with page markers
|
|
270
|
+
|
|
271
|
+
# ── Extract images to files ──
|
|
272
|
+
run_command({ command: "pdf2html /tmp/document.pdf --images /tmp/pdf-images/" })
|
|
273
|
+
# Returns: HTML + manifest with image file paths
|
|
274
|
+
|
|
275
|
+
# ── Pixel-perfect lossless conversion (pdf2htmlEX) ──
|
|
276
|
+
run_command({ command: "pdf2html /tmp/document.pdf --lossless" })
|
|
277
|
+
# Returns: pixel-perfect HTML (absolute-positioned divs, visually exact)
|
|
278
|
+
|
|
279
|
+
# ── Specific pages only ──
|
|
280
|
+
run_command({ command: "pdf2html /tmp/document.pdf --pages 1-5 --manifest" })
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
**Getting the PDF into the sandbox:**
|
|
284
|
+
```bash
|
|
285
|
+
# Option 1: Download from URL
|
|
286
|
+
run_command({ command: "curl -sL -o /tmp/document.pdf 'https://example.com/report.pdf'" })
|
|
287
|
+
|
|
288
|
+
# Option 2: Browse to PDF URL (Chromium downloads it)
|
|
289
|
+
browse({ url: "https://example.com/report.pdf" })
|
|
290
|
+
# Then find it in the browser's download directory
|
|
291
|
+
|
|
292
|
+
# Option 3: If PDF is already in VFS, copy via the agent's file tools
|
|
293
|
+
# (depends on sandbox ↔ VFS bridge)
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
### When to Use --lossless (pdf2htmlEX)
|
|
297
|
+
|
|
298
|
+
The `--lossless` flag delegates to pdf2htmlEX instead of PyMuPDF. Use it when:
|
|
299
|
+
- User asks for **exact visual reproduction** of the PDF in a browser
|
|
300
|
+
- The PDF has complex **multi-column layouts** that PyMuPDF misinterprets
|
|
301
|
+
- The output will be **displayed to a human**, not parsed by an agent
|
|
302
|
+
- You need to **embed the PDF in a web page** with exact fidelity
|
|
303
|
+
|
|
304
|
+
Do NOT use `--lossless` for agent consumption — the output is CSS absolute-positioned divs that are impossible to parse semantically. Use the default PyMuPDF path for anything an agent needs to understand.
|
|
305
|
+
|
|
306
|
+
---
|
|
307
|
+
|
|
308
|
+
## Direct PyMuPDF Usage (CLI / Server-Side)
|
|
309
|
+
|
|
310
|
+
When you have Python available (Claude Code bash, server-side scripts, CI), use PyMuPDF directly:
|
|
311
|
+
|
|
312
|
+
```bash
|
|
313
|
+
pip install pymupdf
|
|
314
|
+
python3 -c "
|
|
315
|
+
import fitz
|
|
316
|
+
doc = fitz.open('input.pdf')
|
|
317
|
+
html_parts = []
|
|
318
|
+
for i, page in enumerate(doc):
|
|
319
|
+
html_parts.append(f'<div class=\"page\" data-page=\"{i+1}\">')
|
|
320
|
+
html_parts.append(page.get_text('html'))
|
|
321
|
+
html_parts.append('</div>')
|
|
322
|
+
with open('output.html', 'w') as f:
|
|
323
|
+
f.write('<html><body>' + ''.join(html_parts) + '</body></html>')
|
|
324
|
+
"
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
## Deconstructing PDF-to-HTML for Agent Consumption
|
|
328
|
+
|
|
329
|
+
Once you have HTML from any extraction method, deconstruct it into an indexed structure:
|
|
330
|
+
|
|
331
|
+
### Step 1: Build a Media Index
|
|
332
|
+
|
|
333
|
+
```python
|
|
334
|
+
from bs4 import BeautifulSoup
|
|
335
|
+
import json
|
|
336
|
+
|
|
337
|
+
with open('converted.html') as f:
|
|
338
|
+
soup = BeautifulSoup(f.read(), 'html.parser')
|
|
339
|
+
|
|
340
|
+
media_index = {
|
|
341
|
+
"images": [],
|
|
342
|
+
"tables": [],
|
|
343
|
+
"headings": [],
|
|
344
|
+
"links": [],
|
|
345
|
+
"figures": [],
|
|
346
|
+
}
|
|
347
|
+
|
|
348
|
+
for i, img in enumerate(soup.find_all('img')):
|
|
349
|
+
media_index["images"].append({
|
|
350
|
+
"id": f"img-{i}",
|
|
351
|
+
"src": img.get('src', '')[:100],
|
|
352
|
+
"alt": img.get('alt', ''),
|
|
353
|
+
"page": img.find_parent(attrs={"data-page": True}).get("data-page", "?"),
|
|
354
|
+
"width": img.get('width'),
|
|
355
|
+
"height": img.get('height'),
|
|
356
|
+
})
|
|
357
|
+
|
|
358
|
+
for i, table in enumerate(soup.find_all('table')):
|
|
359
|
+
rows = table.find_all('tr')
|
|
360
|
+
headers = [th.get_text(strip=True) for th in rows[0].find_all(['th', 'td'])] if rows else []
|
|
361
|
+
media_index["tables"].append({
|
|
362
|
+
"id": f"table-{i}",
|
|
363
|
+
"headers": headers,
|
|
364
|
+
"row_count": len(rows),
|
|
365
|
+
"page": table.find_parent(attrs={"data-page": True}).get("data-page", "?"),
|
|
366
|
+
})
|
|
367
|
+
|
|
368
|
+
for h in soup.find_all(['h1','h2','h3','h4','h5','h6']):
|
|
369
|
+
media_index["headings"].append({
|
|
370
|
+
"level": int(h.name[1]),
|
|
371
|
+
"text": h.get_text(strip=True),
|
|
372
|
+
})
|
|
373
|
+
|
|
374
|
+
print(json.dumps(media_index, indent=2))
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
### Step 2: Extract Text as Structured Markdown
|
|
378
|
+
|
|
379
|
+
```python
|
|
380
|
+
def html_to_agent_markdown(html_path):
|
|
381
|
+
with open(html_path) as f:
|
|
382
|
+
soup = BeautifulSoup(f.read(), 'html.parser')
|
|
383
|
+
|
|
384
|
+
sections = []
|
|
385
|
+
for page_div in soup.find_all('div', class_='page'):
|
|
386
|
+
page_num = page_div.get('data-page', '?')
|
|
387
|
+
sections.append(f"\n<!-- PAGE {page_num} -->\n")
|
|
388
|
+
|
|
389
|
+
for elem in page_div.children:
|
|
390
|
+
if elem.name in ('h1','h2','h3','h4','h5','h6'):
|
|
391
|
+
level = int(elem.name[1])
|
|
392
|
+
sections.append(f"{'#' * level} {elem.get_text(strip=True)}\n")
|
|
393
|
+
elif elem.name == 'table':
|
|
394
|
+
sections.append(table_to_markdown(elem))
|
|
395
|
+
elif elem.name == 'p':
|
|
396
|
+
text = elem.get_text(strip=True)
|
|
397
|
+
if text:
|
|
398
|
+
sections.append(f"{text}\n")
|
|
399
|
+
elif elem.name == 'img':
|
|
400
|
+
alt = elem.get('alt', 'image')
|
|
401
|
+
sections.append(f"\n")
|
|
402
|
+
elif elem.name == 'figure':
|
|
403
|
+
caption = elem.find('figcaption')
|
|
404
|
+
sections.append(f"[FIGURE: {caption.get_text(strip=True) if caption else 'untitled'}]\n")
|
|
405
|
+
|
|
406
|
+
return '\n'.join(sections)
|
|
407
|
+
```
|
|
408
|
+
|
|
409
|
+
### Step 3: Agent Consumption Format
|
|
410
|
+
|
|
411
|
+
The final output for agent ingestion should be:
|
|
412
|
+
|
|
413
|
+
```
|
|
414
|
+
DOCUMENT_MANIFEST:
|
|
415
|
+
source: input.pdf
|
|
416
|
+
pages: 13
|
|
417
|
+
images: 4 (indexed as img-0 through img-3)
|
|
418
|
+
tables: 3 (indexed as table-0 through table-2)
|
|
419
|
+
headings: 12 (see outline below)
|
|
420
|
+
|
|
421
|
+
OUTLINE:
|
|
422
|
+
1. Introduction
|
|
423
|
+
1.1 Background
|
|
424
|
+
1.2 Methodology
|
|
425
|
+
2. Findings
|
|
426
|
+
2.1 Data Analysis (contains table-0, table-1)
|
|
427
|
+
2.2 Visual Results (contains img-1, img-2)
|
|
428
|
+
3. Conclusions
|
|
429
|
+
|
|
430
|
+
TABLE_SUMMARIES:
|
|
431
|
+
table-0 (page 4): Revenue by Quarter | 4 cols | 12 rows
|
|
432
|
+
table-1 (page 5): Cost Breakdown | 6 cols | 8 rows
|
|
433
|
+
table-2 (page 9): Comparison Matrix | 5 cols | 5 rows
|
|
434
|
+
|
|
435
|
+
IMAGE_DESCRIPTIONS:
|
|
436
|
+
img-0 (page 1): [cover photo] — requires vision model for description
|
|
437
|
+
img-1 (page 6): [bar chart] — requires vision model for data extraction
|
|
438
|
+
img-2 (page 7): [photograph] — requires vision model for description
|
|
439
|
+
img-3 (page 11): [diagram] — requires vision model for description
|
|
440
|
+
|
|
441
|
+
TEXT_CONTENT:
|
|
442
|
+
[structured markdown follows...]
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
## Handling Charts and Diagrams in PDFs
|
|
446
|
+
|
|
447
|
+
Charts in PDFs are stored as either:
|
|
448
|
+
1. **Vector paths** (lines, curves, fills) — can be extracted as SVG but lose data labels
|
|
449
|
+
2. **Rasterized images** — requires vision model to extract data
|
|
450
|
+
|
|
451
|
+
**Recommended approach for charts:**
|
|
452
|
+
```
|
|
453
|
+
1. Extract chart region as image (PyMuPDF page.get_pixmap(clip=rect))
|
|
454
|
+
2. Send image to vision model: "Extract the data from this chart as a markdown table"
|
|
455
|
+
3. Include both: image reference (for visual context) + extracted data table (for agent reasoning)
|
|
456
|
+
```
|
|
457
|
+
|
|
458
|
+
```python
|
|
459
|
+
import fitz
|
|
460
|
+
|
|
461
|
+
doc = fitz.open('input.pdf')
|
|
462
|
+
page = doc[5] # page with the chart
|
|
463
|
+
|
|
464
|
+
chart_rect = fitz.Rect(50, 200, 500, 450)
|
|
465
|
+
pix = page.get_pixmap(clip=chart_rect, dpi=300)
|
|
466
|
+
pix.save('chart-page6.png')
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
## Tool Selection Guide
|
|
470
|
+
|
|
471
|
+
| Tool | Text | Tables | Images | Layout | Speed | Available In |
|
|
472
|
+
|------|------|--------|--------|--------|-------|-------------|
|
|
473
|
+
| **pdf2html utility** | Excellent | Good | Extract as PNG | Positioned HTML | Fast | nexus-browser-sandbox |
|
|
474
|
+
| **pdf2htmlEX** | Excellent | Visual only | Embeds | **Pixel-perfect** | Medium | nexus-browser-sandbox |
|
|
475
|
+
| **PyMuPDF (fitz)** | Excellent | Good (HTML mode) | Extract as PNG | Positioned HTML | Fast | sandbox, CLI, server |
|
|
476
|
+
| **Claude Code Read** | Good | Basic | No | Text only | Instant | Claude Code |
|
|
477
|
+
| **Tabula** | N/A | **Best** (structured) | N/A | N/A | Medium | CLI (Java) |
|
|
478
|
+
| **Camelot** | N/A | Excellent | N/A | N/A | Medium | CLI (Python) |
|
|
479
|
+
| **Vision model** | Good | Good | **Best** (understanding) | Visual | Expensive | Any agent |
|
|
480
|
+
|
|
481
|
+
**Decision tree:**
|
|
482
|
+
```
|
|
483
|
+
Agent with browser-MCP access?
|
|
484
|
+
→ run_command("pdf2html doc.pdf") # structured extraction
|
|
485
|
+
→ run_command("pdf2html doc.pdf --lossless") # visual fidelity
|
|
486
|
+
|
|
487
|
+
Agent in Claude Code?
|
|
488
|
+
→ Read tool for text (up to 20 pages)
|
|
489
|
+
→ Bash + PyMuPDF for structured HTML
|
|
490
|
+
|
|
491
|
+
Need tables specifically?
|
|
492
|
+
→ Tabula or Camelot (CLI only)
|
|
493
|
+
|
|
494
|
+
Need chart/diagram data?
|
|
495
|
+
→ Extract image region + vision model
|
|
496
|
+
|
|
497
|
+
User wants exact visual reproduction?
|
|
498
|
+
→ pdf2htmlEX via --lossless flag
|
|
499
|
+
```
|