@sylphx/pdf-reader-mcp 2.4.3 → 2.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +193 -32
  2. package/dist/index.js +1514 -555
  3. package/package.json +5 -4
package/README.md CHANGED
@@ -11,7 +11,7 @@
11
11
  [![TypeScript](https://img.shields.io/badge/TypeScript-6.0-blue.svg?style=flat-square)](https://www.typescriptlang.org/)
12
12
  [![Downloads](https://img.shields.io/npm/dm/@sylphx/pdf-reader-mcp?style=flat-square)](https://www.npmjs.com/package/@sylphx/pdf-reader-mcp)
13
13
 
14
- **5-10x faster parallel processing** • **Y-coordinate content ordering** • **94%+ test coverage** • **173 tests passing**
14
+ **5-10x faster parallel processing** • **Structured element output** • **Semantic citation chunks** • **CI-backed quality**
15
15
 
16
16
  <a href="https://mseep.ai/app/SylphxAI-pdf-reader-mcp">
17
17
  <img src="https://mseep.net/pr/SylphxAI-pdf-reader-mcp-badge.png" alt="Security Validated" width="200"/>
@@ -23,7 +23,7 @@
23
23
 
24
24
  ## 🚀 Overview
25
25
 
26
- PDF Reader MCP is a **production-ready** Model Context Protocol server that empowers AI agents with **enterprise-grade PDF processing capabilities**. Extract text, images, and metadata with unmatched performance and reliability.
26
+ PDF Reader MCP is a **production-ready** Model Context Protocol server that empowers AI agents with **structured, local-first PDF processing capabilities**. Extract text, Markdown, semantic citation chunks, images, tables, annotations, outlines, structure trees, form fields, attachment metadata, and agent-ready document elements with strong performance and reliability.
27
27
 
28
28
  **The Problem:**
29
29
  ```typescript
@@ -38,10 +38,14 @@ PDF Reader MCP is a **production-ready** Model Context Protocol server that empo
38
38
  ```typescript
39
39
  // PDF Reader MCP
40
40
  - 5-10x faster parallel processing ⚡
41
- - Y-coordinate based ordering 📐
41
+ - Structured element output for agent workflows 🧩
42
+ - Markdown rendering for RAG and summarization 📝
43
+ - Citation-ready semantic/table/page chunks 🔗
44
+ - Outlines, annotations, structure trees, forms, attachments, labels, and permission signals 🗂️
45
+ - Column-aware reading order 📐
42
46
  - Flexible path support (absolute/relative) 🎯
43
47
  - Per-page error resilience 🛡️
44
- - 94%+ test coverage
48
+ - CI-backed quality
45
49
  ```
46
50
 
47
51
  **Result: Production-ready PDF processing that scales.**
@@ -60,9 +64,13 @@ PDF Reader MCP is a **production-ready** Model Context Protocol server that empo
60
64
  ### Developer Experience
61
65
 
62
66
  - 🎯 **Path Flexibility** - Absolute & relative paths, Windows/Unix support (v1.3.0)
63
- - 🖼️ **Smart Ordering** - Y-coordinate based content preserves document layout
67
+ - 🧩 **Structured Elements** - Optional page-level elements with stable IDs, provenance, and best-effort bounding boxes
68
+ - 📝 **Markdown Rendering** - Optional page-aware Markdown for RAG, summarization, and agent context
69
+ - 🔗 **Citation Chunks** - Optional page, semantic, size, and table chunks with element IDs and best-effort bounding boxes
70
+ - 🗂️ **Document Signals** - Optional outlines, page labels, annotations, structure trees, forms, attachments, permissions, and mark info
71
+ - 🖼️ **Smart Ordering** - Column-aware content ordering improves natural reading flow
64
72
  - 🛡️ **Type Safe** - Full TypeScript with strict mode enabled
65
- - 📚 **Battle-tested** - 173 tests, 94%+ coverage, 98%+ function coverage
73
+ - 📚 **Battle-tested** - Automated tests, strict TypeScript, and CI validation
66
74
  - 🎨 **Simple API** - Single tool handles all operations elegantly
67
75
 
68
76
  ---
@@ -211,7 +219,7 @@ npm install -g @sylphx/pdf-reader-mcp
211
219
  - ✅ Full text content extracted
212
220
  - ✅ PDF metadata (author, title, dates)
213
221
  - ✅ Total page count
214
- - ✅ Structural sharing - unchanged parts preserved
222
+ - ✅ Structured JSON summary for agent workflows
215
223
 
216
224
  ### Extract Specific Pages
217
225
 
@@ -225,6 +233,96 @@ npm install -g @sylphx/pdf-reader-mcp
225
233
  }
226
234
  ```
227
235
 
236
+ ### Structured Elements for Agents
237
+
238
+ ```json
239
+ {
240
+ "sources": [{
241
+ "path": "documents/report.pdf",
242
+ "pages": "1-3"
243
+ }],
244
+ "include_elements": true,
245
+ "include_metadata": true,
246
+ "include_page_count": true
247
+ }
248
+ ```
249
+
250
+ **Response includes:**
251
+ - Stable element IDs such as `p1-text-1`
252
+ - Page numbers and provenance for each element
253
+ - Best-effort bounding boxes when coordinates are available
254
+ - Text, image metadata, and table elements without embedding image bytes in the JSON summary
255
+ - Table elements include best-effort table and cell bounding boxes when coordinates are available
256
+
257
+ ### Markdown for RAG and Summaries
258
+
259
+ ```json
260
+ {
261
+ "sources": [{
262
+ "path": "documents/report.pdf",
263
+ "pages": "1-5"
264
+ }],
265
+ "include_markdown": true,
266
+ "include_full_text": false
267
+ }
268
+ ```
269
+
270
+ **Response includes:**
271
+ - Page-aware Markdown sections
272
+ - Text blocks in extraction order
273
+ - Image placeholders with dimensions when images are requested
274
+ - Extracted tables appended as Markdown when `include_tables` is enabled
275
+
276
+ ### Citation-Ready Chunks
277
+
278
+ ```json
279
+ {
280
+ "sources": [{
281
+ "path": "documents/report.pdf",
282
+ "pages": "1-5"
283
+ }],
284
+ "include_chunks": true,
285
+ "include_semantic_hints": true,
286
+ "include_tables": true,
287
+ "include_full_text": false
288
+ }
289
+ ```
290
+
291
+ **Response includes:**
292
+ - Stable chunk IDs such as `p1-chunk-1`
293
+ - Page ranges for each chunk
294
+ - Chunk strategies such as `page`, `semantic`, `size`, and `table`
295
+ - Semantic headings when heading boundaries are available
296
+ - Element IDs that map back to structured elements
297
+ - Best-effort bounding boxes for source highlighting
298
+
299
+ ### Outlines, Forms, Attachments, and Document Signals
300
+
301
+ ```json
302
+ {
303
+ "sources": [{
304
+ "path": "documents/spec.pdf",
305
+ "pages": "1-5"
306
+ }],
307
+ "include_outline": true,
308
+ "include_annotations": true,
309
+ "include_page_labels": true,
310
+ "include_permissions": true,
311
+ "include_structure_tree": true,
312
+ "include_form_fields": true,
313
+ "include_attachments": true
314
+ }
315
+ ```
316
+
317
+ **Response includes, when available:**
318
+ - Bookmark/outline trees
319
+ - Page labels such as roman numerals or section labels
320
+ - Link and note annotation summaries with bounding boxes
321
+ - Tagged PDF structure trees for selected pages when available
322
+ - Form field summaries with values, field types, and bounding boxes when available
323
+ - Embedded attachment metadata without returning attachment bytes
324
+ - Permission labels and marking signals
325
+
228
326
  ### Absolute Paths (v1.3.0+)
229
327
 
230
328
  ```json
@@ -261,7 +359,7 @@ npm install -g @sylphx/pdf-reader-mcp
261
359
  ```
262
360
 
263
361
  **Response includes:**
264
- - Text and images in **exact document order** (Y-coordinate sorted)
362
+ - Text and images in **Y-coordinate reading order**
265
363
  - Base64-encoded images with metadata (width, height, format)
266
364
  - Natural reading flow preserved for AI comprehension
267
365
 
@@ -287,7 +385,11 @@ npm install -g @sylphx/pdf-reader-mcp
287
385
  ### Core Capabilities
288
386
  - ✅ **Text Extraction** - Full document or specific pages with intelligent parsing
289
387
  - ✅ **Image Extraction** - Base64-encoded with complete metadata (width, height, format)
290
- - ✅ **Content Ordering** - Y-coordinate based layout preservation for natural reading flow
388
+ - ✅ **Structured Elements** - Agent-ready elements with stable IDs, provenance, and best-effort bounding boxes
389
+ - ✅ **Markdown Output** - Page-aware Markdown for RAG, summaries, and context preparation
390
+ - ✅ **Citation Chunks** - Page, semantic, size, and table chunks with source references for downstream retrieval
391
+ - ✅ **Document Signals** - Outlines, annotations, structure trees, forms, attachments, page labels, permissions, and mark info when exposed by the PDF
392
+ - ✅ **Content Ordering** - Column-aware layout preservation for natural reading flow
291
393
  - ✅ **Metadata Extraction** - Author, title, creation date, and custom properties
292
394
  - ✅ **Page Counting** - Fast enumeration without loading full content
293
395
  - ✅ **Dual Sources** - Local files (absolute or relative paths) and HTTP/HTTPS URLs
@@ -304,9 +406,37 @@ npm install -g @sylphx/pdf-reader-mcp
304
406
 
305
407
  ---
306
408
 
307
- ## 🆕 What's New in v1.3.0
409
+ ## 🆕 Latest Improvements
410
+
411
+ ### Agent-Ready Structured Output
412
+
413
+ `include_elements` adds structured document elements to the JSON response while keeping the existing text, metadata, image, and table outputs backward compatible.
414
+
415
+ ```json
416
+ {
417
+ "sources": [{ "path": "report.pdf" }],
418
+ "include_elements": true,
419
+ "include_semantic_hints": true
420
+ }
421
+ ```
422
+
423
+ Elements include stable IDs, page numbers, provenance, and best-effort bounding boxes where available. Image bytes stay out of the JSON summary so MCP clients can keep context payloads manageable.
424
+
425
+ `include_semantic_hints` adds deterministic heading/list/paragraph hints to text elements, with confidence and signals, without claiming a full semantic parser.
426
+
427
+ `include_markdown` adds page-aware Markdown for workflows that need clean text context without manually rebuilding sections from raw page text.
428
+
429
+ `include_html` adds an escaped HTML rendering for previews, export workflows, and downstream conversion.
430
+
431
+ The extraction pipeline also separates distant same-line text into independent segments before ordering, which improves multi-column PDFs without requiring any extra configuration.
432
+
433
+ `include_chunks` adds citation-ready chunks with stable IDs, strategy labels, element references, and best-effort bounding boxes for downstream retrieval and citation workflows. When `include_semantic_hints` is also enabled, chunks split on deterministic heading boundaries; table chunks are emitted when table extraction is requested.
434
+
435
+ `include_outline`, `include_annotations`, `include_page_labels`, `include_page_geometry`, `include_permissions`, `include_structure_tree`, `include_form_fields`, and `include_attachments` expose additional document signals without changing the default lightweight response shape.
308
436
 
309
- ### 🎉 Absolute Paths Now Supported!
437
+ `include_safety_findings` adds deterministic findings for common prompt-injection patterns, tiny text, and off-page text so agents can inspect risky document content before using it as instructions.
438
+
439
+ ### Absolute Paths Supported
310
440
 
311
441
  ```json
312
442
  // ✅ Windows
@@ -322,9 +452,9 @@ npm install -g @sylphx/pdf-reader-mcp
322
452
  ```
323
453
 
324
454
  **Other Improvements:**
325
- - 🐛 Fixed Zod validation error handling
326
- - 📦 Updated all dependencies to latest versions
327
- - 173 tests passing, 94%+ coverage maintained
455
+ - 🛡️ Filesystem and HTTP access restrictions for safer deployments
456
+ - 📊 Table extraction with Markdown output
457
+ - 📦 Updated parser resources for CMaps, fonts, WASM decoders, and color profiles
328
458
 
329
459
  <details>
330
460
  <summary><strong>📋 View Full Changelog</strong></summary>
@@ -339,7 +469,7 @@ npm install -g @sylphx/pdf-reader-mcp
339
469
  **v1.1.0 - Image Extraction & Performance**
340
470
  - Base64-encoded image extraction
341
471
  - 10x speedup with parallel processing
342
- - Comprehensive test coverage (94%+)
472
+ - Comprehensive test coverage
343
473
 
344
474
  [View Full Changelog →](./CHANGELOG.md)
345
475
 
@@ -362,6 +492,21 @@ The single tool that handles all PDF operations.
362
492
  | `include_metadata` | boolean | Extract PDF metadata | `true` |
363
493
  | `include_page_count` | boolean | Include total page count | `true` |
364
494
  | `include_images` | boolean | Extract embedded images | `false` |
495
+ | `include_tables` | boolean | Detect tables with rows, cell metadata, confidence, and best-effort geometry | `false` |
496
+ | `include_elements` | boolean | Include structured document elements for agent workflows | `false` |
497
+ | `include_semantic_hints` | boolean | Include deterministic heading/list/paragraph hints on text elements | `false` |
498
+ | `include_markdown` | boolean | Include page-aware Markdown for RAG and summarization | `false` |
499
+ | `include_html` | boolean | Include escaped page-aware HTML for preview/export workflows | `false` |
500
+ | `include_chunks` | boolean | Include page, semantic, size, and table chunks with source references | `false` |
501
+ | `include_outline` | boolean | Include PDF outline/bookmarks when available | `false` |
502
+ | `include_annotations` | boolean | Include safe annotation summaries for selected pages | `false` |
503
+ | `include_page_labels` | boolean | Include PDF page labels when available | `false` |
504
+ | `include_page_geometry` | boolean | Include page viewport geometry and PDF view boxes | `false` |
505
+ | `include_permissions` | boolean | Include permission labels and mark info when available | `false` |
506
+ | `include_structure_tree` | boolean | Include tagged PDF structure trees for selected pages when available | `false` |
507
+ | `include_form_fields` | boolean | Include PDF form field summaries when available | `false` |
508
+ | `include_attachments` | boolean | Include embedded attachment metadata without attachment bytes | `false` |
509
+ | `include_safety_findings` | boolean | Include deterministic content safety findings for agent workflows | `false` |
365
510
 
366
511
  #### Source Object
367
512
 
@@ -405,16 +550,27 @@ The single tool that handles all PDF operations.
405
550
  }
406
551
  ```
407
552
 
553
+ **Structured elements:**
554
+ ```json
555
+ {
556
+ "sources": [{ "path": "report.pdf", "pages": "1-3" }],
557
+ "include_elements": true,
558
+ "include_metadata": true
559
+ }
560
+ ```
561
+
562
+ Elements are designed for agent workflows that need stable page references, provenance, and best-effort coordinates for citation-ready downstream processing.
563
+
408
564
  ---
409
565
 
410
566
  ## 🔧 Advanced Usage
411
567
 
412
568
  <details>
413
- <summary><strong>📐 Y-Coordinate Content Ordering</strong></summary>
569
+ <summary><strong>📐 Column-Aware Content Ordering</strong></summary>
414
570
 
415
571
  <br/>
416
572
 
417
- Content is returned in natural reading order based on Y-coordinates:
573
+ Content is returned in natural reading order using Y-coordinates plus lightweight column segmentation:
418
574
 
419
575
  ```
420
576
  Document Layout:
@@ -441,6 +597,7 @@ Response Order:
441
597
  - Natural document comprehension
442
598
  - Perfect for vision-enabled models
443
599
  - Automatic multi-line text grouping
600
+ - Better ordering for common two-column PDFs
444
601
 
445
602
  </details>
446
603
 
@@ -713,7 +870,7 @@ CMD ["bun", "node_modules/@sylphx/pdf-reader-mcp/dist/index.js"]
713
870
  | **Validation** | Vex + JSON Schema |
714
871
  | **Protocol** | MCP SDK |
715
872
  | **Language** | TypeScript (strict) |
716
- | **Testing** | Bun test (173 tests) |
873
+ | **Testing** | Bun test suite |
717
874
  | **Quality** | Biome (50x faster) |
718
875
  | **CI/CD** | GitHub Actions |
719
876
 
@@ -723,7 +880,7 @@ CMD ["bun", "node_modules/@sylphx/pdf-reader-mcp/dist/index.js"]
723
880
  - 🎯 **Simple Interface** - One tool, all operations
724
881
  - ⚡ **Performance** - Parallel processing, efficient memory
725
882
  - 🛡️ **Reliability** - Per-page isolation, detailed errors
726
- - 🧪 **Quality** - 94%+ coverage, strict TypeScript
883
+ - 🧪 **Quality** - Automated tests, strict TypeScript, and CI validation
727
884
  - 📝 **Type Safety** - No `any` types, strict mode
728
885
  - 🔄 **Backward Compatible** - Smooth upgrades always
729
886
 
@@ -750,17 +907,17 @@ bun install && bun run build
750
907
  **Scripts:**
751
908
  ```bash
752
909
  bun run build # Build with bunup
753
- bun test # Run 173 tests
754
- bun run test:cov # Coverage (94%+)
910
+ bun test # Run the test suite
911
+ bun run test:cov # Run coverage
755
912
  bun run check # Lint + format
756
913
  bun run check:fix # Auto-fix
757
914
  bun run benchmark # Performance tests
758
915
  ```
759
916
 
760
917
  **Quality:**
761
- - ✅ 173 tests
762
- - ✅ 94%+ coverage
763
- - ✅ 98%+ function coverage
918
+ - ✅ Automated tests
919
+ - ✅ Coverage reporting
920
+ - ✅ Strict TypeScript
764
921
  - ✅ Zero lint errors
765
922
  - ✅ Strict TypeScript
766
923
 
@@ -810,16 +967,21 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md)
810
967
  - [x] 5-10x parallel speedup (v1.1.0)
811
968
  - [x] Y-coordinate ordering (v1.2.0)
812
969
  - [x] Absolute paths (v1.3.0)
813
- - [x] 94%+ test coverage (v1.3.0)
970
+ - [x] Table extraction
971
+ - [x] Structured element output
972
+ - [x] Markdown rendering
973
+ - [x] Citation-ready page, semantic, size, and table chunks
974
+ - [x] Outlines, annotations, structure trees, form fields, attachment metadata, page labels, and permission signals
975
+ - [x] Column-aware ordering for common multi-column PDFs
976
+ - [x] Quality evals for semantic chunks, table ordering, renderers, and safety findings
977
+ - [x] Filesystem and HTTP access restrictions
814
978
 
815
979
  **🚀 Next**
816
980
  - [ ] OCR for scanned PDFs
817
- - [ ] Annotation extraction
818
- - [ ] Form field extraction
819
- - [ ] Table detection
981
+ - [ ] Richer semantic layout detection
982
+ - [ ] Optional advanced parser engines
820
983
  - [ ] 100+ MB streaming
821
984
  - [ ] Advanced caching
822
- - [ ] PDF generation
823
985
 
824
986
  Vote at [Discussions](https://github.com/SylphxAI/pdf-reader-mcp/discussions)
825
987
 
@@ -832,7 +994,7 @@ Vote at [Discussions](https://github.com/SylphxAI/pdf-reader-mcp/discussions)
832
994
  - [Glama](https://glama.ai/mcp/servers/@sylphx/pdf-reader-mcp) - AI marketplace
833
995
  - [MseeP.ai](https://mseep.ai/app/SylphxAI-pdf-reader-mcp) - Security validated
834
996
 
835
- **Trusted worldwide** • **Enterprise adoption** • **Battle-tested**
997
+ **Local-first** • **Agent-ready** • **Battle-tested**
836
998
 
837
999
  ---
838
1000
 
@@ -858,7 +1020,7 @@ Vote at [Discussions](https://github.com/SylphxAI/pdf-reader-mcp/discussions)
858
1020
  ![Downloads](https://img.shields.io/npm/dm/@sylphx/pdf-reader-mcp)
859
1021
  ![Contributors](https://img.shields.io/github/contributors/SylphxAI/pdf-reader-mcp)
860
1022
 
861
- **103 Tests** • **94%+ Coverage** • **Production Ready**
1023
+ **CI-backed quality** • **Structured extraction** • **Production ready**
862
1024
 
863
1025
  ---
864
1026
 
@@ -884,7 +1046,6 @@ This project uses the following [@sylphx](https://github.com/SylphxAI) packages:
884
1046
  - [@sylphx/vex](https://github.com/SylphxAI/vex) - Schema validation
885
1047
  - [@sylphx/biome-config](https://github.com/SylphxAI/biome-config) - Biome configuration
886
1048
  - [@sylphx/tsconfig](https://github.com/SylphxAI/tsconfig) - TypeScript configuration
887
- - [@sylphx/bump](https://github.com/SylphxAI/bump) - Version management
888
1049
  - [@sylphx/doctor](https://github.com/SylphxAI/doctor) - Project health checker
889
1050
 
890
1051
  ---