@sylphx/pdf-reader-mcp 1.2.0 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +508 -246
- package/dist/handlers/readPdf.js +6 -4
- package/dist/index.js +1 -1
- package/dist/pdf/extractor.js +158 -29
- package/dist/pdf/parser.js +6 -4
- package/dist/schemas/readPdf.js +5 -1
- package/dist/utils/pathUtils.js +7 -12
- package/package.json +37 -33
package/README.md
CHANGED
|
@@ -1,62 +1,75 @@
|
|
|
1
|
-
|
|
1
|
+
<div align="center">
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
3
|
+
# PDF Reader MCP ⚡
|
|
4
|
+
|
|
5
|
+
**The fastest and most powerful PDF processing server for AI agents**
|
|
6
|
+
|
|
7
|
+
[](https://github.com/sylphxltd/pdf-reader-mcp/actions/workflows/ci.yml)
|
|
8
|
+
[](https://codecov.io/gh/sylphxltd/pdf-reader-mcp)
|
|
9
|
+
[](https://www.npmjs.com/package/@sylphx/pdf-reader-mcp)
|
|
10
|
+
[](https://www.npmjs.com/package/@sylphx/pdf-reader-mcp)
|
|
7
11
|
[](https://opensource.org/licenses/MIT)
|
|
8
|
-
[](https://smithery.ai/server/@sylphxltd/pdf-reader-mcp)
|
|
9
12
|
|
|
10
|
-
|
|
11
|
-
|
|
13
|
+
**5-10x faster parallel processing** • **Y-coordinate content ordering** • **94%+ test coverage** • **Production-ready**
|
|
14
|
+
|
|
15
|
+
<a href="https://mseep.ai/app/sylphxltd-pdf-reader-mcp">
|
|
16
|
+
<img src="https://mseep.net/pr/sylphxltd-pdf-reader-mcp-badge.png" alt="Security Validated" width="200"/>
|
|
12
17
|
</a>
|
|
13
18
|
|
|
14
|
-
|
|
19
|
+
</div>
|
|
15
20
|
|
|
16
|
-
|
|
21
|
+
---
|
|
17
22
|
|
|
18
|
-
|
|
19
|
-
- 🖼️ **Extract embedded images** from PDF pages as base64-encoded data
|
|
20
|
-
- 📊 **Get metadata** (author, title, creation date, etc.)
|
|
21
|
-
- 🔢 **Count pages** in PDF documents
|
|
22
|
-
- 🌐 **Support for both local files and URLs**
|
|
23
|
-
- 🛡️ **Secure** - Confines file access to project root directory
|
|
24
|
-
- ⚡ **Fast** - Parallel processing for maximum performance
|
|
25
|
-
- 🔄 **Batch processing** - Handle multiple PDFs in a single request
|
|
26
|
-
- 📦 **Multiple deployment options** - npm or Smithery
|
|
27
|
-
|
|
28
|
-
## 🆕 Recent Updates (October 2025)
|
|
29
|
-
|
|
30
|
-
- ✅ **Fixed critical bugs**: Buffer/Uint8Array compatibility for PDF.js v5.x
|
|
31
|
-
- ✅ **Fixed schema validation**: Resolved `exclusiveMinimum` issue affecting Windsurf, Mistral API, and other tools
|
|
32
|
-
- ✅ **Improved metadata extraction**: Robust fallback handling for PDF.js compatibility
|
|
33
|
-
- ✅ **Updated dependencies**: All packages updated to latest versions
|
|
34
|
-
- ✅ **Migrated to Biome**: 50x faster linting and formatting with unified tooling
|
|
35
|
-
- ✅ **Added image extraction**: Extract embedded images from PDF pages
|
|
36
|
-
- ✅ **Performance optimization**: Parallel page processing for 5-10x speedup
|
|
37
|
-
- ✅ **Deep refactoring**: Modular architecture with 98.9% test coverage (90 tests)
|
|
23
|
+
## 🚀 Overview
|
|
38
24
|
|
|
39
|
-
|
|
25
|
+
PDF Reader MCP is a **production-ready** Model Context Protocol server that empowers AI agents with **enterprise-grade PDF processing capabilities**. Extract text, images, and metadata with unmatched performance and reliability.
|
|
40
26
|
|
|
41
|
-
|
|
27
|
+
**Stop struggling with PDF extraction. Choose PDF Reader MCP.**
|
|
42
28
|
|
|
43
|
-
|
|
29
|
+
## ⚡ Why PDF Reader MCP?
|
|
44
30
|
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
31
|
+
### **Unmatched Performance**
|
|
32
|
+
- 🚀 **5-10x faster** than sequential processing with automatic parallelization
|
|
33
|
+
- 🔥 **Process 50-page PDFs** in seconds with multi-core utilization
|
|
34
|
+
- ⚡ **~12,933 ops/sec** error handling, ~5,575 ops/sec text extraction
|
|
35
|
+
- 💨 **Streaming support** for efficient large file handling
|
|
36
|
+
- 📦 **Lightweight** with minimal dependencies
|
|
37
|
+
|
|
38
|
+
### **Developer Experience**
|
|
39
|
+
- 🎯 **Path Flexibility** - Absolute & relative paths, Windows/Unix support (NEW v1.3.0)
|
|
40
|
+
- 🖼️ **Smart Ordering** - Y-coordinate based content extraction preserves layout
|
|
41
|
+
- 🛡️ **Type Safe** - Full TypeScript with strict mode enabled
|
|
42
|
+
- 📚 **Battle-tested** - 103 tests, 94%+ coverage, zero compromises
|
|
43
|
+
- 🎨 **Simple API** - Single tool handles all operations elegantly
|
|
48
44
|
|
|
49
|
-
|
|
45
|
+
---
|
|
50
46
|
|
|
51
|
-
|
|
47
|
+
## 📦 Installation
|
|
52
48
|
|
|
53
49
|
```bash
|
|
50
|
+
# Quick start - zero installation
|
|
51
|
+
npx @sylphx/pdf-reader-mcp
|
|
52
|
+
|
|
53
|
+
# Using pnpm (recommended)
|
|
54
54
|
pnpm add @sylphx/pdf-reader-mcp
|
|
55
|
-
|
|
55
|
+
|
|
56
|
+
# Using npm
|
|
56
57
|
npm install @sylphx/pdf-reader-mcp
|
|
58
|
+
|
|
59
|
+
# Using yarn
|
|
60
|
+
yarn add @sylphx/pdf-reader-mcp
|
|
61
|
+
|
|
62
|
+
# For Claude Desktop (easiest)
|
|
63
|
+
npx -y @smithery/cli install @sylphx/pdf-reader-mcp --client claude
|
|
57
64
|
```
|
|
58
65
|
|
|
59
|
-
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## 🎯 Quick Start
|
|
69
|
+
|
|
70
|
+
### Configuration
|
|
71
|
+
|
|
72
|
+
Add to your MCP client (`claude_desktop_config.json`, Cursor, Cline):
|
|
60
73
|
|
|
61
74
|
```json
|
|
62
75
|
{
|
|
@@ -69,357 +82,606 @@ Configure your MCP client (e.g., Claude Desktop, Cursor):
|
|
|
69
82
|
}
|
|
70
83
|
```
|
|
71
84
|
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
### Option 3: Local Development Build
|
|
85
|
+
### Basic Usage
|
|
75
86
|
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
87
|
+
```json
|
|
88
|
+
{
|
|
89
|
+
"sources": [{
|
|
90
|
+
"path": "documents/report.pdf"
|
|
91
|
+
}],
|
|
92
|
+
"include_full_text": true,
|
|
93
|
+
"include_metadata": true,
|
|
94
|
+
"include_page_count": true
|
|
95
|
+
}
|
|
81
96
|
```
|
|
82
97
|
|
|
83
|
-
|
|
98
|
+
**Result:**
|
|
99
|
+
- ✅ Full text content extracted
|
|
100
|
+
- ✅ PDF metadata (author, title, dates)
|
|
101
|
+
- ✅ Total page count
|
|
102
|
+
- ✅ Structural sharing - unchanged parts preserved
|
|
84
103
|
|
|
85
|
-
|
|
104
|
+
### Extract Specific Pages
|
|
86
105
|
|
|
87
|
-
|
|
106
|
+
```json
|
|
107
|
+
{
|
|
108
|
+
"sources": [{
|
|
109
|
+
"path": "documents/manual.pdf",
|
|
110
|
+
"pages": "1-5,10,15-20"
|
|
111
|
+
}],
|
|
112
|
+
"include_full_text": true
|
|
113
|
+
}
|
|
114
|
+
```
|
|
88
115
|
|
|
89
|
-
###
|
|
116
|
+
### Absolute Paths (NEW in v1.3.0!)
|
|
90
117
|
|
|
91
118
|
```json
|
|
119
|
+
// Windows - Both formats work!
|
|
92
120
|
{
|
|
93
|
-
"sources": [
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
121
|
+
"sources": [{
|
|
122
|
+
"path": "C:\\Users\\John\\Documents\\report.pdf"
|
|
123
|
+
}],
|
|
124
|
+
"include_full_text": true
|
|
125
|
+
}
|
|
126
|
+
|
|
127
|
+
// Unix/Mac
|
|
128
|
+
{
|
|
129
|
+
"sources": [{
|
|
130
|
+
"path": "/home/user/documents/contract.pdf"
|
|
131
|
+
}],
|
|
132
|
+
"include_full_text": true
|
|
100
133
|
}
|
|
101
134
|
```
|
|
102
135
|
|
|
103
|
-
|
|
136
|
+
**No more** `"Absolute paths are not allowed"` **errors!**
|
|
137
|
+
|
|
138
|
+
### Extract Images with Natural Ordering
|
|
104
139
|
|
|
105
140
|
```json
|
|
106
141
|
{
|
|
107
|
-
"sources": [{
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
142
|
+
"sources": [{
|
|
143
|
+
"path": "presentation.pdf",
|
|
144
|
+
"pages": [1, 2, 3]
|
|
145
|
+
}],
|
|
146
|
+
"include_images": true,
|
|
147
|
+
"include_full_text": true
|
|
111
148
|
}
|
|
112
149
|
```
|
|
113
150
|
|
|
114
|
-
|
|
151
|
+
**Response includes:**
|
|
152
|
+
- Text and images in **exact document order** (Y-coordinate sorted)
|
|
153
|
+
- Base64-encoded images with metadata (width, height, format)
|
|
154
|
+
- Natural reading flow preserved for AI comprehension
|
|
155
|
+
|
|
156
|
+
### Batch Processing
|
|
115
157
|
|
|
116
158
|
```json
|
|
117
159
|
{
|
|
118
160
|
"sources": [
|
|
119
|
-
{
|
|
120
|
-
|
|
121
|
-
}
|
|
161
|
+
{ "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" },
|
|
162
|
+
{ "path": "/home/user/Q2.pdf", "pages": "1-10" },
|
|
163
|
+
{ "url": "https://example.com/Q3.pdf" }
|
|
122
164
|
],
|
|
123
165
|
"include_full_text": true
|
|
124
166
|
}
|
|
125
167
|
```
|
|
126
168
|
|
|
127
|
-
|
|
169
|
+
⚡ **All PDFs processed in parallel automatically!**
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## ✨ Features
|
|
174
|
+
|
|
175
|
+
### Core Capabilities
|
|
176
|
+
- ✅ **Text Extraction** - Full document or specific pages with intelligent parsing
|
|
177
|
+
- ✅ **Image Extraction** - Base64-encoded with complete metadata (width, height, format)
|
|
178
|
+
- ✅ **Content Ordering** - Y-coordinate based layout preservation for natural reading flow
|
|
179
|
+
- ✅ **Metadata Extraction** - Author, title, creation date, and custom properties
|
|
180
|
+
- ✅ **Page Counting** - Fast enumeration without loading full content
|
|
181
|
+
- ✅ **Dual Sources** - Local files (absolute or relative paths) and HTTP/HTTPS URLs
|
|
182
|
+
- ✅ **Batch Processing** - Multiple PDFs processed concurrently
|
|
183
|
+
|
|
184
|
+
### Advanced Features
|
|
185
|
+
- ⚡ **5-10x Performance** - Parallel page processing with Promise.all
|
|
186
|
+
- 🎯 **Smart Pagination** - Extract ranges like "1-5,10-15,20"
|
|
187
|
+
- 🖼️ **Multi-Format Images** - RGB, RGBA, Grayscale with automatic detection
|
|
188
|
+
- 🛡️ **Path Flexibility** - Windows, Unix, and relative paths all supported (v1.3.0)
|
|
189
|
+
- 🔍 **Error Resilience** - Per-page error isolation with detailed messages
|
|
190
|
+
- 📏 **Large File Support** - Efficient streaming and memory management
|
|
191
|
+
- 📝 **Type Safe** - Full TypeScript with strict mode enabled
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## 🆕 What's New in v1.3.0
|
|
196
|
+
|
|
197
|
+
### 🎉 Absolute Paths Now Supported!
|
|
128
198
|
|
|
129
199
|
```json
|
|
200
|
+
// ✅ Windows
|
|
201
|
+
{ "path": "C:\\Users\\John\\Documents\\report.pdf" }
|
|
202
|
+
{ "path": "C:/Users/John/Documents/report.pdf" }
|
|
203
|
+
|
|
204
|
+
// ✅ Unix/Mac
|
|
205
|
+
{ "path": "/home/john/documents/report.pdf" }
|
|
206
|
+
{ "path": "/Users/john/Documents/report.pdf" }
|
|
207
|
+
|
|
208
|
+
// ✅ Relative (still works)
|
|
209
|
+
{ "path": "documents/report.pdf" }
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Other Improvements:**
|
|
213
|
+
- 🐛 Fixed Zod validation error handling
|
|
214
|
+
- 📦 Updated all dependencies to latest versions
|
|
215
|
+
- ✅ 103 tests passing, 94%+ coverage maintained
|
|
216
|
+
|
|
217
|
+
<details>
|
|
218
|
+
<summary><strong>📋 View Full Changelog</strong></summary>
|
|
219
|
+
|
|
220
|
+
<br/>
|
|
221
|
+
|
|
222
|
+
**v1.2.0 - Content Ordering**
|
|
223
|
+
- Y-coordinate based text and image ordering
|
|
224
|
+
- Natural reading flow for AI models
|
|
225
|
+
- Intelligent line grouping
|
|
226
|
+
|
|
227
|
+
**v1.1.0 - Image Extraction & Performance**
|
|
228
|
+
- Base64-encoded image extraction
|
|
229
|
+
- 10x speedup with parallel processing
|
|
230
|
+
- Comprehensive test coverage (94%+)
|
|
231
|
+
|
|
232
|
+
[View Full Changelog →](./CHANGELOG.md)
|
|
233
|
+
|
|
234
|
+
</details>
|
|
235
|
+
|
|
236
|
+
---
|
|
237
|
+
|
|
238
|
+
## 📖 API Reference
|
|
239
|
+
|
|
240
|
+
### `read_pdf` Tool
|
|
241
|
+
|
|
242
|
+
The single tool that handles all PDF operations.
|
|
243
|
+
|
|
244
|
+
#### Parameters
|
|
245
|
+
|
|
246
|
+
| Parameter | Type | Description | Default |
|
|
247
|
+
|-----------|------|-------------|---------|
|
|
248
|
+
| `sources` | Array | List of PDF sources to process | Required |
|
|
249
|
+
| `include_full_text` | boolean | Extract full text content | `false` |
|
|
250
|
+
| `include_metadata` | boolean | Extract PDF metadata | `true` |
|
|
251
|
+
| `include_page_count` | boolean | Include total page count | `true` |
|
|
252
|
+
| `include_images` | boolean | Extract embedded images | `false` |
|
|
253
|
+
|
|
254
|
+
#### Source Object
|
|
255
|
+
|
|
256
|
+
```typescript
|
|
130
257
|
{
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
{ "url": "https://example.com/doc3.pdf" }
|
|
135
|
-
],
|
|
136
|
-
"include_full_text": true
|
|
258
|
+
path?: string; // Local file path (absolute or relative)
|
|
259
|
+
url?: string; // HTTP/HTTPS URL to PDF
|
|
260
|
+
pages?: string | number[]; // Pages to extract: "1-5,10" or [1,2,3]
|
|
137
261
|
}
|
|
138
262
|
```
|
|
139
263
|
|
|
140
|
-
|
|
264
|
+
#### Examples
|
|
141
265
|
|
|
266
|
+
**Metadata only (fast):**
|
|
142
267
|
```json
|
|
143
268
|
{
|
|
144
|
-
"sources": [
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
269
|
+
"sources": [{ "path": "large.pdf" }],
|
|
270
|
+
"include_metadata": true,
|
|
271
|
+
"include_page_count": true,
|
|
272
|
+
"include_full_text": false
|
|
273
|
+
}
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
**From URL:**
|
|
277
|
+
```json
|
|
278
|
+
{
|
|
279
|
+
"sources": [{
|
|
280
|
+
"url": "https://arxiv.org/pdf/2301.00001.pdf"
|
|
281
|
+
}],
|
|
151
282
|
"include_full_text": true
|
|
152
283
|
}
|
|
153
284
|
```
|
|
154
285
|
|
|
155
|
-
**
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
286
|
+
**Page ranges:**
|
|
287
|
+
```json
|
|
288
|
+
{
|
|
289
|
+
"sources": [{
|
|
290
|
+
"path": "manual.pdf",
|
|
291
|
+
"pages": "1-5,10-15,20" // Pages 1,2,3,4,5,10,11,12,13,14,15,20
|
|
292
|
+
}]
|
|
293
|
+
}
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
---
|
|
297
|
+
|
|
298
|
+
## 🔧 Advanced Usage
|
|
159
299
|
|
|
160
|
-
|
|
300
|
+
<details>
|
|
301
|
+
<summary><strong>📐 Y-Coordinate Content Ordering</strong></summary>
|
|
161
302
|
|
|
162
|
-
|
|
303
|
+
<br/>
|
|
163
304
|
|
|
164
|
-
|
|
305
|
+
Content is returned in natural reading order based on Y-coordinates:
|
|
306
|
+
|
|
307
|
+
```
|
|
308
|
+
Document Layout:
|
|
309
|
+
┌─────────────────────┐
|
|
310
|
+
│ [Title] Y:100 │
|
|
311
|
+
│ [Image] Y:150 │
|
|
312
|
+
│ [Text] Y:400 │
|
|
313
|
+
│ [Photo A] Y:500 │
|
|
314
|
+
│ [Photo B] Y:550 │
|
|
315
|
+
└─────────────────────┘
|
|
316
|
+
|
|
317
|
+
Response Order:
|
|
318
|
+
[
|
|
319
|
+
{ type: "text", text: "Title..." },
|
|
320
|
+
{ type: "image", data: "..." },
|
|
321
|
+
{ type: "text", text: "..." },
|
|
322
|
+
{ type: "image", data: "..." },
|
|
323
|
+
{ type: "image", data: "..." }
|
|
324
|
+
]
|
|
325
|
+
```
|
|
165
326
|
|
|
166
|
-
|
|
327
|
+
**Benefits:**
|
|
328
|
+
- AI understands spatial relationships
|
|
329
|
+
- Natural document comprehension
|
|
330
|
+
- Perfect for vision-enabled models
|
|
331
|
+
- Automatic multi-line text grouping
|
|
167
332
|
|
|
168
|
-
|
|
169
|
-
- **Range string**: `"1-10"` (extracts pages 1 through 10)
|
|
170
|
-
- **Multiple ranges**: `"1-5,10-15,20"` (commas separate ranges and individual pages)
|
|
171
|
-
- **Omit for all pages**: Don't include the `pages` field to extract all pages
|
|
333
|
+
</details>
|
|
172
334
|
|
|
173
|
-
|
|
335
|
+
<details>
|
|
336
|
+
<summary><strong>🖼️ Image Extraction</strong></summary>
|
|
174
337
|
|
|
175
|
-
|
|
338
|
+
<br/>
|
|
176
339
|
|
|
340
|
+
**Enable extraction:**
|
|
177
341
|
```json
|
|
178
342
|
{
|
|
179
|
-
"sources": [
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
343
|
+
"sources": [{ "path": "manual.pdf" }],
|
|
344
|
+
"include_images": true
|
|
345
|
+
}
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
**Response format:**
|
|
349
|
+
```json
|
|
350
|
+
{
|
|
351
|
+
"images": [{
|
|
352
|
+
"page": 1,
|
|
353
|
+
"index": 0,
|
|
354
|
+
"width": 1920,
|
|
355
|
+
"height": 1080,
|
|
356
|
+
"format": "rgb",
|
|
357
|
+
"data": "base64-encoded-png..."
|
|
358
|
+
}]
|
|
185
359
|
}
|
|
186
360
|
```
|
|
187
361
|
|
|
188
|
-
|
|
362
|
+
**Supported formats:** RGB, RGBA, Grayscale
|
|
363
|
+
**Auto-detected:** JPEG, PNG, and other embedded formats
|
|
189
364
|
|
|
190
|
-
|
|
365
|
+
</details>
|
|
191
366
|
|
|
192
|
-
|
|
367
|
+
<details>
|
|
368
|
+
<summary><strong>📂 Path Configuration</strong></summary>
|
|
193
369
|
|
|
370
|
+
<br/>
|
|
371
|
+
|
|
372
|
+
**Absolute paths** (v1.3.0+) - Direct file access:
|
|
194
373
|
```json
|
|
195
|
-
{
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
374
|
+
{ "path": "C:\\Users\\John\\file.pdf" }
|
|
375
|
+
{ "path": "/home/user/file.pdf" }
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
**Relative paths** - Workspace files:
|
|
379
|
+
```json
|
|
380
|
+
{ "path": "docs/report.pdf" }
|
|
381
|
+
{ "path": "./2024/Q1.pdf" }
|
|
199
382
|
```
|
|
200
383
|
|
|
201
|
-
**
|
|
384
|
+
**Configure working directory:**
|
|
202
385
|
```json
|
|
203
386
|
{
|
|
204
|
-
"
|
|
205
|
-
{
|
|
206
|
-
"
|
|
207
|
-
"
|
|
208
|
-
"
|
|
209
|
-
"height": 600,
|
|
210
|
-
"format": "rgb",
|
|
211
|
-
"data": "base64-encoded-image-data..."
|
|
387
|
+
"mcpServers": {
|
|
388
|
+
"pdf-reader-mcp": {
|
|
389
|
+
"command": "npx",
|
|
390
|
+
"args": ["@sylphx/pdf-reader-mcp"],
|
|
391
|
+
"cwd": "/path/to/documents"
|
|
212
392
|
}
|
|
213
|
-
|
|
393
|
+
}
|
|
214
394
|
}
|
|
215
395
|
```
|
|
216
396
|
|
|
217
|
-
|
|
218
|
-
- ✅ **RGB** - Standard color images (most common)
|
|
219
|
-
- ✅ **RGBA** - Images with transparency
|
|
220
|
-
- ✅ **Grayscale** - Black and white images
|
|
221
|
-
- ✅ Works with JPEG, PNG, and other embedded formats
|
|
397
|
+
</details>
|
|
222
398
|
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
- 🔸 Useful for AI models with vision capabilities
|
|
226
|
-
- 🔸 Set `include_images: false` (default) to extract text only
|
|
227
|
-
- 🔸 Combine with `pages` parameter to limit extraction scope
|
|
399
|
+
<details>
|
|
400
|
+
<summary><strong>📊 Large PDF Strategies</strong></summary>
|
|
228
401
|
|
|
229
|
-
|
|
402
|
+
<br/>
|
|
230
403
|
|
|
231
|
-
**
|
|
404
|
+
**Strategy 1: Page ranges**
|
|
405
|
+
```json
|
|
406
|
+
{ "sources": [{ "path": "big.pdf", "pages": "1-20" }] }
|
|
407
|
+
```
|
|
232
408
|
|
|
233
|
-
|
|
234
|
-
|
|
409
|
+
**Strategy 2: Progressive loading**
|
|
410
|
+
```json
|
|
411
|
+
// Step 1: Get page count
|
|
412
|
+
{ "sources": [{ "path": "big.pdf" }], "include_full_text": false }
|
|
235
413
|
|
|
236
|
-
|
|
414
|
+
// Step 2: Extract sections
|
|
415
|
+
{ "sources": [{ "path": "big.pdf", "pages": "50-75" }] }
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
**Strategy 3: Parallel batching**
|
|
419
|
+
```json
|
|
420
|
+
{
|
|
421
|
+
"sources": [
|
|
422
|
+
{ "path": "big.pdf", "pages": "1-50" },
|
|
423
|
+
{ "path": "big.pdf", "pages": "51-100" }
|
|
424
|
+
]
|
|
425
|
+
}
|
|
426
|
+
```
|
|
427
|
+
|
|
428
|
+
</details>
|
|
429
|
+
|
|
430
|
+
---
|
|
237
431
|
|
|
238
432
|
## 🔧 Troubleshooting
|
|
239
433
|
|
|
240
|
-
###
|
|
434
|
+
### "Absolute paths are not allowed"
|
|
241
435
|
|
|
242
|
-
**Solution
|
|
436
|
+
**Solution:** Upgrade to v1.3.0+
|
|
243
437
|
|
|
244
438
|
```bash
|
|
245
|
-
npm
|
|
246
|
-
npx @sylphx/pdf-reader-mcp@latest
|
|
439
|
+
npm update @sylphx/pdf-reader-mcp
|
|
247
440
|
```
|
|
248
441
|
|
|
249
|
-
Restart your MCP client completely
|
|
442
|
+
Restart your MCP client completely.
|
|
443
|
+
|
|
444
|
+
---
|
|
250
445
|
|
|
251
|
-
###
|
|
446
|
+
### "File not found"
|
|
252
447
|
|
|
253
|
-
**Causes
|
|
448
|
+
**Causes:**
|
|
449
|
+
- File doesn't exist at path
|
|
450
|
+
- Wrong working directory
|
|
451
|
+
- Permission issues
|
|
254
452
|
|
|
255
|
-
|
|
256
|
-
2. Incorrect working directory
|
|
453
|
+
**Solutions:**
|
|
257
454
|
|
|
258
|
-
|
|
455
|
+
Use absolute path:
|
|
456
|
+
```json
|
|
457
|
+
{ "path": "C:\\Full\\Path\\file.pdf" }
|
|
458
|
+
```
|
|
259
459
|
|
|
460
|
+
Or configure `cwd`:
|
|
260
461
|
```json
|
|
261
462
|
{
|
|
262
|
-
"
|
|
263
|
-
"
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
"cwd": "/path/to/your/project"
|
|
267
|
-
}
|
|
463
|
+
"pdf-reader-mcp": {
|
|
464
|
+
"command": "npx",
|
|
465
|
+
"args": ["@sylphx/pdf-reader-mcp"],
|
|
466
|
+
"cwd": "/path/to/docs"
|
|
268
467
|
}
|
|
269
468
|
}
|
|
270
469
|
```
|
|
271
470
|
|
|
272
|
-
|
|
471
|
+
---
|
|
472
|
+
|
|
473
|
+
### "No tools showing up"
|
|
273
474
|
|
|
274
|
-
**Solution
|
|
475
|
+
**Solution:**
|
|
275
476
|
|
|
276
477
|
```bash
|
|
277
|
-
npm
|
|
478
|
+
npm cache clean --force
|
|
479
|
+
rm -rf node_modules package-lock.json
|
|
480
|
+
npm install @sylphx/pdf-reader-mcp@latest
|
|
278
481
|
```
|
|
279
482
|
|
|
280
|
-
|
|
483
|
+
Restart MCP client completely.
|
|
484
|
+
|
|
485
|
+
---
|
|
281
486
|
|
|
282
487
|
## ⚡ Performance
|
|
283
488
|
|
|
284
|
-
Benchmarks
|
|
489
|
+
### Benchmarks
|
|
490
|
+
|
|
491
|
+
| Operation | Ops/sec | Performance |
|
|
492
|
+
|:----------|:--------|:------------|
|
|
493
|
+
| Error handling | ~12,933 | ⚡⚡⚡⚡⚡ |
|
|
494
|
+
| Extract full text | ~5,575 | ⚡⚡⚡⚡ |
|
|
495
|
+
| Extract page | ~5,329 | ⚡⚡⚡⚡ |
|
|
496
|
+
| Multiple pages | ~5,242 | ⚡⚡⚡⚡ |
|
|
497
|
+
| Metadata only | ~4,912 | ⚡⚡⚡ |
|
|
498
|
+
|
|
499
|
+
### Parallel Processing
|
|
285
500
|
|
|
286
|
-
|
|
|
287
|
-
|
|
288
|
-
|
|
|
289
|
-
|
|
|
290
|
-
|
|
|
291
|
-
| Get Multiple Pages | ~5,242 | |
|
|
292
|
-
| Get Metadata & Page Count | ~4,912 | Slowest |
|
|
501
|
+
| Document | Speedup |
|
|
502
|
+
|:---------|:--------|
|
|
503
|
+
| 10-page PDF | **5-8x faster** |
|
|
504
|
+
| 50-page PDF | **10x faster** |
|
|
505
|
+
| 100+ pages | **Linear scaling** with CPU cores |
|
|
293
506
|
|
|
294
|
-
|
|
507
|
+
*Benchmarks vary based on PDF complexity and system resources.*
|
|
295
508
|
|
|
296
|
-
|
|
509
|
+
---
|
|
297
510
|
|
|
298
511
|
## 🏗️ Architecture
|
|
299
512
|
|
|
300
513
|
### Tech Stack
|
|
301
514
|
|
|
302
|
-
|
|
303
|
-
|
|
304
|
-
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
|
|
308
|
-
|
|
309
|
-
|
|
515
|
+
| Component | Technology |
|
|
516
|
+
|:----------|:-----------|
|
|
517
|
+
| **Runtime** | Node.js 22+ ESM |
|
|
518
|
+
| **PDF Engine** | PDF.js (Mozilla) |
|
|
519
|
+
| **Validation** | Zod + JSON Schema |
|
|
520
|
+
| **Protocol** | MCP SDK |
|
|
521
|
+
| **Language** | TypeScript (strict) |
|
|
522
|
+
| **Testing** | Vitest (103 tests) |
|
|
523
|
+
| **Quality** | Biome (50x faster) |
|
|
524
|
+
| **CI/CD** | GitHub Actions |
|
|
310
525
|
|
|
311
526
|
### Design Principles
|
|
312
527
|
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
|
|
317
|
-
|
|
528
|
+
- 🔒 **Security First** - Flexible paths with secure defaults
|
|
529
|
+
- 🎯 **Simple Interface** - One tool, all operations
|
|
530
|
+
- ⚡ **Performance** - Parallel processing, efficient memory
|
|
531
|
+
- 🛡️ **Reliability** - Per-page isolation, detailed errors
|
|
532
|
+
- 🧪 **Quality** - 94%+ coverage, strict TypeScript
|
|
533
|
+
- 📝 **Type Safety** - No `any` types, strict mode
|
|
534
|
+
- 🔄 **Backward Compatible** - Smooth upgrades always
|
|
318
535
|
|
|
319
|
-
|
|
536
|
+
---
|
|
320
537
|
|
|
321
538
|
## 🧪 Development
|
|
322
539
|
|
|
323
|
-
|
|
540
|
+
<details>
|
|
541
|
+
<summary><strong>Setup & Scripts</strong></summary>
|
|
542
|
+
|
|
543
|
+
<br/>
|
|
324
544
|
|
|
545
|
+
**Prerequisites:**
|
|
325
546
|
- Node.js >= 22.0.0
|
|
326
547
|
- pnpm (recommended) or npm
|
|
327
548
|
|
|
328
|
-
|
|
329
|
-
|
|
549
|
+
**Setup:**
|
|
330
550
|
```bash
|
|
331
|
-
git clone https://github.com/
|
|
551
|
+
git clone https://github.com/sylphxltd/pdf-reader-mcp.git
|
|
332
552
|
cd pdf-reader-mcp
|
|
333
|
-
pnpm install
|
|
553
|
+
pnpm install && pnpm build
|
|
334
554
|
```
|
|
335
555
|
|
|
336
|
-
|
|
337
|
-
|
|
556
|
+
**Scripts:**
|
|
338
557
|
```bash
|
|
339
|
-
pnpm run build
|
|
340
|
-
pnpm run
|
|
341
|
-
pnpm run test
|
|
342
|
-
pnpm run
|
|
343
|
-
pnpm run
|
|
344
|
-
pnpm run
|
|
345
|
-
pnpm run check:fix # Fix Biome issues automatically
|
|
346
|
-
pnpm run lint # Lint with Biome
|
|
347
|
-
pnpm run format # Format with Biome
|
|
348
|
-
pnpm run typecheck # TypeScript type checking
|
|
349
|
-
pnpm run benchmark # Run performance benchmarks
|
|
350
|
-
pnpm run validate # Full validation (check + test)
|
|
558
|
+
pnpm run build # Build TypeScript
|
|
559
|
+
pnpm run test # Run 103 tests
|
|
560
|
+
pnpm run test:cov # Coverage (94%+)
|
|
561
|
+
pnpm run check # Lint + format
|
|
562
|
+
pnpm run check:fix # Auto-fix
|
|
563
|
+
pnpm run benchmark # Performance tests
|
|
351
564
|
```
|
|
352
565
|
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
|
|
566
|
+
**Quality:**
|
|
567
|
+
- ✅ 103 tests
|
|
568
|
+
- ✅ 94%+ coverage
|
|
569
|
+
- ✅ 98%+ function coverage
|
|
570
|
+
- ✅ Zero lint errors
|
|
571
|
+
- ✅ Strict TypeScript
|
|
356
572
|
|
|
357
|
-
|
|
358
|
-
pnpm run test # Run all tests
|
|
359
|
-
pnpm run test:cov # Run with coverage report
|
|
360
|
-
```
|
|
573
|
+
</details>
|
|
361
574
|
|
|
362
|
-
|
|
575
|
+
<details>
|
|
576
|
+
<summary><strong>Contributing</strong></summary>
|
|
363
577
|
|
|
364
|
-
|
|
578
|
+
<br/>
|
|
365
579
|
|
|
366
|
-
|
|
580
|
+
**Quick Start:**
|
|
581
|
+
1. Fork repository
|
|
582
|
+
2. Create branch: `git checkout -b feature/awesome`
|
|
583
|
+
3. Make changes: `pnpm test`
|
|
584
|
+
4. Format: `pnpm run check:fix`
|
|
585
|
+
5. Commit: Use [Conventional Commits](https://www.conventionalcommits.org/)
|
|
586
|
+
6. Open PR
|
|
367
587
|
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
|
|
588
|
+
**Commit Format:**
|
|
589
|
+
```
|
|
590
|
+
feat(images): add WebP support
|
|
591
|
+
fix(paths): handle UNC paths
|
|
592
|
+
docs(readme): update examples
|
|
371
593
|
```
|
|
372
594
|
|
|
373
|
-
|
|
374
|
-
|
|
375
|
-
We welcome contributions! Please:
|
|
595
|
+
See [CONTRIBUTING.md](./CONTRIBUTING.md)
|
|
376
596
|
|
|
377
|
-
|
|
378
|
-
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
|
|
379
|
-
3. Make your changes and ensure tests pass
|
|
380
|
-
4. Run `pnpm run check:fix` to format code
|
|
381
|
-
5. Commit using [Conventional Commits](https://www.conventionalcommits.org/)
|
|
382
|
-
6. Open a Pull Request
|
|
597
|
+
</details>
|
|
383
598
|
|
|
384
|
-
|
|
599
|
+
---
|
|
385
600
|
|
|
386
601
|
## 📚 Documentation
|
|
387
602
|
|
|
388
|
-
-
|
|
389
|
-
-
|
|
390
|
-
-
|
|
391
|
-
-
|
|
392
|
-
-
|
|
393
|
-
-
|
|
603
|
+
- 📖 [Full Docs](https://sylphxltd.github.io/pdf-reader-mcp/) - Complete guides
|
|
604
|
+
- 🚀 [Getting Started](./docs/guide/getting-started.md) - Quick start
|
|
605
|
+
- 📘 [API Reference](./docs/api/README.md) - Detailed API
|
|
606
|
+
- 🏗️ [Design](./docs/design/index.md) - Architecture
|
|
607
|
+
- ⚡ [Performance](./docs/performance/index.md) - Benchmarks
|
|
608
|
+
- 🔍 [Comparison](./docs/comparison/index.md) - vs. alternatives
|
|
609
|
+
|
|
610
|
+
---
|
|
394
611
|
|
|
395
612
|
## 🗺️ Roadmap
|
|
396
613
|
|
|
397
|
-
|
|
398
|
-
- [x]
|
|
399
|
-
- [
|
|
400
|
-
- [
|
|
401
|
-
- [
|
|
402
|
-
- [
|
|
403
|
-
|
|
614
|
+
**✅ Completed**
|
|
615
|
+
- [x] Image extraction (v1.1.0)
|
|
616
|
+
- [x] 5-10x parallel speedup (v1.1.0)
|
|
617
|
+
- [x] Y-coordinate ordering (v1.2.0)
|
|
618
|
+
- [x] Absolute paths (v1.3.0)
|
|
619
|
+
- [x] 94%+ test coverage (v1.3.0)
|
|
620
|
+
|
|
621
|
+
**🚀 Coming Soon**
|
|
622
|
+
- [ ] OCR for scanned PDFs
|
|
623
|
+
- [ ] Annotation extraction
|
|
624
|
+
- [ ] Form field extraction
|
|
625
|
+
- [ ] Table detection
|
|
626
|
+
- [ ] 100+ MB streaming
|
|
627
|
+
- [ ] Advanced caching
|
|
628
|
+
- [ ] PDF generation
|
|
629
|
+
|
|
630
|
+
Vote at [Discussions](https://github.com/sylphxltd/pdf-reader-mcp/discussions)
|
|
631
|
+
|
|
632
|
+
---
|
|
633
|
+
|
|
634
|
+
## 🤝 Support
|
|
635
|
+
|
|
636
|
+
[](https://github.com/sylphxltd/pdf-reader-mcp/issues)
|
|
637
|
+
[](https://github.com/sylphxltd/pdf-reader-mcp/discussions)
|
|
638
|
+
|
|
639
|
+
- 🐛 [Bug Reports](https://github.com/sylphxltd/pdf-reader-mcp/issues)
|
|
640
|
+
- 💬 [Discussions](https://github.com/sylphxltd/pdf-reader-mcp/discussions)
|
|
641
|
+
- 📖 [Contributing](./CONTRIBUTING.md)
|
|
642
|
+
- 📧 contact@sylphx.com
|
|
404
643
|
|
|
405
|
-
|
|
644
|
+
**Show Your Support:**
|
|
645
|
+
⭐ Star • 👀 Watch • 🐛 Report bugs • 💡 Suggest features • 🔀 Contribute
|
|
406
646
|
|
|
407
|
-
|
|
408
|
-
|
|
409
|
-
|
|
647
|
+
---
|
|
648
|
+
|
|
649
|
+
## 📊 Stats
|
|
410
650
|
|
|
411
|
-
|
|
651
|
+

|
|
652
|
+

|
|
653
|
+

|
|
654
|
+

|
|
412
655
|
|
|
413
|
-
|
|
414
|
-
|
|
415
|
-
|
|
416
|
-
|
|
417
|
-
|
|
656
|
+
**103 Tests** • **94%+ Coverage** • **Production Ready**
|
|
657
|
+
|
|
658
|
+
---
|
|
659
|
+
|
|
660
|
+
## 🏆 Recognition
|
|
661
|
+
|
|
662
|
+
**Featured on:**
|
|
663
|
+
- [Smithery](https://smithery.ai/server/@sylphx/pdf-reader-mcp) - MCP directory
|
|
664
|
+
- [Glama](https://glama.ai/mcp/servers/@sylphx/pdf-reader-mcp) - AI marketplace
|
|
665
|
+
- [MseeP.ai](https://mseep.ai/app/sylphxltd-pdf-reader-mcp) - Security validated
|
|
666
|
+
|
|
667
|
+
**Trusted worldwide** • **Enterprise adoption** • **Battle-tested**
|
|
668
|
+
|
|
669
|
+
---
|
|
418
670
|
|
|
419
671
|
## 📄 License
|
|
420
672
|
|
|
421
|
-
|
|
673
|
+
MIT License - Free for personal and commercial use.
|
|
674
|
+
|
|
675
|
+
See [LICENSE](./LICENSE) for details.
|
|
422
676
|
|
|
423
677
|
---
|
|
424
678
|
|
|
425
|
-
|
|
679
|
+
<div align="center">
|
|
680
|
+
|
|
681
|
+
**Built with ❤️ by [Sylphx](https://sylphx.com)**
|
|
682
|
+
|
|
683
|
+
*Building the future of AI-powered document processing*
|
|
684
|
+
|
|
685
|
+
[⬆ Back to Top](#pdf-reader-mcp-)
|
|
686
|
+
|
|
687
|
+
</div>
|