@sylphx/pdf-reader-mcp 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 SylphLab
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,390 @@
1
+ # PDF Reader MCP Server
2
+
3
+ [![MseeP.ai Security Assessment Badge](https://mseep.net/pr/sylphxltd-pdf-reader-mcp-badge.png)](https://mseep.ai/app/sylphxltd-pdf-reader-mcp)
4
+ [![CI/CD Pipeline](https://github.com/sylphlab/pdf-reader-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/sylphlab/pdf-reader-mcp/actions/workflows/ci.yml)
5
+ [![codecov](https://codecov.io/gh/sylphlab/pdf-reader-mcp/graph/badge.svg?token=VYRQFB40UN)](https://codecov.io/gh/sylphlab/pdf-reader-mcp)
6
+ [![npm version](https://badge.fury.io/js/%40sylphlab%2Fpdf-reader-mcp.svg)](https://badge.fury.io/js/%40sylphlab%2Fpdf-reader-mcp)
7
+ [![Docker Pulls](https://img.shields.io/docker/pulls/sylphlab/pdf-reader-mcp.svg)](https://hub.docker.com/r/sylphlab/pdf-reader-mcp)
8
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
9
+ [![smithery badge](https://smithery.ai/badge/@sylphxltd/pdf-reader-mcp)](https://smithery.ai/server/@sylphxltd/pdf-reader-mcp)
10
+
11
+ <a href="https://glama.ai/mcp/servers/@sylphlab/pdf-reader-mcp">
12
+ <img width="380" height="200" src="https://glama.ai/mcp/servers/@sylphlab/pdf-reader-mcp/badge" alt="PDF Reader Server MCP server" />
13
+ </a>
14
+
15
+ **Empower your AI agents** with the ability to securely read and extract information from PDF files using the Model Context Protocol (MCP).
16
+
17
+ ## ✨ Features
18
+
19
+ - 📄 **Extract text content** from PDF files (full document or specific pages)
20
+ - 📊 **Get metadata** (author, title, creation date, etc.)
21
+ - 🔢 **Count pages** in PDF documents
22
+ - 🌐 **Support for both local files and URLs**
23
+ - 🛡️ **Secure** - Confines file access to project root directory
24
+ - ⚡ **Fast** - Powered by PDF.js with optimized performance
25
+ - 🔄 **Batch processing** - Handle multiple PDFs in a single request
26
+ - 📦 **Multiple deployment options** - npm, Docker, or Smithery
27
+
28
+ ## 🆕 Recent Updates (October 2025)
29
+
30
+ - ✅ **Fixed critical bugs**: Buffer/Uint8Array compatibility for PDF.js v5.x
31
+ - ✅ **Fixed schema validation**: Resolved `exclusiveMinimum` issue affecting Windsurf, Mistral API, and other tools
32
+ - ✅ **Improved metadata extraction**: Robust fallback handling for PDF.js compatibility
33
+ - ✅ **Updated dependencies**: All packages updated to latest versions
34
+ - ✅ **Migrated to Biome**: 50x faster linting and formatting with unified tooling
35
+ - ✅ **Added Docker support**: Easy deployment with containerization
36
+ - ✅ **All tests passing**: 31/31 tests with comprehensive coverage
37
+
38
+ ## 📦 Installation
39
+
40
+ ### Option 1: Using Smithery (Easiest)
41
+
42
+ Install automatically for Claude Desktop:
43
+
44
+ ```bash
45
+ npx -y @smithery/cli install @sylphxltd/pdf-reader-mcp --client claude
46
+ ```
47
+
48
+ ### Option 2: Using npm/pnpm (Recommended)
49
+
50
+ Install the package:
51
+
52
+ ```bash
53
+ pnpm add @sylphx/pdf-reader-mcp
54
+ # or
55
+ npm install @sylphx/pdf-reader-mcp
56
+ ```
57
+
58
+ Configure your MCP client (e.g., Claude Desktop, Cursor):
59
+
60
+ ```json
61
+ {
62
+ "mcpServers": {
63
+ "pdf-reader-mcp": {
64
+ "command": "npx",
65
+ "args": ["@sylphx/pdf-reader-mcp"]
66
+ }
67
+ }
68
+ }
69
+ ```
70
+
71
+ **Important:** Make sure your MCP client sets the correct working directory (`cwd`) to your project root.
72
+
73
+ ### Option 3: Using Docker
74
+
75
+ Pull the image:
76
+
77
+ ```bash
78
+ docker pull sylphx/pdf-reader-mcp:latest
79
+ ```
80
+
81
+ Configure your MCP client:
82
+
83
+ ```json
84
+ {
85
+ "mcpServers": {
86
+ "pdf-reader-mcp": {
87
+ "command": "docker",
88
+ "args": [
89
+ "run",
90
+ "-i",
91
+ "--rm",
92
+ "-v",
93
+ "/path/to/your/project:/app",
94
+ "sylphx/pdf-reader-mcp:latest"
95
+ ]
96
+ }
97
+ }
98
+ }
99
+ ```
100
+
101
+ ### Option 4: Local Development Build
102
+
103
+ ```bash
104
+ git clone https://github.com/sylphlab/pdf-reader-mcp.git
105
+ cd pdf-reader-mcp
106
+ pnpm install
107
+ pnpm run build
108
+ ```
109
+
110
+ Then configure your MCP client to use `node dist/index.js`.
111
+
112
+ ## 🚀 Quick Start
113
+
114
+ Once configured, your AI agent can read PDFs using the `read_pdf` tool:
115
+
116
+ ### Example 1: Extract text from specific pages
117
+
118
+ ```json
119
+ {
120
+ "sources": [
121
+ {
122
+ "path": "documents/report.pdf",
123
+ "pages": [1, 2, 3]
124
+ }
125
+ ],
126
+ "include_metadata": true
127
+ }
128
+ ```
129
+
130
+ ### Example 2: Get metadata and page count only
131
+
132
+ ```json
133
+ {
134
+ "sources": [{ "path": "documents/report.pdf" }],
135
+ "include_metadata": true,
136
+ "include_page_count": true,
137
+ "include_full_text": false
138
+ }
139
+ ```
140
+
141
+ ### Example 3: Read from URL
142
+
143
+ ```json
144
+ {
145
+ "sources": [
146
+ {
147
+ "url": "https://example.com/document.pdf"
148
+ }
149
+ ],
150
+ "include_full_text": true
151
+ }
152
+ ```
153
+
154
+ ### Example 4: Process multiple PDFs
155
+
156
+ ```json
157
+ {
158
+ "sources": [
159
+ { "path": "doc1.pdf", "pages": "1-5" },
160
+ { "path": "doc2.pdf" },
161
+ { "url": "https://example.com/doc3.pdf" }
162
+ ],
163
+ "include_full_text": true
164
+ }
165
+ ```
166
+
167
+ ## 📖 Usage Guide
168
+
169
+ ### Page Specification
170
+
171
+ You can specify pages in multiple ways:
172
+
173
+ - **Array of page numbers**: `[1, 3, 5]` (1-based indexing)
174
+ - **Range string**: `"1-10"` (extracts pages 1 through 10)
175
+ - **Multiple ranges**: `"1-5,10-15,20"` (commas separate ranges and individual pages)
176
+ - **Omit for all pages**: Don't include the `pages` field to extract all pages
177
+
178
+ ### Working with Large PDFs
179
+
180
+ For large PDF files (>20 MB), extract specific pages instead of the full document:
181
+
182
+ ```json
183
+ {
184
+ "sources": [
185
+ {
186
+ "path": "large-document.pdf",
187
+ "pages": "1-10"
188
+ }
189
+ ]
190
+ }
191
+ ```
192
+
193
+ This prevents hitting AI model context limits and improves performance.
194
+
195
+ ### Security: Relative Paths Only
196
+
197
+ **Important:** The server only accepts **relative paths** for security reasons. Absolute paths are blocked to prevent unauthorized file system access.
198
+
199
+ ✅ **Good**: `"path": "documents/report.pdf"`
200
+ ❌ **Bad**: `"path": "/Users/john/documents/report.pdf"`
201
+
202
+ **Solution**: Configure the `cwd` (current working directory) in your MCP client settings.
203
+
204
+ ## 🔧 Troubleshooting
205
+
206
+ ### Issue: "No tools" showing up
207
+
208
+ **Solution**: Clear npm cache and reinstall:
209
+
210
+ ```bash
211
+ npm cache clean --force
212
+ npx @sylphx/pdf-reader-mcp@latest
213
+ ```
214
+
215
+ Restart your MCP client completely after updating.
216
+
217
+ ### Issue: "File not found" errors
218
+
219
+ **Causes**:
220
+
221
+ 1. Using absolute paths (not allowed for security)
222
+ 2. Incorrect working directory
223
+
224
+ **Solution**: Use relative paths and configure `cwd` in your MCP client:
225
+
226
+ ```json
227
+ {
228
+ "mcpServers": {
229
+ "pdf-reader-mcp": {
230
+ "command": "npx",
231
+ "args": ["@sylphx/pdf-reader-mcp"],
232
+ "cwd": "/path/to/your/project"
233
+ }
234
+ }
235
+ }
236
+ ```
237
+
238
+ ### Issue: Cursor/Claude Code compatibility
239
+
240
+ **Solution**: Update to the latest version (all recent compatibility issues have been fixed):
241
+
242
+ ```bash
243
+ npm update @sylphx/pdf-reader-mcp@latest
244
+ ```
245
+
246
+ Then restart your editor completely.
247
+
248
+ ## ⚡ Performance
249
+
250
+ Benchmarks on a standard PDF file:
251
+
252
+ | Operation | Ops/sec | Speed |
253
+ | :------------------------------- | :-------- | :--------- |
254
+ | Handle Non-Existent File | ~12,933 | Fastest |
255
+ | Get Full Text | ~5,575 | |
256
+ | Get Specific Page | ~5,329 | |
257
+ | Get Multiple Pages | ~5,242 | |
258
+ | Get Metadata & Page Count | ~4,912 | Slowest |
259
+
260
+ _Performance varies based on PDF complexity and system resources._
261
+
262
+ See [Performance Documentation](./docs/performance/index.md) for details.
263
+
264
+ ## 🏗️ Architecture
265
+
266
+ ### Tech Stack
267
+
268
+ - **Runtime**: Node.js 22+
269
+ - **PDF Processing**: PDF.js (pdfjs-dist)
270
+ - **Validation**: Zod with JSON Schema generation
271
+ - **Protocol**: Model Context Protocol (MCP) SDK
272
+ - **Build**: TypeScript
273
+ - **Testing**: Vitest with 100% coverage goal
274
+ - **Code Quality**: Biome (linting + formatting)
275
+ - **CI/CD**: GitHub Actions
276
+
277
+ ### Design Principles
278
+
279
+ 1. **Security First**: Strict path validation and sandboxing
280
+ 2. **Simple Interface**: Single tool handles all PDF operations
281
+ 3. **Structured Output**: Predictable JSON format for AI parsing
282
+ 4. **Performance**: Efficient caching and lazy loading
283
+ 5. **Reliability**: Comprehensive error handling and validation
284
+
285
+ See [Design Philosophy](./docs/design/index.md) for more details.
286
+
287
+ ## 🧪 Development
288
+
289
+ ### Prerequisites
290
+
291
+ - Node.js >= 22.0.0
292
+ - pnpm (recommended) or npm
293
+
294
+ ### Setup
295
+
296
+ ```bash
297
+ git clone https://github.com/sylphlab/pdf-reader-mcp.git
298
+ cd pdf-reader-mcp
299
+ pnpm install
300
+ ```
301
+
302
+ ### Available Scripts
303
+
304
+ ```bash
305
+ pnpm run build # Build TypeScript to dist/
306
+ pnpm run watch # Build in watch mode
307
+ pnpm run test # Run tests
308
+ pnpm run test:watch # Run tests in watch mode
309
+ pnpm run test:cov # Run tests with coverage
310
+ pnpm run check # Run Biome (lint + format check)
311
+ pnpm run check:fix # Fix Biome issues automatically
312
+ pnpm run lint # Lint with Biome
313
+ pnpm run format # Format with Biome
314
+ pnpm run typecheck # TypeScript type checking
315
+ pnpm run benchmark # Run performance benchmarks
316
+ pnpm run validate # Full validation (check + test)
317
+ ```
318
+
319
+ ### Testing
320
+
321
+ We maintain high test coverage using Vitest:
322
+
323
+ ```bash
324
+ pnpm run test # Run all tests
325
+ pnpm run test:cov # Run with coverage report
326
+ ```
327
+
328
+ All tests must pass before merging. Current: **31/31 tests passing** ✅
329
+
330
+ ### Code Quality
331
+
332
+ The project uses [Biome](https://biomejs.dev/) for fast, unified linting and formatting:
333
+
334
+ ```bash
335
+ pnpm run check # Check code quality
336
+ pnpm run check:fix # Auto-fix issues
337
+ ```
338
+
339
+ ### Contributing
340
+
341
+ We welcome contributions! Please:
342
+
343
+ 1. Fork the repository
344
+ 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
345
+ 3. Make your changes and ensure tests pass
346
+ 4. Run `pnpm run check:fix` to format code
347
+ 5. Commit using [Conventional Commits](https://www.conventionalcommits.org/)
348
+ 6. Open a Pull Request
349
+
350
+ See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidelines.
351
+
352
+ ## 📚 Documentation
353
+
354
+ - **[Full Documentation](https://sylphlab.github.io/pdf-reader-mcp/)** - Complete guides and API reference
355
+ - **[Getting Started Guide](./docs/guide/getting-started.md)** - Quick start guide
356
+ - **[API Reference](./docs/api/README.md)** - Detailed API documentation
357
+ - **[Design Philosophy](./docs/design/index.md)** - Architecture and design decisions
358
+ - **[Performance](./docs/performance/index.md)** - Benchmarks and optimization
359
+ - **[Comparison](./docs/comparison/index.md)** - How it compares to alternatives
360
+
361
+ ## 🗺️ Roadmap
362
+
363
+ - [ ] Image extraction from PDFs
364
+ - [ ] Annotation extraction support
365
+ - [ ] OCR integration for scanned PDFs
366
+ - [ ] Streaming support for very large files
367
+ - [ ] Enhanced caching mechanisms
368
+ - [ ] Performance optimizations for large batches
369
+
370
+ ## 🤝 Support & Community
371
+
372
+ - **Issues**: [GitHub Issues](https://github.com/sylphlab/pdf-reader-mcp/issues)
373
+ - **Discussions**: [GitHub Discussions](https://github.com/sylphlab/pdf-reader-mcp/discussions)
374
+ - **Contributing**: [CONTRIBUTING.md](./CONTRIBUTING.md)
375
+
376
+ If you find this project useful, please:
377
+
378
+ - ⭐ Star the repository
379
+ - 👀 Watch for updates
380
+ - 🐛 Report bugs
381
+ - 💡 Suggest features
382
+ - 🔀 Contribute code
383
+
384
+ ## 📄 License
385
+
386
+ This project is licensed under the [MIT License](./LICENSE).
387
+
388
+ ---
389
+
390
+ **Made with ❤️ by [Sylphx](https://sylphx.com)**
@@ -0,0 +1,4 @@
1
+ // Import only the consolidated PDF tool definition
2
+ import { readPdfToolDefinition } from './readPdf.js';
3
+ // Aggregate only the consolidated PDF tool definition
4
+ export const allToolDefinitions = [readPdfToolDefinition];
@@ -0,0 +1,342 @@
1
+ import fs from 'node:fs/promises';
2
+ import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
3
+ import * as pdfjsLib from 'pdfjs-dist/legacy/build/pdf.mjs';
4
+ import { z } from 'zod';
5
+ import { resolvePath } from '../utils/pathUtils.js';
6
+ // Helper to parse page range strings (e.g., "1-3,5,7-")
7
+ // Helper to parse a single range part (e.g., "1-3", "5", "7-")
8
+ const parseRangePart = (part, pages) => {
9
+ const trimmedPart = part.trim();
10
+ if (trimmedPart.includes('-')) {
11
+ const [startStr, endStr] = trimmedPart.split('-');
12
+ if (startStr === undefined) {
13
+ // Basic check
14
+ throw new Error(`Invalid page range format: ${trimmedPart}`);
15
+ }
16
+ const start = parseInt(startStr, 10);
17
+ const end = endStr === '' || endStr === undefined ? Infinity : parseInt(endStr, 10);
18
+ if (Number.isNaN(start) || Number.isNaN(end) || start <= 0 || start > end) {
19
+ throw new Error(`Invalid page range values: ${trimmedPart}`);
20
+ }
21
+ // Add a reasonable upper limit to prevent infinite loops for open ranges
22
+ const practicalEnd = Math.min(end, start + 10000); // Limit range parsing depth
23
+ for (let i = start; i <= practicalEnd; i++) {
24
+ pages.add(i);
25
+ }
26
+ if (end === Infinity && practicalEnd === start + 10000) {
27
+ console.warn(`[PDF Reader MCP] Open-ended range starting at ${String(start)} was truncated at page ${String(practicalEnd)} during parsing.`);
28
+ }
29
+ }
30
+ else {
31
+ const page = parseInt(trimmedPart, 10);
32
+ if (Number.isNaN(page) || page <= 0) {
33
+ throw new Error(`Invalid page number: ${trimmedPart}`);
34
+ }
35
+ pages.add(page);
36
+ }
37
+ };
38
+ // Parses the complete page range string (e.g., "1-3,5,7-")
39
+ const parsePageRanges = (ranges) => {
40
+ const pages = new Set();
41
+ const parts = ranges.split(',');
42
+ for (const part of parts) {
43
+ parseRangePart(part, pages); // Delegate parsing of each part
44
+ }
45
+ if (pages.size === 0) {
46
+ throw new Error('Page range string resulted in zero valid pages.');
47
+ }
48
+ return Array.from(pages).sort((a, b) => a - b);
49
+ };
50
+ // --- Zod Schemas ---
51
+ const pageSpecifierSchema = z.union([
52
+ z
53
+ .array(z.number().int().min(1))
54
+ .min(1), // Array of integers with minimum value 1 (pages are 1-based)
55
+ z
56
+ .string()
57
+ .min(1)
58
+ .refine((val) => /^[0-9,-]+$/.test(val.replace(/\s/g, '')), {
59
+ // Allow spaces but test without them
60
+ message: 'Page string must contain only numbers, commas, and hyphens.',
61
+ }),
62
+ ]);
63
+ const PdfSourceSchema = z
64
+ .object({
65
+ path: z.string().min(1).optional().describe('Relative path to the local PDF file.'),
66
+ url: z.string().url().optional().describe('URL of the PDF file.'),
67
+ pages: pageSpecifierSchema
68
+ .optional()
69
+ .describe("Extract text only from specific pages (1-based) or ranges for *this specific source*. If provided, 'include_full_text' for the entire request is ignored for this source."),
70
+ })
71
+ .strict()
72
+ .refine((data) => !!(data.path && !data.url) || !!(!data.path && data.url), {
73
+ // Use boolean coercion instead of || for truthiness check if needed, though refine expects boolean
74
+ message: "Each source must have either 'path' or 'url', but not both.",
75
+ });
76
+ const ReadPdfArgsSchema = z
77
+ .object({
78
+ sources: z
79
+ .array(PdfSourceSchema)
80
+ .min(1)
81
+ .describe('An array of PDF sources to process, each can optionally specify pages.'),
82
+ include_full_text: z
83
+ .boolean()
84
+ .optional()
85
+ .default(false)
86
+ .describe("Include the full text content of each PDF (only if 'pages' is not specified for that source)."),
87
+ include_metadata: z
88
+ .boolean()
89
+ .optional()
90
+ .default(true)
91
+ .describe('Include metadata and info objects for each PDF.'),
92
+ include_page_count: z
93
+ .boolean()
94
+ .optional()
95
+ .default(true)
96
+ .describe('Include the total number of pages for each PDF.'),
97
+ })
98
+ .strict();
99
+ // --- Helper Functions ---
100
+ // Parses the page specification for a single source
101
+ const getTargetPages = (sourcePages, sourceDescription) => {
102
+ if (!sourcePages) {
103
+ return undefined;
104
+ }
105
+ try {
106
+ let targetPages;
107
+ if (typeof sourcePages === 'string') {
108
+ targetPages = parsePageRanges(sourcePages);
109
+ }
110
+ else {
111
+ // Ensure array elements are positive integers
112
+ if (sourcePages.some((p) => !Number.isInteger(p) || p <= 0)) {
113
+ throw new Error('Page numbers in array must be positive integers.');
114
+ }
115
+ targetPages = [...new Set(sourcePages)].sort((a, b) => a - b);
116
+ }
117
+ if (targetPages.length === 0) {
118
+ // Check after potential Set deduplication
119
+ throw new Error('Page specification resulted in an empty set of pages.');
120
+ }
121
+ return targetPages;
122
+ }
123
+ catch (error) {
124
+ const message = error instanceof Error ? error.message : String(error);
125
+ // Throw McpError for invalid page specs caught during parsing
126
+ throw new McpError(ErrorCode.InvalidParams, `Invalid page specification for source ${sourceDescription}: ${message}`);
127
+ }
128
+ };
129
+ // Loads the PDF document from path or URL
130
+ const loadPdfDocument = async (source, // Explicitly allow undefined
131
+ sourceDescription) => {
132
+ let pdfDataSource;
133
+ try {
134
+ if (source.path) {
135
+ const safePath = resolvePath(source.path); // resolvePath handles security checks
136
+ const buffer = await fs.readFile(safePath);
137
+ pdfDataSource = new Uint8Array(buffer); // Convert Buffer to Uint8Array
138
+ }
139
+ else if (source.url) {
140
+ pdfDataSource = { url: source.url };
141
+ }
142
+ else {
143
+ // This case should be caught by Zod, but added for robustness
144
+ throw new McpError(ErrorCode.InvalidParams, `Source ${sourceDescription} missing 'path' or 'url'.`);
145
+ }
146
+ }
147
+ catch (err) {
148
+ // Handle errors during path resolution or file reading
149
+ let errorMessage; // Declare errorMessage here
150
+ const message = err instanceof Error ? err.message : String(err);
151
+ const errorCode = ErrorCode.InvalidRequest; // Default error code
152
+ if (typeof err === 'object' &&
153
+ err !== null &&
154
+ 'code' in err &&
155
+ err.code === 'ENOENT' &&
156
+ source.path) {
157
+ // Specific handling for file not found
158
+ errorMessage = `File not found at '${source.path}'.`;
159
+ // Optionally keep errorCode as InvalidRequest or change if needed
160
+ }
161
+ else {
162
+ // Generic error for other file prep issues or resolvePath errors
163
+ errorMessage = `Failed to prepare PDF source ${sourceDescription}. Reason: ${message}`;
164
+ }
165
+ throw new McpError(errorCode, errorMessage, { cause: err instanceof Error ? err : undefined });
166
+ }
167
+ const loadingTask = pdfjsLib.getDocument(pdfDataSource);
168
+ try {
169
+ return await loadingTask.promise;
170
+ }
171
+ catch (err) {
172
+ console.error(`[PDF Reader MCP] PDF.js loading error for ${sourceDescription}:`, err);
173
+ const message = err instanceof Error ? err.message : String(err);
174
+ // Use ?? for default message
175
+ throw new McpError(ErrorCode.InvalidRequest, `Failed to load PDF document from ${sourceDescription}. Reason: ${message || 'Unknown loading error'}`, // Revert to || as message is likely always string here
176
+ { cause: err instanceof Error ? err : undefined });
177
+ }
178
+ };
179
+ // Extracts metadata and page count
180
+ const extractMetadataAndPageCount = async (pdfDocument, includeMetadata, includePageCount) => {
181
+ const output = {};
182
+ if (includePageCount) {
183
+ output.num_pages = pdfDocument.numPages;
184
+ }
185
+ if (includeMetadata) {
186
+ try {
187
+ const pdfMetadata = await pdfDocument.getMetadata();
188
+ const infoData = pdfMetadata.info;
189
+ if (infoData !== undefined) {
190
+ output.info = infoData;
191
+ }
192
+ const metadataObj = pdfMetadata.metadata;
193
+ // Convert the metadata object to a plain object by extracting all properties
194
+ // Check if it has a getAll method (as used in tests)
195
+ if (typeof metadataObj.getAll === 'function') {
196
+ output.metadata = metadataObj.getAll();
197
+ }
198
+ else {
199
+ // For real PDF.js metadata, convert to plain object
200
+ const metadataRecord = {};
201
+ // Extract enumerable properties
202
+ for (const key in metadataObj) {
203
+ if (Object.hasOwn(metadataObj, key)) {
204
+ metadataRecord[key] = metadataObj[key];
205
+ }
206
+ }
207
+ output.metadata = metadataRecord;
208
+ }
209
+ }
210
+ catch (metaError) {
211
+ console.warn(`[PDF Reader MCP] Error extracting metadata: ${metaError instanceof Error ? metaError.message : String(metaError)}`);
212
+ // Optionally add a warning to the result if metadata extraction fails partially
213
+ }
214
+ }
215
+ return output;
216
+ };
217
+ // Extracts text from specified pages
218
+ const extractPageTexts = async (pdfDocument, pagesToProcess, sourceDescription) => {
219
+ const extractedPageTexts = [];
220
+ for (const pageNum of pagesToProcess) {
221
+ let pageText = '';
222
+ try {
223
+ const page = await pdfDocument.getPage(pageNum);
224
+ const textContent = await page.getTextContent();
225
+ pageText = textContent.items
226
+ .map((item) => item.str) // Type assertion
227
+ .join('');
228
+ }
229
+ catch (pageError) {
230
+ const message = pageError instanceof Error ? pageError.message : String(pageError);
231
+ console.warn(`[PDF Reader MCP] Error getting text content for page ${String(pageNum)} in ${sourceDescription}: ${message}` // Explicit string conversion
232
+ );
233
+ pageText = `Error processing page: ${message}`; // Include error in text
234
+ }
235
+ extractedPageTexts.push({ page: pageNum, text: pageText });
236
+ }
237
+ // Sorting is likely unnecessary if pagesToProcess was sorted, but keep for safety
238
+ extractedPageTexts.sort((a, b) => a.page - b.page);
239
+ return extractedPageTexts;
240
+ };
241
+ // Determines the actual list of pages to process based on target pages and total pages
242
+ const determinePagesToProcess = (targetPages, totalPages, includeFullText) => {
243
+ let pagesToProcess = [];
244
+ let invalidPages = [];
245
+ if (targetPages) {
246
+ // Filter target pages based on actual total pages
247
+ pagesToProcess = targetPages.filter((p) => p <= totalPages);
248
+ invalidPages = targetPages.filter((p) => p > totalPages);
249
+ }
250
+ else if (includeFullText) {
251
+ // If no specific pages requested for this source, use global flag
252
+ pagesToProcess = Array.from({ length: totalPages }, (_, i) => i + 1);
253
+ }
254
+ return { pagesToProcess, invalidPages };
255
+ };
256
+ // Processes a single PDF source
257
+ const processSingleSource = async (source, globalIncludeFullText, globalIncludeMetadata, globalIncludePageCount) => {
258
+ const sourceDescription = source.path ?? source.url ?? 'unknown source';
259
+ let individualResult = { source: sourceDescription, success: false };
260
+ try {
261
+ // 1. Parse target pages for this source (throws McpError on invalid spec)
262
+ const targetPages = getTargetPages(source.pages, sourceDescription);
263
+ // 2. Load PDF Document (throws McpError on loading failure)
264
+ // Destructure to remove 'pages' before passing to loadPdfDocument due to exactOptionalPropertyTypes
265
+ const { pages: _pages, ...loadArgs } = source;
266
+ const pdfDocument = await loadPdfDocument(loadArgs, sourceDescription);
267
+ const totalPages = pdfDocument.numPages;
268
+ // 3. Extract Metadata & Page Count
269
+ const metadataOutput = await extractMetadataAndPageCount(pdfDocument, globalIncludeMetadata, globalIncludePageCount);
270
+ const output = { ...metadataOutput }; // Start building output
271
+ // 4. Determine actual pages to process
272
+ const { pagesToProcess, invalidPages } = determinePagesToProcess(targetPages, totalPages, globalIncludeFullText // Pass the global flag
273
+ );
274
+ // Add warnings for invalid requested pages
275
+ if (invalidPages.length > 0) {
276
+ output.warnings = output.warnings ?? [];
277
+ output.warnings.push(`Requested page numbers ${invalidPages.join(', ')} exceed total pages (${String(totalPages)}).`);
278
+ }
279
+ // 5. Extract Text (if needed)
280
+ if (pagesToProcess.length > 0) {
281
+ const extractedPageTexts = await extractPageTexts(pdfDocument, pagesToProcess, sourceDescription);
282
+ if (targetPages) {
283
+ // If specific pages were requested for *this source*
284
+ output.page_texts = extractedPageTexts;
285
+ }
286
+ else {
287
+ // Only assign full_text if pages were NOT specified for this source
288
+ output.full_text = extractedPageTexts.map((p) => p.text).join('\n\n');
289
+ }
290
+ }
291
+ individualResult = { ...individualResult, data: output, success: true };
292
+ }
293
+ catch (error) {
294
+ let errorMessage = `Failed to process PDF from ${sourceDescription}.`;
295
+ if (error instanceof McpError) {
296
+ errorMessage = error.message; // Use message from McpError directly
297
+ }
298
+ else if (error instanceof Error) {
299
+ errorMessage += ` Reason: ${error.message}`;
300
+ }
301
+ else {
302
+ errorMessage += ` Unknown error: ${JSON.stringify(error)}`;
303
+ }
304
+ individualResult.error = errorMessage;
305
+ individualResult.success = false;
306
+ individualResult.data = undefined; // Ensure no data on error
307
+ }
308
+ return individualResult;
309
+ };
310
+ // --- Main Handler Function ---
311
+ export const handleReadPdfFunc = async (args) => {
312
+ let parsedArgs;
313
+ try {
314
+ parsedArgs = ReadPdfArgsSchema.parse(args);
315
+ }
316
+ catch (error) {
317
+ if (error instanceof z.ZodError) {
318
+ throw new McpError(ErrorCode.InvalidParams, `Invalid arguments: ${error.errors.map((e) => `${e.path.join('.')} (${e.message})`).join(', ')}`);
319
+ }
320
+ // Added fallback for non-Zod errors during parsing
321
+ const message = error instanceof Error ? error.message : String(error);
322
+ throw new McpError(ErrorCode.InvalidParams, `Argument validation failed: ${message}`);
323
+ }
324
+ const { sources, include_full_text, include_metadata, include_page_count } = parsedArgs;
325
+ // Process all sources concurrently
326
+ const results = await Promise.all(sources.map((source) => processSingleSource(source, include_full_text, include_metadata, include_page_count)));
327
+ return {
328
+ content: [
329
+ {
330
+ type: 'text',
331
+ text: JSON.stringify({ results }, null, 2),
332
+ },
333
+ ],
334
+ };
335
+ };
336
+ // Export the consolidated ToolDefinition
337
+ export const readPdfToolDefinition = {
338
+ name: 'read_pdf',
339
+ description: 'Reads content/metadata from one or more PDFs (local/URL). Each source can specify pages to extract.',
340
+ schema: ReadPdfArgsSchema,
341
+ handler: handleReadPdfFunc,
342
+ };
package/dist/index.js ADDED
@@ -0,0 +1,57 @@
1
+ #!/usr/bin/env node
2
+ import { Server } from '@modelcontextprotocol/sdk/server/index.js';
3
+ import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
4
+ import { CallToolRequestSchema, ErrorCode, ListToolsRequestSchema, McpError, } from '@modelcontextprotocol/sdk/types.js';
5
+ import { zodToJsonSchema } from 'zod-to-json-schema';
6
+ // Import the aggregated tool definitions
7
+ import { allToolDefinitions } from './handlers/index.js';
8
+ // Removed incorrect import left over from partial diff
9
+ // --- Tool Names (Constants) ---
10
+ // Removed tool name constants, names are now in the definitions
11
+ // --- Server Setup ---
12
+ const server = new Server({
13
+ name: 'filesystem-mcp',
14
+ version: '0.4.0', // Increment version for definition refactor
15
+ description: 'MCP Server for filesystem operations relative to the project root.',
16
+ }, {
17
+ capabilities: { tools: {} },
18
+ });
19
+ // Helper function to convert Zod schema to JSON schema for MCP
20
+ // Use 'unknown' instead of 'any' for better type safety, although casting is still needed for the SDK
21
+ const generateInputSchema = (schema) => {
22
+ // Need to cast as 'unknown' then 'object' because zodToJsonSchema might return slightly incompatible types for MCP SDK
23
+ return zodToJsonSchema(schema, { target: 'openApi3' });
24
+ };
25
+ server.setRequestHandler(ListToolsRequestSchema, () => {
26
+ // Removed unnecessary async
27
+ // Removed log
28
+ // Map the aggregated definitions to the format expected by the SDK
29
+ const availableTools = allToolDefinitions.map((def) => ({
30
+ name: def.name,
31
+ description: def.description,
32
+ inputSchema: generateInputSchema(def.schema), // Generate JSON schema from Zod schema
33
+ }));
34
+ return { tools: availableTools };
35
+ });
36
+ server.setRequestHandler(CallToolRequestSchema, async (request) => {
37
+ // Use imported handlers
38
+ // Find the tool definition by name and call its handler
39
+ const toolDefinition = allToolDefinitions.find((def) => def.name === request.params.name);
40
+ if (!toolDefinition) {
41
+ throw new McpError(ErrorCode.MethodNotFound, `Unknown tool: ${request.params.name}`);
42
+ }
43
+ // Call the handler associated with the found definition
44
+ // The handler itself will perform Zod validation on the arguments
45
+ return toolDefinition.handler(request.params.arguments);
46
+ });
47
+ // --- Server Start ---
48
+ async function main() {
49
+ const transport = new StdioServerTransport();
50
+ await server.connect(transport);
51
+ console.error('[Filesystem MCP] Server running on stdio');
52
+ }
53
+ main().catch((error) => {
54
+ // Specify 'unknown' type for catch variable
55
+ console.error('[Filesystem MCP] Server error:', error);
56
+ process.exit(1);
57
+ });
@@ -0,0 +1,30 @@
1
+ // Removed unused import: import { fileURLToPath } from 'url';
2
+ import path from 'node:path';
3
+ import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
4
+ // Use the server's current working directory as the project root.
5
+ // This relies on the process launching the server to set the CWD correctly.
6
+ export const PROJECT_ROOT = process.cwd();
7
+ console.info(`[Filesystem MCP - pathUtils] Project Root determined from CWD: ${PROJECT_ROOT}`); // Use info instead of log
8
+ /**
9
+ * Resolves a user-provided relative path against the project root,
10
+ * ensuring it stays within the project boundaries.
11
+ * Throws McpError on invalid input, absolute paths, or path traversal.
12
+ * @param userPath The relative path provided by the user.
13
+ * @returns The resolved absolute path.
14
+ */
15
+ export const resolvePath = (userPath) => {
16
+ if (typeof userPath !== 'string') {
17
+ throw new McpError(ErrorCode.InvalidParams, 'Path must be a string.');
18
+ }
19
+ const normalizedUserPath = path.normalize(userPath);
20
+ if (path.isAbsolute(normalizedUserPath)) {
21
+ throw new McpError(ErrorCode.InvalidParams, 'Absolute paths are not allowed.');
22
+ }
23
+ // Resolve against the calculated PROJECT_ROOT
24
+ const resolved = path.resolve(PROJECT_ROOT, normalizedUserPath);
25
+ // Security check: Ensure the resolved path is still within the project root
26
+ if (!resolved.startsWith(PROJECT_ROOT)) {
27
+ throw new McpError(ErrorCode.InvalidRequest, 'Path traversal detected. Access denied.');
28
+ }
29
+ return resolved;
30
+ };
package/package.json ADDED
@@ -0,0 +1,102 @@
1
+ {
2
+ "name": "@sylphx/pdf-reader-mcp",
3
+ "version": "1.0.0",
4
+ "description": "An MCP server providing tools to read PDF files.",
5
+ "type": "module",
6
+ "bin": {
7
+ "pdf-reader-mcp": "./dist/index.js"
8
+ },
9
+ "files": [
10
+ "dist/",
11
+ "README.md",
12
+ "LICENSE"
13
+ ],
14
+ "publishConfig": {
15
+ "access": "public"
16
+ },
17
+ "engines": {
18
+ "node": ">=22.0.0"
19
+ },
20
+ "repository": {
21
+ "type": "git",
22
+ "url": "git+https://github.com/sylphlab/pdf-reader-mcp.git"
23
+ },
24
+ "bugs": {
25
+ "url": "https://github.com/sylphlab/pdf-reader-mcp/issues"
26
+ },
27
+ "homepage": "https://github.com/sylphlab/pdf-reader-mcp#readme",
28
+ "author": "Sylphx <contact@sylphx.com> (https://sylphx.com)",
29
+ "license": "MIT",
30
+ "keywords": [
31
+ "mcp",
32
+ "model-context-protocol",
33
+ "pdf",
34
+ "reader",
35
+ "parser",
36
+ "typescript",
37
+ "node",
38
+ "ai",
39
+ "agent",
40
+ "tool"
41
+ ],
42
+ "scripts": {
43
+ "build": "tsc",
44
+ "watch": "tsc --watch",
45
+ "inspector": "npx @modelcontextprotocol/inspector dist/index.js",
46
+ "test": "vitest run",
47
+ "test:watch": "vitest watch",
48
+ "test:cov": "vitest run --coverage --reporter=junit --outputFile=test-report.junit.xml",
49
+ "lint": "biome lint .",
50
+ "lint:fix": "biome lint --write .",
51
+ "format": "biome format --write .",
52
+ "check-format": "biome format .",
53
+ "check": "biome check .",
54
+ "check:fix": "biome check --write .",
55
+ "validate": "npm run check && npm run test",
56
+ "docs:dev": "vitepress dev docs",
57
+ "docs:build": "vitepress build docs",
58
+ "docs:preview": "vitepress preview docs",
59
+ "start": "node dist/index.js",
60
+ "typecheck": "tsc --noEmit",
61
+ "benchmark": "vitest bench",
62
+ "clean": "rm -rf dist coverage",
63
+ "docs:api": "typedoc --entryPoints src/index.ts --tsconfig tsconfig.json --plugin typedoc-plugin-markdown --out docs/api --readme none",
64
+ "prepublishOnly": "pnpm run clean && pnpm run build",
65
+ "release": "standard-version",
66
+ "prepare": "husky"
67
+ },
68
+ "dependencies": {
69
+ "@modelcontextprotocol/sdk": "1.20.2",
70
+ "glob": "^11.0.1",
71
+ "pdfjs-dist": "^5.4.296",
72
+ "zod": "^3.24.2",
73
+ "zod-to-json-schema": "^3.24.5"
74
+ },
75
+ "devDependencies": {
76
+ "@biomejs/biome": "^2.3.2",
77
+ "@commitlint/cli": "^19.8.0",
78
+ "@commitlint/config-conventional": "^19.8.0",
79
+ "@types/glob": "^8.1.0",
80
+ "@types/node": "^24.0.7",
81
+ "@vitest/coverage-v8": "^3.1.1",
82
+ "husky": "^9.1.7",
83
+ "lint-staged": "^16.2.6",
84
+ "standard-version": "^9.5.0",
85
+ "typedoc": "^0.28.2",
86
+ "typedoc-plugin-markdown": "^4.9.0",
87
+ "typescript": "^5.8.3",
88
+ "vitepress": "^1.6.3",
89
+ "vitest": "^3.1.1",
90
+ "vue": "^3.5.13"
91
+ },
92
+ "commitlint": {
93
+ "extends": [
94
+ "@commitlint/config-conventional"
95
+ ]
96
+ },
97
+ "lint-staged": {
98
+ "*.{ts,tsx,js,cjs,json}": [
99
+ "biome check --write --no-errors-on-unmatched --files-ignore-unknown=true"
100
+ ]
101
+ }
102
+ }