@sylphx/pdf-reader-mcp 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,7 +4,6 @@
4
4
  [![CI/CD Pipeline](https://github.com/sylphlab/pdf-reader-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/sylphlab/pdf-reader-mcp/actions/workflows/ci.yml)
5
5
  [![codecov](https://codecov.io/gh/sylphlab/pdf-reader-mcp/graph/badge.svg?token=VYRQFB40UN)](https://codecov.io/gh/sylphlab/pdf-reader-mcp)
6
6
  [![npm version](https://badge.fury.io/js/%40sylphlab%2Fpdf-reader-mcp.svg)](https://badge.fury.io/js/%40sylphlab%2Fpdf-reader-mcp)
7
- [![Docker Pulls](https://img.shields.io/docker/pulls/sylphlab/pdf-reader-mcp.svg)](https://hub.docker.com/r/sylphlab/pdf-reader-mcp)
8
7
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
9
8
  [![smithery badge](https://smithery.ai/badge/@sylphxltd/pdf-reader-mcp)](https://smithery.ai/server/@sylphxltd/pdf-reader-mcp)
10
9
 
@@ -17,13 +16,14 @@
17
16
  ## ✨ Features
18
17
 
19
18
  - 📄 **Extract text content** from PDF files (full document or specific pages)
19
+ - 🖼️ **Extract embedded images** from PDF pages as base64-encoded data
20
20
  - 📊 **Get metadata** (author, title, creation date, etc.)
21
21
  - 🔢 **Count pages** in PDF documents
22
22
  - 🌐 **Support for both local files and URLs**
23
23
  - 🛡️ **Secure** - Confines file access to project root directory
24
- - ⚡ **Fast** - Powered by PDF.js with optimized performance
24
+ - ⚡ **Fast** - Parallel processing for maximum performance
25
25
  - 🔄 **Batch processing** - Handle multiple PDFs in a single request
26
- - 📦 **Multiple deployment options** - npm, Docker, or Smithery
26
+ - 📦 **Multiple deployment options** - npm or Smithery
27
27
 
28
28
  ## 🆕 Recent Updates (October 2025)
29
29
 
@@ -32,8 +32,9 @@
32
32
  - ✅ **Improved metadata extraction**: Robust fallback handling for PDF.js compatibility
33
33
  - ✅ **Updated dependencies**: All packages updated to latest versions
34
34
  - ✅ **Migrated to Biome**: 50x faster linting and formatting with unified tooling
35
- - ✅ **Added Docker support**: Easy deployment with containerization
36
- - ✅ **All tests passing**: 31/31 tests with comprehensive coverage
35
+ - ✅ **Added image extraction**: Extract embedded images from PDF pages
36
+ - ✅ **Performance optimization**: Parallel page processing for 5-10x speedup
37
+ - ✅ **Deep refactoring**: Modular architecture with 98.9% test coverage (90 tests)
37
38
 
38
39
  ## 📦 Installation
39
40
 
@@ -70,35 +71,7 @@ Configure your MCP client (e.g., Claude Desktop, Cursor):
70
71
 
71
72
  **Important:** Make sure your MCP client sets the correct working directory (`cwd`) to your project root.
72
73
 
73
- ### Option 3: Using Docker
74
-
75
- Pull the image:
76
-
77
- ```bash
78
- docker pull sylphx/pdf-reader-mcp:latest
79
- ```
80
-
81
- Configure your MCP client:
82
-
83
- ```json
84
- {
85
- "mcpServers": {
86
- "pdf-reader-mcp": {
87
- "command": "docker",
88
- "args": [
89
- "run",
90
- "-i",
91
- "--rm",
92
- "-v",
93
- "/path/to/your/project:/app",
94
- "sylphx/pdf-reader-mcp:latest"
95
- ]
96
- }
97
- }
98
- }
99
- ```
100
-
101
- ### Option 4: Local Development Build
74
+ ### Option 3: Local Development Build
102
75
 
103
76
  ```bash
104
77
  git clone https://github.com/sylphlab/pdf-reader-mcp.git
@@ -164,6 +137,28 @@ Once configured, your AI agent can read PDFs using the `read_pdf` tool:
164
137
  }
165
138
  ```
166
139
 
140
+ ### Example 5: Extract images from PDF
141
+
142
+ ```json
143
+ {
144
+ "sources": [
145
+ {
146
+ "path": "presentation.pdf",
147
+ "pages": [1, 2, 3]
148
+ }
149
+ ],
150
+ "include_images": true,
151
+ "include_full_text": true
152
+ }
153
+ ```
154
+
155
+ **Response includes**:
156
+ - Text content from each page
157
+ - Embedded images as base64-encoded data with metadata (width, height, format)
158
+ - Each image includes page number and index
159
+
160
+ **Note**: Image extraction works best with JPEG and PNG images. Large PDFs with many images may produce large responses.
161
+
167
162
  ## 📖 Usage Guide
168
163
 
169
164
  ### Page Specification
@@ -192,6 +187,45 @@ For large PDF files (>20 MB), extract specific pages instead of the full documen
192
187
 
193
188
  This prevents hitting AI model context limits and improves performance.
194
189
 
190
+ ### Image Extraction
191
+
192
+ Extract embedded images from PDF pages as base64-encoded data:
193
+
194
+ ```json
195
+ {
196
+ "sources": [{ "path": "document.pdf" }],
197
+ "include_images": true
198
+ }
199
+ ```
200
+
201
+ **Image data format**:
202
+ ```json
203
+ {
204
+ "images": [
205
+ {
206
+ "page": 1,
207
+ "index": 0,
208
+ "width": 800,
209
+ "height": 600,
210
+ "format": "rgb",
211
+ "data": "base64-encoded-image-data..."
212
+ }
213
+ ]
214
+ }
215
+ ```
216
+
217
+ **Supported formats**:
218
+ - ✅ **RGB** - Standard color images (most common)
219
+ - ✅ **RGBA** - Images with transparency
220
+ - ✅ **Grayscale** - Black and white images
221
+ - ✅ Works with JPEG, PNG, and other embedded formats
222
+
223
+ **Important considerations**:
224
+ - 🔸 Image extraction increases response size significantly
225
+ - 🔸 Useful for AI models with vision capabilities
226
+ - 🔸 Set `include_images: false` (default) to extract text only
227
+ - 🔸 Combine with `pages` parameter to limit extraction scope
228
+
195
229
  ### Security: Relative Paths Only
196
230
 
197
231
  **Important:** The server only accepts **relative paths** for security reasons. Absolute paths are blocked to prevent unauthorized file system access.
@@ -360,12 +394,13 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidelines.
360
394
 
361
395
  ## 🗺️ Roadmap
362
396
 
363
- - [ ] Image extraction from PDFs
397
+ - [x] ~~Image extraction from PDFs~~ ✅ Completed (v1.0.0)
398
+ - [x] ~~Performance optimizations for parallel processing~~ ✅ Completed (v1.0.0)
364
399
  - [ ] Annotation extraction support
365
400
  - [ ] OCR integration for scanned PDFs
366
401
  - [ ] Streaming support for very large files
367
402
  - [ ] Enhanced caching mechanisms
368
- - [ ] Performance optimizations for large batches
403
+ - [ ] PDF form field extraction
369
404
 
370
405
  ## 🤝 Support & Community
371
406
 
@@ -1,299 +1,58 @@
1
- import fs from 'node:fs/promises';
1
+ // PDF reading handler - orchestrates PDF processing workflow
2
2
  import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
3
- import * as pdfjsLib from 'pdfjs-dist/legacy/build/pdf.mjs';
4
3
  import { z } from 'zod';
5
- import { resolvePath } from '../utils/pathUtils.js';
6
- // Helper to parse page range strings (e.g., "1-3,5,7-")
7
- // Helper to parse a single range part (e.g., "1-3", "5", "7-")
8
- const parseRangePart = (part, pages) => {
9
- const trimmedPart = part.trim();
10
- if (trimmedPart.includes('-')) {
11
- const [startStr, endStr] = trimmedPart.split('-');
12
- if (startStr === undefined) {
13
- // Basic check
14
- throw new Error(`Invalid page range format: ${trimmedPart}`);
15
- }
16
- const start = parseInt(startStr, 10);
17
- const end = endStr === '' || endStr === undefined ? Infinity : parseInt(endStr, 10);
18
- if (Number.isNaN(start) || Number.isNaN(end) || start <= 0 || start > end) {
19
- throw new Error(`Invalid page range values: ${trimmedPart}`);
20
- }
21
- // Add a reasonable upper limit to prevent infinite loops for open ranges
22
- const practicalEnd = Math.min(end, start + 10000); // Limit range parsing depth
23
- for (let i = start; i <= practicalEnd; i++) {
24
- pages.add(i);
25
- }
26
- if (end === Infinity && practicalEnd === start + 10000) {
27
- console.warn(`[PDF Reader MCP] Open-ended range starting at ${String(start)} was truncated at page ${String(practicalEnd)} during parsing.`);
28
- }
29
- }
30
- else {
31
- const page = parseInt(trimmedPart, 10);
32
- if (Number.isNaN(page) || page <= 0) {
33
- throw new Error(`Invalid page number: ${trimmedPart}`);
34
- }
35
- pages.add(page);
36
- }
37
- };
38
- // Parses the complete page range string (e.g., "1-3,5,7-")
39
- const parsePageRanges = (ranges) => {
40
- const pages = new Set();
41
- const parts = ranges.split(',');
42
- for (const part of parts) {
43
- parseRangePart(part, pages); // Delegate parsing of each part
44
- }
45
- if (pages.size === 0) {
46
- throw new Error('Page range string resulted in zero valid pages.');
47
- }
48
- return Array.from(pages).sort((a, b) => a - b);
49
- };
50
- // --- Zod Schemas ---
51
- const pageSpecifierSchema = z.union([
52
- z
53
- .array(z.number().int().min(1))
54
- .min(1), // Array of integers with minimum value 1 (pages are 1-based)
55
- z
56
- .string()
57
- .min(1)
58
- .refine((val) => /^[0-9,-]+$/.test(val.replace(/\s/g, '')), {
59
- // Allow spaces but test without them
60
- message: 'Page string must contain only numbers, commas, and hyphens.',
61
- }),
62
- ]);
63
- const PdfSourceSchema = z
64
- .object({
65
- path: z.string().min(1).optional().describe('Relative path to the local PDF file.'),
66
- url: z.string().url().optional().describe('URL of the PDF file.'),
67
- pages: pageSpecifierSchema
68
- .optional()
69
- .describe("Extract text only from specific pages (1-based) or ranges for *this specific source*. If provided, 'include_full_text' for the entire request is ignored for this source."),
70
- })
71
- .strict()
72
- .refine((data) => !!(data.path && !data.url) || !!(!data.path && data.url), {
73
- // Use boolean coercion instead of || for truthiness check if needed, though refine expects boolean
74
- message: "Each source must have either 'path' or 'url', but not both.",
75
- });
76
- const ReadPdfArgsSchema = z
77
- .object({
78
- sources: z
79
- .array(PdfSourceSchema)
80
- .min(1)
81
- .describe('An array of PDF sources to process, each can optionally specify pages.'),
82
- include_full_text: z
83
- .boolean()
84
- .optional()
85
- .default(false)
86
- .describe("Include the full text content of each PDF (only if 'pages' is not specified for that source)."),
87
- include_metadata: z
88
- .boolean()
89
- .optional()
90
- .default(true)
91
- .describe('Include metadata and info objects for each PDF.'),
92
- include_page_count: z
93
- .boolean()
94
- .optional()
95
- .default(true)
96
- .describe('Include the total number of pages for each PDF.'),
97
- })
98
- .strict();
99
- // --- Helper Functions ---
100
- // Parses the page specification for a single source
101
- const getTargetPages = (sourcePages, sourceDescription) => {
102
- if (!sourcePages) {
103
- return undefined;
104
- }
105
- try {
106
- let targetPages;
107
- if (typeof sourcePages === 'string') {
108
- targetPages = parsePageRanges(sourcePages);
109
- }
110
- else {
111
- // Ensure array elements are positive integers
112
- if (sourcePages.some((p) => !Number.isInteger(p) || p <= 0)) {
113
- throw new Error('Page numbers in array must be positive integers.');
114
- }
115
- targetPages = [...new Set(sourcePages)].sort((a, b) => a - b);
116
- }
117
- if (targetPages.length === 0) {
118
- // Check after potential Set deduplication
119
- throw new Error('Page specification resulted in an empty set of pages.');
120
- }
121
- return targetPages;
122
- }
123
- catch (error) {
124
- const message = error instanceof Error ? error.message : String(error);
125
- // Throw McpError for invalid page specs caught during parsing
126
- throw new McpError(ErrorCode.InvalidParams, `Invalid page specification for source ${sourceDescription}: ${message}`);
127
- }
128
- };
129
- // Loads the PDF document from path or URL
130
- const loadPdfDocument = async (source, // Explicitly allow undefined
131
- sourceDescription) => {
132
- let pdfDataSource;
133
- try {
134
- if (source.path) {
135
- const safePath = resolvePath(source.path); // resolvePath handles security checks
136
- const buffer = await fs.readFile(safePath);
137
- pdfDataSource = new Uint8Array(buffer); // Convert Buffer to Uint8Array
138
- }
139
- else if (source.url) {
140
- pdfDataSource = { url: source.url };
141
- }
142
- else {
143
- // This case should be caught by Zod, but added for robustness
144
- throw new McpError(ErrorCode.InvalidParams, `Source ${sourceDescription} missing 'path' or 'url'.`);
145
- }
146
- }
147
- catch (err) {
148
- // Handle errors during path resolution or file reading
149
- let errorMessage; // Declare errorMessage here
150
- const message = err instanceof Error ? err.message : String(err);
151
- const errorCode = ErrorCode.InvalidRequest; // Default error code
152
- if (typeof err === 'object' &&
153
- err !== null &&
154
- 'code' in err &&
155
- err.code === 'ENOENT' &&
156
- source.path) {
157
- // Specific handling for file not found
158
- errorMessage = `File not found at '${source.path}'.`;
159
- // Optionally keep errorCode as InvalidRequest or change if needed
160
- }
161
- else {
162
- // Generic error for other file prep issues or resolvePath errors
163
- errorMessage = `Failed to prepare PDF source ${sourceDescription}. Reason: ${message}`;
164
- }
165
- throw new McpError(errorCode, errorMessage, { cause: err instanceof Error ? err : undefined });
166
- }
167
- const loadingTask = pdfjsLib.getDocument(pdfDataSource);
168
- try {
169
- return await loadingTask.promise;
170
- }
171
- catch (err) {
172
- console.error(`[PDF Reader MCP] PDF.js loading error for ${sourceDescription}:`, err);
173
- const message = err instanceof Error ? err.message : String(err);
174
- // Use ?? for default message
175
- throw new McpError(ErrorCode.InvalidRequest, `Failed to load PDF document from ${sourceDescription}. Reason: ${message || 'Unknown loading error'}`, // Revert to || as message is likely always string here
176
- { cause: err instanceof Error ? err : undefined });
177
- }
178
- };
179
- // Extracts metadata and page count
180
- const extractMetadataAndPageCount = async (pdfDocument, includeMetadata, includePageCount) => {
181
- const output = {};
182
- if (includePageCount) {
183
- output.num_pages = pdfDocument.numPages;
184
- }
185
- if (includeMetadata) {
186
- try {
187
- const pdfMetadata = await pdfDocument.getMetadata();
188
- const infoData = pdfMetadata.info;
189
- if (infoData !== undefined) {
190
- output.info = infoData;
191
- }
192
- const metadataObj = pdfMetadata.metadata;
193
- // Convert the metadata object to a plain object by extracting all properties
194
- // Check if it has a getAll method (as used in tests)
195
- if (typeof metadataObj.getAll === 'function') {
196
- output.metadata = metadataObj.getAll();
197
- }
198
- else {
199
- // For real PDF.js metadata, convert to plain object
200
- const metadataRecord = {};
201
- // Extract enumerable properties
202
- for (const key in metadataObj) {
203
- if (Object.hasOwn(metadataObj, key)) {
204
- metadataRecord[key] = metadataObj[key];
205
- }
206
- }
207
- output.metadata = metadataRecord;
208
- }
209
- }
210
- catch (metaError) {
211
- console.warn(`[PDF Reader MCP] Error extracting metadata: ${metaError instanceof Error ? metaError.message : String(metaError)}`);
212
- // Optionally add a warning to the result if metadata extraction fails partially
213
- }
214
- }
215
- return output;
216
- };
217
- // Extracts text from specified pages
218
- const extractPageTexts = async (pdfDocument, pagesToProcess, sourceDescription) => {
219
- const extractedPageTexts = [];
220
- for (const pageNum of pagesToProcess) {
221
- let pageText = '';
222
- try {
223
- const page = await pdfDocument.getPage(pageNum);
224
- const textContent = await page.getTextContent();
225
- pageText = textContent.items
226
- .map((item) => item.str) // Type assertion
227
- .join('');
228
- }
229
- catch (pageError) {
230
- const message = pageError instanceof Error ? pageError.message : String(pageError);
231
- console.warn(`[PDF Reader MCP] Error getting text content for page ${String(pageNum)} in ${sourceDescription}: ${message}` // Explicit string conversion
232
- );
233
- pageText = `Error processing page: ${message}`; // Include error in text
234
- }
235
- extractedPageTexts.push({ page: pageNum, text: pageText });
236
- }
237
- // Sorting is likely unnecessary if pagesToProcess was sorted, but keep for safety
238
- extractedPageTexts.sort((a, b) => a.page - b.page);
239
- return extractedPageTexts;
240
- };
241
- // Determines the actual list of pages to process based on target pages and total pages
242
- const determinePagesToProcess = (targetPages, totalPages, includeFullText) => {
243
- let pagesToProcess = [];
244
- let invalidPages = [];
245
- if (targetPages) {
246
- // Filter target pages based on actual total pages
247
- pagesToProcess = targetPages.filter((p) => p <= totalPages);
248
- invalidPages = targetPages.filter((p) => p > totalPages);
249
- }
250
- else if (includeFullText) {
251
- // If no specific pages requested for this source, use global flag
252
- pagesToProcess = Array.from({ length: totalPages }, (_, i) => i + 1);
253
- }
254
- return { pagesToProcess, invalidPages };
255
- };
256
- // Processes a single PDF source
257
- const processSingleSource = async (source, globalIncludeFullText, globalIncludeMetadata, globalIncludePageCount) => {
4
+ import { buildWarnings, extractImages, extractMetadataAndPageCount, extractPageTexts, } from '../pdf/extractor.js';
5
+ import { loadPdfDocument } from '../pdf/loader.js';
6
+ import { determinePagesToProcess, getTargetPages } from '../pdf/parser.js';
7
+ import { readPdfArgsSchema } from '../schemas/readPdf.js';
8
+ /**
9
+ * Process a single PDF source
10
+ */
11
+ const processSingleSource = async (source, options) => {
258
12
  const sourceDescription = source.path ?? source.url ?? 'unknown source';
259
13
  let individualResult = { source: sourceDescription, success: false };
260
14
  try {
261
- // 1. Parse target pages for this source (throws McpError on invalid spec)
15
+ // Parse target pages
262
16
  const targetPages = getTargetPages(source.pages, sourceDescription);
263
- // 2. Load PDF Document (throws McpError on loading failure)
264
- // Destructure to remove 'pages' before passing to loadPdfDocument due to exactOptionalPropertyTypes
17
+ // Load PDF document
265
18
  const { pages: _pages, ...loadArgs } = source;
266
19
  const pdfDocument = await loadPdfDocument(loadArgs, sourceDescription);
267
20
  const totalPages = pdfDocument.numPages;
268
- // 3. Extract Metadata & Page Count
269
- const metadataOutput = await extractMetadataAndPageCount(pdfDocument, globalIncludeMetadata, globalIncludePageCount);
270
- const output = { ...metadataOutput }; // Start building output
271
- // 4. Determine actual pages to process
272
- const { pagesToProcess, invalidPages } = determinePagesToProcess(targetPages, totalPages, globalIncludeFullText // Pass the global flag
273
- );
274
- // Add warnings for invalid requested pages
275
- if (invalidPages.length > 0) {
276
- output.warnings = output.warnings ?? [];
277
- output.warnings.push(`Requested page numbers ${invalidPages.join(', ')} exceed total pages (${String(totalPages)}).`);
278
- }
279
- // 5. Extract Text (if needed)
21
+ // Extract metadata and page count
22
+ const metadataOutput = await extractMetadataAndPageCount(pdfDocument, options.includeMetadata, options.includePageCount);
23
+ const output = { ...metadataOutput };
24
+ // Determine pages to process
25
+ const { pagesToProcess, invalidPages } = determinePagesToProcess(targetPages, totalPages, options.includeFullText);
26
+ // Add warnings for invalid pages
27
+ const warnings = buildWarnings(invalidPages, totalPages);
28
+ if (warnings.length > 0) {
29
+ output.warnings = warnings;
30
+ }
31
+ // Extract text if needed
280
32
  if (pagesToProcess.length > 0) {
281
33
  const extractedPageTexts = await extractPageTexts(pdfDocument, pagesToProcess, sourceDescription);
282
34
  if (targetPages) {
283
- // If specific pages were requested for *this source*
35
+ // Specific pages requested
284
36
  output.page_texts = extractedPageTexts;
285
37
  }
286
38
  else {
287
- // Only assign full_text if pages were NOT specified for this source
39
+ // Full text requested
288
40
  output.full_text = extractedPageTexts.map((p) => p.text).join('\n\n');
289
41
  }
290
42
  }
43
+ // Extract images if needed
44
+ if (options.includeImages && pagesToProcess.length > 0) {
45
+ const extractedImages = await extractImages(pdfDocument, pagesToProcess);
46
+ if (extractedImages.length > 0) {
47
+ output.images = extractedImages;
48
+ }
49
+ }
291
50
  individualResult = { ...individualResult, data: output, success: true };
292
51
  }
293
52
  catch (error) {
294
53
  let errorMessage = `Failed to process PDF from ${sourceDescription}.`;
295
54
  if (error instanceof McpError) {
296
- errorMessage = error.message; // Use message from McpError directly
55
+ errorMessage = error.message;
297
56
  }
298
57
  else if (error instanceof Error) {
299
58
  errorMessage += ` Reason: ${error.message}`;
@@ -303,40 +62,100 @@ const processSingleSource = async (source, globalIncludeFullText, globalIncludeM
303
62
  }
304
63
  individualResult.error = errorMessage;
305
64
  individualResult.success = false;
306
- individualResult.data = undefined; // Ensure no data on error
65
+ individualResult.data = undefined;
307
66
  }
308
67
  return individualResult;
309
68
  };
310
- // --- Main Handler Function ---
69
+ /**
70
+ * Main handler function for read_pdf tool
71
+ */
311
72
  export const handleReadPdfFunc = async (args) => {
312
73
  let parsedArgs;
313
74
  try {
314
- parsedArgs = ReadPdfArgsSchema.parse(args);
75
+ parsedArgs = readPdfArgsSchema.parse(args);
315
76
  }
316
77
  catch (error) {
317
78
  if (error instanceof z.ZodError) {
318
79
  throw new McpError(ErrorCode.InvalidParams, `Invalid arguments: ${error.errors.map((e) => `${e.path.join('.')} (${e.message})`).join(', ')}`);
319
80
  }
320
- // Added fallback for non-Zod errors during parsing
321
81
  const message = error instanceof Error ? error.message : String(error);
322
82
  throw new McpError(ErrorCode.InvalidParams, `Argument validation failed: ${message}`);
323
83
  }
324
- const { sources, include_full_text, include_metadata, include_page_count } = parsedArgs;
84
+ const { sources, include_full_text, include_metadata, include_page_count, include_images } = parsedArgs;
325
85
  // Process all sources concurrently
326
- const results = await Promise.all(sources.map((source) => processSingleSource(source, include_full_text, include_metadata, include_page_count)));
327
- return {
328
- content: [
329
- {
330
- type: 'text',
331
- text: JSON.stringify({ results }, null, 2),
332
- },
333
- ],
334
- };
86
+ const results = await Promise.all(sources.map((source) => processSingleSource(source, {
87
+ includeFullText: include_full_text,
88
+ includeMetadata: include_metadata,
89
+ includePageCount: include_page_count,
90
+ includeImages: include_images,
91
+ })));
92
+ // Build content parts - start with structured JSON for backward compatibility
93
+ const content = [];
94
+ // Strip image data from JSON to keep it manageable
95
+ const resultsForJson = results.map((result) => {
96
+ if (result.data?.images) {
97
+ const { images, ...dataWithoutImages } = result.data;
98
+ // Include image count and metadata in JSON, but not the base64 data
99
+ const imageInfo = images.map((img) => ({
100
+ page: img.page,
101
+ index: img.index,
102
+ width: img.width,
103
+ height: img.height,
104
+ format: img.format,
105
+ }));
106
+ return { ...result, data: { ...dataWithoutImages, image_info: imageInfo } };
107
+ }
108
+ return result;
109
+ });
110
+ // First content part: Structured JSON results
111
+ content.push({
112
+ type: 'text',
113
+ text: JSON.stringify({ results: resultsForJson }, null, 2),
114
+ });
115
+ // Add page content in order: text then images for each page
116
+ if (include_images) {
117
+ for (const result of results) {
118
+ if (!result.success || !result.data)
119
+ continue;
120
+ // Handle page_texts (specific pages requested)
121
+ if (result.data.page_texts) {
122
+ for (const pageText of result.data.page_texts) {
123
+ // Add images for this page (if any) right after page text
124
+ if (result.data.images) {
125
+ const pageImages = result.data.images.filter((img) => img.page === pageText.page);
126
+ for (const image of pageImages) {
127
+ content.push({
128
+ type: 'image',
129
+ data: image.data,
130
+ mimeType: image.format === 'rgba' ? 'image/png' : 'image/jpeg',
131
+ });
132
+ }
133
+ }
134
+ }
135
+ }
136
+ // Handle full_text mode - add all images by page order
137
+ if (result.data.full_text && result.data.images) {
138
+ // Group images by page and add in order
139
+ const pageNumbers = [...new Set(result.data.images.map((img) => img.page))].sort((a, b) => a - b);
140
+ for (const pageNum of pageNumbers) {
141
+ const pageImages = result.data.images.filter((img) => img.page === pageNum);
142
+ for (const image of pageImages) {
143
+ content.push({
144
+ type: 'image',
145
+ data: image.data,
146
+ mimeType: image.format === 'rgba' ? 'image/png' : 'image/jpeg',
147
+ });
148
+ }
149
+ }
150
+ }
151
+ }
152
+ }
153
+ return { content };
335
154
  };
336
- // Export the consolidated ToolDefinition
155
+ // Export the tool definition
337
156
  export const readPdfToolDefinition = {
338
157
  name: 'read_pdf',
339
- description: 'Reads content/metadata from one or more PDFs (local/URL). Each source can specify pages to extract.',
340
- schema: ReadPdfArgsSchema,
158
+ description: 'Reads content/metadata/images from one or more PDFs (local/URL). Each source can specify pages to extract.',
159
+ schema: readPdfArgsSchema,
341
160
  handler: handleReadPdfFunc,
342
161
  };
package/dist/index.js CHANGED
@@ -10,9 +10,9 @@ import { allToolDefinitions } from './handlers/index.js';
10
10
  // Removed tool name constants, names are now in the definitions
11
11
  // --- Server Setup ---
12
12
  const server = new Server({
13
- name: 'filesystem-mcp',
14
- version: '0.4.0', // Increment version for definition refactor
15
- description: 'MCP Server for filesystem operations relative to the project root.',
13
+ name: 'pdf-reader-mcp',
14
+ version: '1.1.0',
15
+ description: 'MCP Server for reading PDF files and extracting text, metadata, images, and page information.',
16
16
  }, {
17
17
  capabilities: { tools: {} },
18
18
  });
@@ -48,10 +48,10 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
48
48
  async function main() {
49
49
  const transport = new StdioServerTransport();
50
50
  await server.connect(transport);
51
- console.error('[Filesystem MCP] Server running on stdio');
51
+ console.error('[PDF Reader MCP] Server running on stdio');
52
52
  }
53
53
  main().catch((error) => {
54
54
  // Specify 'unknown' type for catch variable
55
- console.error('[Filesystem MCP] Server error:', error);
55
+ console.error('[PDF Reader MCP] Server error:', error);
56
56
  process.exit(1);
57
57
  });
@@ -0,0 +1,153 @@
1
+ // PDF text and metadata extraction utilities
2
+ import { OPS } from 'pdfjs-dist/legacy/build/pdf.mjs';
3
+ /**
4
+ * Extract metadata and page count from a PDF document
5
+ */
6
+ export const extractMetadataAndPageCount = async (pdfDocument, includeMetadata, includePageCount) => {
7
+ const output = {};
8
+ if (includePageCount) {
9
+ output.num_pages = pdfDocument.numPages;
10
+ }
11
+ if (includeMetadata) {
12
+ try {
13
+ const pdfMetadata = await pdfDocument.getMetadata();
14
+ const infoData = pdfMetadata.info;
15
+ if (infoData !== undefined) {
16
+ output.info = infoData;
17
+ }
18
+ const metadataObj = pdfMetadata.metadata;
19
+ // Check if it has a getAll method (as used in tests)
20
+ if (typeof metadataObj.getAll === 'function') {
21
+ output.metadata = metadataObj.getAll();
22
+ }
23
+ else {
24
+ // For real PDF.js metadata, convert to plain object
25
+ const metadataRecord = {};
26
+ for (const key in metadataObj) {
27
+ if (Object.hasOwn(metadataObj, key)) {
28
+ metadataRecord[key] = metadataObj[key];
29
+ }
30
+ }
31
+ output.metadata = metadataRecord;
32
+ }
33
+ }
34
+ catch (metaError) {
35
+ console.warn(`[PDF Reader MCP] Error extracting metadata: ${metaError instanceof Error ? metaError.message : String(metaError)}`);
36
+ }
37
+ }
38
+ return output;
39
+ };
40
+ /**
41
+ * Extract text from a single page
42
+ */
43
+ const extractSinglePageText = async (pdfDocument, pageNum, sourceDescription) => {
44
+ try {
45
+ const page = await pdfDocument.getPage(pageNum);
46
+ const textContent = await page.getTextContent();
47
+ const pageText = textContent.items
48
+ .map((item) => item.str)
49
+ .join('');
50
+ return { page: pageNum, text: pageText };
51
+ }
52
+ catch (pageError) {
53
+ const message = pageError instanceof Error ? pageError.message : String(pageError);
54
+ console.warn(`[PDF Reader MCP] Error getting text content for page ${String(pageNum)} in ${sourceDescription}: ${message}`);
55
+ return { page: pageNum, text: `Error processing page: ${message}` };
56
+ }
57
+ };
58
+ /**
59
+ * Extract text from specified pages (parallel processing for performance)
60
+ */
61
+ export const extractPageTexts = async (pdfDocument, pagesToProcess, sourceDescription) => {
62
+ // Process all pages in parallel for better performance
63
+ const extractedPageTexts = await Promise.all(pagesToProcess.map((pageNum) => extractSinglePageText(pdfDocument, pageNum, sourceDescription)));
64
+ return extractedPageTexts.sort((a, b) => a.page - b.page);
65
+ };
66
+ /**
67
+ * Extract images from a single page
68
+ */
69
+ const extractImagesFromPage = async (page, pageNum) => {
70
+ const images = [];
71
+ try {
72
+ const operatorList = await page.getOperatorList();
73
+ // Find all image painting operations
74
+ const imageIndices = [];
75
+ for (let i = 0; i < operatorList.fnArray.length; i++) {
76
+ const op = operatorList.fnArray[i];
77
+ if (op === OPS.paintImageXObject || op === OPS.paintXObject) {
78
+ imageIndices.push(i);
79
+ }
80
+ }
81
+ // Extract each image using Promise-based approach
82
+ const imagePromises = imageIndices.map((imgIndex, arrayIndex) => new Promise((resolve) => {
83
+ const argsArray = operatorList.argsArray[imgIndex];
84
+ if (!argsArray || argsArray.length === 0) {
85
+ resolve(null);
86
+ return;
87
+ }
88
+ const imageName = argsArray[0];
89
+ // Use callback-based get() as images may not be resolved yet
90
+ page.objs.get(imageName, (imageData) => {
91
+ if (!imageData || typeof imageData !== 'object') {
92
+ resolve(null);
93
+ return;
94
+ }
95
+ const img = imageData;
96
+ if (!img.data || !img.width || !img.height) {
97
+ resolve(null);
98
+ return;
99
+ }
100
+ // Determine image format based on kind
101
+ // kind === 1 = grayscale, 2 = RGB, 3 = RGBA
102
+ const format = img.kind === 1 ? 'grayscale' : img.kind === 3 ? 'rgba' : 'rgb';
103
+ // Convert Uint8Array to base64
104
+ const base64 = Buffer.from(img.data).toString('base64');
105
+ resolve({
106
+ page: pageNum,
107
+ index: arrayIndex,
108
+ width: img.width,
109
+ height: img.height,
110
+ format,
111
+ data: base64,
112
+ });
113
+ });
114
+ }));
115
+ const resolvedImages = await Promise.all(imagePromises);
116
+ images.push(...resolvedImages.filter((img) => img !== null));
117
+ }
118
+ catch (error) {
119
+ const message = error instanceof Error ? error.message : String(error);
120
+ console.warn(`[PDF Reader MCP] Error extracting images from page ${String(pageNum)}: ${message}`);
121
+ }
122
+ return images;
123
+ };
124
+ /**
125
+ * Extract images from specified pages
126
+ */
127
+ export const extractImages = async (pdfDocument, pagesToProcess) => {
128
+ const allImages = [];
129
+ // Process pages sequentially to avoid overwhelming PDF.js
130
+ for (const pageNum of pagesToProcess) {
131
+ try {
132
+ const page = await pdfDocument.getPage(pageNum);
133
+ const pageImages = await extractImagesFromPage(page, pageNum);
134
+ allImages.push(...pageImages);
135
+ }
136
+ catch (error) {
137
+ const message = error instanceof Error ? error.message : String(error);
138
+ console.warn(`[PDF Reader MCP] Error getting page ${String(pageNum)} for image extraction: ${message}`);
139
+ }
140
+ }
141
+ return allImages;
142
+ };
143
+ /**
144
+ * Build warnings array for invalid page numbers
145
+ */
146
+ export const buildWarnings = (invalidPages, totalPages) => {
147
+ if (invalidPages.length === 0) {
148
+ return [];
149
+ }
150
+ return [
151
+ `Requested page numbers ${invalidPages.join(', ')} exceed total pages (${String(totalPages)}).`,
152
+ ];
153
+ };
@@ -0,0 +1,53 @@
1
+ // PDF document loading utilities
2
+ import fs from 'node:fs/promises';
3
+ import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
4
+ import { getDocument } from 'pdfjs-dist/legacy/build/pdf.mjs';
5
+ import { resolvePath } from '../utils/pathUtils.js';
6
+ /**
7
+ * Load a PDF document from a local file path or URL
8
+ * @param source - Object containing either path or url
9
+ * @param sourceDescription - Description for error messages
10
+ * @returns PDF document proxy
11
+ */
12
+ export const loadPdfDocument = async (source, sourceDescription) => {
13
+ let pdfDataSource;
14
+ try {
15
+ if (source.path) {
16
+ const safePath = resolvePath(source.path);
17
+ const buffer = await fs.readFile(safePath);
18
+ pdfDataSource = new Uint8Array(buffer);
19
+ }
20
+ else if (source.url) {
21
+ pdfDataSource = { url: source.url };
22
+ }
23
+ else {
24
+ throw new McpError(ErrorCode.InvalidParams, `Source ${sourceDescription} missing 'path' or 'url'.`);
25
+ }
26
+ }
27
+ catch (err) {
28
+ if (err instanceof McpError) {
29
+ throw err;
30
+ }
31
+ const message = err instanceof Error ? err.message : String(err);
32
+ const errorCode = ErrorCode.InvalidRequest;
33
+ if (typeof err === 'object' &&
34
+ err !== null &&
35
+ 'code' in err &&
36
+ err.code === 'ENOENT' &&
37
+ source.path) {
38
+ throw new McpError(errorCode, `File not found at '${source.path}'.`, {
39
+ cause: err instanceof Error ? err : undefined,
40
+ });
41
+ }
42
+ throw new McpError(errorCode, `Failed to prepare PDF source ${sourceDescription}. Reason: ${message}`, { cause: err instanceof Error ? err : undefined });
43
+ }
44
+ const loadingTask = getDocument(pdfDataSource);
45
+ try {
46
+ return await loadingTask.promise;
47
+ }
48
+ catch (err) {
49
+ console.error(`[PDF Reader MCP] PDF.js loading error for ${sourceDescription}:`, err);
50
+ const message = err instanceof Error ? err.message : String(err);
51
+ throw new McpError(ErrorCode.InvalidRequest, `Failed to load PDF document from ${sourceDescription}. Reason: ${message || 'Unknown loading error'}`, { cause: err instanceof Error ? err : undefined });
52
+ }
53
+ };
@@ -0,0 +1,94 @@
1
+ // Page range parsing utilities
2
+ import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
3
+ const MAX_RANGE_SIZE = 10000; // Prevent infinite loops for open ranges
4
+ /**
5
+ * Parse a single range part (e.g., "1-3", "5", "7-")
6
+ */
7
+ const parseRangePart = (part, pages) => {
8
+ const trimmedPart = part.trim();
9
+ if (trimmedPart.includes('-')) {
10
+ const [startStr, endStr] = trimmedPart.split('-');
11
+ if (startStr === undefined) {
12
+ throw new Error(`Invalid page range format: ${trimmedPart}`);
13
+ }
14
+ const start = parseInt(startStr, 10);
15
+ const end = endStr === '' || endStr === undefined ? Infinity : parseInt(endStr, 10);
16
+ if (Number.isNaN(start) || Number.isNaN(end) || start <= 0 || start > end) {
17
+ throw new Error(`Invalid page range values: ${trimmedPart}`);
18
+ }
19
+ const practicalEnd = Math.min(end, start + MAX_RANGE_SIZE);
20
+ for (let i = start; i <= practicalEnd; i++) {
21
+ pages.add(i);
22
+ }
23
+ if (end === Infinity && practicalEnd === start + MAX_RANGE_SIZE) {
24
+ console.warn(`[PDF Reader MCP] Open-ended range starting at ${String(start)} was truncated at page ${String(practicalEnd)}.`);
25
+ }
26
+ }
27
+ else {
28
+ const page = parseInt(trimmedPart, 10);
29
+ if (Number.isNaN(page) || page <= 0) {
30
+ throw new Error(`Invalid page number: ${trimmedPart}`);
31
+ }
32
+ pages.add(page);
33
+ }
34
+ };
35
+ /**
36
+ * Parse page range string into array of page numbers
37
+ * @param ranges - Range string (e.g., "1-3,5,7-10")
38
+ * @returns Sorted array of unique page numbers
39
+ */
40
+ export const parsePageRanges = (ranges) => {
41
+ const pages = new Set();
42
+ const parts = ranges.split(',');
43
+ for (const part of parts) {
44
+ parseRangePart(part, pages);
45
+ }
46
+ if (pages.size === 0) {
47
+ throw new Error('Page range string resulted in zero valid pages.');
48
+ }
49
+ return Array.from(pages).sort((a, b) => a - b);
50
+ };
51
+ /**
52
+ * Get target pages from page specification
53
+ * @param sourcePages - Page specification (string or array)
54
+ * @param sourceDescription - Description for error messages
55
+ * @returns Array of page numbers or undefined
56
+ */
57
+ export const getTargetPages = (sourcePages, sourceDescription) => {
58
+ if (!sourcePages) {
59
+ return undefined;
60
+ }
61
+ try {
62
+ if (typeof sourcePages === 'string') {
63
+ return parsePageRanges(sourcePages);
64
+ }
65
+ // Array of page numbers
66
+ if (sourcePages.some((p) => !Number.isInteger(p) || p <= 0)) {
67
+ throw new Error('Page numbers in array must be positive integers.');
68
+ }
69
+ const uniquePages = [...new Set(sourcePages)].sort((a, b) => a - b);
70
+ if (uniquePages.length === 0) {
71
+ throw new Error('Page specification resulted in an empty set of pages.');
72
+ }
73
+ return uniquePages;
74
+ }
75
+ catch (error) {
76
+ const message = error instanceof Error ? error.message : String(error);
77
+ throw new McpError(ErrorCode.InvalidParams, `Invalid page specification for source ${sourceDescription}: ${message}`);
78
+ }
79
+ };
80
+ /**
81
+ * Determine which pages to process based on target pages and document size
82
+ */
83
+ export const determinePagesToProcess = (targetPages, totalPages, includeFullText) => {
84
+ if (targetPages) {
85
+ const pagesToProcess = targetPages.filter((p) => p <= totalPages);
86
+ const invalidPages = targetPages.filter((p) => p > totalPages);
87
+ return { pagesToProcess, invalidPages };
88
+ }
89
+ if (includeFullText) {
90
+ const pagesToProcess = Array.from({ length: totalPages }, (_, i) => i + 1);
91
+ return { pagesToProcess, invalidPages: [] };
92
+ }
93
+ return { pagesToProcess: [], invalidPages: [] };
94
+ };
@@ -0,0 +1,55 @@
1
+ // Zod validation schemas for PDF reading
2
+ import { z } from 'zod';
3
+ // Schema for page specification (array of numbers or range string)
4
+ export const pageSpecifierSchema = z.union([
5
+ z.array(z.number().int().min(1)).min(1).describe('Array of page numbers (1-based)'),
6
+ z
7
+ .string()
8
+ .min(1)
9
+ .refine((val) => /^[0-9,-]+$/.test(val.replace(/\s/g, '')), {
10
+ message: 'Page string must contain only numbers, commas, and hyphens.',
11
+ })
12
+ .describe('Page range string (e.g., "1-5,10,15-20")'),
13
+ ]);
14
+ // Schema for a single PDF source (path or URL)
15
+ export const pdfSourceSchema = z
16
+ .object({
17
+ path: z.string().min(1).optional().describe('Relative path to the local PDF file.'),
18
+ url: z.string().url().optional().describe('URL of the PDF file.'),
19
+ pages: pageSpecifierSchema
20
+ .optional()
21
+ .describe("Extract text only from specific pages (1-based) or ranges for this source. If provided, 'include_full_text' is ignored for this source."),
22
+ })
23
+ .strict()
24
+ .refine((data) => !!(data.path && !data.url) || !!(!data.path && data.url), {
25
+ message: "Each source must have either 'path' or 'url', but not both.",
26
+ });
27
+ // Schema for the read_pdf tool arguments
28
+ export const readPdfArgsSchema = z
29
+ .object({
30
+ sources: z
31
+ .array(pdfSourceSchema)
32
+ .min(1)
33
+ .describe('An array of PDF sources to process, each can optionally specify pages.'),
34
+ include_full_text: z
35
+ .boolean()
36
+ .optional()
37
+ .default(false)
38
+ .describe("Include the full text content of each PDF (only if 'pages' is not specified for that source)."),
39
+ include_metadata: z
40
+ .boolean()
41
+ .optional()
42
+ .default(true)
43
+ .describe('Include metadata and info objects for each PDF.'),
44
+ include_page_count: z
45
+ .boolean()
46
+ .optional()
47
+ .default(true)
48
+ .describe('Include the total number of pages for each PDF.'),
49
+ include_images: z
50
+ .boolean()
51
+ .optional()
52
+ .default(false)
53
+ .describe('Extract and include embedded images from the PDF pages as base64-encoded data.'),
54
+ })
55
+ .strict();
@@ -0,0 +1,2 @@
1
+ // PDF-related TypeScript type definitions
2
+ export {};
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@sylphx/pdf-reader-mcp",
3
- "version": "1.0.0",
3
+ "version": "1.1.0",
4
4
  "description": "An MCP server providing tools to read PDF files.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -39,32 +39,6 @@
39
39
  "agent",
40
40
  "tool"
41
41
  ],
42
- "scripts": {
43
- "build": "tsc",
44
- "watch": "tsc --watch",
45
- "inspector": "npx @modelcontextprotocol/inspector dist/index.js",
46
- "test": "vitest run",
47
- "test:watch": "vitest watch",
48
- "test:cov": "vitest run --coverage --reporter=junit --outputFile=test-report.junit.xml",
49
- "lint": "biome lint .",
50
- "lint:fix": "biome lint --write .",
51
- "format": "biome format --write .",
52
- "check-format": "biome format .",
53
- "check": "biome check .",
54
- "check:fix": "biome check --write .",
55
- "validate": "npm run check && npm run test",
56
- "docs:dev": "vitepress dev docs",
57
- "docs:build": "vitepress build docs",
58
- "docs:preview": "vitepress preview docs",
59
- "start": "node dist/index.js",
60
- "typecheck": "tsc --noEmit",
61
- "benchmark": "vitest bench",
62
- "clean": "rm -rf dist coverage",
63
- "docs:api": "typedoc --entryPoints src/index.ts --tsconfig tsconfig.json --plugin typedoc-plugin-markdown --out docs/api --readme none",
64
- "prepublishOnly": "pnpm run clean && pnpm run build",
65
- "release": "standard-version",
66
- "prepare": "husky"
67
- },
68
42
  "dependencies": {
69
43
  "@modelcontextprotocol/sdk": "1.20.2",
70
44
  "glob": "^11.0.1",
@@ -98,5 +72,29 @@
98
72
  "*.{ts,tsx,js,cjs,json}": [
99
73
  "biome check --write --no-errors-on-unmatched --files-ignore-unknown=true"
100
74
  ]
75
+ },
76
+ "scripts": {
77
+ "build": "tsc",
78
+ "watch": "tsc --watch",
79
+ "inspector": "npx @modelcontextprotocol/inspector dist/index.js",
80
+ "test": "vitest run",
81
+ "test:watch": "vitest watch",
82
+ "test:cov": "vitest run --coverage --reporter=junit --outputFile=test-report.junit.xml",
83
+ "lint": "biome lint .",
84
+ "lint:fix": "biome lint --write .",
85
+ "format": "biome format --write .",
86
+ "check-format": "biome format .",
87
+ "check": "biome check .",
88
+ "check:fix": "biome check --write .",
89
+ "validate": "npm run check && npm run test",
90
+ "docs:dev": "vitepress dev docs",
91
+ "docs:build": "vitepress build docs",
92
+ "docs:preview": "vitepress preview docs",
93
+ "start": "node dist/index.js",
94
+ "typecheck": "tsc --noEmit",
95
+ "benchmark": "vitest bench",
96
+ "clean": "rm -rf dist coverage",
97
+ "docs:api": "typedoc --entryPoints src/index.ts --tsconfig tsconfig.json --plugin typedoc-plugin-markdown --out docs/api --readme none",
98
+ "release": "standard-version"
101
99
  }
102
- }
100
+ }