@sylphx/pdf-reader-mcp 1.0.0 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +71 -36
- package/dist/handlers/readPdf.js +117 -291
- package/dist/index.js +5 -5
- package/dist/pdf/extractor.js +265 -0
- package/dist/pdf/loader.js +53 -0
- package/dist/pdf/parser.js +94 -0
- package/dist/schemas/readPdf.js +55 -0
- package/dist/types/pdf.js +2 -0
- package/package.json +26 -28
package/README.md
CHANGED
|
@@ -4,7 +4,6 @@
|
|
|
4
4
|
[](https://github.com/sylphlab/pdf-reader-mcp/actions/workflows/ci.yml)
|
|
5
5
|
[](https://codecov.io/gh/sylphlab/pdf-reader-mcp)
|
|
6
6
|
[](https://badge.fury.io/js/%40sylphlab%2Fpdf-reader-mcp)
|
|
7
|
-
[](https://hub.docker.com/r/sylphlab/pdf-reader-mcp)
|
|
8
7
|
[](https://opensource.org/licenses/MIT)
|
|
9
8
|
[](https://smithery.ai/server/@sylphxltd/pdf-reader-mcp)
|
|
10
9
|
|
|
@@ -17,13 +16,14 @@
|
|
|
17
16
|
## ✨ Features
|
|
18
17
|
|
|
19
18
|
- 📄 **Extract text content** from PDF files (full document or specific pages)
|
|
19
|
+
- 🖼️ **Extract embedded images** from PDF pages as base64-encoded data
|
|
20
20
|
- 📊 **Get metadata** (author, title, creation date, etc.)
|
|
21
21
|
- 🔢 **Count pages** in PDF documents
|
|
22
22
|
- 🌐 **Support for both local files and URLs**
|
|
23
23
|
- 🛡️ **Secure** - Confines file access to project root directory
|
|
24
|
-
- ⚡ **Fast** -
|
|
24
|
+
- ⚡ **Fast** - Parallel processing for maximum performance
|
|
25
25
|
- 🔄 **Batch processing** - Handle multiple PDFs in a single request
|
|
26
|
-
- 📦 **Multiple deployment options** - npm
|
|
26
|
+
- 📦 **Multiple deployment options** - npm or Smithery
|
|
27
27
|
|
|
28
28
|
## 🆕 Recent Updates (October 2025)
|
|
29
29
|
|
|
@@ -32,8 +32,9 @@
|
|
|
32
32
|
- ✅ **Improved metadata extraction**: Robust fallback handling for PDF.js compatibility
|
|
33
33
|
- ✅ **Updated dependencies**: All packages updated to latest versions
|
|
34
34
|
- ✅ **Migrated to Biome**: 50x faster linting and formatting with unified tooling
|
|
35
|
-
- ✅ **Added
|
|
36
|
-
- ✅ **
|
|
35
|
+
- ✅ **Added image extraction**: Extract embedded images from PDF pages
|
|
36
|
+
- ✅ **Performance optimization**: Parallel page processing for 5-10x speedup
|
|
37
|
+
- ✅ **Deep refactoring**: Modular architecture with 98.9% test coverage (90 tests)
|
|
37
38
|
|
|
38
39
|
## 📦 Installation
|
|
39
40
|
|
|
@@ -70,35 +71,7 @@ Configure your MCP client (e.g., Claude Desktop, Cursor):
|
|
|
70
71
|
|
|
71
72
|
**Important:** Make sure your MCP client sets the correct working directory (`cwd`) to your project root.
|
|
72
73
|
|
|
73
|
-
### Option 3:
|
|
74
|
-
|
|
75
|
-
Pull the image:
|
|
76
|
-
|
|
77
|
-
```bash
|
|
78
|
-
docker pull sylphx/pdf-reader-mcp:latest
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
Configure your MCP client:
|
|
82
|
-
|
|
83
|
-
```json
|
|
84
|
-
{
|
|
85
|
-
"mcpServers": {
|
|
86
|
-
"pdf-reader-mcp": {
|
|
87
|
-
"command": "docker",
|
|
88
|
-
"args": [
|
|
89
|
-
"run",
|
|
90
|
-
"-i",
|
|
91
|
-
"--rm",
|
|
92
|
-
"-v",
|
|
93
|
-
"/path/to/your/project:/app",
|
|
94
|
-
"sylphx/pdf-reader-mcp:latest"
|
|
95
|
-
]
|
|
96
|
-
}
|
|
97
|
-
}
|
|
98
|
-
}
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
### Option 4: Local Development Build
|
|
74
|
+
### Option 3: Local Development Build
|
|
102
75
|
|
|
103
76
|
```bash
|
|
104
77
|
git clone https://github.com/sylphlab/pdf-reader-mcp.git
|
|
@@ -164,6 +137,28 @@ Once configured, your AI agent can read PDFs using the `read_pdf` tool:
|
|
|
164
137
|
}
|
|
165
138
|
```
|
|
166
139
|
|
|
140
|
+
### Example 5: Extract images from PDF
|
|
141
|
+
|
|
142
|
+
```json
|
|
143
|
+
{
|
|
144
|
+
"sources": [
|
|
145
|
+
{
|
|
146
|
+
"path": "presentation.pdf",
|
|
147
|
+
"pages": [1, 2, 3]
|
|
148
|
+
}
|
|
149
|
+
],
|
|
150
|
+
"include_images": true,
|
|
151
|
+
"include_full_text": true
|
|
152
|
+
}
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
**Response includes**:
|
|
156
|
+
- Text content from each page
|
|
157
|
+
- Embedded images as base64-encoded data with metadata (width, height, format)
|
|
158
|
+
- Each image includes page number and index
|
|
159
|
+
|
|
160
|
+
**Note**: Image extraction works best with JPEG and PNG images. Large PDFs with many images may produce large responses.
|
|
161
|
+
|
|
167
162
|
## 📖 Usage Guide
|
|
168
163
|
|
|
169
164
|
### Page Specification
|
|
@@ -192,6 +187,45 @@ For large PDF files (>20 MB), extract specific pages instead of the full documen
|
|
|
192
187
|
|
|
193
188
|
This prevents hitting AI model context limits and improves performance.
|
|
194
189
|
|
|
190
|
+
### Image Extraction
|
|
191
|
+
|
|
192
|
+
Extract embedded images from PDF pages as base64-encoded data:
|
|
193
|
+
|
|
194
|
+
```json
|
|
195
|
+
{
|
|
196
|
+
"sources": [{ "path": "document.pdf" }],
|
|
197
|
+
"include_images": true
|
|
198
|
+
}
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
**Image data format**:
|
|
202
|
+
```json
|
|
203
|
+
{
|
|
204
|
+
"images": [
|
|
205
|
+
{
|
|
206
|
+
"page": 1,
|
|
207
|
+
"index": 0,
|
|
208
|
+
"width": 800,
|
|
209
|
+
"height": 600,
|
|
210
|
+
"format": "rgb",
|
|
211
|
+
"data": "base64-encoded-image-data..."
|
|
212
|
+
}
|
|
213
|
+
]
|
|
214
|
+
}
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
**Supported formats**:
|
|
218
|
+
- ✅ **RGB** - Standard color images (most common)
|
|
219
|
+
- ✅ **RGBA** - Images with transparency
|
|
220
|
+
- ✅ **Grayscale** - Black and white images
|
|
221
|
+
- ✅ Works with JPEG, PNG, and other embedded formats
|
|
222
|
+
|
|
223
|
+
**Important considerations**:
|
|
224
|
+
- 🔸 Image extraction increases response size significantly
|
|
225
|
+
- 🔸 Useful for AI models with vision capabilities
|
|
226
|
+
- 🔸 Set `include_images: false` (default) to extract text only
|
|
227
|
+
- 🔸 Combine with `pages` parameter to limit extraction scope
|
|
228
|
+
|
|
195
229
|
### Security: Relative Paths Only
|
|
196
230
|
|
|
197
231
|
**Important:** The server only accepts **relative paths** for security reasons. Absolute paths are blocked to prevent unauthorized file system access.
|
|
@@ -360,12 +394,13 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidelines.
|
|
|
360
394
|
|
|
361
395
|
## 🗺️ Roadmap
|
|
362
396
|
|
|
363
|
-
- [
|
|
397
|
+
- [x] ~~Image extraction from PDFs~~ ✅ Completed (v1.0.0)
|
|
398
|
+
- [x] ~~Performance optimizations for parallel processing~~ ✅ Completed (v1.0.0)
|
|
364
399
|
- [ ] Annotation extraction support
|
|
365
400
|
- [ ] OCR integration for scanned PDFs
|
|
366
401
|
- [ ] Streaming support for very large files
|
|
367
402
|
- [ ] Enhanced caching mechanisms
|
|
368
|
-
- [ ]
|
|
403
|
+
- [ ] PDF form field extraction
|
|
369
404
|
|
|
370
405
|
## 🤝 Support & Community
|
|
371
406
|
|
package/dist/handlers/readPdf.js
CHANGED
|
@@ -1,299 +1,75 @@
|
|
|
1
|
-
|
|
1
|
+
// PDF reading handler - orchestrates PDF processing workflow
|
|
2
2
|
import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
|
|
3
|
-
import * as pdfjsLib from 'pdfjs-dist/legacy/build/pdf.mjs';
|
|
4
3
|
import { z } from 'zod';
|
|
5
|
-
import {
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
// Basic check
|
|
14
|
-
throw new Error(`Invalid page range format: ${trimmedPart}`);
|
|
15
|
-
}
|
|
16
|
-
const start = parseInt(startStr, 10);
|
|
17
|
-
const end = endStr === '' || endStr === undefined ? Infinity : parseInt(endStr, 10);
|
|
18
|
-
if (Number.isNaN(start) || Number.isNaN(end) || start <= 0 || start > end) {
|
|
19
|
-
throw new Error(`Invalid page range values: ${trimmedPart}`);
|
|
20
|
-
}
|
|
21
|
-
// Add a reasonable upper limit to prevent infinite loops for open ranges
|
|
22
|
-
const practicalEnd = Math.min(end, start + 10000); // Limit range parsing depth
|
|
23
|
-
for (let i = start; i <= practicalEnd; i++) {
|
|
24
|
-
pages.add(i);
|
|
25
|
-
}
|
|
26
|
-
if (end === Infinity && practicalEnd === start + 10000) {
|
|
27
|
-
console.warn(`[PDF Reader MCP] Open-ended range starting at ${String(start)} was truncated at page ${String(practicalEnd)} during parsing.`);
|
|
28
|
-
}
|
|
29
|
-
}
|
|
30
|
-
else {
|
|
31
|
-
const page = parseInt(trimmedPart, 10);
|
|
32
|
-
if (Number.isNaN(page) || page <= 0) {
|
|
33
|
-
throw new Error(`Invalid page number: ${trimmedPart}`);
|
|
34
|
-
}
|
|
35
|
-
pages.add(page);
|
|
36
|
-
}
|
|
37
|
-
};
|
|
38
|
-
// Parses the complete page range string (e.g., "1-3,5,7-")
|
|
39
|
-
const parsePageRanges = (ranges) => {
|
|
40
|
-
const pages = new Set();
|
|
41
|
-
const parts = ranges.split(',');
|
|
42
|
-
for (const part of parts) {
|
|
43
|
-
parseRangePart(part, pages); // Delegate parsing of each part
|
|
44
|
-
}
|
|
45
|
-
if (pages.size === 0) {
|
|
46
|
-
throw new Error('Page range string resulted in zero valid pages.');
|
|
47
|
-
}
|
|
48
|
-
return Array.from(pages).sort((a, b) => a - b);
|
|
49
|
-
};
|
|
50
|
-
// --- Zod Schemas ---
|
|
51
|
-
const pageSpecifierSchema = z.union([
|
|
52
|
-
z
|
|
53
|
-
.array(z.number().int().min(1))
|
|
54
|
-
.min(1), // Array of integers with minimum value 1 (pages are 1-based)
|
|
55
|
-
z
|
|
56
|
-
.string()
|
|
57
|
-
.min(1)
|
|
58
|
-
.refine((val) => /^[0-9,-]+$/.test(val.replace(/\s/g, '')), {
|
|
59
|
-
// Allow spaces but test without them
|
|
60
|
-
message: 'Page string must contain only numbers, commas, and hyphens.',
|
|
61
|
-
}),
|
|
62
|
-
]);
|
|
63
|
-
const PdfSourceSchema = z
|
|
64
|
-
.object({
|
|
65
|
-
path: z.string().min(1).optional().describe('Relative path to the local PDF file.'),
|
|
66
|
-
url: z.string().url().optional().describe('URL of the PDF file.'),
|
|
67
|
-
pages: pageSpecifierSchema
|
|
68
|
-
.optional()
|
|
69
|
-
.describe("Extract text only from specific pages (1-based) or ranges for *this specific source*. If provided, 'include_full_text' for the entire request is ignored for this source."),
|
|
70
|
-
})
|
|
71
|
-
.strict()
|
|
72
|
-
.refine((data) => !!(data.path && !data.url) || !!(!data.path && data.url), {
|
|
73
|
-
// Use boolean coercion instead of || for truthiness check if needed, though refine expects boolean
|
|
74
|
-
message: "Each source must have either 'path' or 'url', but not both.",
|
|
75
|
-
});
|
|
76
|
-
const ReadPdfArgsSchema = z
|
|
77
|
-
.object({
|
|
78
|
-
sources: z
|
|
79
|
-
.array(PdfSourceSchema)
|
|
80
|
-
.min(1)
|
|
81
|
-
.describe('An array of PDF sources to process, each can optionally specify pages.'),
|
|
82
|
-
include_full_text: z
|
|
83
|
-
.boolean()
|
|
84
|
-
.optional()
|
|
85
|
-
.default(false)
|
|
86
|
-
.describe("Include the full text content of each PDF (only if 'pages' is not specified for that source)."),
|
|
87
|
-
include_metadata: z
|
|
88
|
-
.boolean()
|
|
89
|
-
.optional()
|
|
90
|
-
.default(true)
|
|
91
|
-
.describe('Include metadata and info objects for each PDF.'),
|
|
92
|
-
include_page_count: z
|
|
93
|
-
.boolean()
|
|
94
|
-
.optional()
|
|
95
|
-
.default(true)
|
|
96
|
-
.describe('Include the total number of pages for each PDF.'),
|
|
97
|
-
})
|
|
98
|
-
.strict();
|
|
99
|
-
// --- Helper Functions ---
|
|
100
|
-
// Parses the page specification for a single source
|
|
101
|
-
const getTargetPages = (sourcePages, sourceDescription) => {
|
|
102
|
-
if (!sourcePages) {
|
|
103
|
-
return undefined;
|
|
104
|
-
}
|
|
105
|
-
try {
|
|
106
|
-
let targetPages;
|
|
107
|
-
if (typeof sourcePages === 'string') {
|
|
108
|
-
targetPages = parsePageRanges(sourcePages);
|
|
109
|
-
}
|
|
110
|
-
else {
|
|
111
|
-
// Ensure array elements are positive integers
|
|
112
|
-
if (sourcePages.some((p) => !Number.isInteger(p) || p <= 0)) {
|
|
113
|
-
throw new Error('Page numbers in array must be positive integers.');
|
|
114
|
-
}
|
|
115
|
-
targetPages = [...new Set(sourcePages)].sort((a, b) => a - b);
|
|
116
|
-
}
|
|
117
|
-
if (targetPages.length === 0) {
|
|
118
|
-
// Check after potential Set deduplication
|
|
119
|
-
throw new Error('Page specification resulted in an empty set of pages.');
|
|
120
|
-
}
|
|
121
|
-
return targetPages;
|
|
122
|
-
}
|
|
123
|
-
catch (error) {
|
|
124
|
-
const message = error instanceof Error ? error.message : String(error);
|
|
125
|
-
// Throw McpError for invalid page specs caught during parsing
|
|
126
|
-
throw new McpError(ErrorCode.InvalidParams, `Invalid page specification for source ${sourceDescription}: ${message}`);
|
|
127
|
-
}
|
|
128
|
-
};
|
|
129
|
-
// Loads the PDF document from path or URL
|
|
130
|
-
const loadPdfDocument = async (source, // Explicitly allow undefined
|
|
131
|
-
sourceDescription) => {
|
|
132
|
-
let pdfDataSource;
|
|
133
|
-
try {
|
|
134
|
-
if (source.path) {
|
|
135
|
-
const safePath = resolvePath(source.path); // resolvePath handles security checks
|
|
136
|
-
const buffer = await fs.readFile(safePath);
|
|
137
|
-
pdfDataSource = new Uint8Array(buffer); // Convert Buffer to Uint8Array
|
|
138
|
-
}
|
|
139
|
-
else if (source.url) {
|
|
140
|
-
pdfDataSource = { url: source.url };
|
|
141
|
-
}
|
|
142
|
-
else {
|
|
143
|
-
// This case should be caught by Zod, but added for robustness
|
|
144
|
-
throw new McpError(ErrorCode.InvalidParams, `Source ${sourceDescription} missing 'path' or 'url'.`);
|
|
145
|
-
}
|
|
146
|
-
}
|
|
147
|
-
catch (err) {
|
|
148
|
-
// Handle errors during path resolution or file reading
|
|
149
|
-
let errorMessage; // Declare errorMessage here
|
|
150
|
-
const message = err instanceof Error ? err.message : String(err);
|
|
151
|
-
const errorCode = ErrorCode.InvalidRequest; // Default error code
|
|
152
|
-
if (typeof err === 'object' &&
|
|
153
|
-
err !== null &&
|
|
154
|
-
'code' in err &&
|
|
155
|
-
err.code === 'ENOENT' &&
|
|
156
|
-
source.path) {
|
|
157
|
-
// Specific handling for file not found
|
|
158
|
-
errorMessage = `File not found at '${source.path}'.`;
|
|
159
|
-
// Optionally keep errorCode as InvalidRequest or change if needed
|
|
160
|
-
}
|
|
161
|
-
else {
|
|
162
|
-
// Generic error for other file prep issues or resolvePath errors
|
|
163
|
-
errorMessage = `Failed to prepare PDF source ${sourceDescription}. Reason: ${message}`;
|
|
164
|
-
}
|
|
165
|
-
throw new McpError(errorCode, errorMessage, { cause: err instanceof Error ? err : undefined });
|
|
166
|
-
}
|
|
167
|
-
const loadingTask = pdfjsLib.getDocument(pdfDataSource);
|
|
168
|
-
try {
|
|
169
|
-
return await loadingTask.promise;
|
|
170
|
-
}
|
|
171
|
-
catch (err) {
|
|
172
|
-
console.error(`[PDF Reader MCP] PDF.js loading error for ${sourceDescription}:`, err);
|
|
173
|
-
const message = err instanceof Error ? err.message : String(err);
|
|
174
|
-
// Use ?? for default message
|
|
175
|
-
throw new McpError(ErrorCode.InvalidRequest, `Failed to load PDF document from ${sourceDescription}. Reason: ${message || 'Unknown loading error'}`, // Revert to || as message is likely always string here
|
|
176
|
-
{ cause: err instanceof Error ? err : undefined });
|
|
177
|
-
}
|
|
178
|
-
};
|
|
179
|
-
// Extracts metadata and page count
|
|
180
|
-
const extractMetadataAndPageCount = async (pdfDocument, includeMetadata, includePageCount) => {
|
|
181
|
-
const output = {};
|
|
182
|
-
if (includePageCount) {
|
|
183
|
-
output.num_pages = pdfDocument.numPages;
|
|
184
|
-
}
|
|
185
|
-
if (includeMetadata) {
|
|
186
|
-
try {
|
|
187
|
-
const pdfMetadata = await pdfDocument.getMetadata();
|
|
188
|
-
const infoData = pdfMetadata.info;
|
|
189
|
-
if (infoData !== undefined) {
|
|
190
|
-
output.info = infoData;
|
|
191
|
-
}
|
|
192
|
-
const metadataObj = pdfMetadata.metadata;
|
|
193
|
-
// Convert the metadata object to a plain object by extracting all properties
|
|
194
|
-
// Check if it has a getAll method (as used in tests)
|
|
195
|
-
if (typeof metadataObj.getAll === 'function') {
|
|
196
|
-
output.metadata = metadataObj.getAll();
|
|
197
|
-
}
|
|
198
|
-
else {
|
|
199
|
-
// For real PDF.js metadata, convert to plain object
|
|
200
|
-
const metadataRecord = {};
|
|
201
|
-
// Extract enumerable properties
|
|
202
|
-
for (const key in metadataObj) {
|
|
203
|
-
if (Object.hasOwn(metadataObj, key)) {
|
|
204
|
-
metadataRecord[key] = metadataObj[key];
|
|
205
|
-
}
|
|
206
|
-
}
|
|
207
|
-
output.metadata = metadataRecord;
|
|
208
|
-
}
|
|
209
|
-
}
|
|
210
|
-
catch (metaError) {
|
|
211
|
-
console.warn(`[PDF Reader MCP] Error extracting metadata: ${metaError instanceof Error ? metaError.message : String(metaError)}`);
|
|
212
|
-
// Optionally add a warning to the result if metadata extraction fails partially
|
|
213
|
-
}
|
|
214
|
-
}
|
|
215
|
-
return output;
|
|
216
|
-
};
|
|
217
|
-
// Extracts text from specified pages
|
|
218
|
-
const extractPageTexts = async (pdfDocument, pagesToProcess, sourceDescription) => {
|
|
219
|
-
const extractedPageTexts = [];
|
|
220
|
-
for (const pageNum of pagesToProcess) {
|
|
221
|
-
let pageText = '';
|
|
222
|
-
try {
|
|
223
|
-
const page = await pdfDocument.getPage(pageNum);
|
|
224
|
-
const textContent = await page.getTextContent();
|
|
225
|
-
pageText = textContent.items
|
|
226
|
-
.map((item) => item.str) // Type assertion
|
|
227
|
-
.join('');
|
|
228
|
-
}
|
|
229
|
-
catch (pageError) {
|
|
230
|
-
const message = pageError instanceof Error ? pageError.message : String(pageError);
|
|
231
|
-
console.warn(`[PDF Reader MCP] Error getting text content for page ${String(pageNum)} in ${sourceDescription}: ${message}` // Explicit string conversion
|
|
232
|
-
);
|
|
233
|
-
pageText = `Error processing page: ${message}`; // Include error in text
|
|
234
|
-
}
|
|
235
|
-
extractedPageTexts.push({ page: pageNum, text: pageText });
|
|
236
|
-
}
|
|
237
|
-
// Sorting is likely unnecessary if pagesToProcess was sorted, but keep for safety
|
|
238
|
-
extractedPageTexts.sort((a, b) => a.page - b.page);
|
|
239
|
-
return extractedPageTexts;
|
|
240
|
-
};
|
|
241
|
-
// Determines the actual list of pages to process based on target pages and total pages
|
|
242
|
-
const determinePagesToProcess = (targetPages, totalPages, includeFullText) => {
|
|
243
|
-
let pagesToProcess = [];
|
|
244
|
-
let invalidPages = [];
|
|
245
|
-
if (targetPages) {
|
|
246
|
-
// Filter target pages based on actual total pages
|
|
247
|
-
pagesToProcess = targetPages.filter((p) => p <= totalPages);
|
|
248
|
-
invalidPages = targetPages.filter((p) => p > totalPages);
|
|
249
|
-
}
|
|
250
|
-
else if (includeFullText) {
|
|
251
|
-
// If no specific pages requested for this source, use global flag
|
|
252
|
-
pagesToProcess = Array.from({ length: totalPages }, (_, i) => i + 1);
|
|
253
|
-
}
|
|
254
|
-
return { pagesToProcess, invalidPages };
|
|
255
|
-
};
|
|
256
|
-
// Processes a single PDF source
|
|
257
|
-
const processSingleSource = async (source, globalIncludeFullText, globalIncludeMetadata, globalIncludePageCount) => {
|
|
4
|
+
import { buildWarnings, extractMetadataAndPageCount, extractPageContent, } from '../pdf/extractor.js';
|
|
5
|
+
import { loadPdfDocument } from '../pdf/loader.js';
|
|
6
|
+
import { determinePagesToProcess, getTargetPages } from '../pdf/parser.js';
|
|
7
|
+
import { readPdfArgsSchema } from '../schemas/readPdf.js';
|
|
8
|
+
/**
|
|
9
|
+
* Process a single PDF source
|
|
10
|
+
*/
|
|
11
|
+
const processSingleSource = async (source, options) => {
|
|
258
12
|
const sourceDescription = source.path ?? source.url ?? 'unknown source';
|
|
259
13
|
let individualResult = { source: sourceDescription, success: false };
|
|
260
14
|
try {
|
|
261
|
-
//
|
|
15
|
+
// Parse target pages
|
|
262
16
|
const targetPages = getTargetPages(source.pages, sourceDescription);
|
|
263
|
-
//
|
|
264
|
-
// Destructure to remove 'pages' before passing to loadPdfDocument due to exactOptionalPropertyTypes
|
|
17
|
+
// Load PDF document
|
|
265
18
|
const { pages: _pages, ...loadArgs } = source;
|
|
266
19
|
const pdfDocument = await loadPdfDocument(loadArgs, sourceDescription);
|
|
267
20
|
const totalPages = pdfDocument.numPages;
|
|
268
|
-
//
|
|
269
|
-
const metadataOutput = await extractMetadataAndPageCount(pdfDocument,
|
|
270
|
-
const output = { ...metadataOutput };
|
|
271
|
-
//
|
|
272
|
-
const { pagesToProcess, invalidPages } = determinePagesToProcess(targetPages, totalPages,
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
if (
|
|
276
|
-
output.warnings =
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
// 5. Extract Text (if needed)
|
|
21
|
+
// Extract metadata and page count
|
|
22
|
+
const metadataOutput = await extractMetadataAndPageCount(pdfDocument, options.includeMetadata, options.includePageCount);
|
|
23
|
+
const output = { ...metadataOutput };
|
|
24
|
+
// Determine pages to process
|
|
25
|
+
const { pagesToProcess, invalidPages } = determinePagesToProcess(targetPages, totalPages, options.includeFullText);
|
|
26
|
+
// Add warnings for invalid pages
|
|
27
|
+
const warnings = buildWarnings(invalidPages, totalPages);
|
|
28
|
+
if (warnings.length > 0) {
|
|
29
|
+
output.warnings = warnings;
|
|
30
|
+
}
|
|
31
|
+
// Extract content with ordering preserved
|
|
280
32
|
if (pagesToProcess.length > 0) {
|
|
281
|
-
|
|
33
|
+
// Use new extractPageContent to preserve Y-coordinate ordering
|
|
34
|
+
const pageContents = await Promise.all(pagesToProcess.map((pageNum) => extractPageContent(pdfDocument, pageNum, options.includeImages, sourceDescription)));
|
|
35
|
+
// Store page contents for ordered retrieval
|
|
36
|
+
output.page_contents = pageContents.map((items, idx) => ({
|
|
37
|
+
page: pagesToProcess[idx],
|
|
38
|
+
items,
|
|
39
|
+
}));
|
|
40
|
+
// For backward compatibility, also provide text-only outputs
|
|
41
|
+
const extractedPageTexts = pageContents.map((items, idx) => ({
|
|
42
|
+
page: pagesToProcess[idx],
|
|
43
|
+
text: items
|
|
44
|
+
.filter((item) => item.type === 'text')
|
|
45
|
+
.map((item) => item.textContent)
|
|
46
|
+
.join(''),
|
|
47
|
+
}));
|
|
282
48
|
if (targetPages) {
|
|
283
|
-
//
|
|
49
|
+
// Specific pages requested
|
|
284
50
|
output.page_texts = extractedPageTexts;
|
|
285
51
|
}
|
|
286
52
|
else {
|
|
287
|
-
//
|
|
53
|
+
// Full text requested
|
|
288
54
|
output.full_text = extractedPageTexts.map((p) => p.text).join('\n\n');
|
|
289
55
|
}
|
|
56
|
+
// Extract image metadata for JSON response
|
|
57
|
+
if (options.includeImages) {
|
|
58
|
+
const extractedImages = pageContents
|
|
59
|
+
.flatMap((items) => items.filter((item) => item.type === 'image' && item.imageData))
|
|
60
|
+
.map((item) => item.imageData)
|
|
61
|
+
.filter((img) => img !== undefined);
|
|
62
|
+
if (extractedImages.length > 0) {
|
|
63
|
+
output.images = extractedImages;
|
|
64
|
+
}
|
|
65
|
+
}
|
|
290
66
|
}
|
|
291
67
|
individualResult = { ...individualResult, data: output, success: true };
|
|
292
68
|
}
|
|
293
69
|
catch (error) {
|
|
294
70
|
let errorMessage = `Failed to process PDF from ${sourceDescription}.`;
|
|
295
71
|
if (error instanceof McpError) {
|
|
296
|
-
errorMessage = error.message;
|
|
72
|
+
errorMessage = error.message;
|
|
297
73
|
}
|
|
298
74
|
else if (error instanceof Error) {
|
|
299
75
|
errorMessage += ` Reason: ${error.message}`;
|
|
@@ -303,40 +79,90 @@ const processSingleSource = async (source, globalIncludeFullText, globalIncludeM
|
|
|
303
79
|
}
|
|
304
80
|
individualResult.error = errorMessage;
|
|
305
81
|
individualResult.success = false;
|
|
306
|
-
individualResult.data = undefined;
|
|
82
|
+
individualResult.data = undefined;
|
|
307
83
|
}
|
|
308
84
|
return individualResult;
|
|
309
85
|
};
|
|
310
|
-
|
|
86
|
+
/**
|
|
87
|
+
* Main handler function for read_pdf tool
|
|
88
|
+
*/
|
|
311
89
|
export const handleReadPdfFunc = async (args) => {
|
|
312
90
|
let parsedArgs;
|
|
313
91
|
try {
|
|
314
|
-
parsedArgs =
|
|
92
|
+
parsedArgs = readPdfArgsSchema.parse(args);
|
|
315
93
|
}
|
|
316
94
|
catch (error) {
|
|
317
95
|
if (error instanceof z.ZodError) {
|
|
318
96
|
throw new McpError(ErrorCode.InvalidParams, `Invalid arguments: ${error.errors.map((e) => `${e.path.join('.')} (${e.message})`).join(', ')}`);
|
|
319
97
|
}
|
|
320
|
-
// Added fallback for non-Zod errors during parsing
|
|
321
98
|
const message = error instanceof Error ? error.message : String(error);
|
|
322
99
|
throw new McpError(ErrorCode.InvalidParams, `Argument validation failed: ${message}`);
|
|
323
100
|
}
|
|
324
|
-
const { sources, include_full_text, include_metadata, include_page_count } = parsedArgs;
|
|
101
|
+
const { sources, include_full_text, include_metadata, include_page_count, include_images } = parsedArgs;
|
|
325
102
|
// Process all sources concurrently
|
|
326
|
-
const results = await Promise.all(sources.map((source) => processSingleSource(source,
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
103
|
+
const results = await Promise.all(sources.map((source) => processSingleSource(source, {
|
|
104
|
+
includeFullText: include_full_text,
|
|
105
|
+
includeMetadata: include_metadata,
|
|
106
|
+
includePageCount: include_page_count,
|
|
107
|
+
includeImages: include_images,
|
|
108
|
+
})));
|
|
109
|
+
// Build content parts - start with structured JSON for backward compatibility
|
|
110
|
+
const content = [];
|
|
111
|
+
// Strip image data and page_contents from JSON to keep it manageable
|
|
112
|
+
const resultsForJson = results.map((result) => {
|
|
113
|
+
if (result.data) {
|
|
114
|
+
const { images, page_contents, ...dataWithoutBinaryContent } = result.data;
|
|
115
|
+
// Include image count and metadata in JSON, but not the base64 data
|
|
116
|
+
if (images) {
|
|
117
|
+
const imageInfo = images.map((img) => ({
|
|
118
|
+
page: img.page,
|
|
119
|
+
index: img.index,
|
|
120
|
+
width: img.width,
|
|
121
|
+
height: img.height,
|
|
122
|
+
format: img.format,
|
|
123
|
+
}));
|
|
124
|
+
return { ...result, data: { ...dataWithoutBinaryContent, image_info: imageInfo } };
|
|
125
|
+
}
|
|
126
|
+
return { ...result, data: dataWithoutBinaryContent };
|
|
127
|
+
}
|
|
128
|
+
return result;
|
|
129
|
+
});
|
|
130
|
+
// First content part: Structured JSON results
|
|
131
|
+
content.push({
|
|
132
|
+
type: 'text',
|
|
133
|
+
text: JSON.stringify({ results: resultsForJson }, null, 2),
|
|
134
|
+
});
|
|
135
|
+
// Add page content in exact Y-coordinate order
|
|
136
|
+
for (const result of results) {
|
|
137
|
+
if (!result.success || !result.data?.page_contents)
|
|
138
|
+
continue;
|
|
139
|
+
// Process each page's content items in order
|
|
140
|
+
for (const pageContent of result.data.page_contents) {
|
|
141
|
+
for (const item of pageContent.items) {
|
|
142
|
+
if (item.type === 'text' && item.textContent) {
|
|
143
|
+
// Add text content part
|
|
144
|
+
content.push({
|
|
145
|
+
type: 'text',
|
|
146
|
+
text: item.textContent,
|
|
147
|
+
});
|
|
148
|
+
}
|
|
149
|
+
else if (item.type === 'image' && item.imageData) {
|
|
150
|
+
// Add image content part
|
|
151
|
+
content.push({
|
|
152
|
+
type: 'image',
|
|
153
|
+
data: item.imageData.data,
|
|
154
|
+
mimeType: item.imageData.format === 'rgba' ? 'image/png' : 'image/jpeg',
|
|
155
|
+
});
|
|
156
|
+
}
|
|
157
|
+
}
|
|
158
|
+
}
|
|
159
|
+
}
|
|
160
|
+
return { content };
|
|
335
161
|
};
|
|
336
|
-
// Export the
|
|
162
|
+
// Export the tool definition
|
|
337
163
|
export const readPdfToolDefinition = {
|
|
338
164
|
name: 'read_pdf',
|
|
339
|
-
description: 'Reads content/metadata from one or more PDFs (local/URL). Each source can specify pages to extract.',
|
|
340
|
-
schema:
|
|
165
|
+
description: 'Reads content/metadata/images from one or more PDFs (local/URL). Each source can specify pages to extract.',
|
|
166
|
+
schema: readPdfArgsSchema,
|
|
341
167
|
handler: handleReadPdfFunc,
|
|
342
168
|
};
|
package/dist/index.js
CHANGED
|
@@ -10,9 +10,9 @@ import { allToolDefinitions } from './handlers/index.js';
|
|
|
10
10
|
// Removed tool name constants, names are now in the definitions
|
|
11
11
|
// --- Server Setup ---
|
|
12
12
|
const server = new Server({
|
|
13
|
-
name: '
|
|
14
|
-
version: '
|
|
15
|
-
description: 'MCP Server for
|
|
13
|
+
name: 'pdf-reader-mcp',
|
|
14
|
+
version: '1.2.0',
|
|
15
|
+
description: 'MCP Server for reading PDF files and extracting text, metadata, images, and page information.',
|
|
16
16
|
}, {
|
|
17
17
|
capabilities: { tools: {} },
|
|
18
18
|
});
|
|
@@ -48,10 +48,10 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
|
|
|
48
48
|
async function main() {
|
|
49
49
|
const transport = new StdioServerTransport();
|
|
50
50
|
await server.connect(transport);
|
|
51
|
-
console.error('[
|
|
51
|
+
console.error('[PDF Reader MCP] Server running on stdio');
|
|
52
52
|
}
|
|
53
53
|
main().catch((error) => {
|
|
54
54
|
// Specify 'unknown' type for catch variable
|
|
55
|
-
console.error('[
|
|
55
|
+
console.error('[PDF Reader MCP] Server error:', error);
|
|
56
56
|
process.exit(1);
|
|
57
57
|
});
|
|
@@ -0,0 +1,265 @@
|
|
|
1
|
+
// PDF text and metadata extraction utilities
|
|
2
|
+
import { OPS } from 'pdfjs-dist/legacy/build/pdf.mjs';
|
|
3
|
+
/**
|
|
4
|
+
* Extract metadata and page count from a PDF document
|
|
5
|
+
*/
|
|
6
|
+
export const extractMetadataAndPageCount = async (pdfDocument, includeMetadata, includePageCount) => {
|
|
7
|
+
const output = {};
|
|
8
|
+
if (includePageCount) {
|
|
9
|
+
output.num_pages = pdfDocument.numPages;
|
|
10
|
+
}
|
|
11
|
+
if (includeMetadata) {
|
|
12
|
+
try {
|
|
13
|
+
const pdfMetadata = await pdfDocument.getMetadata();
|
|
14
|
+
const infoData = pdfMetadata.info;
|
|
15
|
+
if (infoData !== undefined) {
|
|
16
|
+
output.info = infoData;
|
|
17
|
+
}
|
|
18
|
+
const metadataObj = pdfMetadata.metadata;
|
|
19
|
+
// Check if it has a getAll method (as used in tests)
|
|
20
|
+
if (typeof metadataObj.getAll === 'function') {
|
|
21
|
+
output.metadata = metadataObj.getAll();
|
|
22
|
+
}
|
|
23
|
+
else {
|
|
24
|
+
// For real PDF.js metadata, convert to plain object
|
|
25
|
+
const metadataRecord = {};
|
|
26
|
+
for (const key in metadataObj) {
|
|
27
|
+
if (Object.hasOwn(metadataObj, key)) {
|
|
28
|
+
metadataRecord[key] = metadataObj[key];
|
|
29
|
+
}
|
|
30
|
+
}
|
|
31
|
+
output.metadata = metadataRecord;
|
|
32
|
+
}
|
|
33
|
+
}
|
|
34
|
+
catch (metaError) {
|
|
35
|
+
console.warn(`[PDF Reader MCP] Error extracting metadata: ${metaError instanceof Error ? metaError.message : String(metaError)}`);
|
|
36
|
+
}
|
|
37
|
+
}
|
|
38
|
+
return output;
|
|
39
|
+
};
|
|
40
|
+
/**
|
|
41
|
+
* Extract text from a single page
|
|
42
|
+
*/
|
|
43
|
+
const extractSinglePageText = async (pdfDocument, pageNum, sourceDescription) => {
|
|
44
|
+
try {
|
|
45
|
+
const page = await pdfDocument.getPage(pageNum);
|
|
46
|
+
const textContent = await page.getTextContent();
|
|
47
|
+
const pageText = textContent.items
|
|
48
|
+
.map((item) => item.str)
|
|
49
|
+
.join('');
|
|
50
|
+
return { page: pageNum, text: pageText };
|
|
51
|
+
}
|
|
52
|
+
catch (pageError) {
|
|
53
|
+
const message = pageError instanceof Error ? pageError.message : String(pageError);
|
|
54
|
+
console.warn(`[PDF Reader MCP] Error getting text content for page ${String(pageNum)} in ${sourceDescription}: ${message}`);
|
|
55
|
+
return { page: pageNum, text: `Error processing page: ${message}` };
|
|
56
|
+
}
|
|
57
|
+
};
|
|
58
|
+
/**
|
|
59
|
+
* Extract text from specified pages (parallel processing for performance)
|
|
60
|
+
*/
|
|
61
|
+
export const extractPageTexts = async (pdfDocument, pagesToProcess, sourceDescription) => {
|
|
62
|
+
// Process all pages in parallel for better performance
|
|
63
|
+
const extractedPageTexts = await Promise.all(pagesToProcess.map((pageNum) => extractSinglePageText(pdfDocument, pageNum, sourceDescription)));
|
|
64
|
+
return extractedPageTexts.sort((a, b) => a.page - b.page);
|
|
65
|
+
};
|
|
66
|
+
/**
|
|
67
|
+
* Extract images from a single page
|
|
68
|
+
*/
|
|
69
|
+
const extractImagesFromPage = async (page, pageNum) => {
|
|
70
|
+
const images = [];
|
|
71
|
+
try {
|
|
72
|
+
const operatorList = await page.getOperatorList();
|
|
73
|
+
// Find all image painting operations
|
|
74
|
+
const imageIndices = [];
|
|
75
|
+
for (let i = 0; i < operatorList.fnArray.length; i++) {
|
|
76
|
+
const op = operatorList.fnArray[i];
|
|
77
|
+
if (op === OPS.paintImageXObject || op === OPS.paintXObject) {
|
|
78
|
+
imageIndices.push(i);
|
|
79
|
+
}
|
|
80
|
+
}
|
|
81
|
+
// Extract each image using Promise-based approach
|
|
82
|
+
const imagePromises = imageIndices.map((imgIndex, arrayIndex) => new Promise((resolve) => {
|
|
83
|
+
const argsArray = operatorList.argsArray[imgIndex];
|
|
84
|
+
if (!argsArray || argsArray.length === 0) {
|
|
85
|
+
resolve(null);
|
|
86
|
+
return;
|
|
87
|
+
}
|
|
88
|
+
const imageName = argsArray[0];
|
|
89
|
+
// Use callback-based get() as images may not be resolved yet
|
|
90
|
+
page.objs.get(imageName, (imageData) => {
|
|
91
|
+
if (!imageData || typeof imageData !== 'object') {
|
|
92
|
+
resolve(null);
|
|
93
|
+
return;
|
|
94
|
+
}
|
|
95
|
+
const img = imageData;
|
|
96
|
+
if (!img.data || !img.width || !img.height) {
|
|
97
|
+
resolve(null);
|
|
98
|
+
return;
|
|
99
|
+
}
|
|
100
|
+
// Determine image format based on kind
|
|
101
|
+
// kind === 1 = grayscale, 2 = RGB, 3 = RGBA
|
|
102
|
+
const format = img.kind === 1 ? 'grayscale' : img.kind === 3 ? 'rgba' : 'rgb';
|
|
103
|
+
// Convert Uint8Array to base64
|
|
104
|
+
const base64 = Buffer.from(img.data).toString('base64');
|
|
105
|
+
resolve({
|
|
106
|
+
page: pageNum,
|
|
107
|
+
index: arrayIndex,
|
|
108
|
+
width: img.width,
|
|
109
|
+
height: img.height,
|
|
110
|
+
format,
|
|
111
|
+
data: base64,
|
|
112
|
+
});
|
|
113
|
+
});
|
|
114
|
+
}));
|
|
115
|
+
const resolvedImages = await Promise.all(imagePromises);
|
|
116
|
+
images.push(...resolvedImages.filter((img) => img !== null));
|
|
117
|
+
}
|
|
118
|
+
catch (error) {
|
|
119
|
+
const message = error instanceof Error ? error.message : String(error);
|
|
120
|
+
console.warn(`[PDF Reader MCP] Error extracting images from page ${String(pageNum)}: ${message}`);
|
|
121
|
+
}
|
|
122
|
+
return images;
|
|
123
|
+
};
|
|
124
|
+
/**
|
|
125
|
+
* Extract images from specified pages
|
|
126
|
+
*/
|
|
127
|
+
export const extractImages = async (pdfDocument, pagesToProcess) => {
|
|
128
|
+
const allImages = [];
|
|
129
|
+
// Process pages sequentially to avoid overwhelming PDF.js
|
|
130
|
+
for (const pageNum of pagesToProcess) {
|
|
131
|
+
try {
|
|
132
|
+
const page = await pdfDocument.getPage(pageNum);
|
|
133
|
+
const pageImages = await extractImagesFromPage(page, pageNum);
|
|
134
|
+
allImages.push(...pageImages);
|
|
135
|
+
}
|
|
136
|
+
catch (error) {
|
|
137
|
+
const message = error instanceof Error ? error.message : String(error);
|
|
138
|
+
console.warn(`[PDF Reader MCP] Error getting page ${String(pageNum)} for image extraction: ${message}`);
|
|
139
|
+
}
|
|
140
|
+
}
|
|
141
|
+
return allImages;
|
|
142
|
+
};
|
|
143
|
+
/**
|
|
144
|
+
* Build warnings array for invalid page numbers
|
|
145
|
+
*/
|
|
146
|
+
export const buildWarnings = (invalidPages, totalPages) => {
|
|
147
|
+
if (invalidPages.length === 0) {
|
|
148
|
+
return [];
|
|
149
|
+
}
|
|
150
|
+
return [
|
|
151
|
+
`Requested page numbers ${invalidPages.join(', ')} exceed total pages (${String(totalPages)}).`,
|
|
152
|
+
];
|
|
153
|
+
};
|
|
154
|
+
/**
|
|
155
|
+
* Extract all content (text and images) from a single page with Y-coordinate ordering
|
|
156
|
+
*/
|
|
157
|
+
export const extractPageContent = async (pdfDocument, pageNum, includeImages, sourceDescription) => {
|
|
158
|
+
const contentItems = [];
|
|
159
|
+
try {
|
|
160
|
+
const page = await pdfDocument.getPage(pageNum);
|
|
161
|
+
// Extract text content with Y-coordinates
|
|
162
|
+
const textContent = await page.getTextContent();
|
|
163
|
+
// Group text items by Y-coordinate (items on same line have similar Y values)
|
|
164
|
+
const textByY = new Map();
|
|
165
|
+
for (const item of textContent.items) {
|
|
166
|
+
const textItem = item;
|
|
167
|
+
// transform[5] is the Y coordinate
|
|
168
|
+
const yCoord = textItem.transform[5];
|
|
169
|
+
if (yCoord === undefined)
|
|
170
|
+
continue;
|
|
171
|
+
const y = Math.round(yCoord);
|
|
172
|
+
if (!textByY.has(y)) {
|
|
173
|
+
textByY.set(y, []);
|
|
174
|
+
}
|
|
175
|
+
textByY.get(y)?.push(textItem.str);
|
|
176
|
+
}
|
|
177
|
+
// Convert grouped text to content items
|
|
178
|
+
for (const [y, textParts] of textByY.entries()) {
|
|
179
|
+
const textContent = textParts.join('');
|
|
180
|
+
if (textContent.trim()) {
|
|
181
|
+
contentItems.push({
|
|
182
|
+
type: 'text',
|
|
183
|
+
yPosition: y,
|
|
184
|
+
textContent,
|
|
185
|
+
});
|
|
186
|
+
}
|
|
187
|
+
}
|
|
188
|
+
// Extract images with Y-coordinates if requested
|
|
189
|
+
if (includeImages) {
|
|
190
|
+
const operatorList = await page.getOperatorList();
|
|
191
|
+
// Find all image painting operations
|
|
192
|
+
const imageIndices = [];
|
|
193
|
+
for (let i = 0; i < operatorList.fnArray.length; i++) {
|
|
194
|
+
const op = operatorList.fnArray[i];
|
|
195
|
+
if (op === OPS.paintImageXObject || op === OPS.paintXObject) {
|
|
196
|
+
imageIndices.push(i);
|
|
197
|
+
}
|
|
198
|
+
}
|
|
199
|
+
// Extract each image with its Y-coordinate
|
|
200
|
+
const imagePromises = imageIndices.map((imgIndex, arrayIndex) => new Promise((resolve) => {
|
|
201
|
+
const argsArray = operatorList.argsArray[imgIndex];
|
|
202
|
+
if (!argsArray || argsArray.length === 0) {
|
|
203
|
+
resolve(null);
|
|
204
|
+
return;
|
|
205
|
+
}
|
|
206
|
+
const imageName = argsArray[0];
|
|
207
|
+
// Get transform matrix from the args (if available)
|
|
208
|
+
// The transform is typically in argsArray[1] for some ops
|
|
209
|
+
let yPosition = 0;
|
|
210
|
+
if (argsArray.length > 1 && Array.isArray(argsArray[1])) {
|
|
211
|
+
const transform = argsArray[1];
|
|
212
|
+
// transform[5] is the Y coordinate
|
|
213
|
+
const yCoord = transform[5];
|
|
214
|
+
if (yCoord !== undefined) {
|
|
215
|
+
yPosition = Math.round(yCoord);
|
|
216
|
+
}
|
|
217
|
+
}
|
|
218
|
+
// Use callback-based get() as images may not be resolved yet
|
|
219
|
+
page.objs.get(imageName, (imageData) => {
|
|
220
|
+
if (!imageData || typeof imageData !== 'object') {
|
|
221
|
+
resolve(null);
|
|
222
|
+
return;
|
|
223
|
+
}
|
|
224
|
+
const img = imageData;
|
|
225
|
+
if (!img.data || !img.width || !img.height) {
|
|
226
|
+
resolve(null);
|
|
227
|
+
return;
|
|
228
|
+
}
|
|
229
|
+
// Determine image format based on kind
|
|
230
|
+
const format = img.kind === 1 ? 'grayscale' : img.kind === 3 ? 'rgba' : 'rgb';
|
|
231
|
+
// Convert Uint8Array to base64
|
|
232
|
+
const base64 = Buffer.from(img.data).toString('base64');
|
|
233
|
+
resolve({
|
|
234
|
+
type: 'image',
|
|
235
|
+
yPosition,
|
|
236
|
+
imageData: {
|
|
237
|
+
page: pageNum,
|
|
238
|
+
index: arrayIndex,
|
|
239
|
+
width: img.width,
|
|
240
|
+
height: img.height,
|
|
241
|
+
format,
|
|
242
|
+
data: base64,
|
|
243
|
+
},
|
|
244
|
+
});
|
|
245
|
+
});
|
|
246
|
+
}));
|
|
247
|
+
const resolvedImages = await Promise.all(imagePromises);
|
|
248
|
+
contentItems.push(...resolvedImages.filter((item) => item !== null));
|
|
249
|
+
}
|
|
250
|
+
}
|
|
251
|
+
catch (error) {
|
|
252
|
+
const message = error instanceof Error ? error.message : String(error);
|
|
253
|
+
console.warn(`[PDF Reader MCP] Error extracting page content for page ${String(pageNum)} in ${sourceDescription}: ${message}`);
|
|
254
|
+
// Return error message as text content
|
|
255
|
+
return [
|
|
256
|
+
{
|
|
257
|
+
type: 'text',
|
|
258
|
+
yPosition: 0,
|
|
259
|
+
textContent: `Error processing page: ${message}`,
|
|
260
|
+
},
|
|
261
|
+
];
|
|
262
|
+
}
|
|
263
|
+
// Sort by Y-position (descending = top to bottom in PDF coordinates)
|
|
264
|
+
return contentItems.sort((a, b) => b.yPosition - a.yPosition);
|
|
265
|
+
};
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
// PDF document loading utilities
|
|
2
|
+
import fs from 'node:fs/promises';
|
|
3
|
+
import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
|
|
4
|
+
import { getDocument } from 'pdfjs-dist/legacy/build/pdf.mjs';
|
|
5
|
+
import { resolvePath } from '../utils/pathUtils.js';
|
|
6
|
+
/**
|
|
7
|
+
* Load a PDF document from a local file path or URL
|
|
8
|
+
* @param source - Object containing either path or url
|
|
9
|
+
* @param sourceDescription - Description for error messages
|
|
10
|
+
* @returns PDF document proxy
|
|
11
|
+
*/
|
|
12
|
+
export const loadPdfDocument = async (source, sourceDescription) => {
|
|
13
|
+
let pdfDataSource;
|
|
14
|
+
try {
|
|
15
|
+
if (source.path) {
|
|
16
|
+
const safePath = resolvePath(source.path);
|
|
17
|
+
const buffer = await fs.readFile(safePath);
|
|
18
|
+
pdfDataSource = new Uint8Array(buffer);
|
|
19
|
+
}
|
|
20
|
+
else if (source.url) {
|
|
21
|
+
pdfDataSource = { url: source.url };
|
|
22
|
+
}
|
|
23
|
+
else {
|
|
24
|
+
throw new McpError(ErrorCode.InvalidParams, `Source ${sourceDescription} missing 'path' or 'url'.`);
|
|
25
|
+
}
|
|
26
|
+
}
|
|
27
|
+
catch (err) {
|
|
28
|
+
if (err instanceof McpError) {
|
|
29
|
+
throw err;
|
|
30
|
+
}
|
|
31
|
+
const message = err instanceof Error ? err.message : String(err);
|
|
32
|
+
const errorCode = ErrorCode.InvalidRequest;
|
|
33
|
+
if (typeof err === 'object' &&
|
|
34
|
+
err !== null &&
|
|
35
|
+
'code' in err &&
|
|
36
|
+
err.code === 'ENOENT' &&
|
|
37
|
+
source.path) {
|
|
38
|
+
throw new McpError(errorCode, `File not found at '${source.path}'.`, {
|
|
39
|
+
cause: err instanceof Error ? err : undefined,
|
|
40
|
+
});
|
|
41
|
+
}
|
|
42
|
+
throw new McpError(errorCode, `Failed to prepare PDF source ${sourceDescription}. Reason: ${message}`, { cause: err instanceof Error ? err : undefined });
|
|
43
|
+
}
|
|
44
|
+
const loadingTask = getDocument(pdfDataSource);
|
|
45
|
+
try {
|
|
46
|
+
return await loadingTask.promise;
|
|
47
|
+
}
|
|
48
|
+
catch (err) {
|
|
49
|
+
console.error(`[PDF Reader MCP] PDF.js loading error for ${sourceDescription}:`, err);
|
|
50
|
+
const message = err instanceof Error ? err.message : String(err);
|
|
51
|
+
throw new McpError(ErrorCode.InvalidRequest, `Failed to load PDF document from ${sourceDescription}. Reason: ${message || 'Unknown loading error'}`, { cause: err instanceof Error ? err : undefined });
|
|
52
|
+
}
|
|
53
|
+
};
|
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
// Page range parsing utilities
|
|
2
|
+
import { ErrorCode, McpError } from '@modelcontextprotocol/sdk/types.js';
|
|
3
|
+
const MAX_RANGE_SIZE = 10000; // Prevent infinite loops for open ranges
|
|
4
|
+
/**
|
|
5
|
+
* Parse a single range part (e.g., "1-3", "5", "7-")
|
|
6
|
+
*/
|
|
7
|
+
const parseRangePart = (part, pages) => {
|
|
8
|
+
const trimmedPart = part.trim();
|
|
9
|
+
if (trimmedPart.includes('-')) {
|
|
10
|
+
const [startStr, endStr] = trimmedPart.split('-');
|
|
11
|
+
if (startStr === undefined) {
|
|
12
|
+
throw new Error(`Invalid page range format: ${trimmedPart}`);
|
|
13
|
+
}
|
|
14
|
+
const start = parseInt(startStr, 10);
|
|
15
|
+
const end = endStr === '' || endStr === undefined ? Infinity : parseInt(endStr, 10);
|
|
16
|
+
if (Number.isNaN(start) || Number.isNaN(end) || start <= 0 || start > end) {
|
|
17
|
+
throw new Error(`Invalid page range values: ${trimmedPart}`);
|
|
18
|
+
}
|
|
19
|
+
const practicalEnd = Math.min(end, start + MAX_RANGE_SIZE);
|
|
20
|
+
for (let i = start; i <= practicalEnd; i++) {
|
|
21
|
+
pages.add(i);
|
|
22
|
+
}
|
|
23
|
+
if (end === Infinity && practicalEnd === start + MAX_RANGE_SIZE) {
|
|
24
|
+
console.warn(`[PDF Reader MCP] Open-ended range starting at ${String(start)} was truncated at page ${String(practicalEnd)}.`);
|
|
25
|
+
}
|
|
26
|
+
}
|
|
27
|
+
else {
|
|
28
|
+
const page = parseInt(trimmedPart, 10);
|
|
29
|
+
if (Number.isNaN(page) || page <= 0) {
|
|
30
|
+
throw new Error(`Invalid page number: ${trimmedPart}`);
|
|
31
|
+
}
|
|
32
|
+
pages.add(page);
|
|
33
|
+
}
|
|
34
|
+
};
|
|
35
|
+
/**
|
|
36
|
+
* Parse page range string into array of page numbers
|
|
37
|
+
* @param ranges - Range string (e.g., "1-3,5,7-10")
|
|
38
|
+
* @returns Sorted array of unique page numbers
|
|
39
|
+
*/
|
|
40
|
+
export const parsePageRanges = (ranges) => {
|
|
41
|
+
const pages = new Set();
|
|
42
|
+
const parts = ranges.split(',');
|
|
43
|
+
for (const part of parts) {
|
|
44
|
+
parseRangePart(part, pages);
|
|
45
|
+
}
|
|
46
|
+
if (pages.size === 0) {
|
|
47
|
+
throw new Error('Page range string resulted in zero valid pages.');
|
|
48
|
+
}
|
|
49
|
+
return Array.from(pages).sort((a, b) => a - b);
|
|
50
|
+
};
|
|
51
|
+
/**
|
|
52
|
+
* Get target pages from page specification
|
|
53
|
+
* @param sourcePages - Page specification (string or array)
|
|
54
|
+
* @param sourceDescription - Description for error messages
|
|
55
|
+
* @returns Array of page numbers or undefined
|
|
56
|
+
*/
|
|
57
|
+
export const getTargetPages = (sourcePages, sourceDescription) => {
|
|
58
|
+
if (!sourcePages) {
|
|
59
|
+
return undefined;
|
|
60
|
+
}
|
|
61
|
+
try {
|
|
62
|
+
if (typeof sourcePages === 'string') {
|
|
63
|
+
return parsePageRanges(sourcePages);
|
|
64
|
+
}
|
|
65
|
+
// Array of page numbers
|
|
66
|
+
if (sourcePages.some((p) => !Number.isInteger(p) || p <= 0)) {
|
|
67
|
+
throw new Error('Page numbers in array must be positive integers.');
|
|
68
|
+
}
|
|
69
|
+
const uniquePages = [...new Set(sourcePages)].sort((a, b) => a - b);
|
|
70
|
+
if (uniquePages.length === 0) {
|
|
71
|
+
throw new Error('Page specification resulted in an empty set of pages.');
|
|
72
|
+
}
|
|
73
|
+
return uniquePages;
|
|
74
|
+
}
|
|
75
|
+
catch (error) {
|
|
76
|
+
const message = error instanceof Error ? error.message : String(error);
|
|
77
|
+
throw new McpError(ErrorCode.InvalidParams, `Invalid page specification for source ${sourceDescription}: ${message}`);
|
|
78
|
+
}
|
|
79
|
+
};
|
|
80
|
+
/**
|
|
81
|
+
* Determine which pages to process based on target pages and document size
|
|
82
|
+
*/
|
|
83
|
+
export const determinePagesToProcess = (targetPages, totalPages, includeFullText) => {
|
|
84
|
+
if (targetPages) {
|
|
85
|
+
const pagesToProcess = targetPages.filter((p) => p <= totalPages);
|
|
86
|
+
const invalidPages = targetPages.filter((p) => p > totalPages);
|
|
87
|
+
return { pagesToProcess, invalidPages };
|
|
88
|
+
}
|
|
89
|
+
if (includeFullText) {
|
|
90
|
+
const pagesToProcess = Array.from({ length: totalPages }, (_, i) => i + 1);
|
|
91
|
+
return { pagesToProcess, invalidPages: [] };
|
|
92
|
+
}
|
|
93
|
+
return { pagesToProcess: [], invalidPages: [] };
|
|
94
|
+
};
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
// Zod validation schemas for PDF reading
|
|
2
|
+
import { z } from 'zod';
|
|
3
|
+
// Schema for page specification (array of numbers or range string)
|
|
4
|
+
export const pageSpecifierSchema = z.union([
|
|
5
|
+
z.array(z.number().int().min(1)).min(1).describe('Array of page numbers (1-based)'),
|
|
6
|
+
z
|
|
7
|
+
.string()
|
|
8
|
+
.min(1)
|
|
9
|
+
.refine((val) => /^[0-9,-]+$/.test(val.replace(/\s/g, '')), {
|
|
10
|
+
message: 'Page string must contain only numbers, commas, and hyphens.',
|
|
11
|
+
})
|
|
12
|
+
.describe('Page range string (e.g., "1-5,10,15-20")'),
|
|
13
|
+
]);
|
|
14
|
+
// Schema for a single PDF source (path or URL)
|
|
15
|
+
export const pdfSourceSchema = z
|
|
16
|
+
.object({
|
|
17
|
+
path: z.string().min(1).optional().describe('Relative path to the local PDF file.'),
|
|
18
|
+
url: z.string().url().optional().describe('URL of the PDF file.'),
|
|
19
|
+
pages: pageSpecifierSchema
|
|
20
|
+
.optional()
|
|
21
|
+
.describe("Extract text only from specific pages (1-based) or ranges for this source. If provided, 'include_full_text' is ignored for this source."),
|
|
22
|
+
})
|
|
23
|
+
.strict()
|
|
24
|
+
.refine((data) => !!(data.path && !data.url) || !!(!data.path && data.url), {
|
|
25
|
+
message: "Each source must have either 'path' or 'url', but not both.",
|
|
26
|
+
});
|
|
27
|
+
// Schema for the read_pdf tool arguments
|
|
28
|
+
export const readPdfArgsSchema = z
|
|
29
|
+
.object({
|
|
30
|
+
sources: z
|
|
31
|
+
.array(pdfSourceSchema)
|
|
32
|
+
.min(1)
|
|
33
|
+
.describe('An array of PDF sources to process, each can optionally specify pages.'),
|
|
34
|
+
include_full_text: z
|
|
35
|
+
.boolean()
|
|
36
|
+
.optional()
|
|
37
|
+
.default(false)
|
|
38
|
+
.describe("Include the full text content of each PDF (only if 'pages' is not specified for that source)."),
|
|
39
|
+
include_metadata: z
|
|
40
|
+
.boolean()
|
|
41
|
+
.optional()
|
|
42
|
+
.default(true)
|
|
43
|
+
.describe('Include metadata and info objects for each PDF.'),
|
|
44
|
+
include_page_count: z
|
|
45
|
+
.boolean()
|
|
46
|
+
.optional()
|
|
47
|
+
.default(true)
|
|
48
|
+
.describe('Include the total number of pages for each PDF.'),
|
|
49
|
+
include_images: z
|
|
50
|
+
.boolean()
|
|
51
|
+
.optional()
|
|
52
|
+
.default(false)
|
|
53
|
+
.describe('Extract and include embedded images from the PDF pages as base64-encoded data.'),
|
|
54
|
+
})
|
|
55
|
+
.strict();
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@sylphx/pdf-reader-mcp",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.2.0",
|
|
4
4
|
"description": "An MCP server providing tools to read PDF files.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -39,32 +39,6 @@
|
|
|
39
39
|
"agent",
|
|
40
40
|
"tool"
|
|
41
41
|
],
|
|
42
|
-
"scripts": {
|
|
43
|
-
"build": "tsc",
|
|
44
|
-
"watch": "tsc --watch",
|
|
45
|
-
"inspector": "npx @modelcontextprotocol/inspector dist/index.js",
|
|
46
|
-
"test": "vitest run",
|
|
47
|
-
"test:watch": "vitest watch",
|
|
48
|
-
"test:cov": "vitest run --coverage --reporter=junit --outputFile=test-report.junit.xml",
|
|
49
|
-
"lint": "biome lint .",
|
|
50
|
-
"lint:fix": "biome lint --write .",
|
|
51
|
-
"format": "biome format --write .",
|
|
52
|
-
"check-format": "biome format .",
|
|
53
|
-
"check": "biome check .",
|
|
54
|
-
"check:fix": "biome check --write .",
|
|
55
|
-
"validate": "npm run check && npm run test",
|
|
56
|
-
"docs:dev": "vitepress dev docs",
|
|
57
|
-
"docs:build": "vitepress build docs",
|
|
58
|
-
"docs:preview": "vitepress preview docs",
|
|
59
|
-
"start": "node dist/index.js",
|
|
60
|
-
"typecheck": "tsc --noEmit",
|
|
61
|
-
"benchmark": "vitest bench",
|
|
62
|
-
"clean": "rm -rf dist coverage",
|
|
63
|
-
"docs:api": "typedoc --entryPoints src/index.ts --tsconfig tsconfig.json --plugin typedoc-plugin-markdown --out docs/api --readme none",
|
|
64
|
-
"prepublishOnly": "pnpm run clean && pnpm run build",
|
|
65
|
-
"release": "standard-version",
|
|
66
|
-
"prepare": "husky"
|
|
67
|
-
},
|
|
68
42
|
"dependencies": {
|
|
69
43
|
"@modelcontextprotocol/sdk": "1.20.2",
|
|
70
44
|
"glob": "^11.0.1",
|
|
@@ -98,5 +72,29 @@
|
|
|
98
72
|
"*.{ts,tsx,js,cjs,json}": [
|
|
99
73
|
"biome check --write --no-errors-on-unmatched --files-ignore-unknown=true"
|
|
100
74
|
]
|
|
75
|
+
},
|
|
76
|
+
"scripts": {
|
|
77
|
+
"build": "tsc",
|
|
78
|
+
"watch": "tsc --watch",
|
|
79
|
+
"inspector": "npx @modelcontextprotocol/inspector dist/index.js",
|
|
80
|
+
"test": "vitest run",
|
|
81
|
+
"test:watch": "vitest watch",
|
|
82
|
+
"test:cov": "vitest run --coverage --reporter=junit --outputFile=test-report.junit.xml",
|
|
83
|
+
"lint": "biome lint .",
|
|
84
|
+
"lint:fix": "biome lint --write .",
|
|
85
|
+
"format": "biome format --write .",
|
|
86
|
+
"check-format": "biome format .",
|
|
87
|
+
"check": "biome check .",
|
|
88
|
+
"check:fix": "biome check --write .",
|
|
89
|
+
"validate": "npm run check && npm run test",
|
|
90
|
+
"docs:dev": "vitepress dev docs",
|
|
91
|
+
"docs:build": "vitepress build docs",
|
|
92
|
+
"docs:preview": "vitepress preview docs",
|
|
93
|
+
"start": "node dist/index.js",
|
|
94
|
+
"typecheck": "tsc --noEmit",
|
|
95
|
+
"benchmark": "vitest bench",
|
|
96
|
+
"clean": "rm -rf dist coverage",
|
|
97
|
+
"docs:api": "typedoc --entryPoints src/index.ts --tsconfig tsconfig.json --plugin typedoc-plugin-markdown --out docs/api --readme none",
|
|
98
|
+
"release": "standard-version"
|
|
101
99
|
}
|
|
102
|
-
}
|
|
100
|
+
}
|