pdf-plus 1.0.1 โ†’ 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -7,8 +7,14 @@ A comprehensive PDF content extraction library with support for text, images, an
7
7
  - ๐Ÿ“ **Text Extraction** - High-quality text extraction with positioning
8
8
  - ๐Ÿ–ผ๏ธ **Image Detection** - Detect and reference images in PDF content
9
9
  - ๐Ÿ’พ **Image File Extraction** - Extract actual image files from PDFs
10
- - ๐ŸŽจ **Flexible Formatting** - Customizable image reference formats
11
- - โšก **Performance Options** - Text-only, images-only, or combined modes
10
+ - ๐ŸŽจ **Image Optimization** - Optional Sharp/Imagemin optimization with quality control
11
+ - ๐Ÿ”„ **JP2 Conversion** - Automatic JPEG 2000 to JPG conversion for compatibility
12
+ - ๐Ÿš€ **Parallel Processing** - 1.5-3x faster with configurable concurrency (Phase 1)
13
+ - โšก **Async I/O** - Non-blocking file operations for better performance (Phase 2)
14
+ - ๐Ÿงต **Worker Threads** - True multi-threading for CPU-intensive operations (Phase 3)
15
+ - ๐ŸŒŠ **Streaming API** - Process large PDFs with 10-100x lower memory usage (Phase 4)
16
+ - ๐Ÿ“„ **Page to Image** - Convert PDF pages to images (PNG, JPG, WebP) (Phase 5 - NEW!)
17
+ - ๐ŸŽฏ **Format Preservation** - Preserves original image formats (JPG, PNG) and full quality
12
18
  - ๐Ÿ”ง **TypeScript Support** - Full TypeScript definitions included
13
19
  - ๐Ÿ›ก๏ธ **Robust Validation** - Comprehensive input validation and error handling
14
20
 
@@ -43,6 +49,69 @@ console.log(
43
49
  console.log(`Text content: ${result.cleanText.substring(0, 100)}...`);
44
50
  ```
45
51
 
52
+ ### Streaming API for Large PDFs (NEW! - Phase 4)
53
+
54
+ For large PDFs, use the streaming API for lower memory usage and real-time progress:
55
+
56
+ ```typescript
57
+ import { extractPdfStream } from "pdf-plus";
58
+
59
+ const stream = extractPdfStream("large-document.pdf", {
60
+ extractImageFiles: true,
61
+ imageOutputDir: "./images",
62
+ streamMode: true,
63
+ });
64
+
65
+ for await (const event of stream) {
66
+ if (event.type === "page") {
67
+ console.log(`Page ${event.pageNumber}/${event.totalPages} complete`);
68
+ } else if (event.type === "progress") {
69
+ console.log(`Progress: ${event.percentComplete.toFixed(1)}%`);
70
+ } else if (event.type === "complete") {
71
+ console.log(`Done! ${event.totalImages} images extracted`);
72
+ }
73
+ }
74
+ ```
75
+
76
+ **Benefits:**
77
+
78
+ - ๐Ÿ“‰ **10-100x lower memory usage** for large PDFs
79
+ - โšก **100x faster time to first result**
80
+ - ๐Ÿ“Š **Real-time progress tracking**
81
+ - ๐Ÿ›‘ **Cancellation support**
82
+
83
+ See [PHASE4-STREAMING.md](./PHASE4-STREAMING.md) for complete streaming API documentation.
84
+
85
+ ### Convert PDF Pages to Images (NEW! - Phase 5)
86
+
87
+ Convert PDF pages to high-quality images (PNG, JPG, WebP):
88
+
89
+ ```typescript
90
+ import { PageToImageConverter } from "pdf-plus";
91
+
92
+ const converter = new PageToImageConverter();
93
+
94
+ // Convert all pages to PNG
95
+ const result = await converter.convertToImages("document.pdf", {
96
+ outputDir: "./page-images",
97
+ format: "png",
98
+ dpi: 150,
99
+ verbose: true,
100
+ });
101
+
102
+ console.log(`Converted ${result.totalPages} pages`);
103
+ ```
104
+
105
+ **Features:**
106
+
107
+ - ๐ŸŽจ **Multiple formats** - PNG, JPG, WebP
108
+ - ๐Ÿ“ **Quality control** - Adjustable DPI (72, 150, 300, 600) and quality
109
+ - ๐Ÿ“„ **Page selection** - Convert specific pages or ranges
110
+ - ๐Ÿ–ผ๏ธ **Thumbnails** - Generate low-res previews
111
+ - ๐Ÿ’พ **Buffer/Base64** - In-memory conversion for web apps
112
+
113
+ See [PAGE-TO-IMAGE-FEATURE.md](./PAGE-TO-IMAGE-FEATURE.md) for complete page-to-image documentation.
114
+
46
115
  ## Usage Examples
47
116
 
48
117
  ### Text-Only Extraction (Fast)
@@ -67,6 +136,92 @@ const images = await extractImages("document.pdf", {
67
136
  console.log(`Found ${images.length} images`);
68
137
  ```
69
138
 
139
+ ### Image Extraction with Optimization
140
+
141
+ ```typescript
142
+ import { extractPdfContent } from "pdf-plus";
143
+
144
+ const result = await extractPdfContent("document.pdf", {
145
+ extractImageFiles: true,
146
+ imageOutputDir: "./images",
147
+
148
+ // Enable optimization
149
+ optimizeImages: true,
150
+ imageOptimizer: "auto", // or 'sharp', 'imagemin'
151
+ imageQuality: 80,
152
+ imageProgressive: true,
153
+
154
+ // Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true)
155
+ convertJp2ToJpg: true,
156
+ imageQuality: 100, // Default: 100 for JP2 conversion (max quality)
157
+
158
+ verbose: true,
159
+ });
160
+
161
+ // Check optimization results
162
+ result.images.forEach((img) => {
163
+ console.log(`${img.filename}: Optimized and saved`);
164
+ });
165
+ ```
166
+
167
+ ### Performance Optimization (NEW! ๐Ÿš€)
168
+
169
+ ```typescript
170
+ import { extractPdfContent } from "pdf-plus";
171
+
172
+ // BASIC: Parallel processing (enabled by default)
173
+ const result = await extractPdfContent("document.pdf", {
174
+ extractImageFiles: true,
175
+ imageOutputDir: "./images",
176
+ parallelProcessing: true, // 1.5-3x faster
177
+ });
178
+
179
+ // ADVANCED: With worker threads for CPU-intensive operations
180
+ const result = await extractPdfContent("large-document.pdf", {
181
+ extractImageFiles: true,
182
+ imageOutputDir: "./images",
183
+
184
+ // Enable parallel processing (default: true)
185
+ parallelProcessing: true,
186
+
187
+ // Enable worker threads for true multi-threading (default: false)
188
+ useWorkerThreads: true, // 2.5-3.2x additional speedup!
189
+ autoScaleWorkers: true, // Auto-adjust based on system resources
190
+ maxWorkerThreads: 8, // Max worker threads (default: CPU cores - 1)
191
+
192
+ // Fine-tune concurrency for your workload
193
+ maxConcurrentPages: 20, // Process up to 20 pages simultaneously
194
+ maxConcurrentImages: 50, // Extract up to 50 images per page in parallel
195
+ maxConcurrentConversions: 5, // Convert up to 5 JP2 files simultaneously
196
+ maxConcurrentOptimizations: 5, // Optimize up to 5 images simultaneously
197
+
198
+ verbose: true,
199
+ });
200
+
201
+ // Performance gains (tested on Art Basel PDF, 54 images):
202
+ // - Baseline (sequential): 140ms
203
+ // - Parallel processing: 47ms (2.96x faster)
204
+ // - Parallel + Workers: 44ms (3.23x faster) ๐Ÿš€
205
+ ```
206
+
207
+ **Performance Recommendations:**
208
+
209
+ | PDF Size | Images | Recommended Settings |
210
+ | -------- | ------ | ------------------------------------------------------------------------------------------------------------------------- |
211
+ | Small | <20 | `parallelProcessing: true` (default settings) |
212
+ | Medium | 20-50 | `parallelProcessing: true, maxConcurrentPages: 10, maxConcurrentImages: 20` |
213
+ | Large | 50+ | `parallelProcessing: true, useWorkerThreads: true, maxConcurrentPages: 20, maxConcurrentImages: 50` |
214
+ | Huge | 200+ | `parallelProcessing: true, useWorkerThreads: true, maxWorkerThreads: 8, maxConcurrentPages: 30, maxConcurrentImages: 100` |
215
+
216
+ **Worker Threads Benefits:**
217
+
218
+ - โœ… True multi-threading (runs on separate CPU cores)
219
+ - โœ… 2.5-3.2x faster for CPU-intensive operations (JP2 conversion, optimization)
220
+ - โœ… Auto-scaling based on memory and CPU usage
221
+ - โœ… Opt-in (default: false) - no breaking changes
222
+
223
+ See [PERFORMANCE.md](./PERFORMANCE.md) and [PHASE3-WORKERS.md](./PHASE3-WORKERS.md) for detailed benchmarks and optimization guide.
224
+
70
225
  ### Custom Image References
71
226
 
72
227
  ```typescript
@@ -217,6 +372,7 @@ Extract and save actual image files.
217
372
 
218
373
  ```typescript
219
374
  interface ExtractionOptions {
375
+ // Basic extraction options
220
376
  extractText?: boolean; // Extract text content (default: true)
221
377
  extractImages?: boolean; // Extract image references (default: true)
222
378
  extractImageFiles?: boolean; // Save actual image files (default: false)
@@ -228,9 +384,62 @@ interface ExtractionOptions {
228
384
  memoryLimit?: string; // Memory limit (e.g., '512MB', '1GB')
229
385
  batchSize?: number; // Pages per batch (1-100)
230
386
  progressCallback?: (progress: ProgressInfo) => void;
387
+
388
+ // Image optimization options
389
+ optimizeImages?: boolean; // Enable image optimization (default: false)
390
+ imageOptimizer?: "auto" | "sharp" | "imagemin"; // Optimizer to use (default: 'auto')
391
+ imageQuality?: number; // Image quality 1-100 (default: 80, JP2 conversion: 100)
392
+ imageProgressive?: boolean; // Progressive JPEG (default: true)
393
+ convertJp2ToJpg?: boolean; // Convert JP2 to JPG (default: true)
394
+
395
+ // Performance options (NEW!)
396
+ parallelProcessing?: boolean; // Enable parallel processing (default: true)
397
+ maxConcurrentPages?: number; // Max pages in parallel (default: 10)
398
+ maxConcurrentImages?: number; // Max images per page in parallel (default: 20)
399
+ maxConcurrentConversions?: number; // Max JP2 conversions in parallel (default: 5)
400
+ maxConcurrentOptimizations?: number; // Max optimizations in parallel (default: 5)
401
+
402
+ // Worker thread options (NEW! ๐Ÿš€)
403
+ useWorkerThreads?: boolean; // Enable worker threads (default: false)
404
+ autoScaleWorkers?: boolean; // Auto-scale workers (default: true)
405
+ maxWorkerThreads?: number; // Max worker threads (default: CPU cores - 1)
406
+ minWorkerThreads?: number; // Min worker threads (default: 1)
407
+ memoryThreshold?: number; // Memory threshold 0-1 (default: 0.8)
408
+ cpuThreshold?: number; // CPU threshold 0-1 (default: 0.9)
409
+ workerTaskTimeout?: number; // Task timeout ms (default: 30000)
410
+ workerIdleTimeout?: number; // Idle timeout ms (default: 60000)
411
+ workerMemoryLimit?: number; // Memory per worker MB (default: 512)
412
+ enableWorkerForConversion?: boolean; // Workers for JP2 (default: true)
413
+ enableWorkerForOptimization?: boolean; // Workers for optimization (default: true)
414
+ enableWorkerForDecoding?: boolean; // Workers for decoding (default: true)
231
415
  }
232
416
  ```
233
417
 
418
+ **Performance Options Explained:**
419
+
420
+ **Parallel Processing:**
421
+
422
+ - **`parallelProcessing`**: Enable/disable parallel processing. Enabled by default for 1.5-3x speedup.
423
+ - **`maxConcurrentPages`**: How many pages to process simultaneously. Higher values = faster for multi-page PDFs, but more memory usage.
424
+ - **`maxConcurrentImages`**: How many images per page to extract in parallel. Increase for pages with many images.
425
+ - **`maxConcurrentConversions`**: How many JP2โ†’JPG conversions to run simultaneously. Keep moderate (5-10) to avoid memory issues.
426
+ - **`maxConcurrentOptimizations`**: How many image optimizations to run simultaneously. Keep moderate (5-10) as optimization is CPU-intensive.
427
+
428
+ **Worker Threads (NEW! ๐Ÿš€):**
429
+
430
+ - **`useWorkerThreads`**: Enable true multi-threading using Node.js worker threads. Provides 2.5-3.2x additional speedup for CPU-intensive operations. Default: `false` (opt-in).
431
+ - **`autoScaleWorkers`**: Automatically adjust worker count based on system memory and CPU usage. Default: `true`.
432
+ - **`maxWorkerThreads`**: Maximum number of worker threads. Default: CPU cores - 1.
433
+ - **`minWorkerThreads`**: Minimum number of worker threads to keep alive. Default: 1.
434
+ - **`memoryThreshold`**: Memory usage threshold (0-1) before scaling down workers. Default: 0.8 (80%).
435
+ - **`cpuThreshold`**: CPU usage threshold (0-1) before scaling down workers. Default: 0.9 (90%).
436
+ - **`workerTaskTimeout`**: Maximum time (ms) for a worker task before timeout. Default: 30000 (30 seconds).
437
+ - **`workerIdleTimeout`**: Time (ms) before idle workers are terminated. Default: 60000 (60 seconds).
438
+ - **`workerMemoryLimit`**: Memory limit (MB) per worker thread. Default: 512MB.
439
+ - **`enableWorkerForConversion`**: Use workers for JP2 conversion. Default: `true`.
440
+ - **`enableWorkerForOptimization`**: Use workers for image optimization. Default: `true`.
441
+ - **`enableWorkerForDecoding`**: Use workers for image decoding. Default: `true`.
442
+
234
443
  ### Format Placeholders
235
444
 
236
445
  Use these placeholders in `imageRefFormat`:
@@ -248,6 +457,98 @@ Use these placeholders in `imageRefFormat`:
248
457
  - `{name} on page {page}` โ†’ `artwork_1 on page 5`
249
458
  - `<img src="{path}">` โ†’ `<img src="./images/img_1.jpg">`
250
459
 
460
+ ## Image Optimization & Conversion
461
+
462
+ Extract and optimize images in one step using Sharp or Imagemin:
463
+
464
+ ```typescript
465
+ import { extractPdfContent } from "pdf-plus";
466
+
467
+ const result = await extractPdfContent("document.pdf", {
468
+ extractImageFiles: true,
469
+ imageOutputDir: "./images",
470
+
471
+ // Enable optimization
472
+ optimizeImages: true,
473
+ imageOptimizer: "auto", // Automatically selects best available
474
+ imageQuality: 80,
475
+ imageProgressive: true,
476
+
477
+ // Convert JP2 (JPEG 2000) to JPG for better compatibility (default: true)
478
+ convertJp2ToJpg: true,
479
+
480
+ verbose: true,
481
+ });
482
+
483
+ // Output:
484
+ // ๐Ÿ–ผ๏ธ Extracting images from: document.pdf
485
+ // ๐Ÿ“Š Processing 50 pages with PDF-lib engine
486
+ // ๐Ÿ’พ Extracted real image: img_p1_1.jpg (245KB)
487
+ // ๐Ÿ”„ Converting 16 JP2 images to JPG...
488
+ // ๐Ÿ”„ Converted JP2 โ†’ JPG: img_p2_2.jpg (24026 โ†’ 18500 bytes)
489
+ // ๐ŸŽจ Optimizing 54 images...
490
+ // โœ… img_p1_1.jpg: 251904 โ†’ 184320 bytes (-26.8%) [sharp]
491
+ // โœ… img_p2_2.jpg: 18500 โ†’ 15200 bytes (-17.8%) [sharp]
492
+ ```
493
+
494
+ ### JP2 to JPG Conversion
495
+
496
+ JP2 (JPEG 2000) files are not widely supported by browsers and image tools. The library automatically converts them to standard JPG format:
497
+
498
+ ```typescript
499
+ const result = await extractPdfContent("document.pdf", {
500
+ extractImageFiles: true,
501
+ convertJp2ToJpg: true, // Default: true
502
+ imageQuality: 100, // Default: 100 (maximum quality preservation)
503
+ });
504
+
505
+ // All JP2 images are now JPG files with better compatibility
506
+ ```
507
+
508
+ **Quality Preservation:**
509
+
510
+ - **Default quality: 100** - Preserves maximum quality from JP2
511
+ - Use lower values (80-90) if you want additional compression
512
+ - Original JP2 files are deleted after successful conversion
513
+
514
+ **Benefits:**
515
+
516
+ - โœ… Better browser compatibility
517
+ - โœ… Can be optimized by Sharp/Imagemin
518
+ - โœ… Maximum quality preserved (quality=100)
519
+ - โœ… Works everywhere
520
+
521
+ ### Optimizer Comparison
522
+
523
+ | Optimizer | Speed | Quality | Formats | Platform |
524
+ | ---------- | -------- | --------- | ------------------ | ----------------------------------------- |
525
+ | `sharp` | Fast | Excellent | JPG, PNG, WebP | Native (requires compilation) |
526
+ | `imagemin` | Medium | Excellent | JPG, PNG, GIF, SVG | Cross-platform |
527
+ | `auto` | Variable | Excellent | All supported | Tries sharp first, falls back to imagemin |
528
+
529
+ ### Optimization Presets
530
+
531
+ ```typescript
532
+ // Maximum compression (slower, smaller files)
533
+ const result = await extractPdfContent("document.pdf", {
534
+ optimizeImages: true,
535
+ imageQuality: 70,
536
+ });
537
+
538
+ // Balanced (recommended)
539
+ const result = await extractPdfContent("document.pdf", {
540
+ optimizeImages: true,
541
+ imageQuality: 80, // Default
542
+ });
543
+
544
+ // Fast optimization with Sharp
545
+ const result = await extractPdfContent("document.pdf", {
546
+ optimizeImages: true,
547
+ imageOptimizer: "sharp",
548
+ imageQuality: 85,
549
+ });
550
+ ```
551
+
251
552
  ## Performance Modes
252
553
 
253
554
  ### Text-Only Mode (Fastest)