@kreuzberg/wasm 4.0.0-rc.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. package/README.md +982 -0
  2. package/dist/adapters/wasm-adapter.d.mts +121 -0
  3. package/dist/adapters/wasm-adapter.d.ts +121 -0
  4. package/dist/adapters/wasm-adapter.js +241 -0
  5. package/dist/adapters/wasm-adapter.js.map +1 -0
  6. package/dist/adapters/wasm-adapter.mjs +221 -0
  7. package/dist/adapters/wasm-adapter.mjs.map +1 -0
  8. package/dist/index.d.mts +466 -0
  9. package/dist/index.d.ts +466 -0
  10. package/dist/index.js +383 -0
  11. package/dist/index.js.map +1 -0
  12. package/dist/index.mjs +384 -0
  13. package/dist/index.mjs.map +1 -0
  14. package/dist/kreuzberg_wasm.d.mts +758 -0
  15. package/dist/kreuzberg_wasm.d.ts +758 -0
  16. package/dist/kreuzberg_wasm.js +1913 -0
  17. package/dist/kreuzberg_wasm.mjs +48 -0
  18. package/dist/kreuzberg_wasm_bg.wasm +0 -0
  19. package/dist/kreuzberg_wasm_bg.wasm.d.ts +54 -0
  20. package/dist/ocr/registry.d.mts +102 -0
  21. package/dist/ocr/registry.d.ts +102 -0
  22. package/dist/ocr/registry.js +90 -0
  23. package/dist/ocr/registry.js.map +1 -0
  24. package/dist/ocr/registry.mjs +70 -0
  25. package/dist/ocr/registry.mjs.map +1 -0
  26. package/dist/ocr/tesseract-wasm-backend.d.mts +257 -0
  27. package/dist/ocr/tesseract-wasm-backend.d.ts +257 -0
  28. package/dist/ocr/tesseract-wasm-backend.js +454 -0
  29. package/dist/ocr/tesseract-wasm-backend.js.map +1 -0
  30. package/dist/ocr/tesseract-wasm-backend.mjs +424 -0
  31. package/dist/ocr/tesseract-wasm-backend.mjs.map +1 -0
  32. package/dist/runtime.d.mts +256 -0
  33. package/dist/runtime.d.ts +256 -0
  34. package/dist/runtime.js +172 -0
  35. package/dist/runtime.js.map +1 -0
  36. package/dist/runtime.mjs +152 -0
  37. package/dist/runtime.mjs.map +1 -0
  38. package/dist/snippets/wasm-bindgen-rayon-38edf6e439f6d70d/src/workerHelpers.js +107 -0
  39. package/dist/types-GJVIvbPy.d.mts +221 -0
  40. package/dist/types-GJVIvbPy.d.ts +221 -0
  41. package/package.json +138 -0
package/README.md ADDED
@@ -0,0 +1,982 @@
1
+ # Kreuzberg
2
+
3
+ [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
+ [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
+ [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
+ [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
+ [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
+ [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
+ [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
+ [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
+
12
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
+ [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
+
16
+ High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
17
+
18
+ Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
+
20
+ > **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
21
+ >
22
+ > **This WASM package is designed for:**
23
+ > - Browser applications (including web workers)
24
+ > - Cloudflare Workers and edge runtimes
25
+ > - Deno applications
26
+ > - Environments without native build toolchain
27
+
28
+ > **🚀 Version 4.0.0 Release Candidate**
29
+ > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
30
+
31
+ ## Features
32
+
33
+ - **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
34
+ - **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
35
+ - **Table Extraction**: Advanced table detection and structured data extraction
36
+ - **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
37
+ - **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
38
+ - **API Parity**: All extraction functions from the Node.js binding
39
+ - **Plugin System**: Custom post-processors, validators, and OCR backends
40
+ - **Optimized Bundle**: <5MB uncompressed, <2MB compressed
41
+ - **Zero Configuration**: Works out of the box with sensible defaults
42
+ - **Portable**: Runs anywhere WASM is supported without native dependencies
43
+
44
+ ## Requirements
45
+
46
+ - **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
47
+ - **Node.js**: 18 or higher
48
+ - **Deno**: 1.0 or higher
49
+ - **Cloudflare Workers**: Compatible with Workers runtime
50
+
51
+ ### Optional Dependencies
52
+
53
+ - **tesseract-wasm**: Automatically loaded for OCR functionality (40+ language support)
54
+
55
+ ## Installation
56
+
57
+ ### Choosing the Right Package
58
+
59
+ | Use Case | Recommendation | Reason |
60
+ |----------|---|---|
61
+ | **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
62
+ | **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
63
+ | **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
64
+ | **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
65
+ | **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
66
+
67
+ ### Install via npm/pnpm/yarn
68
+
69
+ ```bash
70
+ npm install @kreuzberg/wasm
71
+ ```
72
+
73
+ Or with pnpm:
74
+
75
+ ```bash
76
+ pnpm add @kreuzberg/wasm
77
+ ```
78
+
79
+ Or with yarn:
80
+
81
+ ```bash
82
+ yarn add @kreuzberg/wasm
83
+ ```
84
+
85
+ ### Deno
86
+
87
+ ```typescript
88
+ import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
89
+ ```
90
+
91
+ ## Quick Start
92
+
93
+ ### Browser (ESM)
94
+
95
+ ```typescript
96
+ import { extractFile } from '@kreuzberg/wasm';
97
+
98
+ async function handleFileUpload() {
99
+ const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
100
+ const file = fileInput.files[0];
101
+
102
+ const result = await extractFile(file, {
103
+ extract_tables: true,
104
+ extract_images: true
105
+ });
106
+
107
+ console.log('Extracted text:', result.content);
108
+ console.log('Tables found:', result.tables.length);
109
+ }
110
+ ```
111
+
112
+ ### Node.js (ESM)
113
+
114
+ ```typescript
115
+ import { extractBytes } from '@kreuzberg/wasm';
116
+ import { readFile } from 'fs/promises';
117
+
118
+ const pdfBytes = await readFile('./document.pdf');
119
+ const result = await extractBytes(
120
+ new Uint8Array(pdfBytes),
121
+ 'application/pdf',
122
+ { extract_tables: true }
123
+ );
124
+
125
+ console.log(result.content);
126
+ console.log('Found', result.tables.length, 'tables');
127
+ ```
128
+
129
+ ### Deno
130
+
131
+ ```typescript
132
+ import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
133
+
134
+ const pdfBytes = await Deno.readFile("./document.pdf");
135
+ const result = await extractBytes(pdfBytes, "application/pdf");
136
+
137
+ console.log(result.content);
138
+ ```
139
+
140
+ ### Cloudflare Workers
141
+
142
+ ```typescript
143
+ import { extractBytes } from '@kreuzberg/wasm';
144
+
145
+ export default {
146
+ async fetch(request: Request): Promise<Response> {
147
+ if (request.method === 'POST') {
148
+ const formData = await request.formData();
149
+ const file = formData.get('file') as File;
150
+
151
+ const arrayBuffer = await file.arrayBuffer();
152
+ const bytes = new Uint8Array(arrayBuffer);
153
+
154
+ const result = await extractBytes(bytes, file.type);
155
+
156
+ return Response.json({
157
+ text: result.content,
158
+ metadata: result.metadata,
159
+ tables: result.tables
160
+ });
161
+ }
162
+
163
+ return new Response('Upload a file', { status: 400 });
164
+ }
165
+ };
166
+ ```
167
+
168
+ ## Performance Comparison
169
+
170
+ Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
171
+
172
+ | Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
173
+ |--------|---|---|---|
174
+ | **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
175
+ | **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
176
+ | **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
177
+ | **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
178
+ | **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
179
+
180
+ ### When to Use WASM vs Native
181
+
182
+ **Use WASM (@kreuzberg/wasm) when:**
183
+ - Building browser applications (no choice, WASM required)
184
+ - Targeting Cloudflare Workers or edge runtimes
185
+ - Supporting Deno applications
186
+ - You don't have a native build toolchain available
187
+ - Portability across runtimes is critical
188
+
189
+ **Use Native (@kreuzberg/node) when:**
190
+ - Building Node.js or Bun applications (2-3x faster)
191
+ - Performance is your primary concern
192
+ - You're processing large volumes of documents
193
+ - You have native build tools available
194
+
195
+ ### Performance Tips for WASM
196
+
197
+ 1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
198
+ 2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
199
+ 3. **Cache WASM module** by loading it once per application
200
+ 4. **Preload OCR models** by calling extraction with OCR enabled early
201
+
202
+ ## Examples
203
+
204
+ Kreuzberg WASM includes complete working examples for different environments:
205
+
206
+ - **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
207
+ - **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
208
+ - **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
209
+
210
+ See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
211
+
212
+ ## Multi-Threading with wasm-bindgen-rayon
213
+
214
+ Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
215
+
216
+ ### Initializing the Thread Pool
217
+
218
+ To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
219
+
220
+ ```typescript
221
+ import { initThreadPool } from '@kreuzberg/wasm';
222
+
223
+ // Initialize thread pool for multi-threaded extraction
224
+ await initThreadPool(navigator.hardwareConcurrency);
225
+
226
+ // Now extractions will use multiple threads for better performance
227
+ const result = await extractBytes(pdfBytes, 'application/pdf');
228
+ ```
229
+
230
+ ### Required HTTP Headers for SharedArrayBuffer
231
+
232
+ Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
233
+
234
+ **Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
235
+
236
+ Set these headers in your server configuration:
237
+
238
+ ```
239
+ Cross-Origin-Opener-Policy: same-origin
240
+ Cross-Origin-Embedder-Policy: require-corp
241
+ ```
242
+
243
+ #### Server Configuration Examples
244
+
245
+ **Express.js:**
246
+ ```javascript
247
+ app.use((req, res, next) => {
248
+ res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
249
+ res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
250
+ next();
251
+ });
252
+ ```
253
+
254
+ **Nginx:**
255
+ ```nginx
256
+ add_header 'Cross-Origin-Opener-Policy' 'same-origin';
257
+ add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
258
+ ```
259
+
260
+ **Apache:**
261
+ ```apache
262
+ Header set Cross-Origin-Opener-Policy "same-origin"
263
+ Header set Cross-Origin-Embedder-Policy "require-corp"
264
+ ```
265
+
266
+ **Cloudflare Workers:**
267
+ ```javascript
268
+ export default {
269
+ async fetch(request: Request): Promise<Response> {
270
+ const response = new Response(body);
271
+ response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
272
+ response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
273
+ return response;
274
+ }
275
+ };
276
+ ```
277
+
278
+ ### Browser Compatibility
279
+
280
+ Multi-threading with SharedArrayBuffer is available in:
281
+
282
+ - **Chrome/Edge**: 74+
283
+ - **Firefox**: 79+
284
+ - **Safari**: 15.2+
285
+ - **Opera**: 60+
286
+
287
+ In unsupported browsers or when headers are not set, the library automatically degrades to single-threaded mode.
288
+
289
+ ### Graceful Degradation
290
+
291
+ The library handles thread pool initialization gracefully. If initialization fails or is unavailable:
292
+
293
+ ```typescript
294
+ import { initThreadPool } from '@kreuzberg/wasm';
295
+
296
+ try {
297
+ await initThreadPool(navigator.hardwareConcurrency);
298
+ console.log('Multi-threading enabled');
299
+ } catch (error) {
300
+ // Fall back to single-threaded processing
301
+ console.warn('Multi-threading unavailable:', error);
302
+ console.log('Using single-threaded extraction');
303
+ }
304
+
305
+ // Extraction will work in both cases
306
+ const result = await extractBytes(pdfBytes, 'application/pdf');
307
+ ```
308
+
309
+ ### Complete Example with Thread Pool
310
+
311
+ ```typescript
312
+ import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
313
+
314
+ async function initializeKreuzbergWithThreading() {
315
+ try {
316
+ // Initialize WASM module
317
+ await initWasm();
318
+
319
+ // Initialize multi-threading
320
+ const cpuCount = navigator.hardwareConcurrency || 1;
321
+ try {
322
+ await initThreadPool(cpuCount);
323
+ console.log(`Thread pool initialized with ${cpuCount} workers`);
324
+ } catch (error) {
325
+ console.warn('Could not initialize thread pool, using single-threaded mode');
326
+ }
327
+
328
+ } catch (error) {
329
+ console.error('Failed to initialize Kreuzberg:', error);
330
+ }
331
+ }
332
+
333
+ async function extractDocument(file: File) {
334
+ const bytes = new Uint8Array(await file.arrayBuffer());
335
+
336
+ // Extraction will use multiple threads if available
337
+ const result = await extractBytes(bytes, file.type, {
338
+ extract_tables: true,
339
+ extract_images: true
340
+ });
341
+
342
+ return result;
343
+ }
344
+
345
+ // Initialize once at app startup
346
+ await initializeKreuzbergWithThreading();
347
+
348
+ // Later, handle file uploads
349
+ fileInput.addEventListener('change', async (e) => {
350
+ const file = e.target.files?.[0];
351
+ if (file) {
352
+ const result = await extractDocument(file);
353
+ console.log('Extracted text:', result.content);
354
+ }
355
+ });
356
+ ```
357
+
358
+ ### Performance Considerations
359
+
360
+ - **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
361
+ - **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
362
+ - **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
363
+
364
+ ## OCR Support
365
+
366
+ The WASM binding integrates [tesseract-wasm](https://github.com/robertknight/tesseract-wasm) for OCR support with 40+ languages.
367
+
368
+ ### Basic OCR
369
+
370
+ ```typescript
371
+ import { extractBytes } from '@kreuzberg/wasm';
372
+
373
+ const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
374
+
375
+ const result = await extractBytes(
376
+ new Uint8Array(imageBytes),
377
+ 'image/jpeg',
378
+ {
379
+ enable_ocr: true,
380
+ ocr_config: {
381
+ languages: ['eng'], // English
382
+ backend: 'tesseract-wasm'
383
+ }
384
+ }
385
+ );
386
+
387
+ console.log('OCR text:', result.content);
388
+ ```
389
+
390
+ ### Multi-Language OCR
391
+
392
+ ```typescript
393
+ const result = await extractBytes(imageBytes, 'image/png', {
394
+ enable_ocr: true,
395
+ ocr_config: {
396
+ languages: ['eng', 'deu', 'fra'], // English, German, French
397
+ backend: 'tesseract-wasm'
398
+ }
399
+ });
400
+ ```
401
+
402
+ ### Supported Languages
403
+
404
+ `eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
405
+
406
+ Training data is automatically loaded from jsDelivr CDN:
407
+ ```
408
+ https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
409
+ ```
410
+
411
+ ## Configuration
412
+
413
+ ### Extract Tables
414
+
415
+ ```typescript
416
+ import { extractBytes } from '@kreuzberg/wasm';
417
+
418
+ const result = await extractBytes(pdfBytes, 'application/pdf', {
419
+ extract_tables: true
420
+ });
421
+
422
+ if (result.tables) {
423
+ for (const table of result.tables) {
424
+ console.log('Table as Markdown:');
425
+ console.log(table.markdown);
426
+
427
+ console.log('Table cells:');
428
+ console.log(JSON.stringify(table.cells, null, 2));
429
+ }
430
+ }
431
+ ```
432
+
433
+ ### Extract Images
434
+
435
+ ```typescript
436
+ import { extractBytes } from '@kreuzberg/wasm';
437
+
438
+ const result = await extractBytes(pdfBytes, 'application/pdf', {
439
+ extract_images: true,
440
+ image_config: {
441
+ target_dpi: 300,
442
+ max_image_dimension: 4096
443
+ }
444
+ });
445
+
446
+ if (result.images) {
447
+ for (const image of result.images) {
448
+ console.log(`Image ${image.index}: ${image.format}`);
449
+ // image.data is a Uint8Array
450
+ }
451
+ }
452
+ ```
453
+
454
+ ### Text Chunking
455
+
456
+ ```typescript
457
+ import { extractBytes } from '@kreuzberg/wasm';
458
+
459
+ const result = await extractBytes(pdfBytes, 'application/pdf', {
460
+ enable_chunking: true,
461
+ chunking_config: {
462
+ max_chars: 1000,
463
+ max_overlap: 200
464
+ }
465
+ });
466
+
467
+ if (result.chunks) {
468
+ for (const chunk of result.chunks) {
469
+ console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
470
+ }
471
+ }
472
+ ```
473
+
474
+ ### Language Detection
475
+
476
+ ```typescript
477
+ import { extractBytes } from '@kreuzberg/wasm';
478
+
479
+ const result = await extractBytes(pdfBytes, 'application/pdf', {
480
+ enable_language_detection: true
481
+ });
482
+
483
+ if (result.language) {
484
+ console.log(`Detected language: ${result.language.code}`);
485
+ console.log(`Confidence: ${result.language.confidence}`);
486
+ }
487
+ ```
488
+
489
+ ### Complete Configuration Example
490
+
491
+ ```typescript
492
+ import {
493
+ extractBytes,
494
+ type ExtractionConfig
495
+ } from '@kreuzberg/wasm';
496
+
497
+ const config: ExtractionConfig = {
498
+ extract_tables: true,
499
+ extract_images: true,
500
+ extract_metadata: true,
501
+
502
+ enable_ocr: true,
503
+ ocr_config: {
504
+ languages: ['eng'],
505
+ backend: 'tesseract-wasm',
506
+ dpi: 300,
507
+ preprocessing: {
508
+ deskew: true,
509
+ denoise: true,
510
+ binarize: true
511
+ }
512
+ },
513
+
514
+ enable_chunking: true,
515
+ chunking_config: {
516
+ max_chars: 1000,
517
+ max_overlap: 200
518
+ },
519
+
520
+ enable_language_detection: true,
521
+
522
+ enable_quality: true,
523
+
524
+ extract_keywords: true,
525
+ keywords_config: {
526
+ max_keywords: 10,
527
+ method: 'yake'
528
+ }
529
+ };
530
+
531
+ const result = await extractBytes(data, mimeType, config);
532
+ ```
533
+
534
+ ## Advanced Usage
535
+
536
+ ### Batch Processing
537
+
538
+ ```typescript
539
+ import { batchExtractFiles, batchExtractBytes } from '@kreuzberg/wasm';
540
+
541
+ // Browser: Process multiple files
542
+ const fileInput = document.querySelector<HTMLInputElement>('#files');
543
+ const files = Array.from(fileInput.files);
544
+
545
+ const results = await batchExtractFiles(files, {
546
+ extract_tables: true
547
+ });
548
+
549
+ for (const result of results) {
550
+ console.log(`${result.mime_type}: ${result.content.length} characters`);
551
+ }
552
+
553
+ // Or from Uint8Arrays
554
+ const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
555
+ const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
556
+
557
+ const results = await batchExtractBytes(dataList, mimeTypes);
558
+ ```
559
+
560
+ ### Synchronous Extraction
561
+
562
+ ```typescript
563
+ import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
564
+
565
+ // Synchronous single extraction
566
+ const result = extractBytesSync(data, 'application/pdf', config);
567
+
568
+ // Synchronous batch extraction
569
+ const results = batchExtractBytesSync(dataList, mimeTypes, config);
570
+ ```
571
+
572
+ ### Plugin System
573
+
574
+ #### Custom Post-Processors
575
+
576
+ ```typescript
577
+ import { registerPostProcessor } from '@kreuzberg/wasm';
578
+
579
+ registerPostProcessor({
580
+ name: 'uppercase',
581
+ async process(result) {
582
+ return {
583
+ ...result,
584
+ content: result.content.toUpperCase()
585
+ };
586
+ }
587
+ });
588
+
589
+ // Now all extractions will use this processor
590
+ const result = await extractBytes(data, mimeType);
591
+ console.log(result.content); // UPPERCASE TEXT
592
+ ```
593
+
594
+ #### Custom Validators
595
+
596
+ ```typescript
597
+ import { registerValidator } from '@kreuzberg/wasm';
598
+
599
+ registerValidator({
600
+ name: 'min-length',
601
+ async validate(result) {
602
+ if (result.content.length < 100) {
603
+ throw new Error('Document too short');
604
+ }
605
+ }
606
+ });
607
+ ```
608
+
609
+ #### Custom OCR Backends
610
+
611
+ ```typescript
612
+ import { registerOcrBackend } from '@kreuzberg/wasm';
613
+
614
+ registerOcrBackend({
615
+ name: 'custom-ocr',
616
+ supportedLanguages() {
617
+ return ['eng', 'fra'];
618
+ },
619
+ async initialize() {
620
+ // Initialize your OCR backend
621
+ },
622
+ async processImage(imageBytes, language) {
623
+ // Process image and return result
624
+ return {
625
+ content: 'extracted text',
626
+ mime_type: 'text/plain',
627
+ metadata: {},
628
+ tables: []
629
+ };
630
+ },
631
+ async shutdown() {
632
+ // Cleanup
633
+ }
634
+ });
635
+ ```
636
+
637
+ ### MIME Type Detection
638
+
639
+ ```typescript
640
+ import {
641
+ detectMimeFromBytes,
642
+ getMimeFromExtension,
643
+ getExtensionsForMime,
644
+ normalizeMimeType
645
+ } from '@kreuzberg/wasm';
646
+
647
+ // Auto-detect MIME type from file bytes
648
+ const mimeType = detectMimeFromBytes(fileBytes);
649
+
650
+ // Get MIME type from file extension
651
+ const mime = getMimeFromExtension('pdf'); // 'application/pdf'
652
+
653
+ // Get extensions for MIME type
654
+ const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
655
+
656
+ // Normalize MIME type
657
+ const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
658
+ ```
659
+
660
+ ### Configuration Loading
661
+
662
+ ```typescript
663
+ import { loadConfigFromString } from '@kreuzberg/wasm';
664
+
665
+ // Load from YAML
666
+ const yamlConfig = `
667
+ extract_tables: true
668
+ enable_ocr: true
669
+ ocr_config:
670
+ languages: [eng, deu]
671
+ `;
672
+ const config = loadConfigFromString(yamlConfig, 'yaml');
673
+
674
+ // Load from JSON
675
+ const jsonConfig = '{"extract_tables":true}';
676
+ const config2 = loadConfigFromString(jsonConfig, 'json');
677
+
678
+ // Load from TOML
679
+ const tomlConfig = 'extract_tables = true';
680
+ const config3 = loadConfigFromString(tomlConfig, 'toml');
681
+ ```
682
+
683
+ ## API Reference
684
+
685
+ ### Extraction Functions
686
+
687
+ #### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
688
+ Extract content from a browser `File` object.
689
+
690
+ #### `extractBytes(data: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
691
+ Asynchronously extract content from a `Uint8Array`.
692
+
693
+ #### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
694
+ Synchronously extract content from a `Uint8Array`.
695
+
696
+ #### `batchExtractFiles(files: File[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
697
+ Extract multiple files in parallel.
698
+
699
+ #### `batchExtractBytes(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
700
+ Extract multiple byte arrays in parallel.
701
+
702
+ #### `batchExtractBytesSync(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): ExtractionResult[]`
703
+ Extract multiple byte arrays synchronously.
704
+
705
+ ### Plugin Management
706
+
707
+ #### Post-Processors
708
+
709
+ ```typescript
710
+ registerPostProcessor(processor: PostProcessorProtocol): void
711
+ unregisterPostProcessor(name: string): void
712
+ clearPostProcessors(): void
713
+ listPostProcessors(): string[]
714
+ ```
715
+
716
+ #### Validators
717
+
718
+ ```typescript
719
+ registerValidator(validator: ValidatorProtocol): void
720
+ unregisterValidator(name: string): void
721
+ clearValidators(): void
722
+ listValidators(): string[]
723
+ ```
724
+
725
+ #### OCR Backends
726
+
727
+ ```typescript
728
+ registerOcrBackend(backend: OcrBackendProtocol): void
729
+ unregisterOcrBackend(name: string): void
730
+ clearOcrBackends(): void
731
+ listOcrBackends(): string[]
732
+ ```
733
+
734
+ ### Document Extractors
735
+
736
+ ```typescript
737
+ listDocumentExtractors(): string[]
738
+ unregisterDocumentExtractor(name: string): void
739
+ clearDocumentExtractors(): void
740
+ ```
741
+
742
+ ### MIME Utilities
743
+
744
+ ```typescript
745
+ detectMimeFromBytes(data: Uint8Array): string
746
+ getMimeFromExtension(ext: string): string | null
747
+ getExtensionsForMime(mime: string): string[]
748
+ normalizeMimeType(mime: string): string
749
+ ```
750
+
751
+ ### Configuration
752
+
753
+ ```typescript
754
+ loadConfigFromString(content: string, format: 'yaml' | 'toml' | 'json'): ExtractionConfig
755
+ ```
756
+
757
+ ### Embeddings
758
+
759
+ ```typescript
760
+ listEmbeddingPresets(): string[]
761
+ getEmbeddingPreset(name: string): EmbeddingPreset | null
762
+ ```
763
+
764
+ ## Types
765
+
766
+ All types are shared via the `@kreuzberg/core` package:
767
+
768
+ ```typescript
769
+ import type {
770
+ ExtractionResult,
771
+ ExtractionConfig,
772
+ OcrConfig,
773
+ ChunkingConfig,
774
+ ImageConfig,
775
+ KeywordsConfig,
776
+ Table,
777
+ ExtractedImage,
778
+ Chunk,
779
+ Metadata,
780
+ PostProcessorProtocol,
781
+ ValidatorProtocol,
782
+ OcrBackendProtocol
783
+ } from '@kreuzberg/core';
784
+ ```
785
+
786
+ ### ExtractionResult
787
+
788
+ Main result object containing:
789
+ - `content: string` - Extracted text content
790
+ - `mime_type: string` - MIME type of the document
791
+ - `metadata?: Metadata` - Document metadata
792
+ - `tables?: Table[]` - Extracted tables
793
+ - `images?: ExtractedImage[]` - Extracted images
794
+ - `chunks?: Chunk[]` - Text chunks (if chunking enabled)
795
+ - `language?: LanguageInfo` - Detected language (if enabled)
796
+ - `keywords?: Keyword[]` - Extracted keywords (if enabled)
797
+
798
+ ### ExtractionConfig
799
+
800
+ Configuration object for extraction:
801
+ - `extract_tables?: boolean` - Extract tables as structured data
802
+ - `extract_images?: boolean` - Extract embedded images
803
+ - `extract_metadata?: boolean` - Extract document metadata
804
+ - `enable_ocr?: boolean` - Enable OCR for images
805
+ - `ocr_config?: OcrConfig` - OCR settings
806
+ - `enable_chunking?: boolean` - Split text into semantic chunks
807
+ - `chunking_config?: ChunkingConfig` - Text chunking settings
808
+ - `enable_language_detection?: boolean` - Detect document language
809
+ - `enable_quality?: boolean` - Encoding detection, normalization
810
+ - `extract_keywords?: boolean` - Extract important keywords
811
+ - `keywords_config?: KeywordsConfig` - Keyword extraction settings
812
+
813
+ ### Table
814
+
815
+ Extracted table structure:
816
+ - `markdown: string` - Table in Markdown format
817
+ - `cells: TableCell[][]` - 2D array of table cells
818
+ - `row_count: number` - Number of rows
819
+ - `column_count: number` - Number of columns
820
+
821
+ ## Supported Formats
822
+
823
+ | Category | Formats |
824
+ |----------|---------|
825
+ | **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
826
+ | **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
827
+ | **Web** | HTML, XHTML, XML, EPUB |
828
+ | **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
829
+ | **Email** | EML, MSG |
830
+ | **Archives** | ZIP, TAR, 7Z |
831
+ | **Other** | And 30+ more formats |
832
+
833
+ ## Build from Source
834
+
835
+ ### Prerequisites
836
+
837
+ - Rust 1.75+ with `wasm32-unknown-unknown` target
838
+ - Node.js 18+ with pnpm
839
+ - wasm-pack
840
+
841
+ ```bash
842
+ # Install Rust target
843
+ rustup target add wasm32-unknown-unknown
844
+
845
+ # Install wasm-pack
846
+ curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
847
+
848
+ # Build WASM package
849
+ cd crates/kreuzberg-wasm
850
+ pnpm install
851
+ pnpm run build
852
+
853
+ # Run tests
854
+ pnpm test
855
+ ```
856
+
857
+ ### Build Targets
858
+
859
+ ```bash
860
+ # For browsers (ESM modules)
861
+ pnpm run build:wasm:web
862
+
863
+ # For bundlers (webpack, rollup, vite)
864
+ pnpm run build:wasm:bundler
865
+
866
+ # For Node.js
867
+ pnpm run build:wasm:nodejs
868
+
869
+ # For Deno
870
+ pnpm run build:wasm:deno
871
+
872
+ # Build all targets
873
+ pnpm run build:all
874
+ ```
875
+
876
+ ## Limitations
877
+
878
+ ### No File System Access
879
+
880
+ The WASM binding cannot access the file system directly. Use file readers:
881
+
882
+ ```typescript
883
+ // ❌ Won't work
884
+ await extractFileSync('./document.pdf'); // Throws error
885
+
886
+ // ✅ Use file readers instead
887
+ const bytes = await Deno.readFile('./document.pdf'); // Deno
888
+ const bytes = await fs.readFile('./document.pdf'); // Node.js
889
+ const bytes = await file.arrayBuffer(); // Browser
890
+ ```
891
+
892
+ ### OCR Training Data
893
+
894
+ Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
895
+
896
+ ### Size Constraints
897
+
898
+ Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
899
+
900
+ ## Troubleshooting
901
+
902
+ ### "WASM module failed to initialize"
903
+
904
+ Ensure your bundler is configured to handle WASM files:
905
+
906
+ **Vite:**
907
+ ```typescript
908
+ // vite.config.ts
909
+ export default {
910
+ optimizeDeps: {
911
+ exclude: ['@kreuzberg/wasm']
912
+ }
913
+ }
914
+ ```
915
+
916
+ **Webpack:**
917
+ ```javascript
918
+ // webpack.config.js
919
+ module.exports = {
920
+ experiments: {
921
+ asyncWebAssembly: true
922
+ }
923
+ }
924
+ ```
925
+
926
+ ### "Module not found: @kreuzberg/core"
927
+
928
+ The @kreuzberg/core package is a peer dependency. Install it:
929
+
930
+ ```bash
931
+ pnpm add @kreuzberg/core
932
+ ```
933
+
934
+ ### Memory Issues in Workers
935
+
936
+ For large documents in Cloudflare Workers, process in smaller chunks:
937
+
938
+ ```typescript
939
+ const result = await extractBytes(pdfBytes, 'application/pdf', {
940
+ chunking_config: { max_chars: 1000 }
941
+ });
942
+ ```
943
+
944
+ ### OCR Not Working
945
+
946
+ Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
947
+
948
+ ## Examples
949
+
950
+ See the [`examples/`](./examples/) directory for complete working examples:
951
+
952
+ - **Browser**: Vanilla JS file upload interface
953
+ - **Deno**: Command-line document extraction
954
+ - **Cloudflare Workers**: Document processing API
955
+ - **Node.js**: Batch processing script
956
+
957
+ ## Documentation
958
+
959
+ For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
960
+
961
+ ## Contributing
962
+
963
+ We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/docs/contributing.md) for details.
964
+
965
+ ## License
966
+
967
+ MIT
968
+
969
+ ## Links
970
+
971
+ - [Website](https://kreuzberg.dev)
972
+ - [Documentation](https://kreuzberg.dev)
973
+ - [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
974
+ - [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
975
+ - [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
976
+ - [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
977
+
978
+ ## Related Packages
979
+
980
+ - [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) - Native Node.js bindings (NAPI)
981
+ - [@kreuzberg/core](https://www.npmjs.com/package/@kreuzberg/core) - Shared TypeScript types
982
+ - [kreuzberg](https://crates.io/crates/kreuzberg) - Rust core library