@kreuzberg/wasm 4.0.0-rc.6 → 4.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (56) hide show
  1. package/LICENSE +7 -0
  2. package/README.md +317 -801
  3. package/dist/adapters/wasm-adapter.d.ts +7 -10
  4. package/dist/adapters/wasm-adapter.d.ts.map +1 -0
  5. package/dist/adapters/wasm-adapter.js +53 -54
  6. package/dist/adapters/wasm-adapter.js.map +1 -1
  7. package/dist/index.d.ts +23 -67
  8. package/dist/index.d.ts.map +1 -0
  9. package/dist/index.js +1102 -104
  10. package/dist/index.js.map +1 -1
  11. package/dist/ocr/registry.d.ts +7 -10
  12. package/dist/ocr/registry.d.ts.map +1 -0
  13. package/dist/ocr/registry.js +9 -28
  14. package/dist/ocr/registry.js.map +1 -1
  15. package/dist/ocr/tesseract-wasm-backend.d.ts +3 -6
  16. package/dist/ocr/tesseract-wasm-backend.d.ts.map +1 -0
  17. package/dist/ocr/tesseract-wasm-backend.js +8 -83
  18. package/dist/ocr/tesseract-wasm-backend.js.map +1 -1
  19. package/dist/pdfium.js +77 -0
  20. package/dist/pkg/LICENSE +7 -0
  21. package/dist/pkg/README.md +498 -0
  22. package/dist/{kreuzberg_wasm.d.ts → pkg/kreuzberg_wasm.d.ts} +24 -12
  23. package/dist/{kreuzberg_wasm.js → pkg/kreuzberg_wasm.js} +224 -233
  24. package/dist/pkg/kreuzberg_wasm_bg.js +1871 -0
  25. package/dist/{kreuzberg_wasm_bg.wasm → pkg/kreuzberg_wasm_bg.wasm} +0 -0
  26. package/dist/{kreuzberg_wasm_bg.wasm.d.ts → pkg/kreuzberg_wasm_bg.wasm.d.ts} +10 -13
  27. package/dist/pkg/package.json +27 -0
  28. package/dist/plugin-registry.d.ts +246 -0
  29. package/dist/plugin-registry.d.ts.map +1 -0
  30. package/dist/runtime.d.ts +21 -22
  31. package/dist/runtime.d.ts.map +1 -0
  32. package/dist/runtime.js +21 -41
  33. package/dist/runtime.js.map +1 -1
  34. package/dist/types.d.ts +363 -0
  35. package/dist/types.d.ts.map +1 -0
  36. package/package.json +34 -51
  37. package/dist/adapters/wasm-adapter.d.mts +0 -121
  38. package/dist/adapters/wasm-adapter.mjs +0 -221
  39. package/dist/adapters/wasm-adapter.mjs.map +0 -1
  40. package/dist/index.d.mts +0 -466
  41. package/dist/index.mjs +0 -384
  42. package/dist/index.mjs.map +0 -1
  43. package/dist/kreuzberg_wasm.d.mts +0 -758
  44. package/dist/kreuzberg_wasm.mjs +0 -48
  45. package/dist/ocr/registry.d.mts +0 -102
  46. package/dist/ocr/registry.mjs +0 -70
  47. package/dist/ocr/registry.mjs.map +0 -1
  48. package/dist/ocr/tesseract-wasm-backend.d.mts +0 -257
  49. package/dist/ocr/tesseract-wasm-backend.mjs +0 -424
  50. package/dist/ocr/tesseract-wasm-backend.mjs.map +0 -1
  51. package/dist/runtime.d.mts +0 -256
  52. package/dist/runtime.mjs +0 -152
  53. package/dist/runtime.mjs.map +0 -1
  54. package/dist/snippets/wasm-bindgen-rayon-38edf6e439f6d70d/src/workerHelpers.js +0 -107
  55. package/dist/types-GJVIvbPy.d.mts +0 -221
  56. package/dist/types-GJVIvbPy.d.ts +0 -221
package/README.md CHANGED
@@ -1,982 +1,498 @@
1
- # Kreuzberg
1
+ # WebAssembly
2
+
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <!-- Language Bindings -->
5
+ <a href="https://crates.io/crates/kreuzberg">
6
+ <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
7
+ </a>
8
+ <a href="https://hex.pm/packages/kreuzberg">
9
+ <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
10
+ </a>
11
+ <a href="https://pypi.org/project/kreuzberg/">
12
+ <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@kreuzberg/node">
15
+ <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
16
+ </a>
17
+ <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
18
+ <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
19
+ </a>
20
+
21
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
22
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
23
+ </a>
24
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
25
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
26
+ </a>
27
+ <a href="https://www.nuget.org/packages/Kreuzberg/">
28
+ <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
29
+ </a>
30
+ <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
31
+ <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
32
+ </a>
33
+ <a href="https://rubygems.org/gems/kreuzberg">
34
+ <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
35
+ </a>
36
+
37
+ <!-- Project Info -->
38
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
39
+ <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
40
+ </a>
41
+ <a href="https://docs.kreuzberg.dev">
42
+ <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
43
+ </a>
44
+ </div>
45
+
46
+ <img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
47
+
48
+ <div align="center" style="margin-top: 20px;">
49
+ <a href="https://discord.gg/pXxagNK2zN">
50
+ <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
51
+ </a>
52
+ </div>
53
+
54
+
55
+ Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
2
56
 
3
- [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
- [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
- [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
- [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
- [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
- [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
- [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
- [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
57
 
12
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
- [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
-
16
- High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
17
-
18
- Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
-
20
- > **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
21
- >
22
- > **This WASM package is designed for:**
23
- > - Browser applications (including web workers)
24
- > - Cloudflare Workers and edge runtimes
25
- > - Deno applications
26
- > - Environments without native build toolchain
27
-
28
- > **🚀 Version 4.0.0 Release Candidate**
29
- > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
30
-
31
- ## Features
32
-
33
- - **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
34
- - **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
35
- - **Table Extraction**: Advanced table detection and structured data extraction
36
- - **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
37
- - **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
38
- - **API Parity**: All extraction functions from the Node.js binding
39
- - **Plugin System**: Custom post-processors, validators, and OCR backends
40
- - **Optimized Bundle**: <5MB uncompressed, <2MB compressed
41
- - **Zero Configuration**: Works out of the box with sensible defaults
42
- - **Portable**: Runs anywhere WASM is supported without native dependencies
43
-
44
- ## Requirements
45
-
46
- - **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
47
- - **Node.js**: 18 or higher
48
- - **Deno**: 1.0 or higher
49
- - **Cloudflare Workers**: Compatible with Workers runtime
50
-
51
- ### Optional Dependencies
58
+ ## Installation
52
59
 
53
- - **tesseract-wasm**: Automatically loaded for OCR functionality (40+ language support)
60
+ ### Package Installation
54
61
 
55
- ## Installation
56
62
 
57
- ### Choosing the Right Package
63
+ Install via one of the supported package managers:
58
64
 
59
- | Use Case | Recommendation | Reason |
60
- |----------|---|---|
61
- | **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
62
- | **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
63
- | **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
64
- | **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
65
- | **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
66
65
 
67
- ### Install via npm/pnpm/yarn
68
66
 
67
+ **npm:**
69
68
  ```bash
70
69
  npm install @kreuzberg/wasm
71
70
  ```
72
71
 
73
- Or with pnpm:
74
72
 
75
- ```bash
76
- pnpm add @kreuzberg/wasm
77
- ```
78
73
 
79
- Or with yarn:
80
74
 
75
+ **pnpm:**
81
76
  ```bash
82
- yarn add @kreuzberg/wasm
83
- ```
84
-
85
- ### Deno
86
-
87
- ```typescript
88
- import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
89
- ```
90
-
91
- ## Quick Start
92
-
93
- ### Browser (ESM)
94
-
95
- ```typescript
96
- import { extractFile } from '@kreuzberg/wasm';
97
-
98
- async function handleFileUpload() {
99
- const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
100
- const file = fileInput.files[0];
101
-
102
- const result = await extractFile(file, {
103
- extract_tables: true,
104
- extract_images: true
105
- });
106
-
107
- console.log('Extracted text:', result.content);
108
- console.log('Tables found:', result.tables.length);
109
- }
110
- ```
111
-
112
- ### Node.js (ESM)
113
-
114
- ```typescript
115
- import { extractBytes } from '@kreuzberg/wasm';
116
- import { readFile } from 'fs/promises';
117
-
118
- const pdfBytes = await readFile('./document.pdf');
119
- const result = await extractBytes(
120
- new Uint8Array(pdfBytes),
121
- 'application/pdf',
122
- { extract_tables: true }
123
- );
124
-
125
- console.log(result.content);
126
- console.log('Found', result.tables.length, 'tables');
127
- ```
128
-
129
- ### Deno
130
-
131
- ```typescript
132
- import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
133
-
134
- const pdfBytes = await Deno.readFile("./document.pdf");
135
- const result = await extractBytes(pdfBytes, "application/pdf");
136
-
137
- console.log(result.content);
138
- ```
139
-
140
- ### Cloudflare Workers
141
-
142
- ```typescript
143
- import { extractBytes } from '@kreuzberg/wasm';
144
-
145
- export default {
146
- async fetch(request: Request): Promise<Response> {
147
- if (request.method === 'POST') {
148
- const formData = await request.formData();
149
- const file = formData.get('file') as File;
150
-
151
- const arrayBuffer = await file.arrayBuffer();
152
- const bytes = new Uint8Array(arrayBuffer);
153
-
154
- const result = await extractBytes(bytes, file.type);
155
-
156
- return Response.json({
157
- text: result.content,
158
- metadata: result.metadata,
159
- tables: result.tables
160
- });
161
- }
162
-
163
- return new Response('Upload a file', { status: 400 });
164
- }
165
- };
77
+ pnpm add @kreuzberg/wasm
166
78
  ```
167
79
 
168
- ## Performance Comparison
169
-
170
- Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
171
-
172
- | Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
173
- |--------|---|---|---|
174
- | **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
175
- | **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
176
- | **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
177
- | **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
178
- | **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
179
-
180
- ### When to Use WASM vs Native
181
-
182
- **Use WASM (@kreuzberg/wasm) when:**
183
- - Building browser applications (no choice, WASM required)
184
- - Targeting Cloudflare Workers or edge runtimes
185
- - Supporting Deno applications
186
- - You don't have a native build toolchain available
187
- - Portability across runtimes is critical
188
80
 
189
- **Use Native (@kreuzberg/node) when:**
190
- - Building Node.js or Bun applications (2-3x faster)
191
- - Performance is your primary concern
192
- - You're processing large volumes of documents
193
- - You have native build tools available
194
81
 
195
- ### Performance Tips for WASM
196
82
 
197
- 1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
198
- 2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
199
- 3. **Cache WASM module** by loading it once per application
200
- 4. **Preload OCR models** by calling extraction with OCR enabled early
201
-
202
- ## Examples
203
-
204
- Kreuzberg WASM includes complete working examples for different environments:
205
-
206
- - **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
207
- - **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
208
- - **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
209
-
210
- See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
211
-
212
- ## Multi-Threading with wasm-bindgen-rayon
213
-
214
- Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
215
-
216
- ### Initializing the Thread Pool
217
-
218
- To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
219
-
220
- ```typescript
221
- import { initThreadPool } from '@kreuzberg/wasm';
222
-
223
- // Initialize thread pool for multi-threaded extraction
224
- await initThreadPool(navigator.hardwareConcurrency);
225
-
226
- // Now extractions will use multiple threads for better performance
227
- const result = await extractBytes(pdfBytes, 'application/pdf');
83
+ **yarn:**
84
+ ```bash
85
+ yarn add @kreuzberg/wasm
228
86
  ```
229
87
 
230
- ### Required HTTP Headers for SharedArrayBuffer
231
-
232
- Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
233
88
 
234
- **Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
235
89
 
236
- Set these headers in your server configuration:
237
-
238
- ```
239
- Cross-Origin-Opener-Policy: same-origin
240
- Cross-Origin-Embedder-Policy: require-corp
241
- ```
242
90
 
243
- #### Server Configuration Examples
244
91
 
245
- **Express.js:**
246
- ```javascript
247
- app.use((req, res, next) => {
248
- res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
249
- res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
250
- next();
251
- });
252
- ```
92
+ ### System Requirements
253
93
 
254
- **Nginx:**
255
- ```nginx
256
- add_header 'Cross-Origin-Opener-Policy' 'same-origin';
257
- add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
258
- ```
94
+ - Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
95
+ - Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
259
96
 
260
- **Apache:**
261
- ```apache
262
- Header set Cross-Origin-Opener-Policy "same-origin"
263
- Header set Cross-Origin-Embedder-Policy "require-corp"
264
- ```
265
97
 
266
- **Cloudflare Workers:**
267
- ```javascript
268
- export default {
269
- async fetch(request: Request): Promise<Response> {
270
- const response = new Response(body);
271
- response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
272
- response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
273
- return response;
274
- }
275
- };
276
- ```
277
98
 
278
- ### Browser Compatibility
99
+ ## Quick Start
279
100
 
280
- Multi-threading with SharedArrayBuffer is available in:
101
+ ### Basic Extraction
281
102
 
282
- - **Chrome/Edge**: 74+
283
- - **Firefox**: 79+
284
- - **Safari**: 15.2+
285
- - **Opera**: 60+
103
+ Extract text, metadata, and structure from any supported document format:
286
104
 
287
- In unsupported browsers or when headers are not set, the library automatically degrades to single-threaded mode.
105
+ ```ts
106
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
288
107
 
289
- ### Graceful Degradation
108
+ async function main() {
109
+ await initWasm();
290
110
 
291
- The library handles thread pool initialization gracefully. If initialization fails or is unavailable:
111
+ const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
112
+ const bytes = new Uint8Array(buffer);
292
113
 
293
- ```typescript
294
- import { initThreadPool } from '@kreuzberg/wasm';
114
+ const result = await extractBytes(bytes, "application/pdf");
295
115
 
296
- try {
297
- await initThreadPool(navigator.hardwareConcurrency);
298
- console.log('Multi-threading enabled');
299
- } catch (error) {
300
- // Fall back to single-threaded processing
301
- console.warn('Multi-threading unavailable:', error);
302
- console.log('Using single-threaded extraction');
116
+ console.log("Extracted content:");
117
+ console.log(result.content);
118
+ console.log("MIME type:", result.mimeType);
119
+ console.log("Metadata:", result.metadata);
303
120
  }
304
121
 
305
- // Extraction will work in both cases
306
- const result = await extractBytes(pdfBytes, 'application/pdf');
122
+ main().catch(console.error);
307
123
  ```
308
124
 
309
- ### Complete Example with Thread Pool
310
-
311
- ```typescript
312
- import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
313
125
 
314
- async function initializeKreuzbergWithThreading() {
315
- try {
316
- // Initialize WASM module
317
- await initWasm();
126
+ ### Common Use Cases
318
127
 
319
- // Initialize multi-threading
320
- const cpuCount = navigator.hardwareConcurrency || 1;
321
- try {
322
- await initThreadPool(cpuCount);
323
- console.log(`Thread pool initialized with ${cpuCount} workers`);
324
- } catch (error) {
325
- console.warn('Could not initialize thread pool, using single-threaded mode');
326
- }
327
-
328
- } catch (error) {
329
- console.error('Failed to initialize Kreuzberg:', error);
330
- }
331
- }
128
+ #### Extract with Custom Configuration
332
129
 
333
- async function extractDocument(file: File) {
334
- const bytes = new Uint8Array(await file.arrayBuffer());
130
+ Most use cases benefit from configuration to control extraction behavior:
335
131
 
336
- // Extraction will use multiple threads if available
337
- const result = await extractBytes(bytes, file.type, {
338
- extract_tables: true,
339
- extract_images: true
340
- });
341
-
342
- return result;
343
- }
344
-
345
- // Initialize once at app startup
346
- await initializeKreuzbergWithThreading();
347
-
348
- // Later, handle file uploads
349
- fileInput.addEventListener('change', async (e) => {
350
- const file = e.target.files?.[0];
351
- if (file) {
352
- const result = await extractDocument(file);
353
- console.log('Extracted text:', result.content);
354
- }
355
- });
356
- ```
357
-
358
- ### Performance Considerations
359
-
360
- - **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
361
- - **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
362
- - **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
363
-
364
- ## OCR Support
365
132
 
366
- The WASM binding integrates [tesseract-wasm](https://github.com/robertknight/tesseract-wasm) for OCR support with 40+ languages.
133
+ **With OCR (for scanned documents):**
367
134
 
368
- ### Basic OCR
135
+ ```ts
136
+ import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
369
137
 
370
- ```typescript
371
- import { extractBytes } from '@kreuzberg/wasm';
138
+ async function extractWithOcr() {
139
+ await initWasm();
372
140
 
373
- const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
141
+ try {
142
+ await enableOcr();
143
+ console.log("OCR enabled successfully");
144
+ } catch (error) {
145
+ console.error("Failed to enable OCR:", error);
146
+ return;
147
+ }
374
148
 
375
- const result = await extractBytes(
376
- new Uint8Array(imageBytes),
377
- 'image/jpeg',
378
- {
379
- enable_ocr: true,
380
- ocr_config: {
381
- languages: ['eng'], // English
382
- backend: 'tesseract-wasm'
383
- }
384
- }
385
- );
149
+ const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
386
150
 
387
- console.log('OCR text:', result.content);
388
- ```
151
+ const result = await extractBytes(bytes, "image/png", {
152
+ ocr: {
153
+ backend: "tesseract-wasm",
154
+ language: "eng",
155
+ },
156
+ });
389
157
 
390
- ### Multi-Language OCR
158
+ console.log("Extracted text:");
159
+ console.log(result.content);
160
+ }
391
161
 
392
- ```typescript
393
- const result = await extractBytes(imageBytes, 'image/png', {
394
- enable_ocr: true,
395
- ocr_config: {
396
- languages: ['eng', 'deu', 'fra'], // English, German, French
397
- backend: 'tesseract-wasm'
398
- }
399
- });
162
+ extractWithOcr().catch(console.error);
400
163
  ```
401
164
 
402
- ### Supported Languages
403
165
 
404
- `eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
405
166
 
406
- Training data is automatically loaded from jsDelivr CDN:
407
- ```
408
- https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
409
- ```
410
167
 
411
- ## Configuration
168
+ #### Table Extraction
412
169
 
413
- ### Extract Tables
414
170
 
415
- ```typescript
416
- import { extractBytes } from '@kreuzberg/wasm';
171
+ See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
417
172
 
418
- const result = await extractBytes(pdfBytes, 'application/pdf', {
419
- extract_tables: true
420
- });
421
173
 
422
- if (result.tables) {
423
- for (const table of result.tables) {
424
- console.log('Table as Markdown:');
425
- console.log(table.markdown);
426
174
 
427
- console.log('Table cells:');
428
- console.log(JSON.stringify(table.cells, null, 2));
429
- }
430
- }
431
- ```
175
+ #### Processing Multiple Files
432
176
 
433
- ### Extract Images
434
177
 
435
- ```typescript
436
- import { extractBytes } from '@kreuzberg/wasm';
178
+ ```ts
179
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
437
180
 
438
- const result = await extractBytes(pdfBytes, 'application/pdf', {
439
- extract_images: true,
440
- image_config: {
441
- target_dpi: 300,
442
- max_image_dimension: 4096
443
- }
444
- });
445
-
446
- if (result.images) {
447
- for (const image of result.images) {
448
- console.log(`Image ${image.index}: ${image.format}`);
449
- // image.data is a Uint8Array
450
- }
181
+ interface DocumentJob {
182
+ name: string;
183
+ bytes: Uint8Array;
184
+ mimeType: string;
451
185
  }
452
- ```
453
186
 
454
- ### Text Chunking
455
-
456
- ```typescript
457
- import { extractBytes } from '@kreuzberg/wasm';
458
-
459
- const result = await extractBytes(pdfBytes, 'application/pdf', {
460
- enable_chunking: true,
461
- chunking_config: {
462
- max_chars: 1000,
463
- max_overlap: 200
464
- }
465
- });
466
-
467
- if (result.chunks) {
468
- for (const chunk of result.chunks) {
469
- console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
470
- }
187
+ async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
188
+ await initWasm();
189
+
190
+ const results: Record<string, string> = {};
191
+ const queue = [...documents];
192
+
193
+ const workers = Array(concurrency)
194
+ .fill(null)
195
+ .map(async () => {
196
+ while (queue.length > 0) {
197
+ const doc = queue.shift();
198
+ if (!doc) break;
199
+
200
+ try {
201
+ const result = await extractBytes(doc.bytes, doc.mimeType);
202
+ results[doc.name] = result.content;
203
+ } catch (error) {
204
+ console.error(`Failed to process ${doc.name}:`, error);
205
+ }
206
+ }
207
+ });
208
+
209
+ await Promise.all(workers);
210
+ return results;
471
211
  }
472
212
  ```
473
213
 
474
- ### Language Detection
475
214
 
476
- ```typescript
477
- import { extractBytes } from '@kreuzberg/wasm';
478
215
 
479
- const result = await extractBytes(pdfBytes, 'application/pdf', {
480
- enable_language_detection: true
481
- });
482
216
 
483
- if (result.language) {
484
- console.log(`Detected language: ${result.language.code}`);
485
- console.log(`Confidence: ${result.language.confidence}`);
486
- }
487
- ```
488
217
 
489
- ### Complete Configuration Example
490
-
491
- ```typescript
492
- import {
493
- extractBytes,
494
- type ExtractionConfig
495
- } from '@kreuzberg/wasm';
496
-
497
- const config: ExtractionConfig = {
498
- extract_tables: true,
499
- extract_images: true,
500
- extract_metadata: true,
501
-
502
- enable_ocr: true,
503
- ocr_config: {
504
- languages: ['eng'],
505
- backend: 'tesseract-wasm',
506
- dpi: 300,
507
- preprocessing: {
508
- deskew: true,
509
- denoise: true,
510
- binarize: true
511
- }
512
- },
513
-
514
- enable_chunking: true,
515
- chunking_config: {
516
- max_chars: 1000,
517
- max_overlap: 200
518
- },
519
-
520
- enable_language_detection: true,
521
-
522
- enable_quality: true,
523
-
524
- extract_keywords: true,
525
- keywords_config: {
526
- max_keywords: 10,
527
- method: 'yake'
528
- }
529
- };
530
-
531
- const result = await extractBytes(data, mimeType, config);
532
- ```
218
+ #### Async Processing
533
219
 
534
- ## Advanced Usage
220
+ For non-blocking document processing:
535
221
 
536
- ### Batch Processing
222
+ ```ts
223
+ import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
537
224
 
538
- ```typescript
539
- import { batchExtractFiles, batchExtractBytes } from '@kreuzberg/wasm';
225
+ async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
226
+ const caps = getWasmCapabilities();
227
+ if (!caps.hasWasm) {
228
+ throw new Error("WebAssembly not supported");
229
+ }
540
230
 
541
- // Browser: Process multiple files
542
- const fileInput = document.querySelector<HTMLInputElement>('#files');
543
- const files = Array.from(fileInput.files);
231
+ await initWasm();
544
232
 
545
- const results = await batchExtractFiles(files, {
546
- extract_tables: true
547
- });
233
+ const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
548
234
 
549
- for (const result of results) {
550
- console.log(`${result.mime_type}: ${result.content.length} characters`);
235
+ return results.map((r) => ({
236
+ content: r.content,
237
+ pageCount: r.metadata?.pageCount,
238
+ }));
551
239
  }
552
240
 
553
- // Or from Uint8Arrays
554
- const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
555
- const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
241
+ const fileBytes = [new Uint8Array([1, 2, 3])];
242
+ const mimes = ["application/pdf"];
556
243
 
557
- const results = await batchExtractBytes(dataList, mimeTypes);
244
+ extractDocuments(fileBytes, mimes)
245
+ .then((results) => console.log(results))
246
+ .catch(console.error);
558
247
  ```
559
248
 
560
- ### Synchronous Extraction
561
249
 
562
- ```typescript
563
- import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
564
250
 
565
- // Synchronous single extraction
566
- const result = extractBytesSync(data, 'application/pdf', config);
567
251
 
568
- // Synchronous batch extraction
569
- const results = batchExtractBytesSync(dataList, mimeTypes, config);
570
- ```
571
252
 
572
- ### Plugin System
573
253
 
574
- #### Custom Post-Processors
254
+ ### Next Steps
575
255
 
576
- ```typescript
577
- import { registerPostProcessor } from '@kreuzberg/wasm';
256
+ - **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
257
+ - **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
258
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
259
+ - **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)** - Advanced configuration options
578
260
 
579
- registerPostProcessor({
580
- name: 'uppercase',
581
- async process(result) {
582
- return {
583
- ...result,
584
- content: result.content.toUpperCase()
585
- };
586
- }
587
- });
588
261
 
589
- // Now all extractions will use this processor
590
- const result = await extractBytes(data, mimeType);
591
- console.log(result.content); // UPPERCASE TEXT
592
- ```
593
262
 
594
- #### Custom Validators
263
+ ## Features
595
264
 
596
- ```typescript
597
- import { registerValidator } from '@kreuzberg/wasm';
265
+ ### Supported File Formats (56+)
598
266
 
599
- registerValidator({
600
- name: 'min-length',
601
- async validate(result) {
602
- if (result.content.length < 100) {
603
- throw new Error('Document too short');
604
- }
605
- }
606
- });
607
- ```
267
+ 56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
608
268
 
609
- #### Custom OCR Backends
610
-
611
- ```typescript
612
- import { registerOcrBackend } from '@kreuzberg/wasm';
613
-
614
- registerOcrBackend({
615
- name: 'custom-ocr',
616
- supportedLanguages() {
617
- return ['eng', 'fra'];
618
- },
619
- async initialize() {
620
- // Initialize your OCR backend
621
- },
622
- async processImage(imageBytes, language) {
623
- // Process image and return result
624
- return {
625
- content: 'extracted text',
626
- mime_type: 'text/plain',
627
- metadata: {},
628
- tables: []
629
- };
630
- },
631
- async shutdown() {
632
- // Cleanup
633
- }
634
- });
635
- ```
269
+ #### Office Documents
636
270
 
637
- ### MIME Type Detection
271
+ | Category | Formats | Capabilities |
272
+ |----------|---------|--------------|
273
+ | **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
274
+ | **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
275
+ | **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
276
+ | **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
277
+ | **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
638
278
 
639
- ```typescript
640
- import {
641
- detectMimeFromBytes,
642
- getMimeFromExtension,
643
- getExtensionsForMime,
644
- normalizeMimeType
645
- } from '@kreuzberg/wasm';
279
+ #### Images (OCR-Enabled)
646
280
 
647
- // Auto-detect MIME type from file bytes
648
- const mimeType = detectMimeFromBytes(fileBytes);
281
+ | Category | Formats | Features |
282
+ |----------|---------|----------|
283
+ | **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
284
+ | **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
285
+ | **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
649
286
 
650
- // Get MIME type from file extension
651
- const mime = getMimeFromExtension('pdf'); // 'application/pdf'
287
+ #### Web & Data
652
288
 
653
- // Get extensions for MIME type
654
- const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
289
+ | Category | Formats | Features |
290
+ |----------|---------|----------|
291
+ | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
292
+ | **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
293
+ | **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
655
294
 
656
- // Normalize MIME type
657
- const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
658
- ```
295
+ #### Email & Archives
659
296
 
660
- ### Configuration Loading
297
+ | Category | Formats | Features |
298
+ |----------|---------|----------|
299
+ | **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
300
+ | **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
661
301
 
662
- ```typescript
663
- import { loadConfigFromString } from '@kreuzberg/wasm';
302
+ #### Academic & Scientific
664
303
 
665
- // Load from YAML
666
- const yamlConfig = `
667
- extract_tables: true
668
- enable_ocr: true
669
- ocr_config:
670
- languages: [eng, deu]
671
- `;
672
- const config = loadConfigFromString(yamlConfig, 'yaml');
304
+ | Category | Formats | Features |
305
+ |----------|---------|----------|
306
+ | **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
307
+ | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
308
+ | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
673
309
 
674
- // Load from JSON
675
- const jsonConfig = '{"extract_tables":true}';
676
- const config2 = loadConfigFromString(jsonConfig, 'json');
310
+ **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
677
311
 
678
- // Load from TOML
679
- const tomlConfig = 'extract_tables = true';
680
- const config3 = loadConfigFromString(tomlConfig, 'toml');
681
- ```
312
+ ### Key Capabilities
682
313
 
683
- ## API Reference
314
+ - **Text Extraction** - Extract all text content with position and formatting information
315
+ - **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
316
+ - **Table Extraction** - Parse tables with structure and cell content preservation
317
+ - **Image Extraction** - Extract embedded images and render page previews
318
+ - **OCR Support** - Integrate multiple OCR backends for scanned documents
684
319
 
685
- ### Extraction Functions
320
+ - **Async/Await** - Non-blocking document processing with concurrent operations
686
321
 
687
- #### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
688
- Extract content from a browser `File` object.
689
322
 
690
- #### `extractBytes(data: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
691
- Asynchronously extract content from a `Uint8Array`.
323
+ - **Plugin System** - Extensible post-processing for custom text transformation
692
324
 
693
- #### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
694
- Synchronously extract content from a `Uint8Array`.
695
325
 
696
- #### `batchExtractFiles(files: File[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
697
- Extract multiple files in parallel.
326
+ - **Batch Processing** - Efficiently process multiple documents in parallel
327
+ - **Memory Efficient** - Stream large files without loading entirely into memory
328
+ - **Language Detection** - Detect and support multiple languages in documents
329
+ - **Configuration** - Fine-grained control over extraction behavior
698
330
 
699
- #### `batchExtractBytes(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
700
- Extract multiple byte arrays in parallel.
331
+ ### Performance Characteristics
701
332
 
702
- #### `batchExtractBytesSync(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): ExtractionResult[]`
703
- Extract multiple byte arrays synchronously.
333
+ | Format | Speed | Memory | Notes |
334
+ |--------|-------|--------|-------|
335
+ | **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
336
+ | **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
337
+ | **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
338
+ | **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
339
+ | **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
704
340
 
705
- ### Plugin Management
706
341
 
707
- #### Post-Processors
708
342
 
709
- ```typescript
710
- registerPostProcessor(processor: PostProcessorProtocol): void
711
- unregisterPostProcessor(name: string): void
712
- clearPostProcessors(): void
713
- listPostProcessors(): string[]
714
- ```
343
+ ## OCR Support
715
344
 
716
- #### Validators
345
+ Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
717
346
 
718
- ```typescript
719
- registerValidator(validator: ValidatorProtocol): void
720
- unregisterValidator(name: string): void
721
- clearValidators(): void
722
- listValidators(): string[]
723
- ```
724
347
 
725
- #### OCR Backends
348
+ - **Tesseract-Wasm**
726
349
 
727
- ```typescript
728
- registerOcrBackend(backend: OcrBackendProtocol): void
729
- unregisterOcrBackend(name: string): void
730
- clearOcrBackends(): void
731
- listOcrBackends(): string[]
732
- ```
733
350
 
734
- ### Document Extractors
351
+ ### OCR Configuration Example
735
352
 
736
- ```typescript
737
- listDocumentExtractors(): string[]
738
- unregisterDocumentExtractor(name: string): void
739
- clearDocumentExtractors(): void
740
- ```
353
+ ```ts
354
+ import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
741
355
 
742
- ### MIME Utilities
356
+ async function extractWithOcr() {
357
+ await initWasm();
743
358
 
744
- ```typescript
745
- detectMimeFromBytes(data: Uint8Array): string
746
- getMimeFromExtension(ext: string): string | null
747
- getExtensionsForMime(mime: string): string[]
748
- normalizeMimeType(mime: string): string
749
- ```
359
+ try {
360
+ await enableOcr();
361
+ console.log("OCR enabled successfully");
362
+ } catch (error) {
363
+ console.error("Failed to enable OCR:", error);
364
+ return;
365
+ }
750
366
 
751
- ### Configuration
367
+ const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
752
368
 
753
- ```typescript
754
- loadConfigFromString(content: string, format: 'yaml' | 'toml' | 'json'): ExtractionConfig
755
- ```
369
+ const result = await extractBytes(bytes, "image/png", {
370
+ ocr: {
371
+ backend: "tesseract-wasm",
372
+ language: "eng",
373
+ },
374
+ });
756
375
 
757
- ### Embeddings
376
+ console.log("Extracted text:");
377
+ console.log(result.content);
378
+ }
758
379
 
759
- ```typescript
760
- listEmbeddingPresets(): string[]
761
- getEmbeddingPreset(name: string): EmbeddingPreset | null
380
+ extractWithOcr().catch(console.error);
762
381
  ```
763
382
 
764
- ## Types
765
-
766
- All types are shared via the `@kreuzberg/core` package:
767
-
768
- ```typescript
769
- import type {
770
- ExtractionResult,
771
- ExtractionConfig,
772
- OcrConfig,
773
- ChunkingConfig,
774
- ImageConfig,
775
- KeywordsConfig,
776
- Table,
777
- ExtractedImage,
778
- Chunk,
779
- Metadata,
780
- PostProcessorProtocol,
781
- ValidatorProtocol,
782
- OcrBackendProtocol
783
- } from '@kreuzberg/core';
784
- ```
785
383
 
786
- ### ExtractionResult
787
-
788
- Main result object containing:
789
- - `content: string` - Extracted text content
790
- - `mime_type: string` - MIME type of the document
791
- - `metadata?: Metadata` - Document metadata
792
- - `tables?: Table[]` - Extracted tables
793
- - `images?: ExtractedImage[]` - Extracted images
794
- - `chunks?: Chunk[]` - Text chunks (if chunking enabled)
795
- - `language?: LanguageInfo` - Detected language (if enabled)
796
- - `keywords?: Keyword[]` - Extracted keywords (if enabled)
797
-
798
- ### ExtractionConfig
799
-
800
- Configuration object for extraction:
801
- - `extract_tables?: boolean` - Extract tables as structured data
802
- - `extract_images?: boolean` - Extract embedded images
803
- - `extract_metadata?: boolean` - Extract document metadata
804
- - `enable_ocr?: boolean` - Enable OCR for images
805
- - `ocr_config?: OcrConfig` - OCR settings
806
- - `enable_chunking?: boolean` - Split text into semantic chunks
807
- - `chunking_config?: ChunkingConfig` - Text chunking settings
808
- - `enable_language_detection?: boolean` - Detect document language
809
- - `enable_quality?: boolean` - Encoding detection, normalization
810
- - `extract_keywords?: boolean` - Extract important keywords
811
- - `keywords_config?: KeywordsConfig` - Keyword extraction settings
812
-
813
- ### Table
814
-
815
- Extracted table structure:
816
- - `markdown: string` - Table in Markdown format
817
- - `cells: TableCell[][]` - 2D array of table cells
818
- - `row_count: number` - Number of rows
819
- - `column_count: number` - Number of columns
820
-
821
- ## Supported Formats
822
-
823
- | Category | Formats |
824
- |----------|---------|
825
- | **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
826
- | **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
827
- | **Web** | HTML, XHTML, XML, EPUB |
828
- | **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
829
- | **Email** | EML, MSG |
830
- | **Archives** | ZIP, TAR, 7Z |
831
- | **Other** | And 30+ more formats |
832
-
833
- ## Build from Source
834
-
835
- ### Prerequisites
836
-
837
- - Rust 1.75+ with `wasm32-unknown-unknown` target
838
- - Node.js 18+ with pnpm
839
- - wasm-pack
840
384
 
841
- ```bash
842
- # Install Rust target
843
- rustup target add wasm32-unknown-unknown
844
385
 
845
- # Install wasm-pack
846
- curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
386
+ ## Async Support
847
387
 
848
- # Build WASM package
849
- cd crates/kreuzberg-wasm
850
- pnpm install
851
- pnpm run build
388
+ This binding provides full async/await support for non-blocking document processing:
852
389
 
853
- # Run tests
854
- pnpm test
855
- ```
390
+ ```ts
391
+ import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
856
392
 
857
- ### Build Targets
393
+ async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
394
+ const caps = getWasmCapabilities();
395
+ if (!caps.hasWasm) {
396
+ throw new Error("WebAssembly not supported");
397
+ }
858
398
 
859
- ```bash
860
- # For browsers (ESM modules)
861
- pnpm run build:wasm:web
399
+ await initWasm();
862
400
 
863
- # For bundlers (webpack, rollup, vite)
864
- pnpm run build:wasm:bundler
401
+ const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
865
402
 
866
- # For Node.js
867
- pnpm run build:wasm:nodejs
403
+ return results.map((r) => ({
404
+ content: r.content,
405
+ pageCount: r.metadata?.pageCount,
406
+ }));
407
+ }
868
408
 
869
- # For Deno
870
- pnpm run build:wasm:deno
409
+ const fileBytes = [new Uint8Array([1, 2, 3])];
410
+ const mimes = ["application/pdf"];
871
411
 
872
- # Build all targets
873
- pnpm run build:all
412
+ extractDocuments(fileBytes, mimes)
413
+ .then((results) => console.log(results))
414
+ .catch(console.error);
874
415
  ```
875
416
 
876
- ## Limitations
877
-
878
- ### No File System Access
879
417
 
880
- The WASM binding cannot access the file system directly. Use file readers:
881
418
 
882
- ```typescript
883
- // ❌ Won't work
884
- await extractFileSync('./document.pdf'); // Throws error
885
419
 
886
- // Use file readers instead
887
- const bytes = await Deno.readFile('./document.pdf'); // Deno
888
- const bytes = await fs.readFile('./document.pdf'); // Node.js
889
- const bytes = await file.arrayBuffer(); // Browser
890
- ```
420
+ ## Plugin System
891
421
 
892
- ### OCR Training Data
422
+ Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
893
423
 
894
- Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
424
+ For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/guides/plugins/).
895
425
 
896
- ### Size Constraints
897
426
 
898
- Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
899
427
 
900
- ## Troubleshooting
901
428
 
902
- ### "WASM module failed to initialize"
903
429
 
904
- Ensure your bundler is configured to handle WASM files:
905
430
 
906
- **Vite:**
907
- ```typescript
908
- // vite.config.ts
909
- export default {
910
- optimizeDeps: {
911
- exclude: ['@kreuzberg/wasm']
912
- }
913
- }
914
- ```
431
+ ## Batch Processing
915
432
 
916
- **Webpack:**
917
- ```javascript
918
- // webpack.config.js
919
- module.exports = {
920
- experiments: {
921
- asyncWebAssembly: true
922
- }
923
- }
924
- ```
433
+ Process multiple documents efficiently:
925
434
 
926
- ### "Module not found: @kreuzberg/core"
435
+ ```ts
436
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
927
437
 
928
- The @kreuzberg/core package is a peer dependency. Install it:
438
+ interface DocumentJob {
439
+ name: string;
440
+ bytes: Uint8Array;
441
+ mimeType: string;
442
+ }
929
443
 
930
- ```bash
931
- pnpm add @kreuzberg/core
444
+ async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
445
+ await initWasm();
446
+
447
+ const results: Record<string, string> = {};
448
+ const queue = [...documents];
449
+
450
+ const workers = Array(concurrency)
451
+ .fill(null)
452
+ .map(async () => {
453
+ while (queue.length > 0) {
454
+ const doc = queue.shift();
455
+ if (!doc) break;
456
+
457
+ try {
458
+ const result = await extractBytes(doc.bytes, doc.mimeType);
459
+ results[doc.name] = result.content;
460
+ } catch (error) {
461
+ console.error(`Failed to process ${doc.name}:`, error);
462
+ }
463
+ }
464
+ });
465
+
466
+ await Promise.all(workers);
467
+ return results;
468
+ }
932
469
  ```
933
470
 
934
- ### Memory Issues in Workers
935
-
936
- For large documents in Cloudflare Workers, process in smaller chunks:
937
-
938
- ```typescript
939
- const result = await extractBytes(pdfBytes, 'application/pdf', {
940
- chunking_config: { max_chars: 1000 }
941
- });
942
- ```
943
471
 
944
- ### OCR Not Working
945
472
 
946
- Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
947
473
 
948
- ## Examples
474
+ ## Configuration
949
475
 
950
- See the [`examples/`](./examples/) directory for complete working examples:
476
+ For advanced configuration options including language detection, table extraction, OCR settings, and more:
951
477
 
952
- - **Browser**: Vanilla JS file upload interface
953
- - **Deno**: Command-line document extraction
954
- - **Cloudflare Workers**: Document processing API
955
- - **Node.js**: Batch processing script
478
+ **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)**
956
479
 
957
480
  ## Documentation
958
481
 
959
- For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
482
+ - **[Official Documentation](https://kreuzberg.dev/)**
483
+ - **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
484
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)**
960
485
 
961
486
  ## Contributing
962
487
 
963
- We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/docs/contributing.md) for details.
488
+ Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
964
489
 
965
490
  ## License
966
491
 
967
- MIT
968
-
969
- ## Links
970
-
971
- - [Website](https://kreuzberg.dev)
972
- - [Documentation](https://kreuzberg.dev)
973
- - [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
974
- - [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
975
- - [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
976
- - [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
492
+ MIT License - see LICENSE file for details.
977
493
 
978
- ## Related Packages
494
+ ## Support
979
495
 
980
- - [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) - Native Node.js bindings (NAPI)
981
- - [@kreuzberg/core](https://www.npmjs.com/package/@kreuzberg/core) - Shared TypeScript types
982
- - [kreuzberg](https://crates.io/crates/kreuzberg) - Rust core library
496
+ - **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
497
+ - **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
498
+ - **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)