@kreuzberg/wasm 4.0.0-rc.20 → 4.0.0-rc.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,1059 +1,742 @@
1
- # Kreuzberg
2
-
3
- [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
- [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
- [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
- [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
- [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
- [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
- [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
- [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
-
12
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
- [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
-
16
- High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
17
-
18
- Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
-
20
- > **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
21
- >
22
- > **This WASM package is designed for:**
23
- > - Browser applications (including web workers)
24
- > - Cloudflare Workers and edge runtimes
25
- > - Deno applications
26
- > - Environments without native build toolchain
27
-
28
- > **🚀 Version 4.0.0 Release Candidate**
29
- > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
30
-
31
- ## Features
32
-
33
- - **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
34
- - **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
35
- - **Table Extraction**: Advanced table detection and structured data extraction
36
- - **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
37
- - **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
38
- - **API Parity**: All extraction functions from the Node.js binding
39
- - **Plugin System**: Custom post-processors, validators, and OCR backends
40
- - **Optimized Bundle**: <5MB uncompressed, <2MB compressed
41
- - **Zero Configuration**: Works out of the box with sensible defaults
42
- - **Portable**: Runs anywhere WASM is supported without native dependencies
43
-
44
- ## Requirements
45
-
46
- - **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
47
- - **Node.js**: 18 or higher
48
- - **Deno**: 1.0 or higher
49
- - **Cloudflare Workers**: Compatible with Workers runtime
50
-
51
- ### Optional Dependencies
52
-
53
- - **tesseract-wasm**: Automatically loaded for OCR functionality (40+ language support)
1
+ # WebAssembly Bindings
2
+
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <!-- Language Bindings -->
5
+ <a href="https://crates.io/crates/kreuzberg">
6
+ <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
7
+ </a>
8
+ <a href="https://hex.pm/packages/kreuzberg">
9
+ <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
10
+ </a>
11
+ <a href="https://pypi.org/project/kreuzberg/">
12
+ <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@kreuzberg/node">
15
+ <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
16
+ </a>
17
+ <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
18
+ <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
19
+ </a>
20
+
21
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
22
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
23
+ </a>
24
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
25
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
26
+ </a>
27
+ <a href="https://www.nuget.org/packages/Kreuzberg/">
28
+ <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
29
+ </a>
30
+ <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
31
+ <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
32
+ </a>
33
+ <a href="https://rubygems.org/gems/kreuzberg">
34
+ <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
35
+ </a>
36
+
37
+ <!-- Project Info -->
38
+
39
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
40
+ <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
41
+ </a>
42
+ <a href="https://docs.kreuzberg.dev">
43
+ <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
44
+ </a>
45
+ </div>
46
+
47
+ <img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
48
+
49
+ <div align="center" style="margin-top: 20px;">
50
+ <a href="https://discord.gg/pXxagNK2zN">
51
+ <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
52
+ </a>
53
+ </div>
54
+
55
+ Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Node.js, Deno, and Cloudflare Workers with portable deployment and optional multi-threading support.
56
+
57
+ > **Version 4.0.0 Release Candidate**
58
+ > Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
59
+ > This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
54
60
 
55
61
  ## Installation
56
62
 
57
- ### Choosing the Right Package
63
+ ### Package Installation
58
64
 
59
- | Use Case | Recommendation | Reason |
60
- |----------|---|---|
61
- | **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
62
- | **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
63
- | **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
64
- | **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
65
- | **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
65
+ Install via one of the supported package managers:
66
66
 
67
- ### Install via npm/pnpm/yarn
67
+ **npm:**
68
68
 
69
69
  ```bash
70
70
  npm install @kreuzberg/wasm
71
71
  ```
72
72
 
73
- Or with pnpm:
73
+ **pnpm:**
74
74
 
75
75
  ```bash
76
76
  pnpm add @kreuzberg/wasm
77
77
  ```
78
78
 
79
- Or with yarn:
79
+ **yarn:**
80
80
 
81
81
  ```bash
82
82
  yarn add @kreuzberg/wasm
83
83
  ```
84
84
 
85
- ### Deno
86
-
87
- ```typescript
88
- import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
89
- ```
90
-
91
- ## PDF Support and PDFium Initialization
92
-
93
- **IMPORTANT**: PDF extraction requires a one-time initialization step to load the PDFium WASM module.
85
+ ### Platform Support
94
86
 
95
- ### Why PDFium Initialization is Needed
87
+ Runs on:
88
+ - Modern browsers (Chrome, Firefox, Safari, Edge with WebAssembly support)
89
+ - Node.js 16+ (with WASM runtime)
90
+ - Deno 1.0+
91
+ - Cloudflare Workers
92
+ - Any JavaScript environment with WebAssembly support
96
93
 
97
- Kreuzberg uses the high-performance PDFium library (from Google Chrome) for PDF processing. In WASM environments, PDFium runs as a separate WASM module that must be loaded and bound to the main kreuzberg module before PDF extraction can work.
94
+ ### System Requirements
98
95
 
99
- ### How to Initialize PDFium
96
+ - WebAssembly support in runtime environment
97
+ - 50 MB minimum free memory for extraction
98
+ - Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
100
99
 
101
- ```javascript
102
- import init, { initialize_pdfium_render, extractBytes } from '@kreuzberg/wasm';
103
- import pdfiumModule from '@kreuzberg/wasm/pdfium.js';
100
+ ### Runtime Detection
104
101
 
105
- // Step 1: Initialize kreuzberg WASM
106
- await init();
102
+ Check platform capabilities before extraction:
107
103
 
108
- // Step 2: Load PDFium WASM module
109
- const pdfium = await pdfiumModule();
110
-
111
- // Step 3: Bind kreuzberg to PDFium (required before any PDF operations)
112
- const success = initialize_pdfium_render(pdfium, wasm, false);
113
- if (!success) {
114
- throw new Error('Failed to initialize PDFium');
115
- }
104
+ ```typescript
105
+ import { getWasmCapabilities } from '@kreuzberg/wasm';
116
106
 
117
- // Step 4: Now PDF extraction works
118
- const pdfBytes = new Uint8Array(await pdfFile.arrayBuffer());
119
- const result = await extractBytes(pdfBytes);
120
- console.log(result.text);
121
- ```
122
-
123
- ### Error: "PdfiumWASMModuleNotConfigured"
124
-
125
- If you see this error, it means `initialize_pdfium_render()` was not called before attempting PDF extraction. Make sure to follow the initialization sequence above.
126
-
127
- ### PDFium Files Location
128
-
129
- The PDFium WASM files (`pdfium.js`, `pdfium.wasm`) should be included in the `@kreuzberg/wasm` package. If they're missing:
130
-
131
- 1. Check your `node_modules/@kreuzberg/wasm/` directory
132
- 2. Ensure both `pdfium.js` and `pdfium.wasm` are present
133
- 3. If missing, reinstall the package
134
-
135
- For self-hosted builds, copy the files from:
136
- ```bash
137
- target/wasm32-unknown-unknown/release/build/kreuzberg-*/out/pdfium/release/node/
107
+ const caps = getWasmCapabilities();
108
+ console.log('WASM available:', caps.hasWasm);
109
+ console.log('Web Workers available:', caps.hasWorkers);
110
+ console.log('Module Workers available:', caps.hasModuleWorkers);
111
+ console.log('File API available:', caps.hasFileApi);
112
+ console.log('SharedArrayBuffer available:', caps.hasSharedArrayBuffer);
138
113
  ```
139
114
 
140
115
  ## Quick Start
141
116
 
142
- ### Browser (ESM)
143
-
144
- ```typescript
145
- import { extractFile } from '@kreuzberg/wasm';
146
-
147
- async function handleFileUpload() {
148
- const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
149
- const file = fileInput.files[0];
117
+ ### Basic Extraction
150
118
 
151
- const result = await extractFile(file, {
152
- extract_tables: true,
153
- extract_images: true
154
- });
155
-
156
- console.log('Extracted text:', result.content);
157
- console.log('Tables found:', result.tables.length);
158
- }
159
- ```
119
+ Extract text, metadata, and structure from any supported document format:
160
120
 
161
- ### Node.js (ESM)
121
+ ```ts
122
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
162
123
 
163
- ```typescript
164
- import { extractBytes } from '@kreuzberg/wasm';
165
- import { readFile } from 'fs/promises';
166
-
167
- const pdfBytes = await readFile('./document.pdf');
168
- const result = await extractBytes(
169
- new Uint8Array(pdfBytes),
170
- 'application/pdf',
171
- { extract_tables: true }
172
- );
173
-
174
- console.log(result.content);
175
- console.log('Found', result.tables.length, 'tables');
176
- ```
124
+ async function main() {
125
+ await initWasm();
177
126
 
178
- ### Deno
127
+ const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
128
+ const bytes = new Uint8Array(buffer);
179
129
 
180
- ```typescript
181
- import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
130
+ const result = await extractBytes(bytes, "application/pdf");
182
131
 
183
- const pdfBytes = await Deno.readFile("./document.pdf");
184
- const result = await extractBytes(pdfBytes, "application/pdf");
132
+ console.log("Extracted content:");
133
+ console.log(result.content);
134
+ console.log("MIME type:", result.mimeType);
135
+ console.log("Metadata:", result.metadata);
136
+ }
185
137
 
186
- console.log(result.content);
138
+ main().catch(console.error);
187
139
  ```
188
140
 
189
- ### Cloudflare Workers
141
+ ### Common Use Cases
190
142
 
191
- ```typescript
192
- import { extractBytes } from '@kreuzberg/wasm';
143
+ #### Extract with Custom Configuration
193
144
 
194
- export default {
195
- async fetch(request: Request): Promise<Response> {
196
- if (request.method === 'POST') {
197
- const formData = await request.formData();
198
- const file = formData.get('file') as File;
145
+ Most use cases benefit from configuration to control extraction behavior:
199
146
 
200
- const arrayBuffer = await file.arrayBuffer();
201
- const bytes = new Uint8Array(arrayBuffer);
147
+ **With OCR (for scanned documents):**
202
148
 
203
- const result = await extractBytes(bytes, file.type);
149
+ ```ts
150
+ import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
204
151
 
205
- return Response.json({
206
- text: result.content,
207
- metadata: result.metadata,
208
- tables: result.tables
209
- });
210
- }
152
+ async function extractWithOcr() {
153
+ await initWasm();
211
154
 
212
- return new Response('Upload a file', { status: 400 });
155
+ try {
156
+ await enableOcr();
157
+ console.log("OCR enabled successfully");
158
+ } catch (error) {
159
+ console.error("Failed to enable OCR:", error);
160
+ return;
213
161
  }
214
- };
215
- ```
216
-
217
- ## Performance Comparison
218
-
219
- Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
220
-
221
- | Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
222
- |--------|---|---|---|
223
- | **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
224
- | **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
225
- | **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
226
- | **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
227
- | **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
228
-
229
- ### When to Use WASM vs Native
230
-
231
- **Use WASM (@kreuzberg/wasm) when:**
232
- - Building browser applications (no choice, WASM required)
233
- - Targeting Cloudflare Workers or edge runtimes
234
- - Supporting Deno applications
235
- - You don't have a native build toolchain available
236
- - Portability across runtimes is critical
237
-
238
- **Use Native (@kreuzberg/node) when:**
239
- - Building Node.js or Bun applications (2-3x faster)
240
- - Performance is your primary concern
241
- - You're processing large volumes of documents
242
- - You have native build tools available
243
-
244
- ### Performance Tips for WASM
245
-
246
- 1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
247
- 2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
248
- 3. **Cache WASM module** by loading it once per application
249
- 4. **Preload OCR models** by calling extraction with OCR enabled early
250
162
 
251
- ## Examples
163
+ const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
252
164
 
253
- Kreuzberg WASM includes complete working examples for different environments:
254
-
255
- - **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
256
- - **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
257
- - **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
258
-
259
- See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
260
-
261
- ## Multi-Threading with wasm-bindgen-rayon
262
-
263
- Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
264
-
265
- ### Initializing the Thread Pool
266
-
267
- To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
268
-
269
- ```typescript
270
- import { initThreadPool } from '@kreuzberg/wasm';
271
-
272
- // Initialize thread pool for multi-threaded extraction
273
- await initThreadPool(navigator.hardwareConcurrency);
274
-
275
- // Now extractions will use multiple threads for better performance
276
- const result = await extractBytes(pdfBytes, 'application/pdf');
277
- ```
278
-
279
- ### Required HTTP Headers for SharedArrayBuffer
280
-
281
- Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
282
-
283
- **Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
284
-
285
- Set these headers in your server configuration:
286
-
287
- ```
288
- Cross-Origin-Opener-Policy: same-origin
289
- Cross-Origin-Embedder-Policy: require-corp
290
- ```
291
-
292
- #### Server Configuration Examples
293
-
294
- **Express.js:**
295
- ```javascript
296
- app.use((req, res, next) => {
297
- res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
298
- res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
299
- next();
300
- });
301
- ```
302
-
303
- **Nginx:**
304
- ```nginx
305
- add_header 'Cross-Origin-Opener-Policy' 'same-origin';
306
- add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
307
- ```
165
+ const result = await extractBytes(bytes, "image/png", {
166
+ ocr: {
167
+ backend: "tesseract-wasm",
168
+ language: "eng",
169
+ },
170
+ });
308
171
 
309
- **Apache:**
310
- ```apache
311
- Header set Cross-Origin-Opener-Policy "same-origin"
312
- Header set Cross-Origin-Embedder-Policy "require-corp"
313
- ```
172
+ console.log("Extracted text:");
173
+ console.log(result.content);
174
+ }
314
175
 
315
- **Cloudflare Workers:**
316
- ```javascript
317
- export default {
318
- async fetch(request: Request): Promise<Response> {
319
- const response = new Response(body);
320
- response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
321
- response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
322
- return response;
323
- }
324
- };
176
+ extractWithOcr().catch(console.error);
325
177
  ```
326
178
 
327
- ### Browser Compatibility
328
-
329
- Multi-threading with SharedArrayBuffer is available in:
179
+ #### Table Extraction
330
180
 
331
- - **Chrome/Edge**: 74+
332
- - **Firefox**: 79+
333
- - **Safari**: 15.2+
334
- - **Opera**: 60+
181
+ See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
335
182
 
336
- In unsupported browsers or when headers are not set, the library automatically degrades to single-threaded mode.
183
+ #### Processing Multiple Files
337
184
 
338
- ### Graceful Degradation
185
+ ```ts
186
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
339
187
 
340
- The library handles thread pool initialization gracefully. If initialization fails or is unavailable:
341
-
342
- ```typescript
343
- import { initThreadPool } from '@kreuzberg/wasm';
344
-
345
- try {
346
- await initThreadPool(navigator.hardwareConcurrency);
347
- console.log('Multi-threading enabled');
348
- } catch (error) {
349
- // Fall back to single-threaded processing
350
- console.warn('Multi-threading unavailable:', error);
351
- console.log('Using single-threaded extraction');
188
+ interface DocumentJob {
189
+ name: string;
190
+ bytes: Uint8Array;
191
+ mimeType: string;
352
192
  }
353
193
 
354
- // Extraction will work in both cases
355
- const result = await extractBytes(pdfBytes, 'application/pdf');
194
+ async function processBatch(documents: DocumentJob[], concurrency: number = 3) {
195
+ await initWasm();
196
+
197
+ const results: Record<string, string> = {};
198
+ const queue = [...documents];
199
+
200
+ const workers = Array(concurrency)
201
+ .fill(null)
202
+ .map(async () => {
203
+ while (queue.length > 0) {
204
+ const doc = queue.shift();
205
+ if (!doc) break;
206
+
207
+ try {
208
+ const result = await extractBytes(doc.bytes, doc.mimeType);
209
+ results[doc.name] = result.content;
210
+ } catch (error) {
211
+ console.error(`Failed to process ${doc.name}:`, error);
212
+ }
213
+ }
214
+ });
215
+
216
+ await Promise.all(workers);
217
+ return results;
218
+ }
356
219
  ```
357
220
 
358
- ### Complete Example with Thread Pool
359
-
360
- ```typescript
361
- import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
221
+ #### Async Processing
362
222
 
363
- async function initializeKreuzbergWithThreading() {
364
- try {
365
- // Initialize WASM module
366
- await initWasm();
223
+ For non-blocking document processing:
367
224
 
368
- // Initialize multi-threading
369
- const cpuCount = navigator.hardwareConcurrency || 1;
370
- try {
371
- await initThreadPool(cpuCount);
372
- console.log(`Thread pool initialized with ${cpuCount} workers`);
373
- } catch (error) {
374
- console.warn('Could not initialize thread pool, using single-threaded mode');
375
- }
225
+ ```ts
226
+ import { extractBytes, initWasm, getWasmCapabilities } from "@kreuzberg/wasm";
376
227
 
377
- } catch (error) {
378
- console.error('Failed to initialize Kreuzberg:', error);
228
+ async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
229
+ const caps = getWasmCapabilities();
230
+ if (!caps.hasWasm) {
231
+ throw new Error("WebAssembly not supported");
379
232
  }
380
- }
381
233
 
382
- async function extractDocument(file: File) {
383
- const bytes = new Uint8Array(await file.arrayBuffer());
234
+ await initWasm();
384
235
 
385
- // Extraction will use multiple threads if available
386
- const result = await extractBytes(bytes, file.type, {
387
- extract_tables: true,
388
- extract_images: true
389
- });
236
+ const results = await Promise.all(
237
+ files.map((bytes, index) => extractBytes(bytes, mimeTypes[index]))
238
+ );
390
239
 
391
- return result;
240
+ return results.map((r) => ({
241
+ content: r.content,
242
+ pageCount: r.metadata?.pageCount,
243
+ }));
392
244
  }
393
245
 
394
- // Initialize once at app startup
395
- await initializeKreuzbergWithThreading();
246
+ const fileBytes = [new Uint8Array([1, 2, 3])];
247
+ const mimes = ["application/pdf"];
396
248
 
397
- // Later, handle file uploads
398
- fileInput.addEventListener('change', async (e) => {
399
- const file = e.target.files?.[0];
400
- if (file) {
401
- const result = await extractDocument(file);
402
- console.log('Extracted text:', result.content);
403
- }
404
- });
249
+ extractDocuments(fileBytes, mimes)
250
+ .then((results) => console.log(results))
251
+ .catch(console.error);
405
252
  ```
406
253
 
407
- ### Performance Considerations
254
+ #### Worker Pool Usage
408
255
 
409
- - **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
410
- - **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
411
- - **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
256
+ When Web Workers are available, use worker threads for parallel document processing without blocking the main thread:
412
257
 
413
- ## OCR Support
258
+ ```typescript
259
+ import { extractBytes, initWasm, hasWorkers, hasModuleWorkers } from '@kreuzberg/wasm';
414
260
 
415
- The WASM binding integrates [tesseract-wasm](https://github.com/robertknight/tesseract-wasm) for OCR support with 40+ languages.
261
+ class DocumentWorkerPool {
262
+ private workers: Worker[] = [];
263
+ private taskQueue: Array<{ id: number; data: Uint8Array; mimeType: string; resolve: Function; reject: Function }> = [];
264
+ private currentTaskId = 0;
416
265
 
417
- ### Basic OCR
266
+ constructor(workerCount: number = navigator.hardwareConcurrency || 4) {
267
+ // Module workers allow importing ES modules, standard workers are more compatible
268
+ const useModuleWorkers = hasModuleWorkers();
418
269
 
419
- ```typescript
420
- import { extractBytes } from '@kreuzberg/wasm';
421
-
422
- const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
423
-
424
- const result = await extractBytes(
425
- new Uint8Array(imageBytes),
426
- 'image/jpeg',
427
- {
428
- enable_ocr: true,
429
- ocr_config: {
430
- languages: ['eng'], // English
431
- backend: 'tesseract-wasm'
270
+ for (let i = 0; i < workerCount; i++) {
271
+ const worker = useModuleWorkers
272
+ ? new Worker(new URL('./extraction-worker.ts', import.meta.url), { type: 'module' })
273
+ : new Worker(new URL('./extraction-worker.js', import.meta.url));
274
+
275
+ worker.onmessage = (event) => this.handleWorkerMessage(event.data);
276
+ worker.onerror = (error) => this.handleWorkerError(error);
277
+ this.workers.push(worker);
432
278
  }
433
279
  }
434
- );
435
-
436
- console.log('OCR text:', result.content);
437
- ```
438
-
439
- ### Multi-Language OCR
440
280
 
441
- ```typescript
442
- const result = await extractBytes(imageBytes, 'image/png', {
443
- enable_ocr: true,
444
- ocr_config: {
445
- languages: ['eng', 'deu', 'fra'], // English, German, French
446
- backend: 'tesseract-wasm'
281
+ async extract(data: Uint8Array, mimeType: string): Promise<string> {
282
+ return new Promise((resolve, reject) => {
283
+ this.taskQueue.push({
284
+ id: this.currentTaskId++,
285
+ data,
286
+ mimeType,
287
+ resolve,
288
+ reject
289
+ });
290
+ this.processQueue();
291
+ });
447
292
  }
448
- });
449
- ```
450
-
451
- ### Supported Languages
452
-
453
- `eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
454
-
455
- Training data is automatically loaded from jsDelivr CDN:
456
- ```
457
- https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
458
- ```
459
-
460
- ## Configuration
461
-
462
- ### Extract Tables
463
-
464
- ```typescript
465
- import { extractBytes } from '@kreuzberg/wasm';
466
293
 
467
- const result = await extractBytes(pdfBytes, 'application/pdf', {
468
- extract_tables: true
469
- });
470
-
471
- if (result.tables) {
472
- for (const table of result.tables) {
473
- console.log('Table as Markdown:');
474
- console.log(table.markdown);
475
-
476
- console.log('Table cells:');
477
- console.log(JSON.stringify(table.cells, null, 2));
294
+ private processQueue(): void {
295
+ while (this.taskQueue.length > 0) {
296
+ const task = this.taskQueue.shift();
297
+ if (task) {
298
+ const worker = this.workers[task.id % this.workers.length];
299
+ worker.postMessage({ id: task.id, data: task.data, mimeType: task.mimeType });
300
+ }
301
+ }
478
302
  }
479
- }
480
- ```
481
-
482
- ### Extract Images
483
-
484
- ```typescript
485
- import { extractBytes } from '@kreuzberg/wasm';
486
303
 
487
- const result = await extractBytes(pdfBytes, 'application/pdf', {
488
- extract_images: true,
489
- image_config: {
490
- target_dpi: 300,
491
- max_image_dimension: 4096
304
+ private handleWorkerMessage(data: { id: number; result: string }): void {
305
+ const task = this.taskQueue.find(t => t.id === data.id);
306
+ if (task) {
307
+ task.resolve(data.result);
308
+ this.processQueue();
309
+ }
492
310
  }
493
- });
494
311
 
495
- if (result.images) {
496
- for (const image of result.images) {
497
- console.log(`Image ${image.index}: ${image.format}`);
498
- // image.data is a Uint8Array
312
+ private handleWorkerError(error: ErrorEvent): void {
313
+ console.error('Worker error:', error.message);
499
314
  }
500
- }
501
- ```
502
-
503
- ### Text Chunking
504
-
505
- ```typescript
506
- import { extractBytes } from '@kreuzberg/wasm';
507
315
 
508
- const result = await extractBytes(pdfBytes, 'application/pdf', {
509
- enable_chunking: true,
510
- chunking_config: {
511
- max_chars: 1000,
512
- max_overlap: 200
316
+ terminate(): void {
317
+ this.workers.forEach(w => w.terminate());
513
318
  }
514
- });
319
+ }
515
320
 
516
- if (result.chunks) {
517
- for (const chunk of result.chunks) {
518
- console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
321
+ // Usage
322
+ async function processDocumentsInParallel() {
323
+ if (!hasWorkers()) {
324
+ console.log('Web Workers not available, falling back to main thread');
325
+ return;
519
326
  }
520
- }
521
- ```
522
327
 
523
- ### Language Detection
328
+ await initWasm();
329
+ const pool = new DocumentWorkerPool(4);
524
330
 
525
- ```typescript
526
- import { extractBytes } from '@kreuzberg/wasm';
331
+ const documents = [
332
+ { data: new Uint8Array([...]), mimeType: 'application/pdf' },
333
+ { data: new Uint8Array([...]), mimeType: 'application/pdf' },
334
+ ];
527
335
 
528
- const result = await extractBytes(pdfBytes, 'application/pdf', {
529
- enable_language_detection: true
530
- });
336
+ const results = await Promise.all(
337
+ documents.map(doc => pool.extract(doc.data, doc.mimeType))
338
+ );
531
339
 
532
- if (result.language) {
533
- console.log(`Detected language: ${result.language.code}`);
534
- console.log(`Confidence: ${result.language.confidence}`);
340
+ pool.terminate();
341
+ return results;
535
342
  }
536
343
  ```
537
344
 
538
- ### Complete Configuration Example
345
+ Worker code (`extraction-worker.ts`):
539
346
 
540
347
  ```typescript
541
- import {
542
- extractBytes,
543
- type ExtractionConfig
544
- } from '@kreuzberg/wasm';
545
-
546
- const config: ExtractionConfig = {
547
- extract_tables: true,
548
- extract_images: true,
549
- extract_metadata: true,
550
-
551
- enable_ocr: true,
552
- ocr_config: {
553
- languages: ['eng'],
554
- backend: 'tesseract-wasm',
555
- dpi: 300,
556
- preprocessing: {
557
- deskew: true,
558
- denoise: true,
559
- binarize: true
560
- }
561
- },
562
-
563
- enable_chunking: true,
564
- chunking_config: {
565
- max_chars: 1000,
566
- max_overlap: 200
567
- },
348
+ import { extractBytes, initWasm } from '@kreuzberg/wasm';
568
349
 
569
- enable_language_detection: true,
350
+ let wasmInitialized = false;
570
351
 
571
- enable_quality: true,
352
+ self.onmessage = async (event) => {
353
+ if (!wasmInitialized) {
354
+ await initWasm();
355
+ wasmInitialized = true;
356
+ }
572
357
 
573
- extract_keywords: true,
574
- keywords_config: {
575
- max_keywords: 10,
576
- method: 'yake'
358
+ const { id, data, mimeType } = event.data;
359
+ try {
360
+ const result = await extractBytes(new Uint8Array(data), mimeType);
361
+ self.postMessage({ id, result: result.content });
362
+ } catch (error) {
363
+ self.postMessage({ id, error: (error as Error).message });
577
364
  }
578
365
  };
579
-
580
- const result = await extractBytes(data, mimeType, config);
581
366
  ```
582
367
 
583
- ## Advanced Usage
368
+ ### Memory Management
584
369
 
585
- ### Batch Processing
370
+ WASM memory is managed by the JavaScript garbage collector:
586
371
 
587
372
  ```typescript
588
- import { batchExtractFiles, batchExtractBytes } from '@kreuzberg/wasm';
373
+ import { initWasm, extractBytes } from '@kreuzberg/wasm';
589
374
 
590
- // Browser: Process multiple files
591
- const fileInput = document.querySelector<HTMLInputElement>('#files');
592
- const files = Array.from(fileInput.files);
375
+ async function extractWithMemoryAwareness() {
376
+ await initWasm();
593
377
 
594
- const results = await batchExtractFiles(files, {
595
- extract_tables: true
596
- });
378
+ // Process documents one at a time to control memory usage
379
+ const documents = [/* ... */];
597
380
 
598
- for (const result of results) {
599
- console.log(`${result.mime_type}: ${result.content.length} characters`);
600
- }
381
+ for (const doc of documents) {
382
+ const result = await extractBytes(doc, 'application/pdf');
601
383
 
602
- // Or from Uint8Arrays
603
- const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
604
- const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
384
+ // Process result immediately
385
+ console.log(result.content);
605
386
 
606
- const results = await batchExtractBytes(dataList, mimeTypes);
607
- ```
387
+ // Result will be garbage collected when no longer referenced
388
+ // Explicitly clear large objects if needed
389
+ // gc(); // Requires --expose-gc flag
390
+ }
391
+ }
608
392
 
609
- ### Synchronous Extraction
393
+ // Check available memory (browser only)
394
+ if (performance.memory) {
395
+ console.log('Memory usage:', {
396
+ usedJSHeapSize: performance.memory.usedJSHeapSize,
397
+ totalJSHeapSize: performance.memory.totalJSHeapSize,
398
+ jsHeapSizeLimit: performance.memory.jsHeapSizeLimit
399
+ });
400
+ }
401
+ ```
610
402
 
611
- ```typescript
612
- import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
403
+ ### Next Steps
613
404
 
614
- // Synchronous single extraction
615
- const result = extractBytesSync(data, 'application/pdf', config);
405
+ - **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
406
+ - **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
407
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
408
+ - **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
409
+ - **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
616
410
 
617
- // Synchronous batch extraction
618
- const results = batchExtractBytesSync(dataList, mimeTypes, config);
619
- ```
411
+ ## WASM-Specific Implementation Details
620
412
 
621
- ### Plugin System
413
+ ### Initialization
622
414
 
623
- #### Custom Post-Processors
415
+ WASM binaries must be loaded before extraction:
624
416
 
625
417
  ```typescript
626
- import { registerPostProcessor } from '@kreuzberg/wasm';
627
-
628
- registerPostProcessor({
629
- name: 'uppercase',
630
- async process(result) {
631
- return {
632
- ...result,
633
- content: result.content.toUpperCase()
634
- };
635
- }
636
- });
418
+ import { initWasm } from '@kreuzberg/wasm';
637
419
 
638
- // Now all extractions will use this processor
639
- const result = await extractBytes(data, mimeType);
640
- console.log(result.content); // UPPERCASE TEXT
420
+ // Initialize once at application startup
421
+ await initWasm();
422
+
423
+ // Now extraction functions can be used
641
424
  ```
642
425
 
643
- #### Custom Validators
426
+ The init function:
427
+ - Downloads and instantiates the WASM binary
428
+ - Initializes the memory space (linear memory module)
429
+ - Prepares thread pools if available
430
+ - Throws if WASM is not supported in the environment
644
431
 
645
- ```typescript
646
- import { registerValidator } from '@kreuzberg/wasm';
432
+ ### Threading Model
647
433
 
648
- registerValidator({
649
- name: 'min-length',
650
- async validate(result) {
651
- if (result.content.length < 100) {
652
- throw new Error('Document too short');
653
- }
654
- }
655
- });
656
- ```
434
+ - Single-threaded by default (main thread execution)
435
+ - Web Workers optional for background processing
436
+ - Shared memory (SharedArrayBuffer) not required
437
+ - Message passing used for worker communication
438
+ - No blocking operations on main thread with worker pool
657
439
 
658
- #### Custom OCR Backends
440
+ ### Memory Considerations
659
441
 
660
- ```typescript
661
- import { registerOcrBackend } from '@kreuzberg/wasm';
662
-
663
- registerOcrBackend({
664
- name: 'custom-ocr',
665
- supportedLanguages() {
666
- return ['eng', 'fra'];
667
- },
668
- async initialize() {
669
- // Initialize your OCR backend
670
- },
671
- async processImage(imageBytes, language) {
672
- // Process image and return result
673
- return {
674
- content: 'extracted text',
675
- mime_type: 'text/plain',
676
- metadata: {},
677
- tables: []
678
- };
679
- },
680
- async shutdown() {
681
- // Cleanup
682
- }
683
- });
684
- ```
442
+ - Each WASM instance has its own 4GB linear memory address space
443
+ - Large documents (> 100 MB) may not fit in WASM memory
444
+ - Binary data is copied between JavaScript and WASM boundaries
445
+ - Garbage collection is handled by JavaScript runtime
446
+ - No manual memory management required
685
447
 
686
- ### MIME Type Detection
448
+ ### Supported Extraction Targets
687
449
 
688
- ```typescript
689
- import {
690
- detectMimeFromBytes,
691
- getMimeFromExtension,
692
- getExtensionsForMime,
693
- normalizeMimeType
694
- } from '@kreuzberg/wasm';
450
+ Different file formats have varying support in WASM:
695
451
 
696
- // Auto-detect MIME type from file bytes
697
- const mimeType = detectMimeFromBytes(fileBytes);
452
+ | Format | Support | Notes |
453
+ |--------|---------|-------|
454
+ | PDF | Full | Text, images, metadata extraction |
455
+ | Office (DOCX, XLSX, PPTX) | Full | All features supported |
456
+ | Images (PNG, JPG, etc) | Full | EXIF metadata extraction |
457
+ | Archives (ZIP, TAR) | Full | Listing and extraction |
458
+ | OCR | Limited | Tesseract WASM only, main thread only |
459
+ | Embeddings | Not Available | WASM has no ML model support |
698
460
 
699
- // Get MIME type from file extension
700
- const mime = getMimeFromExtension('pdf'); // 'application/pdf'
461
+ ### Platform Limitations
701
462
 
702
- // Get extensions for MIME type
703
- const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
463
+ **LibreOffice-Dependent Formats Not Available**
704
464
 
705
- // Normalize MIME type
706
- const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
707
- ```
465
+ WASM cannot load native LibreOffice binaries, so older Office formats are **not supported**:
708
466
 
709
- ### Configuration Loading
467
+ - **DOC** (Microsoft Word 97-2003) - Use DOCX instead
468
+ - ❌ **XLS** (Microsoft Excel 97-2003) - Use XLSX instead
469
+ - ❌ **PPT** (Microsoft PowerPoint 97-2003) - Use PPTX instead
470
+ - ❌ **RTF** (Rich Text Format with complex features)
471
+ - ❌ **ODT/ODS/ODP** (LibreOffice/OpenOffice formats)
710
472
 
711
- ```typescript
712
- import { loadConfigFromString } from '@kreuzberg/wasm';
713
-
714
- // Load from YAML
715
- const yamlConfig = `
716
- extract_tables: true
717
- enable_ocr: true
718
- ocr_config:
719
- languages: [eng, deu]
720
- `;
721
- const config = loadConfigFromString(yamlConfig, 'yaml');
722
-
723
- // Load from JSON
724
- const jsonConfig = '{"extract_tables":true}';
725
- const config2 = loadConfigFromString(jsonConfig, 'json');
726
-
727
- // Load from TOML
728
- const tomlConfig = 'extract_tables = true';
729
- const config3 = loadConfigFromString(tomlConfig, 'toml');
730
- ```
473
+ Modern Office formats (DOCX, XLSX, PPTX) are fully supported and don't require LibreOffice.
731
474
 
732
- ## API Reference
475
+ **Polars Integration Not Available**
733
476
 
734
- ### Extraction Functions
477
+ - Polars DataFrame extraction/conversion not available in WASM
478
+ - ❌ Structured data operations limited compared to Node.js binding
735
479
 
736
- #### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
737
- Extract content from a browser `File` object.
480
+ **Alternative: Use Node.js Binding**
738
481
 
739
- #### `extractBytes(data: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
740
- Asynchronously extract content from a `Uint8Array`.
482
+ If you need support for older Office formats or Polars integration, use the `@kreuzberg/node` package instead:
741
483
 
742
- #### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
743
- Synchronously extract content from a `Uint8Array`.
484
+ ```bash
485
+ npm install @kreuzberg/node
486
+ ```
744
487
 
745
- #### `batchExtractFiles(files: File[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
746
- Extract multiple files in parallel.
488
+ The Node.js binding provides:
489
+ - Full LibreOffice format support (DOC, XLS, PPT, RTF, ODT)
490
+ - ✅ Polars DataFrame integration
491
+ - ✅ All OCR backends (Tesseract, EasyOCR, PaddleOCR)
492
+ - ✅ Full embedding model support
747
493
 
748
- #### `batchExtractBytes(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
749
- Extract multiple byte arrays in parallel.
494
+ **Format Comparison Table**
750
495
 
751
- #### `batchExtractBytesSync(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): ExtractionResult[]`
752
- Extract multiple byte arrays synchronously.
496
+ | Format Type | WASM Support | Node.js Support |
497
+ |-------------|--------------|-----------------|
498
+ | Modern Office (DOCX/XLSX/PPTX) | ✅ Full | ✅ Full |
499
+ | Legacy Office (DOC/XLS/PPT) | ❌ Not Available | ✅ Requires LibreOffice |
500
+ | OpenOffice (ODT/ODS/ODP) | ❌ Not Available | ✅ Requires LibreOffice |
501
+ | PDF | ✅ Full | ✅ Full |
502
+ | Images | ✅ Full | ✅ Full |
503
+ | Embeddings | ❌ Not Available | ✅ With ONNX Runtime |
504
+ | Polars | ❌ Not Available | ✅ Available |
753
505
 
754
- ### Plugin Management
506
+ ### Sandbox Security
755
507
 
756
- #### Post-Processors
508
+ - WASM code runs in a sandbox with restricted capabilities
509
+ - File system access requires user interaction (File API)
510
+ - Network access follows CORS restrictions
511
+ - No access to Node.js native modules
512
+ - Content Security Policy (CSP) may restrict WASM loading
757
513
 
758
- ```typescript
759
- registerPostProcessor(processor: PostProcessorProtocol): void
760
- unregisterPostProcessor(name: string): void
761
- clearPostProcessors(): void
762
- listPostProcessors(): string[]
763
- ```
514
+ ## Features
764
515
 
765
- #### Validators
516
+ ### Supported File Formats (56+)
766
517
 
767
- ```typescript
768
- registerValidator(validator: ValidatorProtocol): void
769
- unregisterValidator(name: string): void
770
- clearValidators(): void
771
- listValidators(): string[]
772
- ```
518
+ 56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
773
519
 
774
- #### OCR Backends
520
+ #### Office Documents
775
521
 
776
- ```typescript
777
- registerOcrBackend(backend: OcrBackendProtocol): void
778
- unregisterOcrBackend(name: string): void
779
- clearOcrBackends(): void
780
- listOcrBackends(): string[]
781
- ```
522
+ | Category | Formats | Capabilities |
523
+ |----------|---------|--------------|
524
+ | **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
525
+ | **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
526
+ | **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
527
+ | **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
528
+ | **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
782
529
 
783
- ### Document Extractors
530
+ #### Images (OCR-Enabled)
784
531
 
785
- ```typescript
786
- listDocumentExtractors(): string[]
787
- unregisterDocumentExtractor(name: string): void
788
- clearDocumentExtractors(): void
789
- ```
532
+ | Category | Formats | Features |
533
+ |----------|---------|----------|
534
+ | **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
535
+ | **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
536
+ | **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
790
537
 
791
- ### MIME Utilities
538
+ #### Web & Data
792
539
 
793
- ```typescript
794
- detectMimeFromBytes(data: Uint8Array): string
795
- getMimeFromExtension(ext: string): string | null
796
- getExtensionsForMime(mime: string): string[]
797
- normalizeMimeType(mime: string): string
798
- ```
540
+ | Category | Formats | Features |
541
+ |----------|---------|----------|
542
+ | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
543
+ | **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
544
+ | **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
799
545
 
800
- ### Configuration
546
+ #### Email & Archives
801
547
 
802
- ```typescript
803
- loadConfigFromString(content: string, format: 'yaml' | 'toml' | 'json'): ExtractionConfig
804
- ```
548
+ | Category | Formats | Features |
549
+ |----------|---------|----------|
550
+ | **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
551
+ | **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
805
552
 
806
- ### Embeddings
553
+ #### Academic & Scientific
807
554
 
808
- ```typescript
809
- listEmbeddingPresets(): string[]
810
- getEmbeddingPreset(name: string): EmbeddingPreset | null
811
- ```
555
+ | Category | Formats | Features |
556
+ |----------|---------|----------|
557
+ | **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
558
+ | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
559
+ | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
812
560
 
813
- ## Types
561
+ **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
814
562
 
815
- All types are shared via the `@kreuzberg/core` package:
563
+ ### Key Capabilities
816
564
 
817
- ```typescript
818
- import type {
819
- ExtractionResult,
820
- ExtractionConfig,
821
- OcrConfig,
822
- ChunkingConfig,
823
- ImageConfig,
824
- KeywordsConfig,
825
- Table,
826
- ExtractedImage,
827
- Chunk,
828
- Metadata,
829
- PostProcessorProtocol,
830
- ValidatorProtocol,
831
- OcrBackendProtocol
832
- } from '@kreuzberg/core';
833
- ```
565
+ - **Text Extraction** - Extract all text content with position and formatting information
834
566
 
835
- ### ExtractionResult
836
-
837
- Main result object containing:
838
- - `content: string` - Extracted text content
839
- - `mime_type: string` - MIME type of the document
840
- - `metadata?: Metadata` - Document metadata
841
- - `tables?: Table[]` - Extracted tables
842
- - `images?: ExtractedImage[]` - Extracted images
843
- - `chunks?: Chunk[]` - Text chunks (if chunking enabled)
844
- - `language?: LanguageInfo` - Detected language (if enabled)
845
- - `keywords?: Keyword[]` - Extracted keywords (if enabled)
846
-
847
- ### ExtractionConfig
848
-
849
- Configuration object for extraction:
850
- - `extract_tables?: boolean` - Extract tables as structured data
851
- - `extract_images?: boolean` - Extract embedded images
852
- - `extract_metadata?: boolean` - Extract document metadata
853
- - `enable_ocr?: boolean` - Enable OCR for images
854
- - `ocr_config?: OcrConfig` - OCR settings
855
- - `enable_chunking?: boolean` - Split text into semantic chunks
856
- - `chunking_config?: ChunkingConfig` - Text chunking settings
857
- - `enable_language_detection?: boolean` - Detect document language
858
- - `enable_quality?: boolean` - Encoding detection, normalization
859
- - `extract_keywords?: boolean` - Extract important keywords
860
- - `keywords_config?: KeywordsConfig` - Keyword extraction settings
861
-
862
- ### Table
863
-
864
- Extracted table structure:
865
- - `markdown: string` - Table in Markdown format
866
- - `cells: TableCell[][]` - 2D array of table cells
867
- - `row_count: number` - Number of rows
868
- - `column_count: number` - Number of columns
869
-
870
- ## Supported Formats
871
-
872
- | Category | Formats |
873
- |----------|---------|
874
- | **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
875
- | **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
876
- | **Web** | HTML, XHTML, XML, EPUB |
877
- | **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
878
- | **Email** | EML, MSG |
879
- | **Archives** | ZIP, TAR, 7Z |
880
- | **Other** | And 30+ more formats |
881
-
882
- ## Build from Source
883
-
884
- ### Prerequisites
885
-
886
- - Rust 1.75+ with `wasm32-unknown-unknown` target
887
- - Node.js 18+ with pnpm
888
- - wasm-pack
567
+ - **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
889
568
 
890
- ```bash
891
- # Install Rust target
892
- rustup target add wasm32-unknown-unknown
569
+ - **Table Extraction** - Parse tables with structure and cell content preservation
893
570
 
894
- # Install wasm-pack
895
- curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
571
+ - **Image Extraction** - Extract embedded images and render page previews
896
572
 
897
- # Build WASM package
898
- cd crates/kreuzberg-wasm
899
- pnpm install
900
- pnpm run build
573
+ - **OCR Support** - Integrate multiple OCR backends for scanned documents
901
574
 
902
- # Run tests
903
- pnpm test
904
- ```
575
+ - **Async/Await** - Non-blocking document processing with concurrent operations
905
576
 
906
- ### Build Targets
577
+ - **Plugin System** - Extensible post-processing for custom text transformation
907
578
 
908
- ```bash
909
- # For browsers (ESM modules)
910
- pnpm run build:wasm:web
579
+ - **Batch Processing** - Efficiently process multiple documents in parallel
911
580
 
912
- # For bundlers (webpack, rollup, vite)
913
- pnpm run build:wasm:bundler
581
+ - **Memory Efficient** - Stream large files without loading entirely into memory
914
582
 
915
- # For Node.js
916
- pnpm run build:wasm:nodejs
583
+ - **Language Detection** - Detect and support multiple languages in documents
917
584
 
918
- # For Deno
919
- pnpm run build:wasm:deno
585
+ - **Configuration** - Fine-grained control over extraction behavior
920
586
 
921
- # Build all targets
922
- pnpm run build:all
923
- ```
587
+ ### Performance Characteristics
924
588
 
925
- ## Limitations
589
+ | Format | Speed | Memory | Notes |
590
+ |--------|-------|--------|-------|
591
+ | **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
592
+ | **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
593
+ | **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
594
+ | **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
595
+ | **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
926
596
 
927
- ### No File System Access
597
+ ## OCR Support
928
598
 
929
- The WASM binding cannot access the file system directly. Use file readers:
599
+ Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
930
600
 
931
- ```typescript
932
- // ❌ Won't work
933
- await extractFileSync('./document.pdf'); // Throws error
601
+ - **Tesseract-Wasm**
934
602
 
935
- // Use file readers instead
936
- const bytes = await Deno.readFile('./document.pdf'); // Deno
937
- const bytes = await fs.readFile('./document.pdf'); // Node.js
938
- const bytes = await file.arrayBuffer(); // Browser
939
- ```
603
+ ### OCR Configuration Example
940
604
 
941
- ### OCR Training Data
605
+ ```ts
606
+ import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
942
607
 
943
- Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
608
+ async function extractWithOcr() {
609
+ await initWasm();
944
610
 
945
- ### Size Constraints
611
+ try {
612
+ await enableOcr();
613
+ console.log("OCR enabled successfully");
614
+ } catch (error) {
615
+ console.error("Failed to enable OCR:", error);
616
+ return;
617
+ }
946
618
 
947
- Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
619
+ const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
948
620
 
949
- ### HTML File Size Limits
621
+ const result = await extractBytes(bytes, "image/png", {
622
+ ocr: {
623
+ backend: "tesseract-wasm",
624
+ language: "eng",
625
+ },
626
+ });
950
627
 
951
- **WASM builds have a 2MB limit for HTML files** due to limited stack space. HTML files larger than 2MB will be rejected with a validation error to prevent stack overflow.
628
+ console.log("Extracted text:");
629
+ console.log(result.content);
630
+ }
952
631
 
953
- ```typescript
954
- // Files > 2MB will throw an error in WASM builds
955
- const largeHtml = new Uint8Array(3 * 1024 * 1024); // 3MB
956
- await extractBytes(largeHtml, 'text/html');
957
- // ❌ Throws: "HTML file size exceeds WASM limit of 2MB"
632
+ extractWithOcr().catch(console.error);
958
633
  ```
959
634
 
960
- For large HTML files, use the native [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) binding which has no size limits.
635
+ ## Async Support
961
636
 
962
- ### PDF Extraction in Non-Browser Environments
637
+ This binding provides full async/await support for non-blocking document processing:
963
638
 
964
- PDF extraction requires PDFium, which is only available in browser environments. In Deno, Node.js, and Cloudflare Workers, PDF extraction will fail with an error.
639
+ ```ts
640
+ import { extractBytes, initWasm, getWasmCapabilities } from "@kreuzberg/wasm";
965
641
 
966
- ```typescript
967
- // Won't work in Deno/Node.js/Workers
968
- await extractBytes(pdfBytes, 'application/pdf');
969
- // Throws: "PDF extraction requires proper WASM module initialization"
970
- ```
642
+ async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
643
+ const caps = getWasmCapabilities();
644
+ if (!caps.hasWasm) {
645
+ throw new Error("WebAssembly not supported");
646
+ }
971
647
 
972
- **Solutions:**
973
- - **Browser**: PDF extraction works out of the box
974
- - **Deno/Node.js**: Use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) with native PDFium bindings
975
- - **Cloudflare Workers**: PDF extraction is not currently supported
648
+ await initWasm();
976
649
 
977
- ## Troubleshooting
650
+ const results = await Promise.all(
651
+ files.map((bytes, index) => extractBytes(bytes, mimeTypes[index]))
652
+ );
978
653
 
979
- ### "WASM module failed to initialize"
654
+ return results.map((r) => ({
655
+ content: r.content,
656
+ pageCount: r.metadata?.pageCount,
657
+ }));
658
+ }
980
659
 
981
- Ensure your bundler is configured to handle WASM files:
660
+ const fileBytes = [new Uint8Array([1, 2, 3])];
661
+ const mimes = ["application/pdf"];
982
662
 
983
- **Vite:**
984
- ```typescript
985
- // vite.config.ts
986
- export default {
987
- optimizeDeps: {
988
- exclude: ['@kreuzberg/wasm']
989
- }
990
- }
663
+ extractDocuments(fileBytes, mimes)
664
+ .then((results) => console.log(results))
665
+ .catch(console.error);
991
666
  ```
992
667
 
993
- **Webpack:**
994
- ```javascript
995
- // webpack.config.js
996
- module.exports = {
997
- experiments: {
998
- asyncWebAssembly: true
999
- }
1000
- }
1001
- ```
668
+ ## Plugin System
1002
669
 
1003
- ### "Module not found: @kreuzberg/core"
670
+ Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
1004
671
 
1005
- The @kreuzberg/core package is a peer dependency. Install it:
672
+ For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
1006
673
 
1007
- ```bash
1008
- pnpm add @kreuzberg/core
1009
- ```
674
+ ## Batch Processing
1010
675
 
1011
- ### Memory Issues in Workers
676
+ Process multiple documents efficiently:
1012
677
 
1013
- For large documents in Cloudflare Workers, process in smaller chunks:
678
+ ```ts
679
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
1014
680
 
1015
- ```typescript
1016
- const result = await extractBytes(pdfBytes, 'application/pdf', {
1017
- chunking_config: { max_chars: 1000 }
1018
- });
681
+ interface DocumentJob {
682
+ name: string;
683
+ bytes: Uint8Array;
684
+ mimeType: string;
685
+ }
686
+
687
+ async function processBatch(documents: DocumentJob[], concurrency: number = 3) {
688
+ await initWasm();
689
+
690
+ const results: Record<string, string> = {};
691
+ const queue = [...documents];
692
+
693
+ const workers = Array(concurrency)
694
+ .fill(null)
695
+ .map(async () => {
696
+ while (queue.length > 0) {
697
+ const doc = queue.shift();
698
+ if (!doc) break;
699
+
700
+ try {
701
+ const result = await extractBytes(doc.bytes, doc.mimeType);
702
+ results[doc.name] = result.content;
703
+ } catch (error) {
704
+ console.error(`Failed to process ${doc.name}:`, error);
705
+ }
706
+ }
707
+ });
708
+
709
+ await Promise.all(workers);
710
+ return results;
711
+ }
1019
712
  ```
1020
713
 
1021
- ### OCR Not Working
714
+ ## Configuration
1022
715
 
1023
- Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
716
+ For advanced configuration options including language detection, table extraction, OCR settings, and more:
1024
717
 
1025
- ## Examples
718
+ **[Configuration Guide](https://kreuzberg.dev/configuration/)**
1026
719
 
1027
- See the [`examples/`](./examples/) directory for complete working examples:
720
+ ## Documentation
1028
721
 
1029
- - **Browser**: Vanilla JS file upload interface
1030
- - **Deno**: Command-line document extraction
1031
- - **Cloudflare Workers**: Document processing API
1032
- - **Node.js**: Batch processing script
722
+ - **[Official Documentation](https://kreuzberg.dev/)**
723
+ - **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
724
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)**
1033
725
 
1034
- ## Documentation
726
+ ## Troubleshooting
1035
727
 
1036
- For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
728
+ For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
1037
729
 
1038
730
  ## Contributing
1039
731
 
1040
- We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/docs/contributing.md) for details.
732
+ Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
1041
733
 
1042
734
  ## License
1043
735
 
1044
- MIT
1045
-
1046
- ## Links
1047
-
1048
- - [Website](https://kreuzberg.dev)
1049
- - [Documentation](https://kreuzberg.dev)
1050
- - [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
1051
- - [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
1052
- - [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
1053
- - [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
736
+ MIT License - see LICENSE file for details.
1054
737
 
1055
- ## Related Packages
738
+ ## Support
1056
739
 
1057
- - [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) - Native Node.js bindings (NAPI)
1058
- - [@kreuzberg/core](https://www.npmjs.com/package/@kreuzberg/core) - Shared TypeScript types
1059
- - [kreuzberg](https://crates.io/crates/kreuzberg) - Rust core library
740
+ - **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
741
+ - **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
742
+ - **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)