@kreuzberg/wasm 4.0.0-rc.6 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (56) hide show
  1. package/LICENSE +7 -0
  2. package/README.md +321 -800
  3. package/dist/adapters/wasm-adapter.d.ts +7 -10
  4. package/dist/adapters/wasm-adapter.d.ts.map +1 -0
  5. package/dist/adapters/wasm-adapter.js +53 -54
  6. package/dist/adapters/wasm-adapter.js.map +1 -1
  7. package/dist/index.d.ts +23 -67
  8. package/dist/index.d.ts.map +1 -0
  9. package/dist/index.js +1102 -104
  10. package/dist/index.js.map +1 -1
  11. package/dist/ocr/registry.d.ts +7 -10
  12. package/dist/ocr/registry.d.ts.map +1 -0
  13. package/dist/ocr/registry.js +9 -28
  14. package/dist/ocr/registry.js.map +1 -1
  15. package/dist/ocr/tesseract-wasm-backend.d.ts +3 -6
  16. package/dist/ocr/tesseract-wasm-backend.d.ts.map +1 -0
  17. package/dist/ocr/tesseract-wasm-backend.js +8 -83
  18. package/dist/ocr/tesseract-wasm-backend.js.map +1 -1
  19. package/dist/pdfium.js +77 -0
  20. package/dist/pkg/LICENSE +7 -0
  21. package/dist/pkg/README.md +503 -0
  22. package/dist/{kreuzberg_wasm.d.ts → pkg/kreuzberg_wasm.d.ts} +24 -12
  23. package/dist/{kreuzberg_wasm.js → pkg/kreuzberg_wasm.js} +224 -233
  24. package/dist/pkg/kreuzberg_wasm_bg.js +1871 -0
  25. package/dist/{kreuzberg_wasm_bg.wasm → pkg/kreuzberg_wasm_bg.wasm} +0 -0
  26. package/dist/{kreuzberg_wasm_bg.wasm.d.ts → pkg/kreuzberg_wasm_bg.wasm.d.ts} +10 -13
  27. package/dist/pkg/package.json +27 -0
  28. package/dist/plugin-registry.d.ts +246 -0
  29. package/dist/plugin-registry.d.ts.map +1 -0
  30. package/dist/runtime.d.ts +21 -22
  31. package/dist/runtime.d.ts.map +1 -0
  32. package/dist/runtime.js +21 -41
  33. package/dist/runtime.js.map +1 -1
  34. package/dist/types.d.ts +363 -0
  35. package/dist/types.d.ts.map +1 -0
  36. package/package.json +34 -51
  37. package/dist/adapters/wasm-adapter.d.mts +0 -121
  38. package/dist/adapters/wasm-adapter.mjs +0 -221
  39. package/dist/adapters/wasm-adapter.mjs.map +0 -1
  40. package/dist/index.d.mts +0 -466
  41. package/dist/index.mjs +0 -384
  42. package/dist/index.mjs.map +0 -1
  43. package/dist/kreuzberg_wasm.d.mts +0 -758
  44. package/dist/kreuzberg_wasm.mjs +0 -48
  45. package/dist/ocr/registry.d.mts +0 -102
  46. package/dist/ocr/registry.mjs +0 -70
  47. package/dist/ocr/registry.mjs.map +0 -1
  48. package/dist/ocr/tesseract-wasm-backend.d.mts +0 -257
  49. package/dist/ocr/tesseract-wasm-backend.mjs +0 -424
  50. package/dist/ocr/tesseract-wasm-backend.mjs.map +0 -1
  51. package/dist/runtime.d.mts +0 -256
  52. package/dist/runtime.mjs +0 -152
  53. package/dist/runtime.mjs.map +0 -1
  54. package/dist/snippets/wasm-bindgen-rayon-38edf6e439f6d70d/src/workerHelpers.js +0 -107
  55. package/dist/types-GJVIvbPy.d.mts +0 -221
  56. package/dist/types-GJVIvbPy.d.ts +0 -221
package/README.md CHANGED
@@ -1,982 +1,503 @@
1
- # Kreuzberg
1
+ # WebAssembly
2
+
3
+ <div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
4
+ <!-- Language Bindings -->
5
+ <a href="https://crates.io/crates/kreuzberg">
6
+ <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
7
+ </a>
8
+ <a href="https://hex.pm/packages/kreuzberg">
9
+ <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
10
+ </a>
11
+ <a href="https://pypi.org/project/kreuzberg/">
12
+ <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
13
+ </a>
14
+ <a href="https://www.npmjs.com/package/@kreuzberg/node">
15
+ <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
16
+ </a>
17
+ <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
18
+ <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
19
+ </a>
20
+
21
+ <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
22
+ <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
23
+ </a>
24
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
25
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
26
+ </a>
27
+ <a href="https://www.nuget.org/packages/Kreuzberg/">
28
+ <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
29
+ </a>
30
+ <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
31
+ <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
32
+ </a>
33
+ <a href="https://rubygems.org/gems/kreuzberg">
34
+ <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
35
+ </a>
36
+
37
+ <!-- Project Info -->
38
+ <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
39
+ <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
40
+ </a>
41
+ <a href="https://docs.kreuzberg.dev">
42
+ <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
43
+ </a>
44
+ </div>
45
+
46
+ <img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
47
+
48
+ <div align="center" style="margin-top: 20px;">
49
+ <a href="https://discord.gg/pXxagNK2zN">
50
+ <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
51
+ </a>
52
+ </div>
53
+
54
+
55
+ Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
2
56
 
3
- [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
- [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
- [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
- [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
- [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
- [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
- [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
- [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
57
 
12
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
- [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
- [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
-
16
- High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
17
-
18
- Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
-
20
- > **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
21
- >
22
- > **This WASM package is designed for:**
23
- > - Browser applications (including web workers)
24
- > - Cloudflare Workers and edge runtimes
25
- > - Deno applications
26
- > - Environments without native build toolchain
27
-
28
- > **🚀 Version 4.0.0 Release Candidate**
29
- > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
30
-
31
- ## Features
32
-
33
- - **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
34
- - **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
35
- - **Table Extraction**: Advanced table detection and structured data extraction
36
- - **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
37
- - **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
38
- - **API Parity**: All extraction functions from the Node.js binding
39
- - **Plugin System**: Custom post-processors, validators, and OCR backends
40
- - **Optimized Bundle**: <5MB uncompressed, <2MB compressed
41
- - **Zero Configuration**: Works out of the box with sensible defaults
42
- - **Portable**: Runs anywhere WASM is supported without native dependencies
43
-
44
- ## Requirements
45
-
46
- - **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
47
- - **Node.js**: 18 or higher
48
- - **Deno**: 1.0 or higher
49
- - **Cloudflare Workers**: Compatible with Workers runtime
50
-
51
- ### Optional Dependencies
58
+ ## Installation
52
59
 
53
- - **tesseract-wasm**: Automatically loaded for OCR functionality (40+ language support)
60
+ ### Package Installation
54
61
 
55
- ## Installation
56
62
 
57
- ### Choosing the Right Package
63
+ Install via one of the supported package managers:
58
64
 
59
- | Use Case | Recommendation | Reason |
60
- |----------|---|---|
61
- | **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
62
- | **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
63
- | **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
64
- | **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
65
- | **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
66
65
 
67
- ### Install via npm/pnpm/yarn
68
66
 
67
+ **npm:**
69
68
  ```bash
70
69
  npm install @kreuzberg/wasm
71
70
  ```
72
71
 
73
- Or with pnpm:
74
72
 
75
- ```bash
76
- pnpm add @kreuzberg/wasm
77
- ```
78
73
 
79
- Or with yarn:
80
74
 
75
+ **pnpm:**
81
76
  ```bash
82
- yarn add @kreuzberg/wasm
83
- ```
84
-
85
- ### Deno
86
-
87
- ```typescript
88
- import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
89
- ```
90
-
91
- ## Quick Start
92
-
93
- ### Browser (ESM)
94
-
95
- ```typescript
96
- import { extractFile } from '@kreuzberg/wasm';
97
-
98
- async function handleFileUpload() {
99
- const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
100
- const file = fileInput.files[0];
101
-
102
- const result = await extractFile(file, {
103
- extract_tables: true,
104
- extract_images: true
105
- });
106
-
107
- console.log('Extracted text:', result.content);
108
- console.log('Tables found:', result.tables.length);
109
- }
110
- ```
111
-
112
- ### Node.js (ESM)
113
-
114
- ```typescript
115
- import { extractBytes } from '@kreuzberg/wasm';
116
- import { readFile } from 'fs/promises';
117
-
118
- const pdfBytes = await readFile('./document.pdf');
119
- const result = await extractBytes(
120
- new Uint8Array(pdfBytes),
121
- 'application/pdf',
122
- { extract_tables: true }
123
- );
124
-
125
- console.log(result.content);
126
- console.log('Found', result.tables.length, 'tables');
127
- ```
128
-
129
- ### Deno
130
-
131
- ```typescript
132
- import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
133
-
134
- const pdfBytes = await Deno.readFile("./document.pdf");
135
- const result = await extractBytes(pdfBytes, "application/pdf");
136
-
137
- console.log(result.content);
138
- ```
139
-
140
- ### Cloudflare Workers
141
-
142
- ```typescript
143
- import { extractBytes } from '@kreuzberg/wasm';
144
-
145
- export default {
146
- async fetch(request: Request): Promise<Response> {
147
- if (request.method === 'POST') {
148
- const formData = await request.formData();
149
- const file = formData.get('file') as File;
150
-
151
- const arrayBuffer = await file.arrayBuffer();
152
- const bytes = new Uint8Array(arrayBuffer);
153
-
154
- const result = await extractBytes(bytes, file.type);
155
-
156
- return Response.json({
157
- text: result.content,
158
- metadata: result.metadata,
159
- tables: result.tables
160
- });
161
- }
162
-
163
- return new Response('Upload a file', { status: 400 });
164
- }
165
- };
77
+ pnpm add @kreuzberg/wasm
166
78
  ```
167
79
 
168
- ## Performance Comparison
169
-
170
- Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
171
-
172
- | Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
173
- |--------|---|---|---|
174
- | **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
175
- | **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
176
- | **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
177
- | **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
178
- | **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
179
-
180
- ### When to Use WASM vs Native
181
-
182
- **Use WASM (@kreuzberg/wasm) when:**
183
- - Building browser applications (no choice, WASM required)
184
- - Targeting Cloudflare Workers or edge runtimes
185
- - Supporting Deno applications
186
- - You don't have a native build toolchain available
187
- - Portability across runtimes is critical
188
80
 
189
- **Use Native (@kreuzberg/node) when:**
190
- - Building Node.js or Bun applications (2-3x faster)
191
- - Performance is your primary concern
192
- - You're processing large volumes of documents
193
- - You have native build tools available
194
81
 
195
- ### Performance Tips for WASM
196
82
 
197
- 1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
198
- 2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
199
- 3. **Cache WASM module** by loading it once per application
200
- 4. **Preload OCR models** by calling extraction with OCR enabled early
201
-
202
- ## Examples
203
-
204
- Kreuzberg WASM includes complete working examples for different environments:
205
-
206
- - **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
207
- - **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
208
- - **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
209
-
210
- See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
211
-
212
- ## Multi-Threading with wasm-bindgen-rayon
213
-
214
- Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
215
-
216
- ### Initializing the Thread Pool
217
-
218
- To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
219
-
220
- ```typescript
221
- import { initThreadPool } from '@kreuzberg/wasm';
222
-
223
- // Initialize thread pool for multi-threaded extraction
224
- await initThreadPool(navigator.hardwareConcurrency);
225
-
226
- // Now extractions will use multiple threads for better performance
227
- const result = await extractBytes(pdfBytes, 'application/pdf');
83
+ **yarn:**
84
+ ```bash
85
+ yarn add @kreuzberg/wasm
228
86
  ```
229
87
 
230
- ### Required HTTP Headers for SharedArrayBuffer
231
88
 
232
- Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
233
89
 
234
- **Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
235
-
236
- Set these headers in your server configuration:
237
-
238
- ```
239
- Cross-Origin-Opener-Policy: same-origin
240
- Cross-Origin-Embedder-Policy: require-corp
241
- ```
242
90
 
243
- #### Server Configuration Examples
244
91
 
245
- **Express.js:**
246
- ```javascript
247
- app.use((req, res, next) => {
248
- res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
249
- res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
250
- next();
251
- });
252
- ```
92
+ ### System Requirements
253
93
 
254
- **Nginx:**
255
- ```nginx
256
- add_header 'Cross-Origin-Opener-Policy' 'same-origin';
257
- add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
258
- ```
94
+ - Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
95
+ - Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
259
96
 
260
- **Apache:**
261
- ```apache
262
- Header set Cross-Origin-Opener-Policy "same-origin"
263
- Header set Cross-Origin-Embedder-Policy "require-corp"
264
- ```
265
97
 
266
- **Cloudflare Workers:**
267
- ```javascript
268
- export default {
269
- async fetch(request: Request): Promise<Response> {
270
- const response = new Response(body);
271
- response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
272
- response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
273
- return response;
274
- }
275
- };
276
- ```
277
98
 
278
- ### Browser Compatibility
99
+ ## Quick Start
279
100
 
280
- Multi-threading with SharedArrayBuffer is available in:
101
+ ### Basic Extraction
281
102
 
282
- - **Chrome/Edge**: 74+
283
- - **Firefox**: 79+
284
- - **Safari**: 15.2+
285
- - **Opera**: 60+
103
+ Extract text, metadata, and structure from any supported document format:
286
104
 
287
- In unsupported browsers or when headers are not set, the library automatically degrades to single-threaded mode.
105
+ ```ts
106
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
288
107
 
289
- ### Graceful Degradation
108
+ async function main() {
109
+ await initWasm();
290
110
 
291
- The library handles thread pool initialization gracefully. If initialization fails or is unavailable:
111
+ const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
112
+ const bytes = new Uint8Array(buffer);
292
113
 
293
- ```typescript
294
- import { initThreadPool } from '@kreuzberg/wasm';
114
+ const result = await extractBytes(bytes, "application/pdf");
295
115
 
296
- try {
297
- await initThreadPool(navigator.hardwareConcurrency);
298
- console.log('Multi-threading enabled');
299
- } catch (error) {
300
- // Fall back to single-threaded processing
301
- console.warn('Multi-threading unavailable:', error);
302
- console.log('Using single-threaded extraction');
116
+ console.log("Extracted content:");
117
+ console.log(result.content);
118
+ console.log("MIME type:", result.mimeType);
119
+ console.log("Metadata:", result.metadata);
303
120
  }
304
121
 
305
- // Extraction will work in both cases
306
- const result = await extractBytes(pdfBytes, 'application/pdf');
122
+ main().catch(console.error);
307
123
  ```
308
124
 
309
- ### Complete Example with Thread Pool
310
125
 
311
- ```typescript
312
- import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
126
+ ### Common Use Cases
313
127
 
314
- async function initializeKreuzbergWithThreading() {
315
- try {
316
- // Initialize WASM module
317
- await initWasm();
128
+ #### Extract with Custom Configuration
318
129
 
319
- // Initialize multi-threading
320
- const cpuCount = navigator.hardwareConcurrency || 1;
321
- try {
322
- await initThreadPool(cpuCount);
323
- console.log(`Thread pool initialized with ${cpuCount} workers`);
324
- } catch (error) {
325
- console.warn('Could not initialize thread pool, using single-threaded mode');
326
- }
130
+ Most use cases benefit from configuration to control extraction behavior:
327
131
 
328
- } catch (error) {
329
- console.error('Failed to initialize Kreuzberg:', error);
330
- }
331
- }
332
-
333
- async function extractDocument(file: File) {
334
- const bytes = new Uint8Array(await file.arrayBuffer());
335
132
 
336
- // Extraction will use multiple threads if available
337
- const result = await extractBytes(bytes, file.type, {
338
- extract_tables: true,
339
- extract_images: true
340
- });
133
+ **With OCR (for scanned documents):**
341
134
 
342
- return result;
343
- }
135
+ ```ts
136
+ import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
344
137
 
345
- // Initialize once at app startup
346
- await initializeKreuzbergWithThreading();
347
-
348
- // Later, handle file uploads
349
- fileInput.addEventListener('change', async (e) => {
350
- const file = e.target.files?.[0];
351
- if (file) {
352
- const result = await extractDocument(file);
353
- console.log('Extracted text:', result.content);
354
- }
355
- });
356
- ```
138
+ async function extractWithOcr() {
139
+ await initWasm();
357
140
 
358
- ### Performance Considerations
141
+ try {
142
+ await enableOcr();
143
+ console.log("OCR enabled successfully");
144
+ } catch (error) {
145
+ console.error("Failed to enable OCR:", error);
146
+ return;
147
+ }
359
148
 
360
- - **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
361
- - **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
362
- - **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
149
+ const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
363
150
 
364
- ## OCR Support
365
-
366
- The WASM binding integrates [tesseract-wasm](https://github.com/robertknight/tesseract-wasm) for OCR support with 40+ languages.
367
-
368
- ### Basic OCR
369
-
370
- ```typescript
371
- import { extractBytes } from '@kreuzberg/wasm';
372
-
373
- const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
374
-
375
- const result = await extractBytes(
376
- new Uint8Array(imageBytes),
377
- 'image/jpeg',
378
- {
379
- enable_ocr: true,
380
- ocr_config: {
381
- languages: ['eng'], // English
382
- backend: 'tesseract-wasm'
383
- }
384
- }
385
- );
386
-
387
- console.log('OCR text:', result.content);
388
- ```
151
+ const result = await extractBytes(bytes, "image/png", {
152
+ ocr: {
153
+ backend: "tesseract-wasm",
154
+ language: "eng",
155
+ },
156
+ });
389
157
 
390
- ### Multi-Language OCR
158
+ console.log("Extracted text:");
159
+ console.log(result.content);
160
+ }
391
161
 
392
- ```typescript
393
- const result = await extractBytes(imageBytes, 'image/png', {
394
- enable_ocr: true,
395
- ocr_config: {
396
- languages: ['eng', 'deu', 'fra'], // English, German, French
397
- backend: 'tesseract-wasm'
398
- }
399
- });
162
+ extractWithOcr().catch(console.error);
400
163
  ```
401
164
 
402
- ### Supported Languages
403
165
 
404
- `eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
405
166
 
406
- Training data is automatically loaded from jsDelivr CDN:
407
- ```
408
- https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
409
- ```
410
167
 
411
- ## Configuration
168
+ #### Table Extraction
412
169
 
413
- ### Extract Tables
414
170
 
415
- ```typescript
416
- import { extractBytes } from '@kreuzberg/wasm';
171
+ See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
417
172
 
418
- const result = await extractBytes(pdfBytes, 'application/pdf', {
419
- extract_tables: true
420
- });
421
173
 
422
- if (result.tables) {
423
- for (const table of result.tables) {
424
- console.log('Table as Markdown:');
425
- console.log(table.markdown);
426
174
 
427
- console.log('Table cells:');
428
- console.log(JSON.stringify(table.cells, null, 2));
429
- }
430
- }
431
- ```
432
-
433
- ### Extract Images
175
+ #### Processing Multiple Files
434
176
 
435
- ```typescript
436
- import { extractBytes } from '@kreuzberg/wasm';
437
177
 
438
- const result = await extractBytes(pdfBytes, 'application/pdf', {
439
- extract_images: true,
440
- image_config: {
441
- target_dpi: 300,
442
- max_image_dimension: 4096
443
- }
444
- });
178
+ ```ts
179
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
445
180
 
446
- if (result.images) {
447
- for (const image of result.images) {
448
- console.log(`Image ${image.index}: ${image.format}`);
449
- // image.data is a Uint8Array
450
- }
181
+ interface DocumentJob {
182
+ name: string;
183
+ bytes: Uint8Array;
184
+ mimeType: string;
451
185
  }
452
- ```
453
-
454
- ### Text Chunking
455
186
 
456
- ```typescript
457
- import { extractBytes } from '@kreuzberg/wasm';
458
-
459
- const result = await extractBytes(pdfBytes, 'application/pdf', {
460
- enable_chunking: true,
461
- chunking_config: {
462
- max_chars: 1000,
463
- max_overlap: 200
464
- }
465
- });
466
-
467
- if (result.chunks) {
468
- for (const chunk of result.chunks) {
469
- console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
470
- }
187
+ async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
188
+ await initWasm();
189
+
190
+ const results: Record<string, string> = {};
191
+ const queue = [...documents];
192
+
193
+ const workers = Array(concurrency)
194
+ .fill(null)
195
+ .map(async () => {
196
+ while (queue.length > 0) {
197
+ const doc = queue.shift();
198
+ if (!doc) break;
199
+
200
+ try {
201
+ const result = await extractBytes(doc.bytes, doc.mimeType);
202
+ results[doc.name] = result.content;
203
+ } catch (error) {
204
+ console.error(`Failed to process ${doc.name}:`, error);
205
+ }
206
+ }
207
+ });
208
+
209
+ await Promise.all(workers);
210
+ return results;
471
211
  }
472
212
  ```
473
213
 
474
- ### Language Detection
475
214
 
476
- ```typescript
477
- import { extractBytes } from '@kreuzberg/wasm';
478
215
 
479
- const result = await extractBytes(pdfBytes, 'application/pdf', {
480
- enable_language_detection: true
481
- });
482
216
 
483
- if (result.language) {
484
- console.log(`Detected language: ${result.language.code}`);
485
- console.log(`Confidence: ${result.language.confidence}`);
486
- }
487
- ```
488
217
 
489
- ### Complete Configuration Example
490
-
491
- ```typescript
492
- import {
493
- extractBytes,
494
- type ExtractionConfig
495
- } from '@kreuzberg/wasm';
496
-
497
- const config: ExtractionConfig = {
498
- extract_tables: true,
499
- extract_images: true,
500
- extract_metadata: true,
501
-
502
- enable_ocr: true,
503
- ocr_config: {
504
- languages: ['eng'],
505
- backend: 'tesseract-wasm',
506
- dpi: 300,
507
- preprocessing: {
508
- deskew: true,
509
- denoise: true,
510
- binarize: true
511
- }
512
- },
513
-
514
- enable_chunking: true,
515
- chunking_config: {
516
- max_chars: 1000,
517
- max_overlap: 200
518
- },
519
-
520
- enable_language_detection: true,
521
-
522
- enable_quality: true,
523
-
524
- extract_keywords: true,
525
- keywords_config: {
526
- max_keywords: 10,
527
- method: 'yake'
528
- }
529
- };
530
-
531
- const result = await extractBytes(data, mimeType, config);
532
- ```
218
+ #### Async Processing
533
219
 
534
- ## Advanced Usage
220
+ For non-blocking document processing:
535
221
 
536
- ### Batch Processing
222
+ ```ts
223
+ import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
537
224
 
538
- ```typescript
539
- import { batchExtractFiles, batchExtractBytes } from '@kreuzberg/wasm';
225
+ async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
226
+ const caps = getWasmCapabilities();
227
+ if (!caps.hasWasm) {
228
+ throw new Error("WebAssembly not supported");
229
+ }
540
230
 
541
- // Browser: Process multiple files
542
- const fileInput = document.querySelector<HTMLInputElement>('#files');
543
- const files = Array.from(fileInput.files);
231
+ await initWasm();
544
232
 
545
- const results = await batchExtractFiles(files, {
546
- extract_tables: true
547
- });
233
+ const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
548
234
 
549
- for (const result of results) {
550
- console.log(`${result.mime_type}: ${result.content.length} characters`);
235
+ return results.map((r) => ({
236
+ content: r.content,
237
+ pageCount: r.metadata?.pageCount,
238
+ }));
551
239
  }
552
240
 
553
- // Or from Uint8Arrays
554
- const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
555
- const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
241
+ const fileBytes = [new Uint8Array([1, 2, 3])];
242
+ const mimes = ["application/pdf"];
556
243
 
557
- const results = await batchExtractBytes(dataList, mimeTypes);
244
+ extractDocuments(fileBytes, mimes)
245
+ .then((results) => console.log(results))
246
+ .catch(console.error);
558
247
  ```
559
248
 
560
- ### Synchronous Extraction
561
249
 
562
- ```typescript
563
- import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
564
250
 
565
- // Synchronous single extraction
566
- const result = extractBytesSync(data, 'application/pdf', config);
567
251
 
568
- // Synchronous batch extraction
569
- const results = batchExtractBytesSync(dataList, mimeTypes, config);
570
- ```
571
252
 
572
- ### Plugin System
573
253
 
574
- #### Custom Post-Processors
254
+ ### Next Steps
575
255
 
576
- ```typescript
577
- import { registerPostProcessor } from '@kreuzberg/wasm';
256
+ - **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
257
+ - **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
258
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
259
+ - **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
260
+ - **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
578
261
 
579
- registerPostProcessor({
580
- name: 'uppercase',
581
- async process(result) {
582
- return {
583
- ...result,
584
- content: result.content.toUpperCase()
585
- };
586
- }
587
- });
588
262
 
589
- // Now all extractions will use this processor
590
- const result = await extractBytes(data, mimeType);
591
- console.log(result.content); // UPPERCASE TEXT
592
- ```
593
263
 
594
- #### Custom Validators
264
+ ## Features
595
265
 
596
- ```typescript
597
- import { registerValidator } from '@kreuzberg/wasm';
266
+ ### Supported File Formats (56+)
598
267
 
599
- registerValidator({
600
- name: 'min-length',
601
- async validate(result) {
602
- if (result.content.length < 100) {
603
- throw new Error('Document too short');
604
- }
605
- }
606
- });
607
- ```
268
+ 56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
608
269
 
609
- #### Custom OCR Backends
610
-
611
- ```typescript
612
- import { registerOcrBackend } from '@kreuzberg/wasm';
613
-
614
- registerOcrBackend({
615
- name: 'custom-ocr',
616
- supportedLanguages() {
617
- return ['eng', 'fra'];
618
- },
619
- async initialize() {
620
- // Initialize your OCR backend
621
- },
622
- async processImage(imageBytes, language) {
623
- // Process image and return result
624
- return {
625
- content: 'extracted text',
626
- mime_type: 'text/plain',
627
- metadata: {},
628
- tables: []
629
- };
630
- },
631
- async shutdown() {
632
- // Cleanup
633
- }
634
- });
635
- ```
270
+ #### Office Documents
636
271
 
637
- ### MIME Type Detection
272
+ | Category | Formats | Capabilities |
273
+ |----------|---------|--------------|
274
+ | **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
275
+ | **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
276
+ | **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
277
+ | **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
278
+ | **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
638
279
 
639
- ```typescript
640
- import {
641
- detectMimeFromBytes,
642
- getMimeFromExtension,
643
- getExtensionsForMime,
644
- normalizeMimeType
645
- } from '@kreuzberg/wasm';
280
+ #### Images (OCR-Enabled)
646
281
 
647
- // Auto-detect MIME type from file bytes
648
- const mimeType = detectMimeFromBytes(fileBytes);
282
+ | Category | Formats | Features |
283
+ |----------|---------|----------|
284
+ | **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
285
+ | **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
286
+ | **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
649
287
 
650
- // Get MIME type from file extension
651
- const mime = getMimeFromExtension('pdf'); // 'application/pdf'
288
+ #### Web & Data
652
289
 
653
- // Get extensions for MIME type
654
- const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
290
+ | Category | Formats | Features |
291
+ |----------|---------|----------|
292
+ | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
293
+ | **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
294
+ | **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
655
295
 
656
- // Normalize MIME type
657
- const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
658
- ```
296
+ #### Email & Archives
659
297
 
660
- ### Configuration Loading
298
+ | Category | Formats | Features |
299
+ |----------|---------|----------|
300
+ | **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
301
+ | **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
661
302
 
662
- ```typescript
663
- import { loadConfigFromString } from '@kreuzberg/wasm';
303
+ #### Academic & Scientific
664
304
 
665
- // Load from YAML
666
- const yamlConfig = `
667
- extract_tables: true
668
- enable_ocr: true
669
- ocr_config:
670
- languages: [eng, deu]
671
- `;
672
- const config = loadConfigFromString(yamlConfig, 'yaml');
305
+ | Category | Formats | Features |
306
+ |----------|---------|----------|
307
+ | **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
308
+ | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
309
+ | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
673
310
 
674
- // Load from JSON
675
- const jsonConfig = '{"extract_tables":true}';
676
- const config2 = loadConfigFromString(jsonConfig, 'json');
311
+ **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
677
312
 
678
- // Load from TOML
679
- const tomlConfig = 'extract_tables = true';
680
- const config3 = loadConfigFromString(tomlConfig, 'toml');
681
- ```
313
+ ### Key Capabilities
682
314
 
683
- ## API Reference
315
+ - **Text Extraction** - Extract all text content with position and formatting information
316
+ - **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
317
+ - **Table Extraction** - Parse tables with structure and cell content preservation
318
+ - **Image Extraction** - Extract embedded images and render page previews
319
+ - **OCR Support** - Integrate multiple OCR backends for scanned documents
684
320
 
685
- ### Extraction Functions
321
+ - **Async/Await** - Non-blocking document processing with concurrent operations
686
322
 
687
- #### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
688
- Extract content from a browser `File` object.
689
323
 
690
- #### `extractBytes(data: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
691
- Asynchronously extract content from a `Uint8Array`.
324
+ - **Plugin System** - Extensible post-processing for custom text transformation
692
325
 
693
- #### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
694
- Synchronously extract content from a `Uint8Array`.
695
326
 
696
- #### `batchExtractFiles(files: File[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
697
- Extract multiple files in parallel.
327
+ - **Batch Processing** - Efficiently process multiple documents in parallel
328
+ - **Memory Efficient** - Stream large files without loading entirely into memory
329
+ - **Language Detection** - Detect and support multiple languages in documents
330
+ - **Configuration** - Fine-grained control over extraction behavior
698
331
 
699
- #### `batchExtractBytes(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
700
- Extract multiple byte arrays in parallel.
332
+ ### Performance Characteristics
701
333
 
702
- #### `batchExtractBytesSync(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): ExtractionResult[]`
703
- Extract multiple byte arrays synchronously.
334
+ | Format | Speed | Memory | Notes |
335
+ |--------|-------|--------|-------|
336
+ | **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
337
+ | **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
338
+ | **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
339
+ | **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
340
+ | **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
704
341
 
705
- ### Plugin Management
706
342
 
707
- #### Post-Processors
708
343
 
709
- ```typescript
710
- registerPostProcessor(processor: PostProcessorProtocol): void
711
- unregisterPostProcessor(name: string): void
712
- clearPostProcessors(): void
713
- listPostProcessors(): string[]
714
- ```
344
+ ## OCR Support
715
345
 
716
- #### Validators
346
+ Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
717
347
 
718
- ```typescript
719
- registerValidator(validator: ValidatorProtocol): void
720
- unregisterValidator(name: string): void
721
- clearValidators(): void
722
- listValidators(): string[]
723
- ```
724
348
 
725
- #### OCR Backends
349
+ - **Tesseract-Wasm**
726
350
 
727
- ```typescript
728
- registerOcrBackend(backend: OcrBackendProtocol): void
729
- unregisterOcrBackend(name: string): void
730
- clearOcrBackends(): void
731
- listOcrBackends(): string[]
732
- ```
733
351
 
734
- ### Document Extractors
352
+ ### OCR Configuration Example
735
353
 
736
- ```typescript
737
- listDocumentExtractors(): string[]
738
- unregisterDocumentExtractor(name: string): void
739
- clearDocumentExtractors(): void
740
- ```
354
+ ```ts
355
+ import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
741
356
 
742
- ### MIME Utilities
357
+ async function extractWithOcr() {
358
+ await initWasm();
743
359
 
744
- ```typescript
745
- detectMimeFromBytes(data: Uint8Array): string
746
- getMimeFromExtension(ext: string): string | null
747
- getExtensionsForMime(mime: string): string[]
748
- normalizeMimeType(mime: string): string
749
- ```
360
+ try {
361
+ await enableOcr();
362
+ console.log("OCR enabled successfully");
363
+ } catch (error) {
364
+ console.error("Failed to enable OCR:", error);
365
+ return;
366
+ }
750
367
 
751
- ### Configuration
368
+ const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
752
369
 
753
- ```typescript
754
- loadConfigFromString(content: string, format: 'yaml' | 'toml' | 'json'): ExtractionConfig
755
- ```
370
+ const result = await extractBytes(bytes, "image/png", {
371
+ ocr: {
372
+ backend: "tesseract-wasm",
373
+ language: "eng",
374
+ },
375
+ });
756
376
 
757
- ### Embeddings
377
+ console.log("Extracted text:");
378
+ console.log(result.content);
379
+ }
758
380
 
759
- ```typescript
760
- listEmbeddingPresets(): string[]
761
- getEmbeddingPreset(name: string): EmbeddingPreset | null
381
+ extractWithOcr().catch(console.error);
762
382
  ```
763
383
 
764
- ## Types
765
-
766
- All types are shared via the `@kreuzberg/core` package:
767
-
768
- ```typescript
769
- import type {
770
- ExtractionResult,
771
- ExtractionConfig,
772
- OcrConfig,
773
- ChunkingConfig,
774
- ImageConfig,
775
- KeywordsConfig,
776
- Table,
777
- ExtractedImage,
778
- Chunk,
779
- Metadata,
780
- PostProcessorProtocol,
781
- ValidatorProtocol,
782
- OcrBackendProtocol
783
- } from '@kreuzberg/core';
784
- ```
785
384
 
786
- ### ExtractionResult
787
-
788
- Main result object containing:
789
- - `content: string` - Extracted text content
790
- - `mime_type: string` - MIME type of the document
791
- - `metadata?: Metadata` - Document metadata
792
- - `tables?: Table[]` - Extracted tables
793
- - `images?: ExtractedImage[]` - Extracted images
794
- - `chunks?: Chunk[]` - Text chunks (if chunking enabled)
795
- - `language?: LanguageInfo` - Detected language (if enabled)
796
- - `keywords?: Keyword[]` - Extracted keywords (if enabled)
797
-
798
- ### ExtractionConfig
799
-
800
- Configuration object for extraction:
801
- - `extract_tables?: boolean` - Extract tables as structured data
802
- - `extract_images?: boolean` - Extract embedded images
803
- - `extract_metadata?: boolean` - Extract document metadata
804
- - `enable_ocr?: boolean` - Enable OCR for images
805
- - `ocr_config?: OcrConfig` - OCR settings
806
- - `enable_chunking?: boolean` - Split text into semantic chunks
807
- - `chunking_config?: ChunkingConfig` - Text chunking settings
808
- - `enable_language_detection?: boolean` - Detect document language
809
- - `enable_quality?: boolean` - Encoding detection, normalization
810
- - `extract_keywords?: boolean` - Extract important keywords
811
- - `keywords_config?: KeywordsConfig` - Keyword extraction settings
812
-
813
- ### Table
814
-
815
- Extracted table structure:
816
- - `markdown: string` - Table in Markdown format
817
- - `cells: TableCell[][]` - 2D array of table cells
818
- - `row_count: number` - Number of rows
819
- - `column_count: number` - Number of columns
820
-
821
- ## Supported Formats
822
-
823
- | Category | Formats |
824
- |----------|---------|
825
- | **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
826
- | **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
827
- | **Web** | HTML, XHTML, XML, EPUB |
828
- | **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
829
- | **Email** | EML, MSG |
830
- | **Archives** | ZIP, TAR, 7Z |
831
- | **Other** | And 30+ more formats |
832
-
833
- ## Build from Source
834
-
835
- ### Prerequisites
836
-
837
- - Rust 1.75+ with `wasm32-unknown-unknown` target
838
- - Node.js 18+ with pnpm
839
- - wasm-pack
840
385
 
841
- ```bash
842
- # Install Rust target
843
- rustup target add wasm32-unknown-unknown
844
386
 
845
- # Install wasm-pack
846
- curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
387
+ ## Async Support
847
388
 
848
- # Build WASM package
849
- cd crates/kreuzberg-wasm
850
- pnpm install
851
- pnpm run build
389
+ This binding provides full async/await support for non-blocking document processing:
852
390
 
853
- # Run tests
854
- pnpm test
855
- ```
391
+ ```ts
392
+ import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
856
393
 
857
- ### Build Targets
394
+ async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
395
+ const caps = getWasmCapabilities();
396
+ if (!caps.hasWasm) {
397
+ throw new Error("WebAssembly not supported");
398
+ }
858
399
 
859
- ```bash
860
- # For browsers (ESM modules)
861
- pnpm run build:wasm:web
400
+ await initWasm();
862
401
 
863
- # For bundlers (webpack, rollup, vite)
864
- pnpm run build:wasm:bundler
402
+ const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
865
403
 
866
- # For Node.js
867
- pnpm run build:wasm:nodejs
404
+ return results.map((r) => ({
405
+ content: r.content,
406
+ pageCount: r.metadata?.pageCount,
407
+ }));
408
+ }
868
409
 
869
- # For Deno
870
- pnpm run build:wasm:deno
410
+ const fileBytes = [new Uint8Array([1, 2, 3])];
411
+ const mimes = ["application/pdf"];
871
412
 
872
- # Build all targets
873
- pnpm run build:all
413
+ extractDocuments(fileBytes, mimes)
414
+ .then((results) => console.log(results))
415
+ .catch(console.error);
874
416
  ```
875
417
 
876
- ## Limitations
877
418
 
878
- ### No File System Access
879
419
 
880
- The WASM binding cannot access the file system directly. Use file readers:
881
420
 
882
- ```typescript
883
- // ❌ Won't work
884
- await extractFileSync('./document.pdf'); // Throws error
421
+ ## Plugin System
885
422
 
886
- // Use file readers instead
887
- const bytes = await Deno.readFile('./document.pdf'); // Deno
888
- const bytes = await fs.readFile('./document.pdf'); // Node.js
889
- const bytes = await file.arrayBuffer(); // Browser
890
- ```
423
+ Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
891
424
 
892
- ### OCR Training Data
425
+ For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
893
426
 
894
- Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
895
427
 
896
- ### Size Constraints
897
428
 
898
- Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
899
429
 
900
- ## Troubleshooting
901
430
 
902
- ### "WASM module failed to initialize"
903
431
 
904
- Ensure your bundler is configured to handle WASM files:
432
+ ## Batch Processing
905
433
 
906
- **Vite:**
907
- ```typescript
908
- // vite.config.ts
909
- export default {
910
- optimizeDeps: {
911
- exclude: ['@kreuzberg/wasm']
912
- }
913
- }
914
- ```
434
+ Process multiple documents efficiently:
915
435
 
916
- **Webpack:**
917
- ```javascript
918
- // webpack.config.js
919
- module.exports = {
920
- experiments: {
921
- asyncWebAssembly: true
922
- }
923
- }
924
- ```
925
-
926
- ### "Module not found: @kreuzberg/core"
436
+ ```ts
437
+ import { extractBytes, initWasm } from "@kreuzberg/wasm";
927
438
 
928
- The @kreuzberg/core package is a peer dependency. Install it:
439
+ interface DocumentJob {
440
+ name: string;
441
+ bytes: Uint8Array;
442
+ mimeType: string;
443
+ }
929
444
 
930
- ```bash
931
- pnpm add @kreuzberg/core
445
+ async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
446
+ await initWasm();
447
+
448
+ const results: Record<string, string> = {};
449
+ const queue = [...documents];
450
+
451
+ const workers = Array(concurrency)
452
+ .fill(null)
453
+ .map(async () => {
454
+ while (queue.length > 0) {
455
+ const doc = queue.shift();
456
+ if (!doc) break;
457
+
458
+ try {
459
+ const result = await extractBytes(doc.bytes, doc.mimeType);
460
+ results[doc.name] = result.content;
461
+ } catch (error) {
462
+ console.error(`Failed to process ${doc.name}:`, error);
463
+ }
464
+ }
465
+ });
466
+
467
+ await Promise.all(workers);
468
+ return results;
469
+ }
932
470
  ```
933
471
 
934
- ### Memory Issues in Workers
935
472
 
936
- For large documents in Cloudflare Workers, process in smaller chunks:
937
473
 
938
- ```typescript
939
- const result = await extractBytes(pdfBytes, 'application/pdf', {
940
- chunking_config: { max_chars: 1000 }
941
- });
942
- ```
943
474
 
944
- ### OCR Not Working
475
+ ## Configuration
945
476
 
946
- Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
477
+ For advanced configuration options including language detection, table extraction, OCR settings, and more:
947
478
 
948
- ## Examples
479
+ **[Configuration Guide](https://kreuzberg.dev/configuration/)**
949
480
 
950
- See the [`examples/`](./examples/) directory for complete working examples:
481
+ ## Documentation
951
482
 
952
- - **Browser**: Vanilla JS file upload interface
953
- - **Deno**: Command-line document extraction
954
- - **Cloudflare Workers**: Document processing API
955
- - **Node.js**: Batch processing script
483
+ - **[Official Documentation](https://kreuzberg.dev/)**
484
+ - **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
485
+ - **[Examples & Guides](https://kreuzberg.dev/guides/)**
956
486
 
957
- ## Documentation
487
+ ## Troubleshooting
958
488
 
959
- For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
489
+ For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
960
490
 
961
491
  ## Contributing
962
492
 
963
- We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/docs/contributing.md) for details.
493
+ Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
964
494
 
965
495
  ## License
966
496
 
967
- MIT
968
-
969
- ## Links
970
-
971
- - [Website](https://kreuzberg.dev)
972
- - [Documentation](https://kreuzberg.dev)
973
- - [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
974
- - [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
975
- - [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
976
- - [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
497
+ MIT License - see LICENSE file for details.
977
498
 
978
- ## Related Packages
499
+ ## Support
979
500
 
980
- - [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) - Native Node.js bindings (NAPI)
981
- - [@kreuzberg/core](https://www.npmjs.com/package/@kreuzberg/core) - Shared TypeScript types
982
- - [kreuzberg](https://crates.io/crates/kreuzberg) - Rust core library
501
+ - **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
502
+ - **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
503
+ - **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)