@kreuzberg/wasm 4.0.0-rc.21 → 4.0.0-rc.23
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +520 -837
- package/dist/adapters/wasm-adapter.cjs.map +1 -1
- package/dist/adapters/wasm-adapter.d.cts +1 -1
- package/dist/adapters/wasm-adapter.d.ts +1 -1
- package/dist/adapters/wasm-adapter.js.map +1 -1
- package/dist/index.cjs +192 -48
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +219 -3
- package/dist/index.d.ts +219 -3
- package/dist/index.js +199 -48
- package/dist/index.js.map +1 -1
- package/dist/ocr/registry.cjs.map +1 -1
- package/dist/ocr/registry.d.cts +1 -1
- package/dist/ocr/registry.d.ts +1 -1
- package/dist/ocr/registry.js.map +1 -1
- package/dist/ocr/tesseract-wasm-backend.cjs +0 -46
- package/dist/ocr/tesseract-wasm-backend.cjs.map +1 -1
- package/dist/ocr/tesseract-wasm-backend.d.cts +1 -1
- package/dist/ocr/tesseract-wasm-backend.d.ts +1 -1
- package/dist/ocr/tesseract-wasm-backend.js +0 -46
- package/dist/ocr/tesseract-wasm-backend.js.map +1 -1
- package/dist/pdfium.js +0 -5
- package/dist/runtime.cjs +0 -1
- package/dist/runtime.cjs.map +1 -1
- package/dist/runtime.js +0 -1
- package/dist/runtime.js.map +1 -1
- package/dist/{types-CKjcIYcX.d.cts → types-wVLLDHkl.d.cts} +73 -3
- package/dist/{types-CKjcIYcX.d.ts → types-wVLLDHkl.d.ts} +73 -3
- package/package.json +162 -162
package/README.md
CHANGED
|
@@ -1,1059 +1,742 @@
|
|
|
1
|
-
#
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
>
|
|
22
|
-
|
|
23
|
-
>
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
>
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
>
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
1
|
+
# WebAssembly Bindings
|
|
2
|
+
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<!-- Language Bindings -->
|
|
5
|
+
<a href="https://crates.io/crates/kreuzberg">
|
|
6
|
+
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
|
|
7
|
+
</a>
|
|
8
|
+
<a href="https://hex.pm/packages/kreuzberg">
|
|
9
|
+
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://pypi.org/project/kreuzberg/">
|
|
12
|
+
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/node">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
|
|
18
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
+
</a>
|
|
20
|
+
|
|
21
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
|
|
22
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
|
|
23
|
+
</a>
|
|
24
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
|
|
25
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
|
|
26
|
+
</a>
|
|
27
|
+
<a href="https://www.nuget.org/packages/Kreuzberg/">
|
|
28
|
+
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
|
|
29
|
+
</a>
|
|
30
|
+
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
|
|
31
|
+
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
|
|
32
|
+
</a>
|
|
33
|
+
<a href="https://rubygems.org/gems/kreuzberg">
|
|
34
|
+
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
35
|
+
</a>
|
|
36
|
+
|
|
37
|
+
<!-- Project Info -->
|
|
38
|
+
|
|
39
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
|
|
40
|
+
<img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
|
|
41
|
+
</a>
|
|
42
|
+
<a href="https://docs.kreuzberg.dev">
|
|
43
|
+
<img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
|
|
44
|
+
</a>
|
|
45
|
+
</div>
|
|
46
|
+
|
|
47
|
+
<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
|
48
|
+
|
|
49
|
+
<div align="center" style="margin-top: 20px;">
|
|
50
|
+
<a href="https://discord.gg/pXxagNK2zN">
|
|
51
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
|
|
52
|
+
</a>
|
|
53
|
+
</div>
|
|
54
|
+
|
|
55
|
+
Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Node.js, Deno, and Cloudflare Workers with portable deployment and optional multi-threading support.
|
|
56
|
+
|
|
57
|
+
> **Version 4.0.0 Release Candidate**
|
|
58
|
+
> Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
|
|
59
|
+
> This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
|
54
60
|
|
|
55
61
|
## Installation
|
|
56
62
|
|
|
57
|
-
###
|
|
63
|
+
### Package Installation
|
|
58
64
|
|
|
59
|
-
|
|
60
|
-
|----------|---|---|
|
|
61
|
-
| **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
|
|
62
|
-
| **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
|
|
63
|
-
| **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
|
|
64
|
-
| **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
|
|
65
|
-
| **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
|
|
65
|
+
Install via one of the supported package managers:
|
|
66
66
|
|
|
67
|
-
|
|
67
|
+
**npm:**
|
|
68
68
|
|
|
69
69
|
```bash
|
|
70
70
|
npm install @kreuzberg/wasm
|
|
71
71
|
```
|
|
72
72
|
|
|
73
|
-
|
|
73
|
+
**pnpm:**
|
|
74
74
|
|
|
75
75
|
```bash
|
|
76
76
|
pnpm add @kreuzberg/wasm
|
|
77
77
|
```
|
|
78
78
|
|
|
79
|
-
|
|
79
|
+
**yarn:**
|
|
80
80
|
|
|
81
81
|
```bash
|
|
82
82
|
yarn add @kreuzberg/wasm
|
|
83
83
|
```
|
|
84
84
|
|
|
85
|
-
###
|
|
86
|
-
|
|
87
|
-
```typescript
|
|
88
|
-
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
89
|
-
```
|
|
90
|
-
|
|
91
|
-
## PDF Support and PDFium Initialization
|
|
92
|
-
|
|
93
|
-
**IMPORTANT**: PDF extraction requires a one-time initialization step to load the PDFium WASM module.
|
|
85
|
+
### Platform Support
|
|
94
86
|
|
|
95
|
-
|
|
87
|
+
Runs on:
|
|
88
|
+
- Modern browsers (Chrome, Firefox, Safari, Edge with WebAssembly support)
|
|
89
|
+
- Node.js 16+ (with WASM runtime)
|
|
90
|
+
- Deno 1.0+
|
|
91
|
+
- Cloudflare Workers
|
|
92
|
+
- Any JavaScript environment with WebAssembly support
|
|
96
93
|
|
|
97
|
-
|
|
94
|
+
### System Requirements
|
|
98
95
|
|
|
99
|
-
|
|
96
|
+
- WebAssembly support in runtime environment
|
|
97
|
+
- 50 MB minimum free memory for extraction
|
|
98
|
+
- Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
|
|
100
99
|
|
|
101
|
-
|
|
102
|
-
import init, { initialize_pdfium_render, extractBytes } from '@kreuzberg/wasm';
|
|
103
|
-
import pdfiumModule from '@kreuzberg/wasm/pdfium.js';
|
|
100
|
+
### Runtime Detection
|
|
104
101
|
|
|
105
|
-
|
|
106
|
-
await init();
|
|
102
|
+
Check platform capabilities before extraction:
|
|
107
103
|
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
// Step 3: Bind kreuzberg to PDFium (required before any PDF operations)
|
|
112
|
-
const success = initialize_pdfium_render(pdfium, wasm, false);
|
|
113
|
-
if (!success) {
|
|
114
|
-
throw new Error('Failed to initialize PDFium');
|
|
115
|
-
}
|
|
104
|
+
```typescript
|
|
105
|
+
import { getWasmCapabilities } from '@kreuzberg/wasm';
|
|
116
106
|
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
console.log(
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
### Error: "PdfiumWASMModuleNotConfigured"
|
|
124
|
-
|
|
125
|
-
If you see this error, it means `initialize_pdfium_render()` was not called before attempting PDF extraction. Make sure to follow the initialization sequence above.
|
|
126
|
-
|
|
127
|
-
### PDFium Files Location
|
|
128
|
-
|
|
129
|
-
The PDFium WASM files (`pdfium.js`, `pdfium.wasm`) should be included in the `@kreuzberg/wasm` package. If they're missing:
|
|
130
|
-
|
|
131
|
-
1. Check your `node_modules/@kreuzberg/wasm/` directory
|
|
132
|
-
2. Ensure both `pdfium.js` and `pdfium.wasm` are present
|
|
133
|
-
3. If missing, reinstall the package
|
|
134
|
-
|
|
135
|
-
For self-hosted builds, copy the files from:
|
|
136
|
-
```bash
|
|
137
|
-
target/wasm32-unknown-unknown/release/build/kreuzberg-*/out/pdfium/release/node/
|
|
107
|
+
const caps = getWasmCapabilities();
|
|
108
|
+
console.log('WASM available:', caps.hasWasm);
|
|
109
|
+
console.log('Web Workers available:', caps.hasWorkers);
|
|
110
|
+
console.log('Module Workers available:', caps.hasModuleWorkers);
|
|
111
|
+
console.log('File API available:', caps.hasFileApi);
|
|
112
|
+
console.log('SharedArrayBuffer available:', caps.hasSharedArrayBuffer);
|
|
138
113
|
```
|
|
139
114
|
|
|
140
115
|
## Quick Start
|
|
141
116
|
|
|
142
|
-
###
|
|
143
|
-
|
|
144
|
-
```typescript
|
|
145
|
-
import { extractFile } from '@kreuzberg/wasm';
|
|
146
|
-
|
|
147
|
-
async function handleFileUpload() {
|
|
148
|
-
const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
|
|
149
|
-
const file = fileInput.files[0];
|
|
117
|
+
### Basic Extraction
|
|
150
118
|
|
|
151
|
-
|
|
152
|
-
extract_tables: true,
|
|
153
|
-
extract_images: true
|
|
154
|
-
});
|
|
155
|
-
|
|
156
|
-
console.log('Extracted text:', result.content);
|
|
157
|
-
console.log('Tables found:', result.tables.length);
|
|
158
|
-
}
|
|
159
|
-
```
|
|
119
|
+
Extract text, metadata, and structure from any supported document format:
|
|
160
120
|
|
|
161
|
-
|
|
121
|
+
```ts
|
|
122
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
162
123
|
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
import { readFile } from 'fs/promises';
|
|
166
|
-
|
|
167
|
-
const pdfBytes = await readFile('./document.pdf');
|
|
168
|
-
const result = await extractBytes(
|
|
169
|
-
new Uint8Array(pdfBytes),
|
|
170
|
-
'application/pdf',
|
|
171
|
-
{ extract_tables: true }
|
|
172
|
-
);
|
|
173
|
-
|
|
174
|
-
console.log(result.content);
|
|
175
|
-
console.log('Found', result.tables.length, 'tables');
|
|
176
|
-
```
|
|
124
|
+
async function main() {
|
|
125
|
+
await initWasm();
|
|
177
126
|
|
|
178
|
-
|
|
127
|
+
const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
|
|
128
|
+
const bytes = new Uint8Array(buffer);
|
|
179
129
|
|
|
180
|
-
|
|
181
|
-
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
130
|
+
const result = await extractBytes(bytes, "application/pdf");
|
|
182
131
|
|
|
183
|
-
|
|
184
|
-
|
|
132
|
+
console.log("Extracted content:");
|
|
133
|
+
console.log(result.content);
|
|
134
|
+
console.log("MIME type:", result.mimeType);
|
|
135
|
+
console.log("Metadata:", result.metadata);
|
|
136
|
+
}
|
|
185
137
|
|
|
186
|
-
|
|
138
|
+
main().catch(console.error);
|
|
187
139
|
```
|
|
188
140
|
|
|
189
|
-
###
|
|
141
|
+
### Common Use Cases
|
|
190
142
|
|
|
191
|
-
|
|
192
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
143
|
+
#### Extract with Custom Configuration
|
|
193
144
|
|
|
194
|
-
|
|
195
|
-
async fetch(request: Request): Promise<Response> {
|
|
196
|
-
if (request.method === 'POST') {
|
|
197
|
-
const formData = await request.formData();
|
|
198
|
-
const file = formData.get('file') as File;
|
|
145
|
+
Most use cases benefit from configuration to control extraction behavior:
|
|
199
146
|
|
|
200
|
-
|
|
201
|
-
const bytes = new Uint8Array(arrayBuffer);
|
|
147
|
+
**With OCR (for scanned documents):**
|
|
202
148
|
|
|
203
|
-
|
|
149
|
+
```ts
|
|
150
|
+
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
204
151
|
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
metadata: result.metadata,
|
|
208
|
-
tables: result.tables
|
|
209
|
-
});
|
|
210
|
-
}
|
|
152
|
+
async function extractWithOcr() {
|
|
153
|
+
await initWasm();
|
|
211
154
|
|
|
212
|
-
|
|
155
|
+
try {
|
|
156
|
+
await enableOcr();
|
|
157
|
+
console.log("OCR enabled successfully");
|
|
158
|
+
} catch (error) {
|
|
159
|
+
console.error("Failed to enable OCR:", error);
|
|
160
|
+
return;
|
|
213
161
|
}
|
|
214
|
-
};
|
|
215
|
-
```
|
|
216
|
-
|
|
217
|
-
## Performance Comparison
|
|
218
|
-
|
|
219
|
-
Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
|
|
220
|
-
|
|
221
|
-
| Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
|
|
222
|
-
|--------|---|---|---|
|
|
223
|
-
| **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
|
|
224
|
-
| **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
|
|
225
|
-
| **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
|
|
226
|
-
| **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
|
|
227
|
-
| **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
|
|
228
|
-
|
|
229
|
-
### When to Use WASM vs Native
|
|
230
|
-
|
|
231
|
-
**Use WASM (@kreuzberg/wasm) when:**
|
|
232
|
-
- Building browser applications (no choice, WASM required)
|
|
233
|
-
- Targeting Cloudflare Workers or edge runtimes
|
|
234
|
-
- Supporting Deno applications
|
|
235
|
-
- You don't have a native build toolchain available
|
|
236
|
-
- Portability across runtimes is critical
|
|
237
|
-
|
|
238
|
-
**Use Native (@kreuzberg/node) when:**
|
|
239
|
-
- Building Node.js or Bun applications (2-3x faster)
|
|
240
|
-
- Performance is your primary concern
|
|
241
|
-
- You're processing large volumes of documents
|
|
242
|
-
- You have native build tools available
|
|
243
|
-
|
|
244
|
-
### Performance Tips for WASM
|
|
245
|
-
|
|
246
|
-
1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
|
|
247
|
-
2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
|
|
248
|
-
3. **Cache WASM module** by loading it once per application
|
|
249
|
-
4. **Preload OCR models** by calling extraction with OCR enabled early
|
|
250
162
|
|
|
251
|
-
|
|
163
|
+
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
|
|
252
164
|
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
|
|
260
|
-
|
|
261
|
-
## Multi-Threading with wasm-bindgen-rayon
|
|
262
|
-
|
|
263
|
-
Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
|
|
264
|
-
|
|
265
|
-
### Initializing the Thread Pool
|
|
266
|
-
|
|
267
|
-
To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
|
|
268
|
-
|
|
269
|
-
```typescript
|
|
270
|
-
import { initThreadPool } from '@kreuzberg/wasm';
|
|
271
|
-
|
|
272
|
-
// Initialize thread pool for multi-threaded extraction
|
|
273
|
-
await initThreadPool(navigator.hardwareConcurrency);
|
|
274
|
-
|
|
275
|
-
// Now extractions will use multiple threads for better performance
|
|
276
|
-
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
277
|
-
```
|
|
278
|
-
|
|
279
|
-
### Required HTTP Headers for SharedArrayBuffer
|
|
280
|
-
|
|
281
|
-
Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
|
|
282
|
-
|
|
283
|
-
**Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
|
|
284
|
-
|
|
285
|
-
Set these headers in your server configuration:
|
|
286
|
-
|
|
287
|
-
```
|
|
288
|
-
Cross-Origin-Opener-Policy: same-origin
|
|
289
|
-
Cross-Origin-Embedder-Policy: require-corp
|
|
290
|
-
```
|
|
291
|
-
|
|
292
|
-
#### Server Configuration Examples
|
|
293
|
-
|
|
294
|
-
**Express.js:**
|
|
295
|
-
```javascript
|
|
296
|
-
app.use((req, res, next) => {
|
|
297
|
-
res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
|
|
298
|
-
res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
299
|
-
next();
|
|
300
|
-
});
|
|
301
|
-
```
|
|
302
|
-
|
|
303
|
-
**Nginx:**
|
|
304
|
-
```nginx
|
|
305
|
-
add_header 'Cross-Origin-Opener-Policy' 'same-origin';
|
|
306
|
-
add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
|
|
307
|
-
```
|
|
165
|
+
const result = await extractBytes(bytes, "image/png", {
|
|
166
|
+
ocr: {
|
|
167
|
+
backend: "tesseract-wasm",
|
|
168
|
+
language: "eng",
|
|
169
|
+
},
|
|
170
|
+
});
|
|
308
171
|
|
|
309
|
-
|
|
310
|
-
|
|
311
|
-
|
|
312
|
-
Header set Cross-Origin-Embedder-Policy "require-corp"
|
|
313
|
-
```
|
|
172
|
+
console.log("Extracted text:");
|
|
173
|
+
console.log(result.content);
|
|
174
|
+
}
|
|
314
175
|
|
|
315
|
-
|
|
316
|
-
```javascript
|
|
317
|
-
export default {
|
|
318
|
-
async fetch(request: Request): Promise<Response> {
|
|
319
|
-
const response = new Response(body);
|
|
320
|
-
response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
|
|
321
|
-
response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
322
|
-
return response;
|
|
323
|
-
}
|
|
324
|
-
};
|
|
176
|
+
extractWithOcr().catch(console.error);
|
|
325
177
|
```
|
|
326
178
|
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
Multi-threading with SharedArrayBuffer is available in:
|
|
179
|
+
#### Table Extraction
|
|
330
180
|
|
|
331
|
-
|
|
332
|
-
- **Firefox**: 79+
|
|
333
|
-
- **Safari**: 15.2+
|
|
334
|
-
- **Opera**: 60+
|
|
181
|
+
See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
|
|
335
182
|
|
|
336
|
-
|
|
183
|
+
#### Processing Multiple Files
|
|
337
184
|
|
|
338
|
-
|
|
185
|
+
```ts
|
|
186
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
339
187
|
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
try {
|
|
346
|
-
await initThreadPool(navigator.hardwareConcurrency);
|
|
347
|
-
console.log('Multi-threading enabled');
|
|
348
|
-
} catch (error) {
|
|
349
|
-
// Fall back to single-threaded processing
|
|
350
|
-
console.warn('Multi-threading unavailable:', error);
|
|
351
|
-
console.log('Using single-threaded extraction');
|
|
188
|
+
interface DocumentJob {
|
|
189
|
+
name: string;
|
|
190
|
+
bytes: Uint8Array;
|
|
191
|
+
mimeType: string;
|
|
352
192
|
}
|
|
353
193
|
|
|
354
|
-
|
|
355
|
-
|
|
194
|
+
async function processBatch(documents: DocumentJob[], concurrency: number = 3) {
|
|
195
|
+
await initWasm();
|
|
196
|
+
|
|
197
|
+
const results: Record<string, string> = {};
|
|
198
|
+
const queue = [...documents];
|
|
199
|
+
|
|
200
|
+
const workers = Array(concurrency)
|
|
201
|
+
.fill(null)
|
|
202
|
+
.map(async () => {
|
|
203
|
+
while (queue.length > 0) {
|
|
204
|
+
const doc = queue.shift();
|
|
205
|
+
if (!doc) break;
|
|
206
|
+
|
|
207
|
+
try {
|
|
208
|
+
const result = await extractBytes(doc.bytes, doc.mimeType);
|
|
209
|
+
results[doc.name] = result.content;
|
|
210
|
+
} catch (error) {
|
|
211
|
+
console.error(`Failed to process ${doc.name}:`, error);
|
|
212
|
+
}
|
|
213
|
+
}
|
|
214
|
+
});
|
|
215
|
+
|
|
216
|
+
await Promise.all(workers);
|
|
217
|
+
return results;
|
|
218
|
+
}
|
|
356
219
|
```
|
|
357
220
|
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
```typescript
|
|
361
|
-
import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
|
|
221
|
+
#### Async Processing
|
|
362
222
|
|
|
363
|
-
|
|
364
|
-
try {
|
|
365
|
-
// Initialize WASM module
|
|
366
|
-
await initWasm();
|
|
223
|
+
For non-blocking document processing:
|
|
367
224
|
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
try {
|
|
371
|
-
await initThreadPool(cpuCount);
|
|
372
|
-
console.log(`Thread pool initialized with ${cpuCount} workers`);
|
|
373
|
-
} catch (error) {
|
|
374
|
-
console.warn('Could not initialize thread pool, using single-threaded mode');
|
|
375
|
-
}
|
|
225
|
+
```ts
|
|
226
|
+
import { extractBytes, initWasm, getWasmCapabilities } from "@kreuzberg/wasm";
|
|
376
227
|
|
|
377
|
-
|
|
378
|
-
|
|
228
|
+
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
|
|
229
|
+
const caps = getWasmCapabilities();
|
|
230
|
+
if (!caps.hasWasm) {
|
|
231
|
+
throw new Error("WebAssembly not supported");
|
|
379
232
|
}
|
|
380
|
-
}
|
|
381
233
|
|
|
382
|
-
|
|
383
|
-
const bytes = new Uint8Array(await file.arrayBuffer());
|
|
234
|
+
await initWasm();
|
|
384
235
|
|
|
385
|
-
|
|
386
|
-
|
|
387
|
-
|
|
388
|
-
extract_images: true
|
|
389
|
-
});
|
|
236
|
+
const results = await Promise.all(
|
|
237
|
+
files.map((bytes, index) => extractBytes(bytes, mimeTypes[index]))
|
|
238
|
+
);
|
|
390
239
|
|
|
391
|
-
return
|
|
240
|
+
return results.map((r) => ({
|
|
241
|
+
content: r.content,
|
|
242
|
+
pageCount: r.metadata?.pageCount,
|
|
243
|
+
}));
|
|
392
244
|
}
|
|
393
245
|
|
|
394
|
-
|
|
395
|
-
|
|
246
|
+
const fileBytes = [new Uint8Array([1, 2, 3])];
|
|
247
|
+
const mimes = ["application/pdf"];
|
|
396
248
|
|
|
397
|
-
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
if (file) {
|
|
401
|
-
const result = await extractDocument(file);
|
|
402
|
-
console.log('Extracted text:', result.content);
|
|
403
|
-
}
|
|
404
|
-
});
|
|
249
|
+
extractDocuments(fileBytes, mimes)
|
|
250
|
+
.then((results) => console.log(results))
|
|
251
|
+
.catch(console.error);
|
|
405
252
|
```
|
|
406
253
|
|
|
407
|
-
|
|
254
|
+
#### Worker Pool Usage
|
|
408
255
|
|
|
409
|
-
|
|
410
|
-
- **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
|
|
411
|
-
- **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
|
|
256
|
+
When Web Workers are available, use worker threads for parallel document processing without blocking the main thread:
|
|
412
257
|
|
|
413
|
-
|
|
258
|
+
```typescript
|
|
259
|
+
import { extractBytes, initWasm, hasWorkers, hasModuleWorkers } from '@kreuzberg/wasm';
|
|
414
260
|
|
|
415
|
-
|
|
261
|
+
class DocumentWorkerPool {
|
|
262
|
+
private workers: Worker[] = [];
|
|
263
|
+
private taskQueue: Array<{ id: number; data: Uint8Array; mimeType: string; resolve: Function; reject: Function }> = [];
|
|
264
|
+
private currentTaskId = 0;
|
|
416
265
|
|
|
417
|
-
|
|
266
|
+
constructor(workerCount: number = navigator.hardwareConcurrency || 4) {
|
|
267
|
+
// Module workers allow importing ES modules, standard workers are more compatible
|
|
268
|
+
const useModuleWorkers = hasModuleWorkers();
|
|
418
269
|
|
|
419
|
-
|
|
420
|
-
|
|
421
|
-
|
|
422
|
-
|
|
423
|
-
|
|
424
|
-
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
{
|
|
428
|
-
enable_ocr: true,
|
|
429
|
-
ocr_config: {
|
|
430
|
-
languages: ['eng'], // English
|
|
431
|
-
backend: 'tesseract-wasm'
|
|
270
|
+
for (let i = 0; i < workerCount; i++) {
|
|
271
|
+
const worker = useModuleWorkers
|
|
272
|
+
? new Worker(new URL('./extraction-worker.ts', import.meta.url), { type: 'module' })
|
|
273
|
+
: new Worker(new URL('./extraction-worker.js', import.meta.url));
|
|
274
|
+
|
|
275
|
+
worker.onmessage = (event) => this.handleWorkerMessage(event.data);
|
|
276
|
+
worker.onerror = (error) => this.handleWorkerError(error);
|
|
277
|
+
this.workers.push(worker);
|
|
432
278
|
}
|
|
433
279
|
}
|
|
434
|
-
);
|
|
435
|
-
|
|
436
|
-
console.log('OCR text:', result.content);
|
|
437
|
-
```
|
|
438
|
-
|
|
439
|
-
### Multi-Language OCR
|
|
440
280
|
|
|
441
|
-
|
|
442
|
-
|
|
443
|
-
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
|
|
281
|
+
async extract(data: Uint8Array, mimeType: string): Promise<string> {
|
|
282
|
+
return new Promise((resolve, reject) => {
|
|
283
|
+
this.taskQueue.push({
|
|
284
|
+
id: this.currentTaskId++,
|
|
285
|
+
data,
|
|
286
|
+
mimeType,
|
|
287
|
+
resolve,
|
|
288
|
+
reject
|
|
289
|
+
});
|
|
290
|
+
this.processQueue();
|
|
291
|
+
});
|
|
447
292
|
}
|
|
448
|
-
});
|
|
449
|
-
```
|
|
450
|
-
|
|
451
|
-
### Supported Languages
|
|
452
|
-
|
|
453
|
-
`eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
|
|
454
|
-
|
|
455
|
-
Training data is automatically loaded from jsDelivr CDN:
|
|
456
|
-
```
|
|
457
|
-
https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
|
|
458
|
-
```
|
|
459
|
-
|
|
460
|
-
## Configuration
|
|
461
|
-
|
|
462
|
-
### Extract Tables
|
|
463
|
-
|
|
464
|
-
```typescript
|
|
465
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
466
293
|
|
|
467
|
-
|
|
468
|
-
|
|
469
|
-
|
|
470
|
-
|
|
471
|
-
|
|
472
|
-
|
|
473
|
-
|
|
474
|
-
|
|
475
|
-
|
|
476
|
-
console.log('Table cells:');
|
|
477
|
-
console.log(JSON.stringify(table.cells, null, 2));
|
|
294
|
+
private processQueue(): void {
|
|
295
|
+
while (this.taskQueue.length > 0) {
|
|
296
|
+
const task = this.taskQueue.shift();
|
|
297
|
+
if (task) {
|
|
298
|
+
const worker = this.workers[task.id % this.workers.length];
|
|
299
|
+
worker.postMessage({ id: task.id, data: task.data, mimeType: task.mimeType });
|
|
300
|
+
}
|
|
301
|
+
}
|
|
478
302
|
}
|
|
479
|
-
}
|
|
480
|
-
```
|
|
481
|
-
|
|
482
|
-
### Extract Images
|
|
483
|
-
|
|
484
|
-
```typescript
|
|
485
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
486
303
|
|
|
487
|
-
|
|
488
|
-
|
|
489
|
-
|
|
490
|
-
|
|
491
|
-
|
|
304
|
+
private handleWorkerMessage(data: { id: number; result: string }): void {
|
|
305
|
+
const task = this.taskQueue.find(t => t.id === data.id);
|
|
306
|
+
if (task) {
|
|
307
|
+
task.resolve(data.result);
|
|
308
|
+
this.processQueue();
|
|
309
|
+
}
|
|
492
310
|
}
|
|
493
|
-
});
|
|
494
311
|
|
|
495
|
-
|
|
496
|
-
|
|
497
|
-
console.log(`Image ${image.index}: ${image.format}`);
|
|
498
|
-
// image.data is a Uint8Array
|
|
312
|
+
private handleWorkerError(error: ErrorEvent): void {
|
|
313
|
+
console.error('Worker error:', error.message);
|
|
499
314
|
}
|
|
500
|
-
}
|
|
501
|
-
```
|
|
502
|
-
|
|
503
|
-
### Text Chunking
|
|
504
|
-
|
|
505
|
-
```typescript
|
|
506
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
507
315
|
|
|
508
|
-
|
|
509
|
-
|
|
510
|
-
chunking_config: {
|
|
511
|
-
max_chars: 1000,
|
|
512
|
-
max_overlap: 200
|
|
316
|
+
terminate(): void {
|
|
317
|
+
this.workers.forEach(w => w.terminate());
|
|
513
318
|
}
|
|
514
|
-
}
|
|
319
|
+
}
|
|
515
320
|
|
|
516
|
-
|
|
517
|
-
|
|
518
|
-
|
|
321
|
+
// Usage
|
|
322
|
+
async function processDocumentsInParallel() {
|
|
323
|
+
if (!hasWorkers()) {
|
|
324
|
+
console.log('Web Workers not available, falling back to main thread');
|
|
325
|
+
return;
|
|
519
326
|
}
|
|
520
|
-
}
|
|
521
|
-
```
|
|
522
327
|
|
|
523
|
-
|
|
328
|
+
await initWasm();
|
|
329
|
+
const pool = new DocumentWorkerPool(4);
|
|
524
330
|
|
|
525
|
-
|
|
526
|
-
|
|
331
|
+
const documents = [
|
|
332
|
+
{ data: new Uint8Array([...]), mimeType: 'application/pdf' },
|
|
333
|
+
{ data: new Uint8Array([...]), mimeType: 'application/pdf' },
|
|
334
|
+
];
|
|
527
335
|
|
|
528
|
-
const
|
|
529
|
-
|
|
530
|
-
|
|
336
|
+
const results = await Promise.all(
|
|
337
|
+
documents.map(doc => pool.extract(doc.data, doc.mimeType))
|
|
338
|
+
);
|
|
531
339
|
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
console.log(`Confidence: ${result.language.confidence}`);
|
|
340
|
+
pool.terminate();
|
|
341
|
+
return results;
|
|
535
342
|
}
|
|
536
343
|
```
|
|
537
344
|
|
|
538
|
-
|
|
345
|
+
Worker code (`extraction-worker.ts`):
|
|
539
346
|
|
|
540
347
|
```typescript
|
|
541
|
-
import {
|
|
542
|
-
extractBytes,
|
|
543
|
-
type ExtractionConfig
|
|
544
|
-
} from '@kreuzberg/wasm';
|
|
545
|
-
|
|
546
|
-
const config: ExtractionConfig = {
|
|
547
|
-
extract_tables: true,
|
|
548
|
-
extract_images: true,
|
|
549
|
-
extract_metadata: true,
|
|
550
|
-
|
|
551
|
-
enable_ocr: true,
|
|
552
|
-
ocr_config: {
|
|
553
|
-
languages: ['eng'],
|
|
554
|
-
backend: 'tesseract-wasm',
|
|
555
|
-
dpi: 300,
|
|
556
|
-
preprocessing: {
|
|
557
|
-
deskew: true,
|
|
558
|
-
denoise: true,
|
|
559
|
-
binarize: true
|
|
560
|
-
}
|
|
561
|
-
},
|
|
562
|
-
|
|
563
|
-
enable_chunking: true,
|
|
564
|
-
chunking_config: {
|
|
565
|
-
max_chars: 1000,
|
|
566
|
-
max_overlap: 200
|
|
567
|
-
},
|
|
348
|
+
import { extractBytes, initWasm } from '@kreuzberg/wasm';
|
|
568
349
|
|
|
569
|
-
|
|
350
|
+
let wasmInitialized = false;
|
|
570
351
|
|
|
571
|
-
|
|
352
|
+
self.onmessage = async (event) => {
|
|
353
|
+
if (!wasmInitialized) {
|
|
354
|
+
await initWasm();
|
|
355
|
+
wasmInitialized = true;
|
|
356
|
+
}
|
|
572
357
|
|
|
573
|
-
|
|
574
|
-
|
|
575
|
-
|
|
576
|
-
|
|
358
|
+
const { id, data, mimeType } = event.data;
|
|
359
|
+
try {
|
|
360
|
+
const result = await extractBytes(new Uint8Array(data), mimeType);
|
|
361
|
+
self.postMessage({ id, result: result.content });
|
|
362
|
+
} catch (error) {
|
|
363
|
+
self.postMessage({ id, error: (error as Error).message });
|
|
577
364
|
}
|
|
578
365
|
};
|
|
579
|
-
|
|
580
|
-
const result = await extractBytes(data, mimeType, config);
|
|
581
366
|
```
|
|
582
367
|
|
|
583
|
-
|
|
368
|
+
### Memory Management
|
|
584
369
|
|
|
585
|
-
|
|
370
|
+
WASM memory is managed by the JavaScript garbage collector:
|
|
586
371
|
|
|
587
372
|
```typescript
|
|
588
|
-
import {
|
|
373
|
+
import { initWasm, extractBytes } from '@kreuzberg/wasm';
|
|
589
374
|
|
|
590
|
-
|
|
591
|
-
|
|
592
|
-
const files = Array.from(fileInput.files);
|
|
375
|
+
async function extractWithMemoryAwareness() {
|
|
376
|
+
await initWasm();
|
|
593
377
|
|
|
594
|
-
|
|
595
|
-
|
|
596
|
-
});
|
|
378
|
+
// Process documents one at a time to control memory usage
|
|
379
|
+
const documents = [/* ... */];
|
|
597
380
|
|
|
598
|
-
for (const
|
|
599
|
-
|
|
600
|
-
}
|
|
381
|
+
for (const doc of documents) {
|
|
382
|
+
const result = await extractBytes(doc, 'application/pdf');
|
|
601
383
|
|
|
602
|
-
//
|
|
603
|
-
|
|
604
|
-
const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
|
|
384
|
+
// Process result immediately
|
|
385
|
+
console.log(result.content);
|
|
605
386
|
|
|
606
|
-
|
|
607
|
-
|
|
387
|
+
// Result will be garbage collected when no longer referenced
|
|
388
|
+
// Explicitly clear large objects if needed
|
|
389
|
+
// gc(); // Requires --expose-gc flag
|
|
390
|
+
}
|
|
391
|
+
}
|
|
608
392
|
|
|
609
|
-
|
|
393
|
+
// Check available memory (browser only)
|
|
394
|
+
if (performance.memory) {
|
|
395
|
+
console.log('Memory usage:', {
|
|
396
|
+
usedJSHeapSize: performance.memory.usedJSHeapSize,
|
|
397
|
+
totalJSHeapSize: performance.memory.totalJSHeapSize,
|
|
398
|
+
jsHeapSizeLimit: performance.memory.jsHeapSizeLimit
|
|
399
|
+
});
|
|
400
|
+
}
|
|
401
|
+
```
|
|
610
402
|
|
|
611
|
-
|
|
612
|
-
import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
|
|
403
|
+
### Next Steps
|
|
613
404
|
|
|
614
|
-
|
|
615
|
-
|
|
405
|
+
- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
|
|
406
|
+
- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
|
|
407
|
+
- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
|
|
408
|
+
- **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
|
|
409
|
+
- **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
|
|
616
410
|
|
|
617
|
-
|
|
618
|
-
const results = batchExtractBytesSync(dataList, mimeTypes, config);
|
|
619
|
-
```
|
|
411
|
+
## WASM-Specific Implementation Details
|
|
620
412
|
|
|
621
|
-
###
|
|
413
|
+
### Initialization
|
|
622
414
|
|
|
623
|
-
|
|
415
|
+
WASM binaries must be loaded before extraction:
|
|
624
416
|
|
|
625
417
|
```typescript
|
|
626
|
-
import {
|
|
627
|
-
|
|
628
|
-
registerPostProcessor({
|
|
629
|
-
name: 'uppercase',
|
|
630
|
-
async process(result) {
|
|
631
|
-
return {
|
|
632
|
-
...result,
|
|
633
|
-
content: result.content.toUpperCase()
|
|
634
|
-
};
|
|
635
|
-
}
|
|
636
|
-
});
|
|
418
|
+
import { initWasm } from '@kreuzberg/wasm';
|
|
637
419
|
|
|
638
|
-
//
|
|
639
|
-
|
|
640
|
-
|
|
420
|
+
// Initialize once at application startup
|
|
421
|
+
await initWasm();
|
|
422
|
+
|
|
423
|
+
// Now extraction functions can be used
|
|
641
424
|
```
|
|
642
425
|
|
|
643
|
-
|
|
426
|
+
The init function:
|
|
427
|
+
- Downloads and instantiates the WASM binary
|
|
428
|
+
- Initializes the memory space (linear memory module)
|
|
429
|
+
- Prepares thread pools if available
|
|
430
|
+
- Throws if WASM is not supported in the environment
|
|
644
431
|
|
|
645
|
-
|
|
646
|
-
import { registerValidator } from '@kreuzberg/wasm';
|
|
432
|
+
### Threading Model
|
|
647
433
|
|
|
648
|
-
|
|
649
|
-
|
|
650
|
-
|
|
651
|
-
|
|
652
|
-
|
|
653
|
-
}
|
|
654
|
-
}
|
|
655
|
-
});
|
|
656
|
-
```
|
|
434
|
+
- Single-threaded by default (main thread execution)
|
|
435
|
+
- Web Workers optional for background processing
|
|
436
|
+
- Shared memory (SharedArrayBuffer) not required
|
|
437
|
+
- Message passing used for worker communication
|
|
438
|
+
- No blocking operations on main thread with worker pool
|
|
657
439
|
|
|
658
|
-
|
|
440
|
+
### Memory Considerations
|
|
659
441
|
|
|
660
|
-
|
|
661
|
-
|
|
662
|
-
|
|
663
|
-
|
|
664
|
-
|
|
665
|
-
supportedLanguages() {
|
|
666
|
-
return ['eng', 'fra'];
|
|
667
|
-
},
|
|
668
|
-
async initialize() {
|
|
669
|
-
// Initialize your OCR backend
|
|
670
|
-
},
|
|
671
|
-
async processImage(imageBytes, language) {
|
|
672
|
-
// Process image and return result
|
|
673
|
-
return {
|
|
674
|
-
content: 'extracted text',
|
|
675
|
-
mime_type: 'text/plain',
|
|
676
|
-
metadata: {},
|
|
677
|
-
tables: []
|
|
678
|
-
};
|
|
679
|
-
},
|
|
680
|
-
async shutdown() {
|
|
681
|
-
// Cleanup
|
|
682
|
-
}
|
|
683
|
-
});
|
|
684
|
-
```
|
|
442
|
+
- Each WASM instance has its own 4GB linear memory address space
|
|
443
|
+
- Large documents (> 100 MB) may not fit in WASM memory
|
|
444
|
+
- Binary data is copied between JavaScript and WASM boundaries
|
|
445
|
+
- Garbage collection is handled by JavaScript runtime
|
|
446
|
+
- No manual memory management required
|
|
685
447
|
|
|
686
|
-
###
|
|
448
|
+
### Supported Extraction Targets
|
|
687
449
|
|
|
688
|
-
|
|
689
|
-
import {
|
|
690
|
-
detectMimeFromBytes,
|
|
691
|
-
getMimeFromExtension,
|
|
692
|
-
getExtensionsForMime,
|
|
693
|
-
normalizeMimeType
|
|
694
|
-
} from '@kreuzberg/wasm';
|
|
450
|
+
Different file formats have varying support in WASM:
|
|
695
451
|
|
|
696
|
-
|
|
697
|
-
|
|
452
|
+
| Format | Support | Notes |
|
|
453
|
+
|--------|---------|-------|
|
|
454
|
+
| PDF | Full | Text, images, metadata extraction |
|
|
455
|
+
| Office (DOCX, XLSX, PPTX) | Full | All features supported |
|
|
456
|
+
| Images (PNG, JPG, etc) | Full | EXIF metadata extraction |
|
|
457
|
+
| Archives (ZIP, TAR) | Full | Listing and extraction |
|
|
458
|
+
| OCR | Limited | Tesseract WASM only, main thread only |
|
|
459
|
+
| Embeddings | Not Available | WASM has no ML model support |
|
|
698
460
|
|
|
699
|
-
|
|
700
|
-
const mime = getMimeFromExtension('pdf'); // 'application/pdf'
|
|
461
|
+
### Platform Limitations
|
|
701
462
|
|
|
702
|
-
|
|
703
|
-
const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
|
|
463
|
+
**LibreOffice-Dependent Formats Not Available**
|
|
704
464
|
|
|
705
|
-
|
|
706
|
-
const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
|
|
707
|
-
```
|
|
465
|
+
WASM cannot load native LibreOffice binaries, so older Office formats are **not supported**:
|
|
708
466
|
|
|
709
|
-
|
|
467
|
+
- ❌ **DOC** (Microsoft Word 97-2003) - Use DOCX instead
|
|
468
|
+
- ❌ **XLS** (Microsoft Excel 97-2003) - Use XLSX instead
|
|
469
|
+
- ❌ **PPT** (Microsoft PowerPoint 97-2003) - Use PPTX instead
|
|
470
|
+
- ❌ **RTF** (Rich Text Format with complex features)
|
|
471
|
+
- ❌ **ODT/ODS/ODP** (LibreOffice/OpenOffice formats)
|
|
710
472
|
|
|
711
|
-
|
|
712
|
-
import { loadConfigFromString } from '@kreuzberg/wasm';
|
|
713
|
-
|
|
714
|
-
// Load from YAML
|
|
715
|
-
const yamlConfig = `
|
|
716
|
-
extract_tables: true
|
|
717
|
-
enable_ocr: true
|
|
718
|
-
ocr_config:
|
|
719
|
-
languages: [eng, deu]
|
|
720
|
-
`;
|
|
721
|
-
const config = loadConfigFromString(yamlConfig, 'yaml');
|
|
722
|
-
|
|
723
|
-
// Load from JSON
|
|
724
|
-
const jsonConfig = '{"extract_tables":true}';
|
|
725
|
-
const config2 = loadConfigFromString(jsonConfig, 'json');
|
|
726
|
-
|
|
727
|
-
// Load from TOML
|
|
728
|
-
const tomlConfig = 'extract_tables = true';
|
|
729
|
-
const config3 = loadConfigFromString(tomlConfig, 'toml');
|
|
730
|
-
```
|
|
473
|
+
Modern Office formats (DOCX, XLSX, PPTX) are fully supported and don't require LibreOffice.
|
|
731
474
|
|
|
732
|
-
|
|
475
|
+
**Polars Integration Not Available**
|
|
733
476
|
|
|
734
|
-
|
|
477
|
+
- ❌ Polars DataFrame extraction/conversion not available in WASM
|
|
478
|
+
- ❌ Structured data operations limited compared to Node.js binding
|
|
735
479
|
|
|
736
|
-
|
|
737
|
-
Extract content from a browser `File` object.
|
|
480
|
+
**Alternative: Use Node.js Binding**
|
|
738
481
|
|
|
739
|
-
|
|
740
|
-
Asynchronously extract content from a `Uint8Array`.
|
|
482
|
+
If you need support for older Office formats or Polars integration, use the `@kreuzberg/node` package instead:
|
|
741
483
|
|
|
742
|
-
|
|
743
|
-
|
|
484
|
+
```bash
|
|
485
|
+
npm install @kreuzberg/node
|
|
486
|
+
```
|
|
744
487
|
|
|
745
|
-
|
|
746
|
-
|
|
488
|
+
The Node.js binding provides:
|
|
489
|
+
- ✅ Full LibreOffice format support (DOC, XLS, PPT, RTF, ODT)
|
|
490
|
+
- ✅ Polars DataFrame integration
|
|
491
|
+
- ✅ All OCR backends (Tesseract, EasyOCR, PaddleOCR)
|
|
492
|
+
- ✅ Full embedding model support
|
|
747
493
|
|
|
748
|
-
|
|
749
|
-
Extract multiple byte arrays in parallel.
|
|
494
|
+
**Format Comparison Table**
|
|
750
495
|
|
|
751
|
-
|
|
752
|
-
|
|
496
|
+
| Format Type | WASM Support | Node.js Support |
|
|
497
|
+
|-------------|--------------|-----------------|
|
|
498
|
+
| Modern Office (DOCX/XLSX/PPTX) | ✅ Full | ✅ Full |
|
|
499
|
+
| Legacy Office (DOC/XLS/PPT) | ❌ Not Available | ✅ Requires LibreOffice |
|
|
500
|
+
| OpenOffice (ODT/ODS/ODP) | ❌ Not Available | ✅ Requires LibreOffice |
|
|
501
|
+
| PDF | ✅ Full | ✅ Full |
|
|
502
|
+
| Images | ✅ Full | ✅ Full |
|
|
503
|
+
| Embeddings | ❌ Not Available | ✅ With ONNX Runtime |
|
|
504
|
+
| Polars | ❌ Not Available | ✅ Available |
|
|
753
505
|
|
|
754
|
-
###
|
|
506
|
+
### Sandbox Security
|
|
755
507
|
|
|
756
|
-
|
|
508
|
+
- WASM code runs in a sandbox with restricted capabilities
|
|
509
|
+
- File system access requires user interaction (File API)
|
|
510
|
+
- Network access follows CORS restrictions
|
|
511
|
+
- No access to Node.js native modules
|
|
512
|
+
- Content Security Policy (CSP) may restrict WASM loading
|
|
757
513
|
|
|
758
|
-
|
|
759
|
-
registerPostProcessor(processor: PostProcessorProtocol): void
|
|
760
|
-
unregisterPostProcessor(name: string): void
|
|
761
|
-
clearPostProcessors(): void
|
|
762
|
-
listPostProcessors(): string[]
|
|
763
|
-
```
|
|
514
|
+
## Features
|
|
764
515
|
|
|
765
|
-
|
|
516
|
+
### Supported File Formats (56+)
|
|
766
517
|
|
|
767
|
-
|
|
768
|
-
registerValidator(validator: ValidatorProtocol): void
|
|
769
|
-
unregisterValidator(name: string): void
|
|
770
|
-
clearValidators(): void
|
|
771
|
-
listValidators(): string[]
|
|
772
|
-
```
|
|
518
|
+
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
|
|
773
519
|
|
|
774
|
-
####
|
|
520
|
+
#### Office Documents
|
|
775
521
|
|
|
776
|
-
|
|
777
|
-
|
|
778
|
-
|
|
779
|
-
|
|
780
|
-
|
|
781
|
-
|
|
522
|
+
| Category | Formats | Capabilities |
|
|
523
|
+
|----------|---------|--------------|
|
|
524
|
+
| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
|
|
525
|
+
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
|
|
526
|
+
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
|
|
527
|
+
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
|
|
528
|
+
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
|
|
782
529
|
|
|
783
|
-
|
|
530
|
+
#### Images (OCR-Enabled)
|
|
784
531
|
|
|
785
|
-
|
|
786
|
-
|
|
787
|
-
|
|
788
|
-
|
|
789
|
-
|
|
532
|
+
| Category | Formats | Features |
|
|
533
|
+
|----------|---------|----------|
|
|
534
|
+
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
|
|
535
|
+
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
|
|
536
|
+
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
|
|
790
537
|
|
|
791
|
-
|
|
538
|
+
#### Web & Data
|
|
792
539
|
|
|
793
|
-
|
|
794
|
-
|
|
795
|
-
|
|
796
|
-
|
|
797
|
-
|
|
798
|
-
```
|
|
540
|
+
| Category | Formats | Features |
|
|
541
|
+
|----------|---------|----------|
|
|
542
|
+
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
|
|
543
|
+
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
|
|
544
|
+
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
|
|
799
545
|
|
|
800
|
-
|
|
546
|
+
#### Email & Archives
|
|
801
547
|
|
|
802
|
-
|
|
803
|
-
|
|
804
|
-
|
|
548
|
+
| Category | Formats | Features |
|
|
549
|
+
|----------|---------|----------|
|
|
550
|
+
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
|
|
551
|
+
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
|
|
805
552
|
|
|
806
|
-
|
|
553
|
+
#### Academic & Scientific
|
|
807
554
|
|
|
808
|
-
|
|
809
|
-
|
|
810
|
-
|
|
811
|
-
|
|
555
|
+
| Category | Formats | Features |
|
|
556
|
+
|----------|---------|----------|
|
|
557
|
+
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
|
|
558
|
+
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
|
|
559
|
+
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
|
|
812
560
|
|
|
813
|
-
|
|
561
|
+
**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
|
|
814
562
|
|
|
815
|
-
|
|
563
|
+
### Key Capabilities
|
|
816
564
|
|
|
817
|
-
|
|
818
|
-
import type {
|
|
819
|
-
ExtractionResult,
|
|
820
|
-
ExtractionConfig,
|
|
821
|
-
OcrConfig,
|
|
822
|
-
ChunkingConfig,
|
|
823
|
-
ImageConfig,
|
|
824
|
-
KeywordsConfig,
|
|
825
|
-
Table,
|
|
826
|
-
ExtractedImage,
|
|
827
|
-
Chunk,
|
|
828
|
-
Metadata,
|
|
829
|
-
PostProcessorProtocol,
|
|
830
|
-
ValidatorProtocol,
|
|
831
|
-
OcrBackendProtocol
|
|
832
|
-
} from '@kreuzberg/core';
|
|
833
|
-
```
|
|
565
|
+
- **Text Extraction** - Extract all text content with position and formatting information
|
|
834
566
|
|
|
835
|
-
|
|
836
|
-
|
|
837
|
-
Main result object containing:
|
|
838
|
-
- `content: string` - Extracted text content
|
|
839
|
-
- `mime_type: string` - MIME type of the document
|
|
840
|
-
- `metadata?: Metadata` - Document metadata
|
|
841
|
-
- `tables?: Table[]` - Extracted tables
|
|
842
|
-
- `images?: ExtractedImage[]` - Extracted images
|
|
843
|
-
- `chunks?: Chunk[]` - Text chunks (if chunking enabled)
|
|
844
|
-
- `language?: LanguageInfo` - Detected language (if enabled)
|
|
845
|
-
- `keywords?: Keyword[]` - Extracted keywords (if enabled)
|
|
846
|
-
|
|
847
|
-
### ExtractionConfig
|
|
848
|
-
|
|
849
|
-
Configuration object for extraction:
|
|
850
|
-
- `extract_tables?: boolean` - Extract tables as structured data
|
|
851
|
-
- `extract_images?: boolean` - Extract embedded images
|
|
852
|
-
- `extract_metadata?: boolean` - Extract document metadata
|
|
853
|
-
- `enable_ocr?: boolean` - Enable OCR for images
|
|
854
|
-
- `ocr_config?: OcrConfig` - OCR settings
|
|
855
|
-
- `enable_chunking?: boolean` - Split text into semantic chunks
|
|
856
|
-
- `chunking_config?: ChunkingConfig` - Text chunking settings
|
|
857
|
-
- `enable_language_detection?: boolean` - Detect document language
|
|
858
|
-
- `enable_quality?: boolean` - Encoding detection, normalization
|
|
859
|
-
- `extract_keywords?: boolean` - Extract important keywords
|
|
860
|
-
- `keywords_config?: KeywordsConfig` - Keyword extraction settings
|
|
861
|
-
|
|
862
|
-
### Table
|
|
863
|
-
|
|
864
|
-
Extracted table structure:
|
|
865
|
-
- `markdown: string` - Table in Markdown format
|
|
866
|
-
- `cells: TableCell[][]` - 2D array of table cells
|
|
867
|
-
- `row_count: number` - Number of rows
|
|
868
|
-
- `column_count: number` - Number of columns
|
|
869
|
-
|
|
870
|
-
## Supported Formats
|
|
871
|
-
|
|
872
|
-
| Category | Formats |
|
|
873
|
-
|----------|---------|
|
|
874
|
-
| **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
|
|
875
|
-
| **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
|
|
876
|
-
| **Web** | HTML, XHTML, XML, EPUB |
|
|
877
|
-
| **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
|
|
878
|
-
| **Email** | EML, MSG |
|
|
879
|
-
| **Archives** | ZIP, TAR, 7Z |
|
|
880
|
-
| **Other** | And 30+ more formats |
|
|
881
|
-
|
|
882
|
-
## Build from Source
|
|
883
|
-
|
|
884
|
-
### Prerequisites
|
|
885
|
-
|
|
886
|
-
- Rust 1.75+ with `wasm32-unknown-unknown` target
|
|
887
|
-
- Node.js 18+ with pnpm
|
|
888
|
-
- wasm-pack
|
|
567
|
+
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
|
|
889
568
|
|
|
890
|
-
|
|
891
|
-
# Install Rust target
|
|
892
|
-
rustup target add wasm32-unknown-unknown
|
|
569
|
+
- **Table Extraction** - Parse tables with structure and cell content preservation
|
|
893
570
|
|
|
894
|
-
|
|
895
|
-
curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
|
|
571
|
+
- **Image Extraction** - Extract embedded images and render page previews
|
|
896
572
|
|
|
897
|
-
|
|
898
|
-
cd crates/kreuzberg-wasm
|
|
899
|
-
pnpm install
|
|
900
|
-
pnpm run build
|
|
573
|
+
- **OCR Support** - Integrate multiple OCR backends for scanned documents
|
|
901
574
|
|
|
902
|
-
|
|
903
|
-
pnpm test
|
|
904
|
-
```
|
|
575
|
+
- **Async/Await** - Non-blocking document processing with concurrent operations
|
|
905
576
|
|
|
906
|
-
|
|
577
|
+
- **Plugin System** - Extensible post-processing for custom text transformation
|
|
907
578
|
|
|
908
|
-
|
|
909
|
-
# For browsers (ESM modules)
|
|
910
|
-
pnpm run build:wasm:web
|
|
579
|
+
- **Batch Processing** - Efficiently process multiple documents in parallel
|
|
911
580
|
|
|
912
|
-
|
|
913
|
-
pnpm run build:wasm:bundler
|
|
581
|
+
- **Memory Efficient** - Stream large files without loading entirely into memory
|
|
914
582
|
|
|
915
|
-
|
|
916
|
-
pnpm run build:wasm:nodejs
|
|
583
|
+
- **Language Detection** - Detect and support multiple languages in documents
|
|
917
584
|
|
|
918
|
-
|
|
919
|
-
pnpm run build:wasm:deno
|
|
585
|
+
- **Configuration** - Fine-grained control over extraction behavior
|
|
920
586
|
|
|
921
|
-
|
|
922
|
-
pnpm run build:all
|
|
923
|
-
```
|
|
587
|
+
### Performance Characteristics
|
|
924
588
|
|
|
925
|
-
|
|
589
|
+
| Format | Speed | Memory | Notes |
|
|
590
|
+
|--------|-------|--------|-------|
|
|
591
|
+
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
|
|
592
|
+
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
|
|
593
|
+
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
|
|
594
|
+
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
|
|
595
|
+
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
|
|
926
596
|
|
|
927
|
-
|
|
597
|
+
## OCR Support
|
|
928
598
|
|
|
929
|
-
|
|
599
|
+
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
|
|
930
600
|
|
|
931
|
-
|
|
932
|
-
// ❌ Won't work
|
|
933
|
-
await extractFileSync('./document.pdf'); // Throws error
|
|
601
|
+
- **Tesseract-Wasm**
|
|
934
602
|
|
|
935
|
-
|
|
936
|
-
const bytes = await Deno.readFile('./document.pdf'); // Deno
|
|
937
|
-
const bytes = await fs.readFile('./document.pdf'); // Node.js
|
|
938
|
-
const bytes = await file.arrayBuffer(); // Browser
|
|
939
|
-
```
|
|
603
|
+
### OCR Configuration Example
|
|
940
604
|
|
|
941
|
-
|
|
605
|
+
```ts
|
|
606
|
+
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
942
607
|
|
|
943
|
-
|
|
608
|
+
async function extractWithOcr() {
|
|
609
|
+
await initWasm();
|
|
944
610
|
|
|
945
|
-
|
|
611
|
+
try {
|
|
612
|
+
await enableOcr();
|
|
613
|
+
console.log("OCR enabled successfully");
|
|
614
|
+
} catch (error) {
|
|
615
|
+
console.error("Failed to enable OCR:", error);
|
|
616
|
+
return;
|
|
617
|
+
}
|
|
946
618
|
|
|
947
|
-
|
|
619
|
+
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
|
|
948
620
|
|
|
949
|
-
|
|
621
|
+
const result = await extractBytes(bytes, "image/png", {
|
|
622
|
+
ocr: {
|
|
623
|
+
backend: "tesseract-wasm",
|
|
624
|
+
language: "eng",
|
|
625
|
+
},
|
|
626
|
+
});
|
|
950
627
|
|
|
951
|
-
|
|
628
|
+
console.log("Extracted text:");
|
|
629
|
+
console.log(result.content);
|
|
630
|
+
}
|
|
952
631
|
|
|
953
|
-
|
|
954
|
-
// Files > 2MB will throw an error in WASM builds
|
|
955
|
-
const largeHtml = new Uint8Array(3 * 1024 * 1024); // 3MB
|
|
956
|
-
await extractBytes(largeHtml, 'text/html');
|
|
957
|
-
// ❌ Throws: "HTML file size exceeds WASM limit of 2MB"
|
|
632
|
+
extractWithOcr().catch(console.error);
|
|
958
633
|
```
|
|
959
634
|
|
|
960
|
-
|
|
635
|
+
## Async Support
|
|
961
636
|
|
|
962
|
-
|
|
637
|
+
This binding provides full async/await support for non-blocking document processing:
|
|
963
638
|
|
|
964
|
-
|
|
639
|
+
```ts
|
|
640
|
+
import { extractBytes, initWasm, getWasmCapabilities } from "@kreuzberg/wasm";
|
|
965
641
|
|
|
966
|
-
|
|
967
|
-
|
|
968
|
-
|
|
969
|
-
|
|
970
|
-
|
|
642
|
+
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
|
|
643
|
+
const caps = getWasmCapabilities();
|
|
644
|
+
if (!caps.hasWasm) {
|
|
645
|
+
throw new Error("WebAssembly not supported");
|
|
646
|
+
}
|
|
971
647
|
|
|
972
|
-
|
|
973
|
-
- **Browser**: PDF extraction works out of the box
|
|
974
|
-
- **Deno/Node.js**: Use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) with native PDFium bindings
|
|
975
|
-
- **Cloudflare Workers**: PDF extraction is not currently supported
|
|
648
|
+
await initWasm();
|
|
976
649
|
|
|
977
|
-
|
|
650
|
+
const results = await Promise.all(
|
|
651
|
+
files.map((bytes, index) => extractBytes(bytes, mimeTypes[index]))
|
|
652
|
+
);
|
|
978
653
|
|
|
979
|
-
|
|
654
|
+
return results.map((r) => ({
|
|
655
|
+
content: r.content,
|
|
656
|
+
pageCount: r.metadata?.pageCount,
|
|
657
|
+
}));
|
|
658
|
+
}
|
|
980
659
|
|
|
981
|
-
|
|
660
|
+
const fileBytes = [new Uint8Array([1, 2, 3])];
|
|
661
|
+
const mimes = ["application/pdf"];
|
|
982
662
|
|
|
983
|
-
|
|
984
|
-
|
|
985
|
-
|
|
986
|
-
export default {
|
|
987
|
-
optimizeDeps: {
|
|
988
|
-
exclude: ['@kreuzberg/wasm']
|
|
989
|
-
}
|
|
990
|
-
}
|
|
663
|
+
extractDocuments(fileBytes, mimes)
|
|
664
|
+
.then((results) => console.log(results))
|
|
665
|
+
.catch(console.error);
|
|
991
666
|
```
|
|
992
667
|
|
|
993
|
-
|
|
994
|
-
```javascript
|
|
995
|
-
// webpack.config.js
|
|
996
|
-
module.exports = {
|
|
997
|
-
experiments: {
|
|
998
|
-
asyncWebAssembly: true
|
|
999
|
-
}
|
|
1000
|
-
}
|
|
1001
|
-
```
|
|
668
|
+
## Plugin System
|
|
1002
669
|
|
|
1003
|
-
|
|
670
|
+
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
|
|
1004
671
|
|
|
1005
|
-
|
|
672
|
+
For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
|
|
1006
673
|
|
|
1007
|
-
|
|
1008
|
-
pnpm add @kreuzberg/core
|
|
1009
|
-
```
|
|
674
|
+
## Batch Processing
|
|
1010
675
|
|
|
1011
|
-
|
|
676
|
+
Process multiple documents efficiently:
|
|
1012
677
|
|
|
1013
|
-
|
|
678
|
+
```ts
|
|
679
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
1014
680
|
|
|
1015
|
-
|
|
1016
|
-
|
|
1017
|
-
|
|
1018
|
-
|
|
681
|
+
interface DocumentJob {
|
|
682
|
+
name: string;
|
|
683
|
+
bytes: Uint8Array;
|
|
684
|
+
mimeType: string;
|
|
685
|
+
}
|
|
686
|
+
|
|
687
|
+
async function processBatch(documents: DocumentJob[], concurrency: number = 3) {
|
|
688
|
+
await initWasm();
|
|
689
|
+
|
|
690
|
+
const results: Record<string, string> = {};
|
|
691
|
+
const queue = [...documents];
|
|
692
|
+
|
|
693
|
+
const workers = Array(concurrency)
|
|
694
|
+
.fill(null)
|
|
695
|
+
.map(async () => {
|
|
696
|
+
while (queue.length > 0) {
|
|
697
|
+
const doc = queue.shift();
|
|
698
|
+
if (!doc) break;
|
|
699
|
+
|
|
700
|
+
try {
|
|
701
|
+
const result = await extractBytes(doc.bytes, doc.mimeType);
|
|
702
|
+
results[doc.name] = result.content;
|
|
703
|
+
} catch (error) {
|
|
704
|
+
console.error(`Failed to process ${doc.name}:`, error);
|
|
705
|
+
}
|
|
706
|
+
}
|
|
707
|
+
});
|
|
708
|
+
|
|
709
|
+
await Promise.all(workers);
|
|
710
|
+
return results;
|
|
711
|
+
}
|
|
1019
712
|
```
|
|
1020
713
|
|
|
1021
|
-
|
|
714
|
+
## Configuration
|
|
1022
715
|
|
|
1023
|
-
|
|
716
|
+
For advanced configuration options including language detection, table extraction, OCR settings, and more:
|
|
1024
717
|
|
|
1025
|
-
|
|
718
|
+
**[Configuration Guide](https://kreuzberg.dev/configuration/)**
|
|
1026
719
|
|
|
1027
|
-
|
|
720
|
+
## Documentation
|
|
1028
721
|
|
|
1029
|
-
- **
|
|
1030
|
-
- **
|
|
1031
|
-
- **
|
|
1032
|
-
- **Node.js**: Batch processing script
|
|
722
|
+
- **[Official Documentation](https://kreuzberg.dev/)**
|
|
723
|
+
- **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
|
|
724
|
+
- **[Examples & Guides](https://kreuzberg.dev/guides/)**
|
|
1033
725
|
|
|
1034
|
-
##
|
|
726
|
+
## Troubleshooting
|
|
1035
727
|
|
|
1036
|
-
For
|
|
728
|
+
For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
|
|
1037
729
|
|
|
1038
730
|
## Contributing
|
|
1039
731
|
|
|
1040
|
-
|
|
732
|
+
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
|
|
1041
733
|
|
|
1042
734
|
## License
|
|
1043
735
|
|
|
1044
|
-
MIT
|
|
1045
|
-
|
|
1046
|
-
## Links
|
|
1047
|
-
|
|
1048
|
-
- [Website](https://kreuzberg.dev)
|
|
1049
|
-
- [Documentation](https://kreuzberg.dev)
|
|
1050
|
-
- [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
|
|
1051
|
-
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
1052
|
-
- [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
|
|
1053
|
-
- [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
736
|
+
MIT License - see LICENSE file for details.
|
|
1054
737
|
|
|
1055
|
-
##
|
|
738
|
+
## Support
|
|
1056
739
|
|
|
1057
|
-
- [
|
|
1058
|
-
- [
|
|
1059
|
-
- [
|
|
740
|
+
- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
|
|
741
|
+
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
742
|
+
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
|