@kreuzberg/wasm 4.0.0-rc.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +982 -0
- package/dist/adapters/wasm-adapter.cjs +245 -0
- package/dist/adapters/wasm-adapter.cjs.map +1 -0
- package/dist/adapters/wasm-adapter.d.cts +121 -0
- package/dist/adapters/wasm-adapter.d.ts +121 -0
- package/dist/adapters/wasm-adapter.js +224 -0
- package/dist/adapters/wasm-adapter.js.map +1 -0
- package/dist/index.cjs +4335 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +466 -0
- package/dist/index.d.ts +466 -0
- package/dist/index.js +4308 -0
- package/dist/index.js.map +1 -0
- package/dist/ocr/registry.cjs +92 -0
- package/dist/ocr/registry.cjs.map +1 -0
- package/dist/ocr/registry.d.cts +102 -0
- package/dist/ocr/registry.d.ts +102 -0
- package/dist/ocr/registry.js +71 -0
- package/dist/ocr/registry.js.map +1 -0
- package/dist/ocr/tesseract-wasm-backend.cjs +3566 -0
- package/dist/ocr/tesseract-wasm-backend.cjs.map +1 -0
- package/dist/ocr/tesseract-wasm-backend.d.cts +257 -0
- package/dist/ocr/tesseract-wasm-backend.d.ts +257 -0
- package/dist/ocr/tesseract-wasm-backend.js +3551 -0
- package/dist/ocr/tesseract-wasm-backend.js.map +1 -0
- package/dist/runtime.cjs +174 -0
- package/dist/runtime.cjs.map +1 -0
- package/dist/runtime.d.cts +256 -0
- package/dist/runtime.d.ts +256 -0
- package/dist/runtime.js +153 -0
- package/dist/runtime.js.map +1 -0
- package/dist/types-CKjcIYcX.d.cts +294 -0
- package/dist/types-CKjcIYcX.d.ts +294 -0
- package/package.json +140 -0
package/README.md
ADDED
|
@@ -0,0 +1,982 @@
|
|
|
1
|
+
# Kreuzberg
|
|
2
|
+
|
|
3
|
+
[](https://crates.io/crates/kreuzberg)
|
|
4
|
+
[](https://pypi.org/project/kreuzberg/)
|
|
5
|
+
[](https://www.npmjs.com/package/@kreuzberg/node)
|
|
6
|
+
[](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
7
|
+
[](https://rubygems.org/gems/kreuzberg)
|
|
8
|
+
[](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
|
|
9
|
+
[](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
|
|
10
|
+
[](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
|
|
11
|
+
|
|
12
|
+
[](https://opensource.org/licenses/MIT)
|
|
13
|
+
[](https://kreuzberg.dev/)
|
|
14
|
+
[](https://discord.gg/pXxagNK2zN)
|
|
15
|
+
|
|
16
|
+
High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
|
|
17
|
+
|
|
18
|
+
Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
|
|
19
|
+
|
|
20
|
+
> **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
|
|
21
|
+
>
|
|
22
|
+
> **This WASM package is designed for:**
|
|
23
|
+
> - Browser applications (including web workers)
|
|
24
|
+
> - Cloudflare Workers and edge runtimes
|
|
25
|
+
> - Deno applications
|
|
26
|
+
> - Environments without native build toolchain
|
|
27
|
+
|
|
28
|
+
> **🚀 Version 4.0.0 Release Candidate**
|
|
29
|
+
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
|
30
|
+
|
|
31
|
+
## Features
|
|
32
|
+
|
|
33
|
+
- **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
|
|
34
|
+
- **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
|
|
35
|
+
- **Table Extraction**: Advanced table detection and structured data extraction
|
|
36
|
+
- **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
|
|
37
|
+
- **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
|
|
38
|
+
- **API Parity**: All extraction functions from the Node.js binding
|
|
39
|
+
- **Plugin System**: Custom post-processors, validators, and OCR backends
|
|
40
|
+
- **Optimized Bundle**: <5MB uncompressed, <2MB compressed
|
|
41
|
+
- **Zero Configuration**: Works out of the box with sensible defaults
|
|
42
|
+
- **Portable**: Runs anywhere WASM is supported without native dependencies
|
|
43
|
+
|
|
44
|
+
## Requirements
|
|
45
|
+
|
|
46
|
+
- **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
|
|
47
|
+
- **Node.js**: 18 or higher
|
|
48
|
+
- **Deno**: 1.0 or higher
|
|
49
|
+
- **Cloudflare Workers**: Compatible with Workers runtime
|
|
50
|
+
|
|
51
|
+
### Optional Dependencies
|
|
52
|
+
|
|
53
|
+
- **tesseract-wasm**: Automatically loaded for OCR functionality (40+ language support)
|
|
54
|
+
|
|
55
|
+
## Installation
|
|
56
|
+
|
|
57
|
+
### Choosing the Right Package
|
|
58
|
+
|
|
59
|
+
| Use Case | Recommendation | Reason |
|
|
60
|
+
|----------|---|---|
|
|
61
|
+
| **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
|
|
62
|
+
| **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
|
|
63
|
+
| **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
|
|
64
|
+
| **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
|
|
65
|
+
| **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
|
|
66
|
+
|
|
67
|
+
### Install via npm/pnpm/yarn
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
npm install @kreuzberg/wasm
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Or with pnpm:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
pnpm add @kreuzberg/wasm
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Or with yarn:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
yarn add @kreuzberg/wasm
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### Deno
|
|
86
|
+
|
|
87
|
+
```typescript
|
|
88
|
+
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
## Quick Start
|
|
92
|
+
|
|
93
|
+
### Browser (ESM)
|
|
94
|
+
|
|
95
|
+
```typescript
|
|
96
|
+
import { extractFile } from '@kreuzberg/wasm';
|
|
97
|
+
|
|
98
|
+
async function handleFileUpload() {
|
|
99
|
+
const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
|
|
100
|
+
const file = fileInput.files[0];
|
|
101
|
+
|
|
102
|
+
const result = await extractFile(file, {
|
|
103
|
+
extract_tables: true,
|
|
104
|
+
extract_images: true
|
|
105
|
+
});
|
|
106
|
+
|
|
107
|
+
console.log('Extracted text:', result.content);
|
|
108
|
+
console.log('Tables found:', result.tables.length);
|
|
109
|
+
}
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### Node.js (ESM)
|
|
113
|
+
|
|
114
|
+
```typescript
|
|
115
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
116
|
+
import { readFile } from 'fs/promises';
|
|
117
|
+
|
|
118
|
+
const pdfBytes = await readFile('./document.pdf');
|
|
119
|
+
const result = await extractBytes(
|
|
120
|
+
new Uint8Array(pdfBytes),
|
|
121
|
+
'application/pdf',
|
|
122
|
+
{ extract_tables: true }
|
|
123
|
+
);
|
|
124
|
+
|
|
125
|
+
console.log(result.content);
|
|
126
|
+
console.log('Found', result.tables.length, 'tables');
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Deno
|
|
130
|
+
|
|
131
|
+
```typescript
|
|
132
|
+
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
133
|
+
|
|
134
|
+
const pdfBytes = await Deno.readFile("./document.pdf");
|
|
135
|
+
const result = await extractBytes(pdfBytes, "application/pdf");
|
|
136
|
+
|
|
137
|
+
console.log(result.content);
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Cloudflare Workers
|
|
141
|
+
|
|
142
|
+
```typescript
|
|
143
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
144
|
+
|
|
145
|
+
export default {
|
|
146
|
+
async fetch(request: Request): Promise<Response> {
|
|
147
|
+
if (request.method === 'POST') {
|
|
148
|
+
const formData = await request.formData();
|
|
149
|
+
const file = formData.get('file') as File;
|
|
150
|
+
|
|
151
|
+
const arrayBuffer = await file.arrayBuffer();
|
|
152
|
+
const bytes = new Uint8Array(arrayBuffer);
|
|
153
|
+
|
|
154
|
+
const result = await extractBytes(bytes, file.type);
|
|
155
|
+
|
|
156
|
+
return Response.json({
|
|
157
|
+
text: result.content,
|
|
158
|
+
metadata: result.metadata,
|
|
159
|
+
tables: result.tables
|
|
160
|
+
});
|
|
161
|
+
}
|
|
162
|
+
|
|
163
|
+
return new Response('Upload a file', { status: 400 });
|
|
164
|
+
}
|
|
165
|
+
};
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
## Performance Comparison
|
|
169
|
+
|
|
170
|
+
Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
|
|
171
|
+
|
|
172
|
+
| Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
|
|
173
|
+
|--------|---|---|---|
|
|
174
|
+
| **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
|
|
175
|
+
| **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
|
|
176
|
+
| **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
|
|
177
|
+
| **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
|
|
178
|
+
| **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
|
|
179
|
+
|
|
180
|
+
### When to Use WASM vs Native
|
|
181
|
+
|
|
182
|
+
**Use WASM (@kreuzberg/wasm) when:**
|
|
183
|
+
- Building browser applications (no choice, WASM required)
|
|
184
|
+
- Targeting Cloudflare Workers or edge runtimes
|
|
185
|
+
- Supporting Deno applications
|
|
186
|
+
- You don't have a native build toolchain available
|
|
187
|
+
- Portability across runtimes is critical
|
|
188
|
+
|
|
189
|
+
**Use Native (@kreuzberg/node) when:**
|
|
190
|
+
- Building Node.js or Bun applications (2-3x faster)
|
|
191
|
+
- Performance is your primary concern
|
|
192
|
+
- You're processing large volumes of documents
|
|
193
|
+
- You have native build tools available
|
|
194
|
+
|
|
195
|
+
### Performance Tips for WASM
|
|
196
|
+
|
|
197
|
+
1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
|
|
198
|
+
2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
|
|
199
|
+
3. **Cache WASM module** by loading it once per application
|
|
200
|
+
4. **Preload OCR models** by calling extraction with OCR enabled early
|
|
201
|
+
|
|
202
|
+
## Examples
|
|
203
|
+
|
|
204
|
+
Kreuzberg WASM includes complete working examples for different environments:
|
|
205
|
+
|
|
206
|
+
- **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
|
|
207
|
+
- **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
|
|
208
|
+
- **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
|
|
209
|
+
|
|
210
|
+
See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
|
|
211
|
+
|
|
212
|
+
## Multi-Threading with wasm-bindgen-rayon
|
|
213
|
+
|
|
214
|
+
Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
|
|
215
|
+
|
|
216
|
+
### Initializing the Thread Pool
|
|
217
|
+
|
|
218
|
+
To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
|
|
219
|
+
|
|
220
|
+
```typescript
|
|
221
|
+
import { initThreadPool } from '@kreuzberg/wasm';
|
|
222
|
+
|
|
223
|
+
// Initialize thread pool for multi-threaded extraction
|
|
224
|
+
await initThreadPool(navigator.hardwareConcurrency);
|
|
225
|
+
|
|
226
|
+
// Now extractions will use multiple threads for better performance
|
|
227
|
+
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
### Required HTTP Headers for SharedArrayBuffer
|
|
231
|
+
|
|
232
|
+
Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
|
|
233
|
+
|
|
234
|
+
**Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
|
|
235
|
+
|
|
236
|
+
Set these headers in your server configuration:
|
|
237
|
+
|
|
238
|
+
```
|
|
239
|
+
Cross-Origin-Opener-Policy: same-origin
|
|
240
|
+
Cross-Origin-Embedder-Policy: require-corp
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
#### Server Configuration Examples
|
|
244
|
+
|
|
245
|
+
**Express.js:**
|
|
246
|
+
```javascript
|
|
247
|
+
app.use((req, res, next) => {
|
|
248
|
+
res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
|
|
249
|
+
res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
250
|
+
next();
|
|
251
|
+
});
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
**Nginx:**
|
|
255
|
+
```nginx
|
|
256
|
+
add_header 'Cross-Origin-Opener-Policy' 'same-origin';
|
|
257
|
+
add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
**Apache:**
|
|
261
|
+
```apache
|
|
262
|
+
Header set Cross-Origin-Opener-Policy "same-origin"
|
|
263
|
+
Header set Cross-Origin-Embedder-Policy "require-corp"
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
**Cloudflare Workers:**
|
|
267
|
+
```javascript
|
|
268
|
+
export default {
|
|
269
|
+
async fetch(request: Request): Promise<Response> {
|
|
270
|
+
const response = new Response(body);
|
|
271
|
+
response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
|
|
272
|
+
response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
273
|
+
return response;
|
|
274
|
+
}
|
|
275
|
+
};
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
### Browser Compatibility
|
|
279
|
+
|
|
280
|
+
Multi-threading with SharedArrayBuffer is available in:
|
|
281
|
+
|
|
282
|
+
- **Chrome/Edge**: 74+
|
|
283
|
+
- **Firefox**: 79+
|
|
284
|
+
- **Safari**: 15.2+
|
|
285
|
+
- **Opera**: 60+
|
|
286
|
+
|
|
287
|
+
In unsupported browsers or when headers are not set, the library automatically degrades to single-threaded mode.
|
|
288
|
+
|
|
289
|
+
### Graceful Degradation
|
|
290
|
+
|
|
291
|
+
The library handles thread pool initialization gracefully. If initialization fails or is unavailable:
|
|
292
|
+
|
|
293
|
+
```typescript
|
|
294
|
+
import { initThreadPool } from '@kreuzberg/wasm';
|
|
295
|
+
|
|
296
|
+
try {
|
|
297
|
+
await initThreadPool(navigator.hardwareConcurrency);
|
|
298
|
+
console.log('Multi-threading enabled');
|
|
299
|
+
} catch (error) {
|
|
300
|
+
// Fall back to single-threaded processing
|
|
301
|
+
console.warn('Multi-threading unavailable:', error);
|
|
302
|
+
console.log('Using single-threaded extraction');
|
|
303
|
+
}
|
|
304
|
+
|
|
305
|
+
// Extraction will work in both cases
|
|
306
|
+
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
### Complete Example with Thread Pool
|
|
310
|
+
|
|
311
|
+
```typescript
|
|
312
|
+
import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
|
|
313
|
+
|
|
314
|
+
async function initializeKreuzbergWithThreading() {
|
|
315
|
+
try {
|
|
316
|
+
// Initialize WASM module
|
|
317
|
+
await initWasm();
|
|
318
|
+
|
|
319
|
+
// Initialize multi-threading
|
|
320
|
+
const cpuCount = navigator.hardwareConcurrency || 1;
|
|
321
|
+
try {
|
|
322
|
+
await initThreadPool(cpuCount);
|
|
323
|
+
console.log(`Thread pool initialized with ${cpuCount} workers`);
|
|
324
|
+
} catch (error) {
|
|
325
|
+
console.warn('Could not initialize thread pool, using single-threaded mode');
|
|
326
|
+
}
|
|
327
|
+
|
|
328
|
+
} catch (error) {
|
|
329
|
+
console.error('Failed to initialize Kreuzberg:', error);
|
|
330
|
+
}
|
|
331
|
+
}
|
|
332
|
+
|
|
333
|
+
async function extractDocument(file: File) {
|
|
334
|
+
const bytes = new Uint8Array(await file.arrayBuffer());
|
|
335
|
+
|
|
336
|
+
// Extraction will use multiple threads if available
|
|
337
|
+
const result = await extractBytes(bytes, file.type, {
|
|
338
|
+
extract_tables: true,
|
|
339
|
+
extract_images: true
|
|
340
|
+
});
|
|
341
|
+
|
|
342
|
+
return result;
|
|
343
|
+
}
|
|
344
|
+
|
|
345
|
+
// Initialize once at app startup
|
|
346
|
+
await initializeKreuzbergWithThreading();
|
|
347
|
+
|
|
348
|
+
// Later, handle file uploads
|
|
349
|
+
fileInput.addEventListener('change', async (e) => {
|
|
350
|
+
const file = e.target.files?.[0];
|
|
351
|
+
if (file) {
|
|
352
|
+
const result = await extractDocument(file);
|
|
353
|
+
console.log('Extracted text:', result.content);
|
|
354
|
+
}
|
|
355
|
+
});
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
### Performance Considerations
|
|
359
|
+
|
|
360
|
+
- **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
|
|
361
|
+
- **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
|
|
362
|
+
- **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
|
|
363
|
+
|
|
364
|
+
## OCR Support
|
|
365
|
+
|
|
366
|
+
The WASM binding integrates [tesseract-wasm](https://github.com/robertknight/tesseract-wasm) for OCR support with 40+ languages.
|
|
367
|
+
|
|
368
|
+
### Basic OCR
|
|
369
|
+
|
|
370
|
+
```typescript
|
|
371
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
372
|
+
|
|
373
|
+
const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
|
|
374
|
+
|
|
375
|
+
const result = await extractBytes(
|
|
376
|
+
new Uint8Array(imageBytes),
|
|
377
|
+
'image/jpeg',
|
|
378
|
+
{
|
|
379
|
+
enable_ocr: true,
|
|
380
|
+
ocr_config: {
|
|
381
|
+
languages: ['eng'], // English
|
|
382
|
+
backend: 'tesseract-wasm'
|
|
383
|
+
}
|
|
384
|
+
}
|
|
385
|
+
);
|
|
386
|
+
|
|
387
|
+
console.log('OCR text:', result.content);
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
### Multi-Language OCR
|
|
391
|
+
|
|
392
|
+
```typescript
|
|
393
|
+
const result = await extractBytes(imageBytes, 'image/png', {
|
|
394
|
+
enable_ocr: true,
|
|
395
|
+
ocr_config: {
|
|
396
|
+
languages: ['eng', 'deu', 'fra'], // English, German, French
|
|
397
|
+
backend: 'tesseract-wasm'
|
|
398
|
+
}
|
|
399
|
+
});
|
|
400
|
+
```
|
|
401
|
+
|
|
402
|
+
### Supported Languages
|
|
403
|
+
|
|
404
|
+
`eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
|
|
405
|
+
|
|
406
|
+
Training data is automatically loaded from jsDelivr CDN:
|
|
407
|
+
```
|
|
408
|
+
https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
## Configuration
|
|
412
|
+
|
|
413
|
+
### Extract Tables
|
|
414
|
+
|
|
415
|
+
```typescript
|
|
416
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
417
|
+
|
|
418
|
+
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
419
|
+
extract_tables: true
|
|
420
|
+
});
|
|
421
|
+
|
|
422
|
+
if (result.tables) {
|
|
423
|
+
for (const table of result.tables) {
|
|
424
|
+
console.log('Table as Markdown:');
|
|
425
|
+
console.log(table.markdown);
|
|
426
|
+
|
|
427
|
+
console.log('Table cells:');
|
|
428
|
+
console.log(JSON.stringify(table.cells, null, 2));
|
|
429
|
+
}
|
|
430
|
+
}
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
### Extract Images
|
|
434
|
+
|
|
435
|
+
```typescript
|
|
436
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
437
|
+
|
|
438
|
+
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
439
|
+
extract_images: true,
|
|
440
|
+
image_config: {
|
|
441
|
+
target_dpi: 300,
|
|
442
|
+
max_image_dimension: 4096
|
|
443
|
+
}
|
|
444
|
+
});
|
|
445
|
+
|
|
446
|
+
if (result.images) {
|
|
447
|
+
for (const image of result.images) {
|
|
448
|
+
console.log(`Image ${image.index}: ${image.format}`);
|
|
449
|
+
// image.data is a Uint8Array
|
|
450
|
+
}
|
|
451
|
+
}
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
### Text Chunking
|
|
455
|
+
|
|
456
|
+
```typescript
|
|
457
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
458
|
+
|
|
459
|
+
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
460
|
+
enable_chunking: true,
|
|
461
|
+
chunking_config: {
|
|
462
|
+
max_chars: 1000,
|
|
463
|
+
max_overlap: 200
|
|
464
|
+
}
|
|
465
|
+
});
|
|
466
|
+
|
|
467
|
+
if (result.chunks) {
|
|
468
|
+
for (const chunk of result.chunks) {
|
|
469
|
+
console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
|
|
470
|
+
}
|
|
471
|
+
}
|
|
472
|
+
```
|
|
473
|
+
|
|
474
|
+
### Language Detection
|
|
475
|
+
|
|
476
|
+
```typescript
|
|
477
|
+
import { extractBytes } from '@kreuzberg/wasm';
|
|
478
|
+
|
|
479
|
+
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
480
|
+
enable_language_detection: true
|
|
481
|
+
});
|
|
482
|
+
|
|
483
|
+
if (result.language) {
|
|
484
|
+
console.log(`Detected language: ${result.language.code}`);
|
|
485
|
+
console.log(`Confidence: ${result.language.confidence}`);
|
|
486
|
+
}
|
|
487
|
+
```
|
|
488
|
+
|
|
489
|
+
### Complete Configuration Example
|
|
490
|
+
|
|
491
|
+
```typescript
|
|
492
|
+
import {
|
|
493
|
+
extractBytes,
|
|
494
|
+
type ExtractionConfig
|
|
495
|
+
} from '@kreuzberg/wasm';
|
|
496
|
+
|
|
497
|
+
const config: ExtractionConfig = {
|
|
498
|
+
extract_tables: true,
|
|
499
|
+
extract_images: true,
|
|
500
|
+
extract_metadata: true,
|
|
501
|
+
|
|
502
|
+
enable_ocr: true,
|
|
503
|
+
ocr_config: {
|
|
504
|
+
languages: ['eng'],
|
|
505
|
+
backend: 'tesseract-wasm',
|
|
506
|
+
dpi: 300,
|
|
507
|
+
preprocessing: {
|
|
508
|
+
deskew: true,
|
|
509
|
+
denoise: true,
|
|
510
|
+
binarize: true
|
|
511
|
+
}
|
|
512
|
+
},
|
|
513
|
+
|
|
514
|
+
enable_chunking: true,
|
|
515
|
+
chunking_config: {
|
|
516
|
+
max_chars: 1000,
|
|
517
|
+
max_overlap: 200
|
|
518
|
+
},
|
|
519
|
+
|
|
520
|
+
enable_language_detection: true,
|
|
521
|
+
|
|
522
|
+
enable_quality: true,
|
|
523
|
+
|
|
524
|
+
extract_keywords: true,
|
|
525
|
+
keywords_config: {
|
|
526
|
+
max_keywords: 10,
|
|
527
|
+
method: 'yake'
|
|
528
|
+
}
|
|
529
|
+
};
|
|
530
|
+
|
|
531
|
+
const result = await extractBytes(data, mimeType, config);
|
|
532
|
+
```
|
|
533
|
+
|
|
534
|
+
## Advanced Usage
|
|
535
|
+
|
|
536
|
+
### Batch Processing
|
|
537
|
+
|
|
538
|
+
```typescript
|
|
539
|
+
import { batchExtractFiles, batchExtractBytes } from '@kreuzberg/wasm';
|
|
540
|
+
|
|
541
|
+
// Browser: Process multiple files
|
|
542
|
+
const fileInput = document.querySelector<HTMLInputElement>('#files');
|
|
543
|
+
const files = Array.from(fileInput.files);
|
|
544
|
+
|
|
545
|
+
const results = await batchExtractFiles(files, {
|
|
546
|
+
extract_tables: true
|
|
547
|
+
});
|
|
548
|
+
|
|
549
|
+
for (const result of results) {
|
|
550
|
+
console.log(`${result.mime_type}: ${result.content.length} characters`);
|
|
551
|
+
}
|
|
552
|
+
|
|
553
|
+
// Or from Uint8Arrays
|
|
554
|
+
const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
|
|
555
|
+
const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
|
|
556
|
+
|
|
557
|
+
const results = await batchExtractBytes(dataList, mimeTypes);
|
|
558
|
+
```
|
|
559
|
+
|
|
560
|
+
### Synchronous Extraction
|
|
561
|
+
|
|
562
|
+
```typescript
|
|
563
|
+
import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
|
|
564
|
+
|
|
565
|
+
// Synchronous single extraction
|
|
566
|
+
const result = extractBytesSync(data, 'application/pdf', config);
|
|
567
|
+
|
|
568
|
+
// Synchronous batch extraction
|
|
569
|
+
const results = batchExtractBytesSync(dataList, mimeTypes, config);
|
|
570
|
+
```
|
|
571
|
+
|
|
572
|
+
### Plugin System
|
|
573
|
+
|
|
574
|
+
#### Custom Post-Processors
|
|
575
|
+
|
|
576
|
+
```typescript
|
|
577
|
+
import { registerPostProcessor } from '@kreuzberg/wasm';
|
|
578
|
+
|
|
579
|
+
registerPostProcessor({
|
|
580
|
+
name: 'uppercase',
|
|
581
|
+
async process(result) {
|
|
582
|
+
return {
|
|
583
|
+
...result,
|
|
584
|
+
content: result.content.toUpperCase()
|
|
585
|
+
};
|
|
586
|
+
}
|
|
587
|
+
});
|
|
588
|
+
|
|
589
|
+
// Now all extractions will use this processor
|
|
590
|
+
const result = await extractBytes(data, mimeType);
|
|
591
|
+
console.log(result.content); // UPPERCASE TEXT
|
|
592
|
+
```
|
|
593
|
+
|
|
594
|
+
#### Custom Validators
|
|
595
|
+
|
|
596
|
+
```typescript
|
|
597
|
+
import { registerValidator } from '@kreuzberg/wasm';
|
|
598
|
+
|
|
599
|
+
registerValidator({
|
|
600
|
+
name: 'min-length',
|
|
601
|
+
async validate(result) {
|
|
602
|
+
if (result.content.length < 100) {
|
|
603
|
+
throw new Error('Document too short');
|
|
604
|
+
}
|
|
605
|
+
}
|
|
606
|
+
});
|
|
607
|
+
```
|
|
608
|
+
|
|
609
|
+
#### Custom OCR Backends
|
|
610
|
+
|
|
611
|
+
```typescript
|
|
612
|
+
import { registerOcrBackend } from '@kreuzberg/wasm';
|
|
613
|
+
|
|
614
|
+
registerOcrBackend({
|
|
615
|
+
name: 'custom-ocr',
|
|
616
|
+
supportedLanguages() {
|
|
617
|
+
return ['eng', 'fra'];
|
|
618
|
+
},
|
|
619
|
+
async initialize() {
|
|
620
|
+
// Initialize your OCR backend
|
|
621
|
+
},
|
|
622
|
+
async processImage(imageBytes, language) {
|
|
623
|
+
// Process image and return result
|
|
624
|
+
return {
|
|
625
|
+
content: 'extracted text',
|
|
626
|
+
mime_type: 'text/plain',
|
|
627
|
+
metadata: {},
|
|
628
|
+
tables: []
|
|
629
|
+
};
|
|
630
|
+
},
|
|
631
|
+
async shutdown() {
|
|
632
|
+
// Cleanup
|
|
633
|
+
}
|
|
634
|
+
});
|
|
635
|
+
```
|
|
636
|
+
|
|
637
|
+
### MIME Type Detection
|
|
638
|
+
|
|
639
|
+
```typescript
|
|
640
|
+
import {
|
|
641
|
+
detectMimeFromBytes,
|
|
642
|
+
getMimeFromExtension,
|
|
643
|
+
getExtensionsForMime,
|
|
644
|
+
normalizeMimeType
|
|
645
|
+
} from '@kreuzberg/wasm';
|
|
646
|
+
|
|
647
|
+
// Auto-detect MIME type from file bytes
|
|
648
|
+
const mimeType = detectMimeFromBytes(fileBytes);
|
|
649
|
+
|
|
650
|
+
// Get MIME type from file extension
|
|
651
|
+
const mime = getMimeFromExtension('pdf'); // 'application/pdf'
|
|
652
|
+
|
|
653
|
+
// Get extensions for MIME type
|
|
654
|
+
const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
|
|
655
|
+
|
|
656
|
+
// Normalize MIME type
|
|
657
|
+
const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
|
|
658
|
+
```
|
|
659
|
+
|
|
660
|
+
### Configuration Loading
|
|
661
|
+
|
|
662
|
+
```typescript
|
|
663
|
+
import { loadConfigFromString } from '@kreuzberg/wasm';
|
|
664
|
+
|
|
665
|
+
// Load from YAML
|
|
666
|
+
const yamlConfig = `
|
|
667
|
+
extract_tables: true
|
|
668
|
+
enable_ocr: true
|
|
669
|
+
ocr_config:
|
|
670
|
+
languages: [eng, deu]
|
|
671
|
+
`;
|
|
672
|
+
const config = loadConfigFromString(yamlConfig, 'yaml');
|
|
673
|
+
|
|
674
|
+
// Load from JSON
|
|
675
|
+
const jsonConfig = '{"extract_tables":true}';
|
|
676
|
+
const config2 = loadConfigFromString(jsonConfig, 'json');
|
|
677
|
+
|
|
678
|
+
// Load from TOML
|
|
679
|
+
const tomlConfig = 'extract_tables = true';
|
|
680
|
+
const config3 = loadConfigFromString(tomlConfig, 'toml');
|
|
681
|
+
```
|
|
682
|
+
|
|
683
|
+
## API Reference
|
|
684
|
+
|
|
685
|
+
### Extraction Functions
|
|
686
|
+
|
|
687
|
+
#### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
|
|
688
|
+
Extract content from a browser `File` object.
|
|
689
|
+
|
|
690
|
+
#### `extractBytes(data: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
|
|
691
|
+
Asynchronously extract content from a `Uint8Array`.
|
|
692
|
+
|
|
693
|
+
#### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
|
|
694
|
+
Synchronously extract content from a `Uint8Array`.
|
|
695
|
+
|
|
696
|
+
#### `batchExtractFiles(files: File[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
|
|
697
|
+
Extract multiple files in parallel.
|
|
698
|
+
|
|
699
|
+
#### `batchExtractBytes(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
|
|
700
|
+
Extract multiple byte arrays in parallel.
|
|
701
|
+
|
|
702
|
+
#### `batchExtractBytesSync(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): ExtractionResult[]`
|
|
703
|
+
Extract multiple byte arrays synchronously.
|
|
704
|
+
|
|
705
|
+
### Plugin Management
|
|
706
|
+
|
|
707
|
+
#### Post-Processors
|
|
708
|
+
|
|
709
|
+
```typescript
|
|
710
|
+
registerPostProcessor(processor: PostProcessorProtocol): void
|
|
711
|
+
unregisterPostProcessor(name: string): void
|
|
712
|
+
clearPostProcessors(): void
|
|
713
|
+
listPostProcessors(): string[]
|
|
714
|
+
```
|
|
715
|
+
|
|
716
|
+
#### Validators
|
|
717
|
+
|
|
718
|
+
```typescript
|
|
719
|
+
registerValidator(validator: ValidatorProtocol): void
|
|
720
|
+
unregisterValidator(name: string): void
|
|
721
|
+
clearValidators(): void
|
|
722
|
+
listValidators(): string[]
|
|
723
|
+
```
|
|
724
|
+
|
|
725
|
+
#### OCR Backends
|
|
726
|
+
|
|
727
|
+
```typescript
|
|
728
|
+
registerOcrBackend(backend: OcrBackendProtocol): void
|
|
729
|
+
unregisterOcrBackend(name: string): void
|
|
730
|
+
clearOcrBackends(): void
|
|
731
|
+
listOcrBackends(): string[]
|
|
732
|
+
```
|
|
733
|
+
|
|
734
|
+
### Document Extractors
|
|
735
|
+
|
|
736
|
+
```typescript
|
|
737
|
+
listDocumentExtractors(): string[]
|
|
738
|
+
unregisterDocumentExtractor(name: string): void
|
|
739
|
+
clearDocumentExtractors(): void
|
|
740
|
+
```
|
|
741
|
+
|
|
742
|
+
### MIME Utilities
|
|
743
|
+
|
|
744
|
+
```typescript
|
|
745
|
+
detectMimeFromBytes(data: Uint8Array): string
|
|
746
|
+
getMimeFromExtension(ext: string): string | null
|
|
747
|
+
getExtensionsForMime(mime: string): string[]
|
|
748
|
+
normalizeMimeType(mime: string): string
|
|
749
|
+
```
|
|
750
|
+
|
|
751
|
+
### Configuration
|
|
752
|
+
|
|
753
|
+
```typescript
|
|
754
|
+
loadConfigFromString(content: string, format: 'yaml' | 'toml' | 'json'): ExtractionConfig
|
|
755
|
+
```
|
|
756
|
+
|
|
757
|
+
### Embeddings
|
|
758
|
+
|
|
759
|
+
```typescript
|
|
760
|
+
listEmbeddingPresets(): string[]
|
|
761
|
+
getEmbeddingPreset(name: string): EmbeddingPreset | null
|
|
762
|
+
```
|
|
763
|
+
|
|
764
|
+
## Types
|
|
765
|
+
|
|
766
|
+
All types are shared via the `@kreuzberg/core` package:
|
|
767
|
+
|
|
768
|
+
```typescript
|
|
769
|
+
import type {
|
|
770
|
+
ExtractionResult,
|
|
771
|
+
ExtractionConfig,
|
|
772
|
+
OcrConfig,
|
|
773
|
+
ChunkingConfig,
|
|
774
|
+
ImageConfig,
|
|
775
|
+
KeywordsConfig,
|
|
776
|
+
Table,
|
|
777
|
+
ExtractedImage,
|
|
778
|
+
Chunk,
|
|
779
|
+
Metadata,
|
|
780
|
+
PostProcessorProtocol,
|
|
781
|
+
ValidatorProtocol,
|
|
782
|
+
OcrBackendProtocol
|
|
783
|
+
} from '@kreuzberg/core';
|
|
784
|
+
```
|
|
785
|
+
|
|
786
|
+
### ExtractionResult
|
|
787
|
+
|
|
788
|
+
Main result object containing:
|
|
789
|
+
- `content: string` - Extracted text content
|
|
790
|
+
- `mime_type: string` - MIME type of the document
|
|
791
|
+
- `metadata?: Metadata` - Document metadata
|
|
792
|
+
- `tables?: Table[]` - Extracted tables
|
|
793
|
+
- `images?: ExtractedImage[]` - Extracted images
|
|
794
|
+
- `chunks?: Chunk[]` - Text chunks (if chunking enabled)
|
|
795
|
+
- `language?: LanguageInfo` - Detected language (if enabled)
|
|
796
|
+
- `keywords?: Keyword[]` - Extracted keywords (if enabled)
|
|
797
|
+
|
|
798
|
+
### ExtractionConfig
|
|
799
|
+
|
|
800
|
+
Configuration object for extraction:
|
|
801
|
+
- `extract_tables?: boolean` - Extract tables as structured data
|
|
802
|
+
- `extract_images?: boolean` - Extract embedded images
|
|
803
|
+
- `extract_metadata?: boolean` - Extract document metadata
|
|
804
|
+
- `enable_ocr?: boolean` - Enable OCR for images
|
|
805
|
+
- `ocr_config?: OcrConfig` - OCR settings
|
|
806
|
+
- `enable_chunking?: boolean` - Split text into semantic chunks
|
|
807
|
+
- `chunking_config?: ChunkingConfig` - Text chunking settings
|
|
808
|
+
- `enable_language_detection?: boolean` - Detect document language
|
|
809
|
+
- `enable_quality?: boolean` - Encoding detection, normalization
|
|
810
|
+
- `extract_keywords?: boolean` - Extract important keywords
|
|
811
|
+
- `keywords_config?: KeywordsConfig` - Keyword extraction settings
|
|
812
|
+
|
|
813
|
+
### Table
|
|
814
|
+
|
|
815
|
+
Extracted table structure:
|
|
816
|
+
- `markdown: string` - Table in Markdown format
|
|
817
|
+
- `cells: TableCell[][]` - 2D array of table cells
|
|
818
|
+
- `row_count: number` - Number of rows
|
|
819
|
+
- `column_count: number` - Number of columns
|
|
820
|
+
|
|
821
|
+
## Supported Formats
|
|
822
|
+
|
|
823
|
+
| Category | Formats |
|
|
824
|
+
|----------|---------|
|
|
825
|
+
| **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
|
|
826
|
+
| **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
|
|
827
|
+
| **Web** | HTML, XHTML, XML, EPUB |
|
|
828
|
+
| **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
|
|
829
|
+
| **Email** | EML, MSG |
|
|
830
|
+
| **Archives** | ZIP, TAR, 7Z |
|
|
831
|
+
| **Other** | And 30+ more formats |
|
|
832
|
+
|
|
833
|
+
## Build from Source
|
|
834
|
+
|
|
835
|
+
### Prerequisites
|
|
836
|
+
|
|
837
|
+
- Rust 1.75+ with `wasm32-unknown-unknown` target
|
|
838
|
+
- Node.js 18+ with pnpm
|
|
839
|
+
- wasm-pack
|
|
840
|
+
|
|
841
|
+
```bash
|
|
842
|
+
# Install Rust target
|
|
843
|
+
rustup target add wasm32-unknown-unknown
|
|
844
|
+
|
|
845
|
+
# Install wasm-pack
|
|
846
|
+
curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
|
|
847
|
+
|
|
848
|
+
# Build WASM package
|
|
849
|
+
cd crates/kreuzberg-wasm
|
|
850
|
+
pnpm install
|
|
851
|
+
pnpm run build
|
|
852
|
+
|
|
853
|
+
# Run tests
|
|
854
|
+
pnpm test
|
|
855
|
+
```
|
|
856
|
+
|
|
857
|
+
### Build Targets
|
|
858
|
+
|
|
859
|
+
```bash
|
|
860
|
+
# For browsers (ESM modules)
|
|
861
|
+
pnpm run build:wasm:web
|
|
862
|
+
|
|
863
|
+
# For bundlers (webpack, rollup, vite)
|
|
864
|
+
pnpm run build:wasm:bundler
|
|
865
|
+
|
|
866
|
+
# For Node.js
|
|
867
|
+
pnpm run build:wasm:nodejs
|
|
868
|
+
|
|
869
|
+
# For Deno
|
|
870
|
+
pnpm run build:wasm:deno
|
|
871
|
+
|
|
872
|
+
# Build all targets
|
|
873
|
+
pnpm run build:all
|
|
874
|
+
```
|
|
875
|
+
|
|
876
|
+
## Limitations
|
|
877
|
+
|
|
878
|
+
### No File System Access
|
|
879
|
+
|
|
880
|
+
The WASM binding cannot access the file system directly. Use file readers:
|
|
881
|
+
|
|
882
|
+
```typescript
|
|
883
|
+
// ❌ Won't work
|
|
884
|
+
await extractFileSync('./document.pdf'); // Throws error
|
|
885
|
+
|
|
886
|
+
// ✅ Use file readers instead
|
|
887
|
+
const bytes = await Deno.readFile('./document.pdf'); // Deno
|
|
888
|
+
const bytes = await fs.readFile('./document.pdf'); // Node.js
|
|
889
|
+
const bytes = await file.arrayBuffer(); // Browser
|
|
890
|
+
```
|
|
891
|
+
|
|
892
|
+
### OCR Training Data
|
|
893
|
+
|
|
894
|
+
Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
|
|
895
|
+
|
|
896
|
+
### Size Constraints
|
|
897
|
+
|
|
898
|
+
Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
|
|
899
|
+
|
|
900
|
+
## Troubleshooting
|
|
901
|
+
|
|
902
|
+
### "WASM module failed to initialize"
|
|
903
|
+
|
|
904
|
+
Ensure your bundler is configured to handle WASM files:
|
|
905
|
+
|
|
906
|
+
**Vite:**
|
|
907
|
+
```typescript
|
|
908
|
+
// vite.config.ts
|
|
909
|
+
export default {
|
|
910
|
+
optimizeDeps: {
|
|
911
|
+
exclude: ['@kreuzberg/wasm']
|
|
912
|
+
}
|
|
913
|
+
}
|
|
914
|
+
```
|
|
915
|
+
|
|
916
|
+
**Webpack:**
|
|
917
|
+
```javascript
|
|
918
|
+
// webpack.config.js
|
|
919
|
+
module.exports = {
|
|
920
|
+
experiments: {
|
|
921
|
+
asyncWebAssembly: true
|
|
922
|
+
}
|
|
923
|
+
}
|
|
924
|
+
```
|
|
925
|
+
|
|
926
|
+
### "Module not found: @kreuzberg/core"
|
|
927
|
+
|
|
928
|
+
The @kreuzberg/core package is a peer dependency. Install it:
|
|
929
|
+
|
|
930
|
+
```bash
|
|
931
|
+
pnpm add @kreuzberg/core
|
|
932
|
+
```
|
|
933
|
+
|
|
934
|
+
### Memory Issues in Workers
|
|
935
|
+
|
|
936
|
+
For large documents in Cloudflare Workers, process in smaller chunks:
|
|
937
|
+
|
|
938
|
+
```typescript
|
|
939
|
+
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
940
|
+
chunking_config: { max_chars: 1000 }
|
|
941
|
+
});
|
|
942
|
+
```
|
|
943
|
+
|
|
944
|
+
### OCR Not Working
|
|
945
|
+
|
|
946
|
+
Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
|
|
947
|
+
|
|
948
|
+
## Examples
|
|
949
|
+
|
|
950
|
+
See the [`examples/`](./examples/) directory for complete working examples:
|
|
951
|
+
|
|
952
|
+
- **Browser**: Vanilla JS file upload interface
|
|
953
|
+
- **Deno**: Command-line document extraction
|
|
954
|
+
- **Cloudflare Workers**: Document processing API
|
|
955
|
+
- **Node.js**: Batch processing script
|
|
956
|
+
|
|
957
|
+
## Documentation
|
|
958
|
+
|
|
959
|
+
For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
|
|
960
|
+
|
|
961
|
+
## Contributing
|
|
962
|
+
|
|
963
|
+
We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/docs/contributing.md) for details.
|
|
964
|
+
|
|
965
|
+
## License
|
|
966
|
+
|
|
967
|
+
MIT
|
|
968
|
+
|
|
969
|
+
## Links
|
|
970
|
+
|
|
971
|
+
- [Website](https://kreuzberg.dev)
|
|
972
|
+
- [Documentation](https://kreuzberg.dev)
|
|
973
|
+
- [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
|
|
974
|
+
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
975
|
+
- [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
|
|
976
|
+
- [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
977
|
+
|
|
978
|
+
## Related Packages
|
|
979
|
+
|
|
980
|
+
- [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) - Native Node.js bindings (NAPI)
|
|
981
|
+
- [@kreuzberg/core](https://www.npmjs.com/package/@kreuzberg/core) - Shared TypeScript types
|
|
982
|
+
- [kreuzberg](https://crates.io/crates/kreuzberg) - Rust core library
|