@kreuzberg/wasm 4.0.0-rc.6 → 4.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +7 -0
- package/README.md +317 -801
- package/dist/adapters/wasm-adapter.d.ts +7 -10
- package/dist/adapters/wasm-adapter.d.ts.map +1 -0
- package/dist/adapters/wasm-adapter.js +53 -54
- package/dist/adapters/wasm-adapter.js.map +1 -1
- package/dist/index.d.ts +23 -67
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +1102 -104
- package/dist/index.js.map +1 -1
- package/dist/ocr/registry.d.ts +7 -10
- package/dist/ocr/registry.d.ts.map +1 -0
- package/dist/ocr/registry.js +9 -28
- package/dist/ocr/registry.js.map +1 -1
- package/dist/ocr/tesseract-wasm-backend.d.ts +3 -6
- package/dist/ocr/tesseract-wasm-backend.d.ts.map +1 -0
- package/dist/ocr/tesseract-wasm-backend.js +8 -83
- package/dist/ocr/tesseract-wasm-backend.js.map +1 -1
- package/dist/pdfium.js +77 -0
- package/dist/pkg/LICENSE +7 -0
- package/dist/pkg/README.md +498 -0
- package/dist/{kreuzberg_wasm.d.ts → pkg/kreuzberg_wasm.d.ts} +24 -12
- package/dist/{kreuzberg_wasm.js → pkg/kreuzberg_wasm.js} +224 -233
- package/dist/pkg/kreuzberg_wasm_bg.js +1871 -0
- package/dist/{kreuzberg_wasm_bg.wasm → pkg/kreuzberg_wasm_bg.wasm} +0 -0
- package/dist/{kreuzberg_wasm_bg.wasm.d.ts → pkg/kreuzberg_wasm_bg.wasm.d.ts} +10 -13
- package/dist/pkg/package.json +27 -0
- package/dist/plugin-registry.d.ts +246 -0
- package/dist/plugin-registry.d.ts.map +1 -0
- package/dist/runtime.d.ts +21 -22
- package/dist/runtime.d.ts.map +1 -0
- package/dist/runtime.js +21 -41
- package/dist/runtime.js.map +1 -1
- package/dist/types.d.ts +363 -0
- package/dist/types.d.ts.map +1 -0
- package/package.json +34 -51
- package/dist/adapters/wasm-adapter.d.mts +0 -121
- package/dist/adapters/wasm-adapter.mjs +0 -221
- package/dist/adapters/wasm-adapter.mjs.map +0 -1
- package/dist/index.d.mts +0 -466
- package/dist/index.mjs +0 -384
- package/dist/index.mjs.map +0 -1
- package/dist/kreuzberg_wasm.d.mts +0 -758
- package/dist/kreuzberg_wasm.mjs +0 -48
- package/dist/ocr/registry.d.mts +0 -102
- package/dist/ocr/registry.mjs +0 -70
- package/dist/ocr/registry.mjs.map +0 -1
- package/dist/ocr/tesseract-wasm-backend.d.mts +0 -257
- package/dist/ocr/tesseract-wasm-backend.mjs +0 -424
- package/dist/ocr/tesseract-wasm-backend.mjs.map +0 -1
- package/dist/runtime.d.mts +0 -256
- package/dist/runtime.mjs +0 -152
- package/dist/runtime.mjs.map +0 -1
- package/dist/snippets/wasm-bindgen-rayon-38edf6e439f6d70d/src/workerHelpers.js +0 -107
- package/dist/types-GJVIvbPy.d.mts +0 -221
- package/dist/types-GJVIvbPy.d.ts +0 -221
package/README.md
CHANGED
|
@@ -1,982 +1,498 @@
|
|
|
1
|
-
#
|
|
1
|
+
# WebAssembly
|
|
2
|
+
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<!-- Language Bindings -->
|
|
5
|
+
<a href="https://crates.io/crates/kreuzberg">
|
|
6
|
+
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
|
|
7
|
+
</a>
|
|
8
|
+
<a href="https://hex.pm/packages/kreuzberg">
|
|
9
|
+
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://pypi.org/project/kreuzberg/">
|
|
12
|
+
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/node">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
|
|
18
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
+
</a>
|
|
20
|
+
|
|
21
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
|
|
22
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
|
|
23
|
+
</a>
|
|
24
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
|
|
25
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
|
|
26
|
+
</a>
|
|
27
|
+
<a href="https://www.nuget.org/packages/Kreuzberg/">
|
|
28
|
+
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
|
|
29
|
+
</a>
|
|
30
|
+
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
|
|
31
|
+
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
|
|
32
|
+
</a>
|
|
33
|
+
<a href="https://rubygems.org/gems/kreuzberg">
|
|
34
|
+
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
35
|
+
</a>
|
|
36
|
+
|
|
37
|
+
<!-- Project Info -->
|
|
38
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
|
|
39
|
+
<img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
|
|
40
|
+
</a>
|
|
41
|
+
<a href="https://docs.kreuzberg.dev">
|
|
42
|
+
<img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
|
|
43
|
+
</a>
|
|
44
|
+
</div>
|
|
45
|
+
|
|
46
|
+
<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
|
47
|
+
|
|
48
|
+
<div align="center" style="margin-top: 20px;">
|
|
49
|
+
<a href="https://discord.gg/pXxagNK2zN">
|
|
50
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
|
|
51
|
+
</a>
|
|
52
|
+
</div>
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
|
|
2
56
|
|
|
3
|
-
[](https://crates.io/crates/kreuzberg)
|
|
4
|
-
[](https://pypi.org/project/kreuzberg/)
|
|
5
|
-
[](https://www.npmjs.com/package/@kreuzberg/node)
|
|
6
|
-
[](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
7
|
-
[](https://rubygems.org/gems/kreuzberg)
|
|
8
|
-
[](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
|
|
9
|
-
[](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
|
|
10
|
-
[](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
|
|
11
57
|
|
|
12
|
-
|
|
13
|
-
[](https://kreuzberg.dev/)
|
|
14
|
-
[](https://discord.gg/pXxagNK2zN)
|
|
15
|
-
|
|
16
|
-
High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
|
|
17
|
-
|
|
18
|
-
Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
|
|
19
|
-
|
|
20
|
-
> **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
|
|
21
|
-
>
|
|
22
|
-
> **This WASM package is designed for:**
|
|
23
|
-
> - Browser applications (including web workers)
|
|
24
|
-
> - Cloudflare Workers and edge runtimes
|
|
25
|
-
> - Deno applications
|
|
26
|
-
> - Environments without native build toolchain
|
|
27
|
-
|
|
28
|
-
> **🚀 Version 4.0.0 Release Candidate**
|
|
29
|
-
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
|
30
|
-
|
|
31
|
-
## Features
|
|
32
|
-
|
|
33
|
-
- **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
|
|
34
|
-
- **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
|
|
35
|
-
- **Table Extraction**: Advanced table detection and structured data extraction
|
|
36
|
-
- **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
|
|
37
|
-
- **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
|
|
38
|
-
- **API Parity**: All extraction functions from the Node.js binding
|
|
39
|
-
- **Plugin System**: Custom post-processors, validators, and OCR backends
|
|
40
|
-
- **Optimized Bundle**: <5MB uncompressed, <2MB compressed
|
|
41
|
-
- **Zero Configuration**: Works out of the box with sensible defaults
|
|
42
|
-
- **Portable**: Runs anywhere WASM is supported without native dependencies
|
|
43
|
-
|
|
44
|
-
## Requirements
|
|
45
|
-
|
|
46
|
-
- **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
|
|
47
|
-
- **Node.js**: 18 or higher
|
|
48
|
-
- **Deno**: 1.0 or higher
|
|
49
|
-
- **Cloudflare Workers**: Compatible with Workers runtime
|
|
50
|
-
|
|
51
|
-
### Optional Dependencies
|
|
58
|
+
## Installation
|
|
52
59
|
|
|
53
|
-
|
|
60
|
+
### Package Installation
|
|
54
61
|
|
|
55
|
-
## Installation
|
|
56
62
|
|
|
57
|
-
|
|
63
|
+
Install via one of the supported package managers:
|
|
58
64
|
|
|
59
|
-
| Use Case | Recommendation | Reason |
|
|
60
|
-
|----------|---|---|
|
|
61
|
-
| **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
|
|
62
|
-
| **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
|
|
63
|
-
| **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
|
|
64
|
-
| **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
|
|
65
|
-
| **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
|
|
66
65
|
|
|
67
|
-
### Install via npm/pnpm/yarn
|
|
68
66
|
|
|
67
|
+
**npm:**
|
|
69
68
|
```bash
|
|
70
69
|
npm install @kreuzberg/wasm
|
|
71
70
|
```
|
|
72
71
|
|
|
73
|
-
Or with pnpm:
|
|
74
72
|
|
|
75
|
-
```bash
|
|
76
|
-
pnpm add @kreuzberg/wasm
|
|
77
|
-
```
|
|
78
73
|
|
|
79
|
-
Or with yarn:
|
|
80
74
|
|
|
75
|
+
**pnpm:**
|
|
81
76
|
```bash
|
|
82
|
-
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
### Deno
|
|
86
|
-
|
|
87
|
-
```typescript
|
|
88
|
-
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
89
|
-
```
|
|
90
|
-
|
|
91
|
-
## Quick Start
|
|
92
|
-
|
|
93
|
-
### Browser (ESM)
|
|
94
|
-
|
|
95
|
-
```typescript
|
|
96
|
-
import { extractFile } from '@kreuzberg/wasm';
|
|
97
|
-
|
|
98
|
-
async function handleFileUpload() {
|
|
99
|
-
const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
|
|
100
|
-
const file = fileInput.files[0];
|
|
101
|
-
|
|
102
|
-
const result = await extractFile(file, {
|
|
103
|
-
extract_tables: true,
|
|
104
|
-
extract_images: true
|
|
105
|
-
});
|
|
106
|
-
|
|
107
|
-
console.log('Extracted text:', result.content);
|
|
108
|
-
console.log('Tables found:', result.tables.length);
|
|
109
|
-
}
|
|
110
|
-
```
|
|
111
|
-
|
|
112
|
-
### Node.js (ESM)
|
|
113
|
-
|
|
114
|
-
```typescript
|
|
115
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
116
|
-
import { readFile } from 'fs/promises';
|
|
117
|
-
|
|
118
|
-
const pdfBytes = await readFile('./document.pdf');
|
|
119
|
-
const result = await extractBytes(
|
|
120
|
-
new Uint8Array(pdfBytes),
|
|
121
|
-
'application/pdf',
|
|
122
|
-
{ extract_tables: true }
|
|
123
|
-
);
|
|
124
|
-
|
|
125
|
-
console.log(result.content);
|
|
126
|
-
console.log('Found', result.tables.length, 'tables');
|
|
127
|
-
```
|
|
128
|
-
|
|
129
|
-
### Deno
|
|
130
|
-
|
|
131
|
-
```typescript
|
|
132
|
-
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
133
|
-
|
|
134
|
-
const pdfBytes = await Deno.readFile("./document.pdf");
|
|
135
|
-
const result = await extractBytes(pdfBytes, "application/pdf");
|
|
136
|
-
|
|
137
|
-
console.log(result.content);
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
### Cloudflare Workers
|
|
141
|
-
|
|
142
|
-
```typescript
|
|
143
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
144
|
-
|
|
145
|
-
export default {
|
|
146
|
-
async fetch(request: Request): Promise<Response> {
|
|
147
|
-
if (request.method === 'POST') {
|
|
148
|
-
const formData = await request.formData();
|
|
149
|
-
const file = formData.get('file') as File;
|
|
150
|
-
|
|
151
|
-
const arrayBuffer = await file.arrayBuffer();
|
|
152
|
-
const bytes = new Uint8Array(arrayBuffer);
|
|
153
|
-
|
|
154
|
-
const result = await extractBytes(bytes, file.type);
|
|
155
|
-
|
|
156
|
-
return Response.json({
|
|
157
|
-
text: result.content,
|
|
158
|
-
metadata: result.metadata,
|
|
159
|
-
tables: result.tables
|
|
160
|
-
});
|
|
161
|
-
}
|
|
162
|
-
|
|
163
|
-
return new Response('Upload a file', { status: 400 });
|
|
164
|
-
}
|
|
165
|
-
};
|
|
77
|
+
pnpm add @kreuzberg/wasm
|
|
166
78
|
```
|
|
167
79
|
|
|
168
|
-
## Performance Comparison
|
|
169
|
-
|
|
170
|
-
Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
|
|
171
|
-
|
|
172
|
-
| Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
|
|
173
|
-
|--------|---|---|---|
|
|
174
|
-
| **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
|
|
175
|
-
| **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
|
|
176
|
-
| **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
|
|
177
|
-
| **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
|
|
178
|
-
| **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
|
|
179
|
-
|
|
180
|
-
### When to Use WASM vs Native
|
|
181
|
-
|
|
182
|
-
**Use WASM (@kreuzberg/wasm) when:**
|
|
183
|
-
- Building browser applications (no choice, WASM required)
|
|
184
|
-
- Targeting Cloudflare Workers or edge runtimes
|
|
185
|
-
- Supporting Deno applications
|
|
186
|
-
- You don't have a native build toolchain available
|
|
187
|
-
- Portability across runtimes is critical
|
|
188
80
|
|
|
189
|
-
**Use Native (@kreuzberg/node) when:**
|
|
190
|
-
- Building Node.js or Bun applications (2-3x faster)
|
|
191
|
-
- Performance is your primary concern
|
|
192
|
-
- You're processing large volumes of documents
|
|
193
|
-
- You have native build tools available
|
|
194
81
|
|
|
195
|
-
### Performance Tips for WASM
|
|
196
82
|
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
4. **Preload OCR models** by calling extraction with OCR enabled early
|
|
201
|
-
|
|
202
|
-
## Examples
|
|
203
|
-
|
|
204
|
-
Kreuzberg WASM includes complete working examples for different environments:
|
|
205
|
-
|
|
206
|
-
- **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
|
|
207
|
-
- **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
|
|
208
|
-
- **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
|
|
209
|
-
|
|
210
|
-
See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
|
|
211
|
-
|
|
212
|
-
## Multi-Threading with wasm-bindgen-rayon
|
|
213
|
-
|
|
214
|
-
Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
|
|
215
|
-
|
|
216
|
-
### Initializing the Thread Pool
|
|
217
|
-
|
|
218
|
-
To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
|
|
219
|
-
|
|
220
|
-
```typescript
|
|
221
|
-
import { initThreadPool } from '@kreuzberg/wasm';
|
|
222
|
-
|
|
223
|
-
// Initialize thread pool for multi-threaded extraction
|
|
224
|
-
await initThreadPool(navigator.hardwareConcurrency);
|
|
225
|
-
|
|
226
|
-
// Now extractions will use multiple threads for better performance
|
|
227
|
-
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
83
|
+
**yarn:**
|
|
84
|
+
```bash
|
|
85
|
+
yarn add @kreuzberg/wasm
|
|
228
86
|
```
|
|
229
87
|
|
|
230
|
-
### Required HTTP Headers for SharedArrayBuffer
|
|
231
|
-
|
|
232
|
-
Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
|
|
233
88
|
|
|
234
|
-
**Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
|
|
235
89
|
|
|
236
|
-
Set these headers in your server configuration:
|
|
237
|
-
|
|
238
|
-
```
|
|
239
|
-
Cross-Origin-Opener-Policy: same-origin
|
|
240
|
-
Cross-Origin-Embedder-Policy: require-corp
|
|
241
|
-
```
|
|
242
90
|
|
|
243
|
-
#### Server Configuration Examples
|
|
244
91
|
|
|
245
|
-
|
|
246
|
-
```javascript
|
|
247
|
-
app.use((req, res, next) => {
|
|
248
|
-
res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
|
|
249
|
-
res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
250
|
-
next();
|
|
251
|
-
});
|
|
252
|
-
```
|
|
92
|
+
### System Requirements
|
|
253
93
|
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
add_header 'Cross-Origin-Opener-Policy' 'same-origin';
|
|
257
|
-
add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
|
|
258
|
-
```
|
|
94
|
+
- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
|
|
95
|
+
- Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
|
|
259
96
|
|
|
260
|
-
**Apache:**
|
|
261
|
-
```apache
|
|
262
|
-
Header set Cross-Origin-Opener-Policy "same-origin"
|
|
263
|
-
Header set Cross-Origin-Embedder-Policy "require-corp"
|
|
264
|
-
```
|
|
265
97
|
|
|
266
|
-
**Cloudflare Workers:**
|
|
267
|
-
```javascript
|
|
268
|
-
export default {
|
|
269
|
-
async fetch(request: Request): Promise<Response> {
|
|
270
|
-
const response = new Response(body);
|
|
271
|
-
response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
|
|
272
|
-
response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
273
|
-
return response;
|
|
274
|
-
}
|
|
275
|
-
};
|
|
276
|
-
```
|
|
277
98
|
|
|
278
|
-
|
|
99
|
+
## Quick Start
|
|
279
100
|
|
|
280
|
-
|
|
101
|
+
### Basic Extraction
|
|
281
102
|
|
|
282
|
-
|
|
283
|
-
- **Firefox**: 79+
|
|
284
|
-
- **Safari**: 15.2+
|
|
285
|
-
- **Opera**: 60+
|
|
103
|
+
Extract text, metadata, and structure from any supported document format:
|
|
286
104
|
|
|
287
|
-
|
|
105
|
+
```ts
|
|
106
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
288
107
|
|
|
289
|
-
|
|
108
|
+
async function main() {
|
|
109
|
+
await initWasm();
|
|
290
110
|
|
|
291
|
-
|
|
111
|
+
const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
|
|
112
|
+
const bytes = new Uint8Array(buffer);
|
|
292
113
|
|
|
293
|
-
|
|
294
|
-
import { initThreadPool } from '@kreuzberg/wasm';
|
|
114
|
+
const result = await extractBytes(bytes, "application/pdf");
|
|
295
115
|
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
// Fall back to single-threaded processing
|
|
301
|
-
console.warn('Multi-threading unavailable:', error);
|
|
302
|
-
console.log('Using single-threaded extraction');
|
|
116
|
+
console.log("Extracted content:");
|
|
117
|
+
console.log(result.content);
|
|
118
|
+
console.log("MIME type:", result.mimeType);
|
|
119
|
+
console.log("Metadata:", result.metadata);
|
|
303
120
|
}
|
|
304
121
|
|
|
305
|
-
|
|
306
|
-
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
122
|
+
main().catch(console.error);
|
|
307
123
|
```
|
|
308
124
|
|
|
309
|
-
### Complete Example with Thread Pool
|
|
310
|
-
|
|
311
|
-
```typescript
|
|
312
|
-
import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
|
|
313
125
|
|
|
314
|
-
|
|
315
|
-
try {
|
|
316
|
-
// Initialize WASM module
|
|
317
|
-
await initWasm();
|
|
126
|
+
### Common Use Cases
|
|
318
127
|
|
|
319
|
-
|
|
320
|
-
const cpuCount = navigator.hardwareConcurrency || 1;
|
|
321
|
-
try {
|
|
322
|
-
await initThreadPool(cpuCount);
|
|
323
|
-
console.log(`Thread pool initialized with ${cpuCount} workers`);
|
|
324
|
-
} catch (error) {
|
|
325
|
-
console.warn('Could not initialize thread pool, using single-threaded mode');
|
|
326
|
-
}
|
|
327
|
-
|
|
328
|
-
} catch (error) {
|
|
329
|
-
console.error('Failed to initialize Kreuzberg:', error);
|
|
330
|
-
}
|
|
331
|
-
}
|
|
128
|
+
#### Extract with Custom Configuration
|
|
332
129
|
|
|
333
|
-
|
|
334
|
-
const bytes = new Uint8Array(await file.arrayBuffer());
|
|
130
|
+
Most use cases benefit from configuration to control extraction behavior:
|
|
335
131
|
|
|
336
|
-
// Extraction will use multiple threads if available
|
|
337
|
-
const result = await extractBytes(bytes, file.type, {
|
|
338
|
-
extract_tables: true,
|
|
339
|
-
extract_images: true
|
|
340
|
-
});
|
|
341
|
-
|
|
342
|
-
return result;
|
|
343
|
-
}
|
|
344
|
-
|
|
345
|
-
// Initialize once at app startup
|
|
346
|
-
await initializeKreuzbergWithThreading();
|
|
347
|
-
|
|
348
|
-
// Later, handle file uploads
|
|
349
|
-
fileInput.addEventListener('change', async (e) => {
|
|
350
|
-
const file = e.target.files?.[0];
|
|
351
|
-
if (file) {
|
|
352
|
-
const result = await extractDocument(file);
|
|
353
|
-
console.log('Extracted text:', result.content);
|
|
354
|
-
}
|
|
355
|
-
});
|
|
356
|
-
```
|
|
357
|
-
|
|
358
|
-
### Performance Considerations
|
|
359
|
-
|
|
360
|
-
- **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
|
|
361
|
-
- **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
|
|
362
|
-
- **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
|
|
363
|
-
|
|
364
|
-
## OCR Support
|
|
365
132
|
|
|
366
|
-
|
|
133
|
+
**With OCR (for scanned documents):**
|
|
367
134
|
|
|
368
|
-
|
|
135
|
+
```ts
|
|
136
|
+
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
369
137
|
|
|
370
|
-
|
|
371
|
-
|
|
138
|
+
async function extractWithOcr() {
|
|
139
|
+
await initWasm();
|
|
372
140
|
|
|
373
|
-
|
|
141
|
+
try {
|
|
142
|
+
await enableOcr();
|
|
143
|
+
console.log("OCR enabled successfully");
|
|
144
|
+
} catch (error) {
|
|
145
|
+
console.error("Failed to enable OCR:", error);
|
|
146
|
+
return;
|
|
147
|
+
}
|
|
374
148
|
|
|
375
|
-
const
|
|
376
|
-
new Uint8Array(imageBytes),
|
|
377
|
-
'image/jpeg',
|
|
378
|
-
{
|
|
379
|
-
enable_ocr: true,
|
|
380
|
-
ocr_config: {
|
|
381
|
-
languages: ['eng'], // English
|
|
382
|
-
backend: 'tesseract-wasm'
|
|
383
|
-
}
|
|
384
|
-
}
|
|
385
|
-
);
|
|
149
|
+
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
|
|
386
150
|
|
|
387
|
-
|
|
388
|
-
|
|
151
|
+
const result = await extractBytes(bytes, "image/png", {
|
|
152
|
+
ocr: {
|
|
153
|
+
backend: "tesseract-wasm",
|
|
154
|
+
language: "eng",
|
|
155
|
+
},
|
|
156
|
+
});
|
|
389
157
|
|
|
390
|
-
|
|
158
|
+
console.log("Extracted text:");
|
|
159
|
+
console.log(result.content);
|
|
160
|
+
}
|
|
391
161
|
|
|
392
|
-
|
|
393
|
-
const result = await extractBytes(imageBytes, 'image/png', {
|
|
394
|
-
enable_ocr: true,
|
|
395
|
-
ocr_config: {
|
|
396
|
-
languages: ['eng', 'deu', 'fra'], // English, German, French
|
|
397
|
-
backend: 'tesseract-wasm'
|
|
398
|
-
}
|
|
399
|
-
});
|
|
162
|
+
extractWithOcr().catch(console.error);
|
|
400
163
|
```
|
|
401
164
|
|
|
402
|
-
### Supported Languages
|
|
403
165
|
|
|
404
|
-
`eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
|
|
405
166
|
|
|
406
|
-
Training data is automatically loaded from jsDelivr CDN:
|
|
407
|
-
```
|
|
408
|
-
https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
|
|
409
|
-
```
|
|
410
167
|
|
|
411
|
-
|
|
168
|
+
#### Table Extraction
|
|
412
169
|
|
|
413
|
-
### Extract Tables
|
|
414
170
|
|
|
415
|
-
|
|
416
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
171
|
+
See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
|
|
417
172
|
|
|
418
|
-
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
419
|
-
extract_tables: true
|
|
420
|
-
});
|
|
421
173
|
|
|
422
|
-
if (result.tables) {
|
|
423
|
-
for (const table of result.tables) {
|
|
424
|
-
console.log('Table as Markdown:');
|
|
425
|
-
console.log(table.markdown);
|
|
426
174
|
|
|
427
|
-
|
|
428
|
-
console.log(JSON.stringify(table.cells, null, 2));
|
|
429
|
-
}
|
|
430
|
-
}
|
|
431
|
-
```
|
|
175
|
+
#### Processing Multiple Files
|
|
432
176
|
|
|
433
|
-
### Extract Images
|
|
434
177
|
|
|
435
|
-
```
|
|
436
|
-
import { extractBytes } from
|
|
178
|
+
```ts
|
|
179
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
437
180
|
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
|
|
441
|
-
|
|
442
|
-
max_image_dimension: 4096
|
|
443
|
-
}
|
|
444
|
-
});
|
|
445
|
-
|
|
446
|
-
if (result.images) {
|
|
447
|
-
for (const image of result.images) {
|
|
448
|
-
console.log(`Image ${image.index}: ${image.format}`);
|
|
449
|
-
// image.data is a Uint8Array
|
|
450
|
-
}
|
|
181
|
+
interface DocumentJob {
|
|
182
|
+
name: string;
|
|
183
|
+
bytes: Uint8Array;
|
|
184
|
+
mimeType: string;
|
|
451
185
|
}
|
|
452
|
-
```
|
|
453
186
|
|
|
454
|
-
|
|
455
|
-
|
|
456
|
-
|
|
457
|
-
|
|
458
|
-
|
|
459
|
-
|
|
460
|
-
|
|
461
|
-
|
|
462
|
-
|
|
463
|
-
|
|
464
|
-
|
|
465
|
-
|
|
466
|
-
|
|
467
|
-
|
|
468
|
-
|
|
469
|
-
|
|
470
|
-
|
|
187
|
+
async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
|
|
188
|
+
await initWasm();
|
|
189
|
+
|
|
190
|
+
const results: Record<string, string> = {};
|
|
191
|
+
const queue = [...documents];
|
|
192
|
+
|
|
193
|
+
const workers = Array(concurrency)
|
|
194
|
+
.fill(null)
|
|
195
|
+
.map(async () => {
|
|
196
|
+
while (queue.length > 0) {
|
|
197
|
+
const doc = queue.shift();
|
|
198
|
+
if (!doc) break;
|
|
199
|
+
|
|
200
|
+
try {
|
|
201
|
+
const result = await extractBytes(doc.bytes, doc.mimeType);
|
|
202
|
+
results[doc.name] = result.content;
|
|
203
|
+
} catch (error) {
|
|
204
|
+
console.error(`Failed to process ${doc.name}:`, error);
|
|
205
|
+
}
|
|
206
|
+
}
|
|
207
|
+
});
|
|
208
|
+
|
|
209
|
+
await Promise.all(workers);
|
|
210
|
+
return results;
|
|
471
211
|
}
|
|
472
212
|
```
|
|
473
213
|
|
|
474
|
-
### Language Detection
|
|
475
214
|
|
|
476
|
-
```typescript
|
|
477
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
478
215
|
|
|
479
|
-
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
480
|
-
enable_language_detection: true
|
|
481
|
-
});
|
|
482
216
|
|
|
483
|
-
if (result.language) {
|
|
484
|
-
console.log(`Detected language: ${result.language.code}`);
|
|
485
|
-
console.log(`Confidence: ${result.language.confidence}`);
|
|
486
|
-
}
|
|
487
|
-
```
|
|
488
217
|
|
|
489
|
-
|
|
490
|
-
|
|
491
|
-
```typescript
|
|
492
|
-
import {
|
|
493
|
-
extractBytes,
|
|
494
|
-
type ExtractionConfig
|
|
495
|
-
} from '@kreuzberg/wasm';
|
|
496
|
-
|
|
497
|
-
const config: ExtractionConfig = {
|
|
498
|
-
extract_tables: true,
|
|
499
|
-
extract_images: true,
|
|
500
|
-
extract_metadata: true,
|
|
501
|
-
|
|
502
|
-
enable_ocr: true,
|
|
503
|
-
ocr_config: {
|
|
504
|
-
languages: ['eng'],
|
|
505
|
-
backend: 'tesseract-wasm',
|
|
506
|
-
dpi: 300,
|
|
507
|
-
preprocessing: {
|
|
508
|
-
deskew: true,
|
|
509
|
-
denoise: true,
|
|
510
|
-
binarize: true
|
|
511
|
-
}
|
|
512
|
-
},
|
|
513
|
-
|
|
514
|
-
enable_chunking: true,
|
|
515
|
-
chunking_config: {
|
|
516
|
-
max_chars: 1000,
|
|
517
|
-
max_overlap: 200
|
|
518
|
-
},
|
|
519
|
-
|
|
520
|
-
enable_language_detection: true,
|
|
521
|
-
|
|
522
|
-
enable_quality: true,
|
|
523
|
-
|
|
524
|
-
extract_keywords: true,
|
|
525
|
-
keywords_config: {
|
|
526
|
-
max_keywords: 10,
|
|
527
|
-
method: 'yake'
|
|
528
|
-
}
|
|
529
|
-
};
|
|
530
|
-
|
|
531
|
-
const result = await extractBytes(data, mimeType, config);
|
|
532
|
-
```
|
|
218
|
+
#### Async Processing
|
|
533
219
|
|
|
534
|
-
|
|
220
|
+
For non-blocking document processing:
|
|
535
221
|
|
|
536
|
-
|
|
222
|
+
```ts
|
|
223
|
+
import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
|
|
537
224
|
|
|
538
|
-
|
|
539
|
-
|
|
225
|
+
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
|
|
226
|
+
const caps = getWasmCapabilities();
|
|
227
|
+
if (!caps.hasWasm) {
|
|
228
|
+
throw new Error("WebAssembly not supported");
|
|
229
|
+
}
|
|
540
230
|
|
|
541
|
-
|
|
542
|
-
const fileInput = document.querySelector<HTMLInputElement>('#files');
|
|
543
|
-
const files = Array.from(fileInput.files);
|
|
231
|
+
await initWasm();
|
|
544
232
|
|
|
545
|
-
const results = await
|
|
546
|
-
extract_tables: true
|
|
547
|
-
});
|
|
233
|
+
const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
|
|
548
234
|
|
|
549
|
-
|
|
550
|
-
|
|
235
|
+
return results.map((r) => ({
|
|
236
|
+
content: r.content,
|
|
237
|
+
pageCount: r.metadata?.pageCount,
|
|
238
|
+
}));
|
|
551
239
|
}
|
|
552
240
|
|
|
553
|
-
|
|
554
|
-
const
|
|
555
|
-
const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
|
|
241
|
+
const fileBytes = [new Uint8Array([1, 2, 3])];
|
|
242
|
+
const mimes = ["application/pdf"];
|
|
556
243
|
|
|
557
|
-
|
|
244
|
+
extractDocuments(fileBytes, mimes)
|
|
245
|
+
.then((results) => console.log(results))
|
|
246
|
+
.catch(console.error);
|
|
558
247
|
```
|
|
559
248
|
|
|
560
|
-
### Synchronous Extraction
|
|
561
249
|
|
|
562
|
-
```typescript
|
|
563
|
-
import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
|
|
564
250
|
|
|
565
|
-
// Synchronous single extraction
|
|
566
|
-
const result = extractBytesSync(data, 'application/pdf', config);
|
|
567
251
|
|
|
568
|
-
// Synchronous batch extraction
|
|
569
|
-
const results = batchExtractBytesSync(dataList, mimeTypes, config);
|
|
570
|
-
```
|
|
571
252
|
|
|
572
|
-
### Plugin System
|
|
573
253
|
|
|
574
|
-
|
|
254
|
+
### Next Steps
|
|
575
255
|
|
|
576
|
-
|
|
577
|
-
|
|
256
|
+
- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
|
|
257
|
+
- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
|
|
258
|
+
- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
|
|
259
|
+
- **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)** - Advanced configuration options
|
|
578
260
|
|
|
579
|
-
registerPostProcessor({
|
|
580
|
-
name: 'uppercase',
|
|
581
|
-
async process(result) {
|
|
582
|
-
return {
|
|
583
|
-
...result,
|
|
584
|
-
content: result.content.toUpperCase()
|
|
585
|
-
};
|
|
586
|
-
}
|
|
587
|
-
});
|
|
588
261
|
|
|
589
|
-
// Now all extractions will use this processor
|
|
590
|
-
const result = await extractBytes(data, mimeType);
|
|
591
|
-
console.log(result.content); // UPPERCASE TEXT
|
|
592
|
-
```
|
|
593
262
|
|
|
594
|
-
|
|
263
|
+
## Features
|
|
595
264
|
|
|
596
|
-
|
|
597
|
-
import { registerValidator } from '@kreuzberg/wasm';
|
|
265
|
+
### Supported File Formats (56+)
|
|
598
266
|
|
|
599
|
-
|
|
600
|
-
name: 'min-length',
|
|
601
|
-
async validate(result) {
|
|
602
|
-
if (result.content.length < 100) {
|
|
603
|
-
throw new Error('Document too short');
|
|
604
|
-
}
|
|
605
|
-
}
|
|
606
|
-
});
|
|
607
|
-
```
|
|
267
|
+
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
|
|
608
268
|
|
|
609
|
-
####
|
|
610
|
-
|
|
611
|
-
```typescript
|
|
612
|
-
import { registerOcrBackend } from '@kreuzberg/wasm';
|
|
613
|
-
|
|
614
|
-
registerOcrBackend({
|
|
615
|
-
name: 'custom-ocr',
|
|
616
|
-
supportedLanguages() {
|
|
617
|
-
return ['eng', 'fra'];
|
|
618
|
-
},
|
|
619
|
-
async initialize() {
|
|
620
|
-
// Initialize your OCR backend
|
|
621
|
-
},
|
|
622
|
-
async processImage(imageBytes, language) {
|
|
623
|
-
// Process image and return result
|
|
624
|
-
return {
|
|
625
|
-
content: 'extracted text',
|
|
626
|
-
mime_type: 'text/plain',
|
|
627
|
-
metadata: {},
|
|
628
|
-
tables: []
|
|
629
|
-
};
|
|
630
|
-
},
|
|
631
|
-
async shutdown() {
|
|
632
|
-
// Cleanup
|
|
633
|
-
}
|
|
634
|
-
});
|
|
635
|
-
```
|
|
269
|
+
#### Office Documents
|
|
636
270
|
|
|
637
|
-
|
|
271
|
+
| Category | Formats | Capabilities |
|
|
272
|
+
|----------|---------|--------------|
|
|
273
|
+
| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
|
|
274
|
+
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
|
|
275
|
+
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
|
|
276
|
+
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
|
|
277
|
+
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
|
|
638
278
|
|
|
639
|
-
|
|
640
|
-
import {
|
|
641
|
-
detectMimeFromBytes,
|
|
642
|
-
getMimeFromExtension,
|
|
643
|
-
getExtensionsForMime,
|
|
644
|
-
normalizeMimeType
|
|
645
|
-
} from '@kreuzberg/wasm';
|
|
279
|
+
#### Images (OCR-Enabled)
|
|
646
280
|
|
|
647
|
-
|
|
648
|
-
|
|
281
|
+
| Category | Formats | Features |
|
|
282
|
+
|----------|---------|----------|
|
|
283
|
+
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
|
|
284
|
+
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
|
|
285
|
+
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
|
|
649
286
|
|
|
650
|
-
|
|
651
|
-
const mime = getMimeFromExtension('pdf'); // 'application/pdf'
|
|
287
|
+
#### Web & Data
|
|
652
288
|
|
|
653
|
-
|
|
654
|
-
|
|
289
|
+
| Category | Formats | Features |
|
|
290
|
+
|----------|---------|----------|
|
|
291
|
+
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
|
|
292
|
+
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
|
|
293
|
+
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
|
|
655
294
|
|
|
656
|
-
|
|
657
|
-
const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
|
|
658
|
-
```
|
|
295
|
+
#### Email & Archives
|
|
659
296
|
|
|
660
|
-
|
|
297
|
+
| Category | Formats | Features |
|
|
298
|
+
|----------|---------|----------|
|
|
299
|
+
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
|
|
300
|
+
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
|
|
661
301
|
|
|
662
|
-
|
|
663
|
-
import { loadConfigFromString } from '@kreuzberg/wasm';
|
|
302
|
+
#### Academic & Scientific
|
|
664
303
|
|
|
665
|
-
|
|
666
|
-
|
|
667
|
-
|
|
668
|
-
|
|
669
|
-
|
|
670
|
-
languages: [eng, deu]
|
|
671
|
-
`;
|
|
672
|
-
const config = loadConfigFromString(yamlConfig, 'yaml');
|
|
304
|
+
| Category | Formats | Features |
|
|
305
|
+
|----------|---------|----------|
|
|
306
|
+
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
|
|
307
|
+
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
|
|
308
|
+
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
|
|
673
309
|
|
|
674
|
-
|
|
675
|
-
const jsonConfig = '{"extract_tables":true}';
|
|
676
|
-
const config2 = loadConfigFromString(jsonConfig, 'json');
|
|
310
|
+
**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
|
|
677
311
|
|
|
678
|
-
|
|
679
|
-
const tomlConfig = 'extract_tables = true';
|
|
680
|
-
const config3 = loadConfigFromString(tomlConfig, 'toml');
|
|
681
|
-
```
|
|
312
|
+
### Key Capabilities
|
|
682
313
|
|
|
683
|
-
|
|
314
|
+
- **Text Extraction** - Extract all text content with position and formatting information
|
|
315
|
+
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
|
|
316
|
+
- **Table Extraction** - Parse tables with structure and cell content preservation
|
|
317
|
+
- **Image Extraction** - Extract embedded images and render page previews
|
|
318
|
+
- **OCR Support** - Integrate multiple OCR backends for scanned documents
|
|
684
319
|
|
|
685
|
-
|
|
320
|
+
- **Async/Await** - Non-blocking document processing with concurrent operations
|
|
686
321
|
|
|
687
|
-
#### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
|
|
688
|
-
Extract content from a browser `File` object.
|
|
689
322
|
|
|
690
|
-
|
|
691
|
-
Asynchronously extract content from a `Uint8Array`.
|
|
323
|
+
- **Plugin System** - Extensible post-processing for custom text transformation
|
|
692
324
|
|
|
693
|
-
#### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
|
|
694
|
-
Synchronously extract content from a `Uint8Array`.
|
|
695
325
|
|
|
696
|
-
|
|
697
|
-
|
|
326
|
+
- **Batch Processing** - Efficiently process multiple documents in parallel
|
|
327
|
+
- **Memory Efficient** - Stream large files without loading entirely into memory
|
|
328
|
+
- **Language Detection** - Detect and support multiple languages in documents
|
|
329
|
+
- **Configuration** - Fine-grained control over extraction behavior
|
|
698
330
|
|
|
699
|
-
|
|
700
|
-
Extract multiple byte arrays in parallel.
|
|
331
|
+
### Performance Characteristics
|
|
701
332
|
|
|
702
|
-
|
|
703
|
-
|
|
333
|
+
| Format | Speed | Memory | Notes |
|
|
334
|
+
|--------|-------|--------|-------|
|
|
335
|
+
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
|
|
336
|
+
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
|
|
337
|
+
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
|
|
338
|
+
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
|
|
339
|
+
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
|
|
704
340
|
|
|
705
|
-
### Plugin Management
|
|
706
341
|
|
|
707
|
-
#### Post-Processors
|
|
708
342
|
|
|
709
|
-
|
|
710
|
-
registerPostProcessor(processor: PostProcessorProtocol): void
|
|
711
|
-
unregisterPostProcessor(name: string): void
|
|
712
|
-
clearPostProcessors(): void
|
|
713
|
-
listPostProcessors(): string[]
|
|
714
|
-
```
|
|
343
|
+
## OCR Support
|
|
715
344
|
|
|
716
|
-
|
|
345
|
+
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
|
|
717
346
|
|
|
718
|
-
```typescript
|
|
719
|
-
registerValidator(validator: ValidatorProtocol): void
|
|
720
|
-
unregisterValidator(name: string): void
|
|
721
|
-
clearValidators(): void
|
|
722
|
-
listValidators(): string[]
|
|
723
|
-
```
|
|
724
347
|
|
|
725
|
-
|
|
348
|
+
- **Tesseract-Wasm**
|
|
726
349
|
|
|
727
|
-
```typescript
|
|
728
|
-
registerOcrBackend(backend: OcrBackendProtocol): void
|
|
729
|
-
unregisterOcrBackend(name: string): void
|
|
730
|
-
clearOcrBackends(): void
|
|
731
|
-
listOcrBackends(): string[]
|
|
732
|
-
```
|
|
733
350
|
|
|
734
|
-
###
|
|
351
|
+
### OCR Configuration Example
|
|
735
352
|
|
|
736
|
-
```
|
|
737
|
-
|
|
738
|
-
unregisterDocumentExtractor(name: string): void
|
|
739
|
-
clearDocumentExtractors(): void
|
|
740
|
-
```
|
|
353
|
+
```ts
|
|
354
|
+
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
741
355
|
|
|
742
|
-
|
|
356
|
+
async function extractWithOcr() {
|
|
357
|
+
await initWasm();
|
|
743
358
|
|
|
744
|
-
|
|
745
|
-
|
|
746
|
-
|
|
747
|
-
|
|
748
|
-
|
|
749
|
-
|
|
359
|
+
try {
|
|
360
|
+
await enableOcr();
|
|
361
|
+
console.log("OCR enabled successfully");
|
|
362
|
+
} catch (error) {
|
|
363
|
+
console.error("Failed to enable OCR:", error);
|
|
364
|
+
return;
|
|
365
|
+
}
|
|
750
366
|
|
|
751
|
-
|
|
367
|
+
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
|
|
752
368
|
|
|
753
|
-
|
|
754
|
-
|
|
755
|
-
|
|
369
|
+
const result = await extractBytes(bytes, "image/png", {
|
|
370
|
+
ocr: {
|
|
371
|
+
backend: "tesseract-wasm",
|
|
372
|
+
language: "eng",
|
|
373
|
+
},
|
|
374
|
+
});
|
|
756
375
|
|
|
757
|
-
|
|
376
|
+
console.log("Extracted text:");
|
|
377
|
+
console.log(result.content);
|
|
378
|
+
}
|
|
758
379
|
|
|
759
|
-
|
|
760
|
-
listEmbeddingPresets(): string[]
|
|
761
|
-
getEmbeddingPreset(name: string): EmbeddingPreset | null
|
|
380
|
+
extractWithOcr().catch(console.error);
|
|
762
381
|
```
|
|
763
382
|
|
|
764
|
-
## Types
|
|
765
|
-
|
|
766
|
-
All types are shared via the `@kreuzberg/core` package:
|
|
767
|
-
|
|
768
|
-
```typescript
|
|
769
|
-
import type {
|
|
770
|
-
ExtractionResult,
|
|
771
|
-
ExtractionConfig,
|
|
772
|
-
OcrConfig,
|
|
773
|
-
ChunkingConfig,
|
|
774
|
-
ImageConfig,
|
|
775
|
-
KeywordsConfig,
|
|
776
|
-
Table,
|
|
777
|
-
ExtractedImage,
|
|
778
|
-
Chunk,
|
|
779
|
-
Metadata,
|
|
780
|
-
PostProcessorProtocol,
|
|
781
|
-
ValidatorProtocol,
|
|
782
|
-
OcrBackendProtocol
|
|
783
|
-
} from '@kreuzberg/core';
|
|
784
|
-
```
|
|
785
383
|
|
|
786
|
-
### ExtractionResult
|
|
787
|
-
|
|
788
|
-
Main result object containing:
|
|
789
|
-
- `content: string` - Extracted text content
|
|
790
|
-
- `mime_type: string` - MIME type of the document
|
|
791
|
-
- `metadata?: Metadata` - Document metadata
|
|
792
|
-
- `tables?: Table[]` - Extracted tables
|
|
793
|
-
- `images?: ExtractedImage[]` - Extracted images
|
|
794
|
-
- `chunks?: Chunk[]` - Text chunks (if chunking enabled)
|
|
795
|
-
- `language?: LanguageInfo` - Detected language (if enabled)
|
|
796
|
-
- `keywords?: Keyword[]` - Extracted keywords (if enabled)
|
|
797
|
-
|
|
798
|
-
### ExtractionConfig
|
|
799
|
-
|
|
800
|
-
Configuration object for extraction:
|
|
801
|
-
- `extract_tables?: boolean` - Extract tables as structured data
|
|
802
|
-
- `extract_images?: boolean` - Extract embedded images
|
|
803
|
-
- `extract_metadata?: boolean` - Extract document metadata
|
|
804
|
-
- `enable_ocr?: boolean` - Enable OCR for images
|
|
805
|
-
- `ocr_config?: OcrConfig` - OCR settings
|
|
806
|
-
- `enable_chunking?: boolean` - Split text into semantic chunks
|
|
807
|
-
- `chunking_config?: ChunkingConfig` - Text chunking settings
|
|
808
|
-
- `enable_language_detection?: boolean` - Detect document language
|
|
809
|
-
- `enable_quality?: boolean` - Encoding detection, normalization
|
|
810
|
-
- `extract_keywords?: boolean` - Extract important keywords
|
|
811
|
-
- `keywords_config?: KeywordsConfig` - Keyword extraction settings
|
|
812
|
-
|
|
813
|
-
### Table
|
|
814
|
-
|
|
815
|
-
Extracted table structure:
|
|
816
|
-
- `markdown: string` - Table in Markdown format
|
|
817
|
-
- `cells: TableCell[][]` - 2D array of table cells
|
|
818
|
-
- `row_count: number` - Number of rows
|
|
819
|
-
- `column_count: number` - Number of columns
|
|
820
|
-
|
|
821
|
-
## Supported Formats
|
|
822
|
-
|
|
823
|
-
| Category | Formats |
|
|
824
|
-
|----------|---------|
|
|
825
|
-
| **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
|
|
826
|
-
| **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
|
|
827
|
-
| **Web** | HTML, XHTML, XML, EPUB |
|
|
828
|
-
| **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
|
|
829
|
-
| **Email** | EML, MSG |
|
|
830
|
-
| **Archives** | ZIP, TAR, 7Z |
|
|
831
|
-
| **Other** | And 30+ more formats |
|
|
832
|
-
|
|
833
|
-
## Build from Source
|
|
834
|
-
|
|
835
|
-
### Prerequisites
|
|
836
|
-
|
|
837
|
-
- Rust 1.75+ with `wasm32-unknown-unknown` target
|
|
838
|
-
- Node.js 18+ with pnpm
|
|
839
|
-
- wasm-pack
|
|
840
384
|
|
|
841
|
-
```bash
|
|
842
|
-
# Install Rust target
|
|
843
|
-
rustup target add wasm32-unknown-unknown
|
|
844
385
|
|
|
845
|
-
|
|
846
|
-
curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
|
|
386
|
+
## Async Support
|
|
847
387
|
|
|
848
|
-
|
|
849
|
-
cd crates/kreuzberg-wasm
|
|
850
|
-
pnpm install
|
|
851
|
-
pnpm run build
|
|
388
|
+
This binding provides full async/await support for non-blocking document processing:
|
|
852
389
|
|
|
853
|
-
|
|
854
|
-
|
|
855
|
-
```
|
|
390
|
+
```ts
|
|
391
|
+
import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
|
|
856
392
|
|
|
857
|
-
|
|
393
|
+
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
|
|
394
|
+
const caps = getWasmCapabilities();
|
|
395
|
+
if (!caps.hasWasm) {
|
|
396
|
+
throw new Error("WebAssembly not supported");
|
|
397
|
+
}
|
|
858
398
|
|
|
859
|
-
|
|
860
|
-
# For browsers (ESM modules)
|
|
861
|
-
pnpm run build:wasm:web
|
|
399
|
+
await initWasm();
|
|
862
400
|
|
|
863
|
-
|
|
864
|
-
pnpm run build:wasm:bundler
|
|
401
|
+
const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
|
|
865
402
|
|
|
866
|
-
|
|
867
|
-
|
|
403
|
+
return results.map((r) => ({
|
|
404
|
+
content: r.content,
|
|
405
|
+
pageCount: r.metadata?.pageCount,
|
|
406
|
+
}));
|
|
407
|
+
}
|
|
868
408
|
|
|
869
|
-
|
|
870
|
-
|
|
409
|
+
const fileBytes = [new Uint8Array([1, 2, 3])];
|
|
410
|
+
const mimes = ["application/pdf"];
|
|
871
411
|
|
|
872
|
-
|
|
873
|
-
|
|
412
|
+
extractDocuments(fileBytes, mimes)
|
|
413
|
+
.then((results) => console.log(results))
|
|
414
|
+
.catch(console.error);
|
|
874
415
|
```
|
|
875
416
|
|
|
876
|
-
## Limitations
|
|
877
|
-
|
|
878
|
-
### No File System Access
|
|
879
417
|
|
|
880
|
-
The WASM binding cannot access the file system directly. Use file readers:
|
|
881
418
|
|
|
882
|
-
```typescript
|
|
883
|
-
// ❌ Won't work
|
|
884
|
-
await extractFileSync('./document.pdf'); // Throws error
|
|
885
419
|
|
|
886
|
-
|
|
887
|
-
const bytes = await Deno.readFile('./document.pdf'); // Deno
|
|
888
|
-
const bytes = await fs.readFile('./document.pdf'); // Node.js
|
|
889
|
-
const bytes = await file.arrayBuffer(); // Browser
|
|
890
|
-
```
|
|
420
|
+
## Plugin System
|
|
891
421
|
|
|
892
|
-
|
|
422
|
+
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
|
|
893
423
|
|
|
894
|
-
|
|
424
|
+
For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/guides/plugins/).
|
|
895
425
|
|
|
896
|
-
### Size Constraints
|
|
897
426
|
|
|
898
|
-
Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
|
|
899
427
|
|
|
900
|
-
## Troubleshooting
|
|
901
428
|
|
|
902
|
-
### "WASM module failed to initialize"
|
|
903
429
|
|
|
904
|
-
Ensure your bundler is configured to handle WASM files:
|
|
905
430
|
|
|
906
|
-
|
|
907
|
-
```typescript
|
|
908
|
-
// vite.config.ts
|
|
909
|
-
export default {
|
|
910
|
-
optimizeDeps: {
|
|
911
|
-
exclude: ['@kreuzberg/wasm']
|
|
912
|
-
}
|
|
913
|
-
}
|
|
914
|
-
```
|
|
431
|
+
## Batch Processing
|
|
915
432
|
|
|
916
|
-
|
|
917
|
-
```javascript
|
|
918
|
-
// webpack.config.js
|
|
919
|
-
module.exports = {
|
|
920
|
-
experiments: {
|
|
921
|
-
asyncWebAssembly: true
|
|
922
|
-
}
|
|
923
|
-
}
|
|
924
|
-
```
|
|
433
|
+
Process multiple documents efficiently:
|
|
925
434
|
|
|
926
|
-
|
|
435
|
+
```ts
|
|
436
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
927
437
|
|
|
928
|
-
|
|
438
|
+
interface DocumentJob {
|
|
439
|
+
name: string;
|
|
440
|
+
bytes: Uint8Array;
|
|
441
|
+
mimeType: string;
|
|
442
|
+
}
|
|
929
443
|
|
|
930
|
-
|
|
931
|
-
|
|
444
|
+
async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
|
|
445
|
+
await initWasm();
|
|
446
|
+
|
|
447
|
+
const results: Record<string, string> = {};
|
|
448
|
+
const queue = [...documents];
|
|
449
|
+
|
|
450
|
+
const workers = Array(concurrency)
|
|
451
|
+
.fill(null)
|
|
452
|
+
.map(async () => {
|
|
453
|
+
while (queue.length > 0) {
|
|
454
|
+
const doc = queue.shift();
|
|
455
|
+
if (!doc) break;
|
|
456
|
+
|
|
457
|
+
try {
|
|
458
|
+
const result = await extractBytes(doc.bytes, doc.mimeType);
|
|
459
|
+
results[doc.name] = result.content;
|
|
460
|
+
} catch (error) {
|
|
461
|
+
console.error(`Failed to process ${doc.name}:`, error);
|
|
462
|
+
}
|
|
463
|
+
}
|
|
464
|
+
});
|
|
465
|
+
|
|
466
|
+
await Promise.all(workers);
|
|
467
|
+
return results;
|
|
468
|
+
}
|
|
932
469
|
```
|
|
933
470
|
|
|
934
|
-
### Memory Issues in Workers
|
|
935
|
-
|
|
936
|
-
For large documents in Cloudflare Workers, process in smaller chunks:
|
|
937
|
-
|
|
938
|
-
```typescript
|
|
939
|
-
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
940
|
-
chunking_config: { max_chars: 1000 }
|
|
941
|
-
});
|
|
942
|
-
```
|
|
943
471
|
|
|
944
|
-
### OCR Not Working
|
|
945
472
|
|
|
946
|
-
Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
|
|
947
473
|
|
|
948
|
-
##
|
|
474
|
+
## Configuration
|
|
949
475
|
|
|
950
|
-
|
|
476
|
+
For advanced configuration options including language detection, table extraction, OCR settings, and more:
|
|
951
477
|
|
|
952
|
-
|
|
953
|
-
- **Deno**: Command-line document extraction
|
|
954
|
-
- **Cloudflare Workers**: Document processing API
|
|
955
|
-
- **Node.js**: Batch processing script
|
|
478
|
+
**[Configuration Guide](https://kreuzberg.dev/guides/configuration/)**
|
|
956
479
|
|
|
957
480
|
## Documentation
|
|
958
481
|
|
|
959
|
-
|
|
482
|
+
- **[Official Documentation](https://kreuzberg.dev/)**
|
|
483
|
+
- **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
|
|
484
|
+
- **[Examples & Guides](https://kreuzberg.dev/guides/)**
|
|
960
485
|
|
|
961
486
|
## Contributing
|
|
962
487
|
|
|
963
|
-
|
|
488
|
+
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
|
|
964
489
|
|
|
965
490
|
## License
|
|
966
491
|
|
|
967
|
-
MIT
|
|
968
|
-
|
|
969
|
-
## Links
|
|
970
|
-
|
|
971
|
-
- [Website](https://kreuzberg.dev)
|
|
972
|
-
- [Documentation](https://kreuzberg.dev)
|
|
973
|
-
- [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
|
|
974
|
-
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
975
|
-
- [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
|
|
976
|
-
- [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
492
|
+
MIT License - see LICENSE file for details.
|
|
977
493
|
|
|
978
|
-
##
|
|
494
|
+
## Support
|
|
979
495
|
|
|
980
|
-
- [
|
|
981
|
-
- [
|
|
982
|
-
- [
|
|
496
|
+
- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
|
|
497
|
+
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
498
|
+
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
|