@kreuzberg/wasm 4.0.0-rc.6 → 4.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +7 -0
- package/README.md +321 -800
- package/dist/adapters/wasm-adapter.d.ts +7 -10
- package/dist/adapters/wasm-adapter.d.ts.map +1 -0
- package/dist/adapters/wasm-adapter.js +53 -54
- package/dist/adapters/wasm-adapter.js.map +1 -1
- package/dist/index.d.ts +23 -67
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +1102 -104
- package/dist/index.js.map +1 -1
- package/dist/ocr/registry.d.ts +7 -10
- package/dist/ocr/registry.d.ts.map +1 -0
- package/dist/ocr/registry.js +9 -28
- package/dist/ocr/registry.js.map +1 -1
- package/dist/ocr/tesseract-wasm-backend.d.ts +3 -6
- package/dist/ocr/tesseract-wasm-backend.d.ts.map +1 -0
- package/dist/ocr/tesseract-wasm-backend.js +8 -83
- package/dist/ocr/tesseract-wasm-backend.js.map +1 -1
- package/dist/pdfium.js +77 -0
- package/dist/pkg/LICENSE +7 -0
- package/dist/pkg/README.md +503 -0
- package/dist/{kreuzberg_wasm.d.ts → pkg/kreuzberg_wasm.d.ts} +24 -12
- package/dist/{kreuzberg_wasm.js → pkg/kreuzberg_wasm.js} +224 -233
- package/dist/pkg/kreuzberg_wasm_bg.js +1871 -0
- package/dist/{kreuzberg_wasm_bg.wasm → pkg/kreuzberg_wasm_bg.wasm} +0 -0
- package/dist/{kreuzberg_wasm_bg.wasm.d.ts → pkg/kreuzberg_wasm_bg.wasm.d.ts} +10 -13
- package/dist/pkg/package.json +27 -0
- package/dist/plugin-registry.d.ts +246 -0
- package/dist/plugin-registry.d.ts.map +1 -0
- package/dist/runtime.d.ts +21 -22
- package/dist/runtime.d.ts.map +1 -0
- package/dist/runtime.js +21 -41
- package/dist/runtime.js.map +1 -1
- package/dist/types.d.ts +363 -0
- package/dist/types.d.ts.map +1 -0
- package/package.json +34 -51
- package/dist/adapters/wasm-adapter.d.mts +0 -121
- package/dist/adapters/wasm-adapter.mjs +0 -221
- package/dist/adapters/wasm-adapter.mjs.map +0 -1
- package/dist/index.d.mts +0 -466
- package/dist/index.mjs +0 -384
- package/dist/index.mjs.map +0 -1
- package/dist/kreuzberg_wasm.d.mts +0 -758
- package/dist/kreuzberg_wasm.mjs +0 -48
- package/dist/ocr/registry.d.mts +0 -102
- package/dist/ocr/registry.mjs +0 -70
- package/dist/ocr/registry.mjs.map +0 -1
- package/dist/ocr/tesseract-wasm-backend.d.mts +0 -257
- package/dist/ocr/tesseract-wasm-backend.mjs +0 -424
- package/dist/ocr/tesseract-wasm-backend.mjs.map +0 -1
- package/dist/runtime.d.mts +0 -256
- package/dist/runtime.mjs +0 -152
- package/dist/runtime.mjs.map +0 -1
- package/dist/snippets/wasm-bindgen-rayon-38edf6e439f6d70d/src/workerHelpers.js +0 -107
- package/dist/types-GJVIvbPy.d.mts +0 -221
- package/dist/types-GJVIvbPy.d.ts +0 -221
package/README.md
CHANGED
|
@@ -1,982 +1,503 @@
|
|
|
1
|
-
#
|
|
1
|
+
# WebAssembly
|
|
2
|
+
|
|
3
|
+
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
|
4
|
+
<!-- Language Bindings -->
|
|
5
|
+
<a href="https://crates.io/crates/kreuzberg">
|
|
6
|
+
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
|
|
7
|
+
</a>
|
|
8
|
+
<a href="https://hex.pm/packages/kreuzberg">
|
|
9
|
+
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
|
|
10
|
+
</a>
|
|
11
|
+
<a href="https://pypi.org/project/kreuzberg/">
|
|
12
|
+
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
|
|
13
|
+
</a>
|
|
14
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/node">
|
|
15
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
|
|
16
|
+
</a>
|
|
17
|
+
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
|
|
18
|
+
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
|
|
19
|
+
</a>
|
|
20
|
+
|
|
21
|
+
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
|
|
22
|
+
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
|
|
23
|
+
</a>
|
|
24
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
|
|
25
|
+
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
|
|
26
|
+
</a>
|
|
27
|
+
<a href="https://www.nuget.org/packages/Kreuzberg/">
|
|
28
|
+
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
|
|
29
|
+
</a>
|
|
30
|
+
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
|
|
31
|
+
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
|
|
32
|
+
</a>
|
|
33
|
+
<a href="https://rubygems.org/gems/kreuzberg">
|
|
34
|
+
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
|
|
35
|
+
</a>
|
|
36
|
+
|
|
37
|
+
<!-- Project Info -->
|
|
38
|
+
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
|
|
39
|
+
<img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
|
|
40
|
+
</a>
|
|
41
|
+
<a href="https://docs.kreuzberg.dev">
|
|
42
|
+
<img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
|
|
43
|
+
</a>
|
|
44
|
+
</div>
|
|
45
|
+
|
|
46
|
+
<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
|
47
|
+
|
|
48
|
+
<div align="center" style="margin-top: 20px;">
|
|
49
|
+
<a href="https://discord.gg/pXxagNK2zN">
|
|
50
|
+
<img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
|
|
51
|
+
</a>
|
|
52
|
+
</div>
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
|
|
2
56
|
|
|
3
|
-
[](https://crates.io/crates/kreuzberg)
|
|
4
|
-
[](https://pypi.org/project/kreuzberg/)
|
|
5
|
-
[](https://www.npmjs.com/package/@kreuzberg/node)
|
|
6
|
-
[](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
7
|
-
[](https://rubygems.org/gems/kreuzberg)
|
|
8
|
-
[](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
|
|
9
|
-
[](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
|
|
10
|
-
[](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
|
|
11
57
|
|
|
12
|
-
|
|
13
|
-
[](https://kreuzberg.dev/)
|
|
14
|
-
[](https://discord.gg/pXxagNK2zN)
|
|
15
|
-
|
|
16
|
-
High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
|
|
17
|
-
|
|
18
|
-
Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
|
|
19
|
-
|
|
20
|
-
> **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
|
|
21
|
-
>
|
|
22
|
-
> **This WASM package is designed for:**
|
|
23
|
-
> - Browser applications (including web workers)
|
|
24
|
-
> - Cloudflare Workers and edge runtimes
|
|
25
|
-
> - Deno applications
|
|
26
|
-
> - Environments without native build toolchain
|
|
27
|
-
|
|
28
|
-
> **🚀 Version 4.0.0 Release Candidate**
|
|
29
|
-
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
|
30
|
-
|
|
31
|
-
## Features
|
|
32
|
-
|
|
33
|
-
- **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
|
|
34
|
-
- **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
|
|
35
|
-
- **Table Extraction**: Advanced table detection and structured data extraction
|
|
36
|
-
- **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
|
|
37
|
-
- **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
|
|
38
|
-
- **API Parity**: All extraction functions from the Node.js binding
|
|
39
|
-
- **Plugin System**: Custom post-processors, validators, and OCR backends
|
|
40
|
-
- **Optimized Bundle**: <5MB uncompressed, <2MB compressed
|
|
41
|
-
- **Zero Configuration**: Works out of the box with sensible defaults
|
|
42
|
-
- **Portable**: Runs anywhere WASM is supported without native dependencies
|
|
43
|
-
|
|
44
|
-
## Requirements
|
|
45
|
-
|
|
46
|
-
- **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
|
|
47
|
-
- **Node.js**: 18 or higher
|
|
48
|
-
- **Deno**: 1.0 or higher
|
|
49
|
-
- **Cloudflare Workers**: Compatible with Workers runtime
|
|
50
|
-
|
|
51
|
-
### Optional Dependencies
|
|
58
|
+
## Installation
|
|
52
59
|
|
|
53
|
-
|
|
60
|
+
### Package Installation
|
|
54
61
|
|
|
55
|
-
## Installation
|
|
56
62
|
|
|
57
|
-
|
|
63
|
+
Install via one of the supported package managers:
|
|
58
64
|
|
|
59
|
-
| Use Case | Recommendation | Reason |
|
|
60
|
-
|----------|---|---|
|
|
61
|
-
| **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
|
|
62
|
-
| **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
|
|
63
|
-
| **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
|
|
64
|
-
| **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
|
|
65
|
-
| **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
|
|
66
65
|
|
|
67
|
-
### Install via npm/pnpm/yarn
|
|
68
66
|
|
|
67
|
+
**npm:**
|
|
69
68
|
```bash
|
|
70
69
|
npm install @kreuzberg/wasm
|
|
71
70
|
```
|
|
72
71
|
|
|
73
|
-
Or with pnpm:
|
|
74
72
|
|
|
75
|
-
```bash
|
|
76
|
-
pnpm add @kreuzberg/wasm
|
|
77
|
-
```
|
|
78
73
|
|
|
79
|
-
Or with yarn:
|
|
80
74
|
|
|
75
|
+
**pnpm:**
|
|
81
76
|
```bash
|
|
82
|
-
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
### Deno
|
|
86
|
-
|
|
87
|
-
```typescript
|
|
88
|
-
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
89
|
-
```
|
|
90
|
-
|
|
91
|
-
## Quick Start
|
|
92
|
-
|
|
93
|
-
### Browser (ESM)
|
|
94
|
-
|
|
95
|
-
```typescript
|
|
96
|
-
import { extractFile } from '@kreuzberg/wasm';
|
|
97
|
-
|
|
98
|
-
async function handleFileUpload() {
|
|
99
|
-
const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
|
|
100
|
-
const file = fileInput.files[0];
|
|
101
|
-
|
|
102
|
-
const result = await extractFile(file, {
|
|
103
|
-
extract_tables: true,
|
|
104
|
-
extract_images: true
|
|
105
|
-
});
|
|
106
|
-
|
|
107
|
-
console.log('Extracted text:', result.content);
|
|
108
|
-
console.log('Tables found:', result.tables.length);
|
|
109
|
-
}
|
|
110
|
-
```
|
|
111
|
-
|
|
112
|
-
### Node.js (ESM)
|
|
113
|
-
|
|
114
|
-
```typescript
|
|
115
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
116
|
-
import { readFile } from 'fs/promises';
|
|
117
|
-
|
|
118
|
-
const pdfBytes = await readFile('./document.pdf');
|
|
119
|
-
const result = await extractBytes(
|
|
120
|
-
new Uint8Array(pdfBytes),
|
|
121
|
-
'application/pdf',
|
|
122
|
-
{ extract_tables: true }
|
|
123
|
-
);
|
|
124
|
-
|
|
125
|
-
console.log(result.content);
|
|
126
|
-
console.log('Found', result.tables.length, 'tables');
|
|
127
|
-
```
|
|
128
|
-
|
|
129
|
-
### Deno
|
|
130
|
-
|
|
131
|
-
```typescript
|
|
132
|
-
import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
|
|
133
|
-
|
|
134
|
-
const pdfBytes = await Deno.readFile("./document.pdf");
|
|
135
|
-
const result = await extractBytes(pdfBytes, "application/pdf");
|
|
136
|
-
|
|
137
|
-
console.log(result.content);
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
### Cloudflare Workers
|
|
141
|
-
|
|
142
|
-
```typescript
|
|
143
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
144
|
-
|
|
145
|
-
export default {
|
|
146
|
-
async fetch(request: Request): Promise<Response> {
|
|
147
|
-
if (request.method === 'POST') {
|
|
148
|
-
const formData = await request.formData();
|
|
149
|
-
const file = formData.get('file') as File;
|
|
150
|
-
|
|
151
|
-
const arrayBuffer = await file.arrayBuffer();
|
|
152
|
-
const bytes = new Uint8Array(arrayBuffer);
|
|
153
|
-
|
|
154
|
-
const result = await extractBytes(bytes, file.type);
|
|
155
|
-
|
|
156
|
-
return Response.json({
|
|
157
|
-
text: result.content,
|
|
158
|
-
metadata: result.metadata,
|
|
159
|
-
tables: result.tables
|
|
160
|
-
});
|
|
161
|
-
}
|
|
162
|
-
|
|
163
|
-
return new Response('Upload a file', { status: 400 });
|
|
164
|
-
}
|
|
165
|
-
};
|
|
77
|
+
pnpm add @kreuzberg/wasm
|
|
166
78
|
```
|
|
167
79
|
|
|
168
|
-
## Performance Comparison
|
|
169
|
-
|
|
170
|
-
Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
|
|
171
|
-
|
|
172
|
-
| Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
|
|
173
|
-
|--------|---|---|---|
|
|
174
|
-
| **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
|
|
175
|
-
| **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
|
|
176
|
-
| **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
|
|
177
|
-
| **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
|
|
178
|
-
| **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
|
|
179
|
-
|
|
180
|
-
### When to Use WASM vs Native
|
|
181
|
-
|
|
182
|
-
**Use WASM (@kreuzberg/wasm) when:**
|
|
183
|
-
- Building browser applications (no choice, WASM required)
|
|
184
|
-
- Targeting Cloudflare Workers or edge runtimes
|
|
185
|
-
- Supporting Deno applications
|
|
186
|
-
- You don't have a native build toolchain available
|
|
187
|
-
- Portability across runtimes is critical
|
|
188
80
|
|
|
189
|
-
**Use Native (@kreuzberg/node) when:**
|
|
190
|
-
- Building Node.js or Bun applications (2-3x faster)
|
|
191
|
-
- Performance is your primary concern
|
|
192
|
-
- You're processing large volumes of documents
|
|
193
|
-
- You have native build tools available
|
|
194
81
|
|
|
195
|
-
### Performance Tips for WASM
|
|
196
82
|
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
4. **Preload OCR models** by calling extraction with OCR enabled early
|
|
201
|
-
|
|
202
|
-
## Examples
|
|
203
|
-
|
|
204
|
-
Kreuzberg WASM includes complete working examples for different environments:
|
|
205
|
-
|
|
206
|
-
- **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
|
|
207
|
-
- **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
|
|
208
|
-
- **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
|
|
209
|
-
|
|
210
|
-
See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
|
|
211
|
-
|
|
212
|
-
## Multi-Threading with wasm-bindgen-rayon
|
|
213
|
-
|
|
214
|
-
Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
|
|
215
|
-
|
|
216
|
-
### Initializing the Thread Pool
|
|
217
|
-
|
|
218
|
-
To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
|
|
219
|
-
|
|
220
|
-
```typescript
|
|
221
|
-
import { initThreadPool } from '@kreuzberg/wasm';
|
|
222
|
-
|
|
223
|
-
// Initialize thread pool for multi-threaded extraction
|
|
224
|
-
await initThreadPool(navigator.hardwareConcurrency);
|
|
225
|
-
|
|
226
|
-
// Now extractions will use multiple threads for better performance
|
|
227
|
-
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
83
|
+
**yarn:**
|
|
84
|
+
```bash
|
|
85
|
+
yarn add @kreuzberg/wasm
|
|
228
86
|
```
|
|
229
87
|
|
|
230
|
-
### Required HTTP Headers for SharedArrayBuffer
|
|
231
88
|
|
|
232
|
-
Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
|
|
233
89
|
|
|
234
|
-
**Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
|
|
235
|
-
|
|
236
|
-
Set these headers in your server configuration:
|
|
237
|
-
|
|
238
|
-
```
|
|
239
|
-
Cross-Origin-Opener-Policy: same-origin
|
|
240
|
-
Cross-Origin-Embedder-Policy: require-corp
|
|
241
|
-
```
|
|
242
90
|
|
|
243
|
-
#### Server Configuration Examples
|
|
244
91
|
|
|
245
|
-
|
|
246
|
-
```javascript
|
|
247
|
-
app.use((req, res, next) => {
|
|
248
|
-
res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
|
|
249
|
-
res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
250
|
-
next();
|
|
251
|
-
});
|
|
252
|
-
```
|
|
92
|
+
### System Requirements
|
|
253
93
|
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
add_header 'Cross-Origin-Opener-Policy' 'same-origin';
|
|
257
|
-
add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
|
|
258
|
-
```
|
|
94
|
+
- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
|
|
95
|
+
- Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
|
|
259
96
|
|
|
260
|
-
**Apache:**
|
|
261
|
-
```apache
|
|
262
|
-
Header set Cross-Origin-Opener-Policy "same-origin"
|
|
263
|
-
Header set Cross-Origin-Embedder-Policy "require-corp"
|
|
264
|
-
```
|
|
265
97
|
|
|
266
|
-
**Cloudflare Workers:**
|
|
267
|
-
```javascript
|
|
268
|
-
export default {
|
|
269
|
-
async fetch(request: Request): Promise<Response> {
|
|
270
|
-
const response = new Response(body);
|
|
271
|
-
response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
|
|
272
|
-
response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
|
|
273
|
-
return response;
|
|
274
|
-
}
|
|
275
|
-
};
|
|
276
|
-
```
|
|
277
98
|
|
|
278
|
-
|
|
99
|
+
## Quick Start
|
|
279
100
|
|
|
280
|
-
|
|
101
|
+
### Basic Extraction
|
|
281
102
|
|
|
282
|
-
|
|
283
|
-
- **Firefox**: 79+
|
|
284
|
-
- **Safari**: 15.2+
|
|
285
|
-
- **Opera**: 60+
|
|
103
|
+
Extract text, metadata, and structure from any supported document format:
|
|
286
104
|
|
|
287
|
-
|
|
105
|
+
```ts
|
|
106
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
288
107
|
|
|
289
|
-
|
|
108
|
+
async function main() {
|
|
109
|
+
await initWasm();
|
|
290
110
|
|
|
291
|
-
|
|
111
|
+
const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
|
|
112
|
+
const bytes = new Uint8Array(buffer);
|
|
292
113
|
|
|
293
|
-
|
|
294
|
-
import { initThreadPool } from '@kreuzberg/wasm';
|
|
114
|
+
const result = await extractBytes(bytes, "application/pdf");
|
|
295
115
|
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
// Fall back to single-threaded processing
|
|
301
|
-
console.warn('Multi-threading unavailable:', error);
|
|
302
|
-
console.log('Using single-threaded extraction');
|
|
116
|
+
console.log("Extracted content:");
|
|
117
|
+
console.log(result.content);
|
|
118
|
+
console.log("MIME type:", result.mimeType);
|
|
119
|
+
console.log("Metadata:", result.metadata);
|
|
303
120
|
}
|
|
304
121
|
|
|
305
|
-
|
|
306
|
-
const result = await extractBytes(pdfBytes, 'application/pdf');
|
|
122
|
+
main().catch(console.error);
|
|
307
123
|
```
|
|
308
124
|
|
|
309
|
-
### Complete Example with Thread Pool
|
|
310
125
|
|
|
311
|
-
|
|
312
|
-
import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
|
|
126
|
+
### Common Use Cases
|
|
313
127
|
|
|
314
|
-
|
|
315
|
-
try {
|
|
316
|
-
// Initialize WASM module
|
|
317
|
-
await initWasm();
|
|
128
|
+
#### Extract with Custom Configuration
|
|
318
129
|
|
|
319
|
-
|
|
320
|
-
const cpuCount = navigator.hardwareConcurrency || 1;
|
|
321
|
-
try {
|
|
322
|
-
await initThreadPool(cpuCount);
|
|
323
|
-
console.log(`Thread pool initialized with ${cpuCount} workers`);
|
|
324
|
-
} catch (error) {
|
|
325
|
-
console.warn('Could not initialize thread pool, using single-threaded mode');
|
|
326
|
-
}
|
|
130
|
+
Most use cases benefit from configuration to control extraction behavior:
|
|
327
131
|
|
|
328
|
-
} catch (error) {
|
|
329
|
-
console.error('Failed to initialize Kreuzberg:', error);
|
|
330
|
-
}
|
|
331
|
-
}
|
|
332
|
-
|
|
333
|
-
async function extractDocument(file: File) {
|
|
334
|
-
const bytes = new Uint8Array(await file.arrayBuffer());
|
|
335
132
|
|
|
336
|
-
|
|
337
|
-
const result = await extractBytes(bytes, file.type, {
|
|
338
|
-
extract_tables: true,
|
|
339
|
-
extract_images: true
|
|
340
|
-
});
|
|
133
|
+
**With OCR (for scanned documents):**
|
|
341
134
|
|
|
342
|
-
|
|
343
|
-
}
|
|
135
|
+
```ts
|
|
136
|
+
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
344
137
|
|
|
345
|
-
|
|
346
|
-
await
|
|
347
|
-
|
|
348
|
-
// Later, handle file uploads
|
|
349
|
-
fileInput.addEventListener('change', async (e) => {
|
|
350
|
-
const file = e.target.files?.[0];
|
|
351
|
-
if (file) {
|
|
352
|
-
const result = await extractDocument(file);
|
|
353
|
-
console.log('Extracted text:', result.content);
|
|
354
|
-
}
|
|
355
|
-
});
|
|
356
|
-
```
|
|
138
|
+
async function extractWithOcr() {
|
|
139
|
+
await initWasm();
|
|
357
140
|
|
|
358
|
-
|
|
141
|
+
try {
|
|
142
|
+
await enableOcr();
|
|
143
|
+
console.log("OCR enabled successfully");
|
|
144
|
+
} catch (error) {
|
|
145
|
+
console.error("Failed to enable OCR:", error);
|
|
146
|
+
return;
|
|
147
|
+
}
|
|
359
148
|
|
|
360
|
-
|
|
361
|
-
- **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
|
|
362
|
-
- **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
|
|
149
|
+
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
|
|
363
150
|
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
```typescript
|
|
371
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
372
|
-
|
|
373
|
-
const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
|
|
374
|
-
|
|
375
|
-
const result = await extractBytes(
|
|
376
|
-
new Uint8Array(imageBytes),
|
|
377
|
-
'image/jpeg',
|
|
378
|
-
{
|
|
379
|
-
enable_ocr: true,
|
|
380
|
-
ocr_config: {
|
|
381
|
-
languages: ['eng'], // English
|
|
382
|
-
backend: 'tesseract-wasm'
|
|
383
|
-
}
|
|
384
|
-
}
|
|
385
|
-
);
|
|
386
|
-
|
|
387
|
-
console.log('OCR text:', result.content);
|
|
388
|
-
```
|
|
151
|
+
const result = await extractBytes(bytes, "image/png", {
|
|
152
|
+
ocr: {
|
|
153
|
+
backend: "tesseract-wasm",
|
|
154
|
+
language: "eng",
|
|
155
|
+
},
|
|
156
|
+
});
|
|
389
157
|
|
|
390
|
-
|
|
158
|
+
console.log("Extracted text:");
|
|
159
|
+
console.log(result.content);
|
|
160
|
+
}
|
|
391
161
|
|
|
392
|
-
|
|
393
|
-
const result = await extractBytes(imageBytes, 'image/png', {
|
|
394
|
-
enable_ocr: true,
|
|
395
|
-
ocr_config: {
|
|
396
|
-
languages: ['eng', 'deu', 'fra'], // English, German, French
|
|
397
|
-
backend: 'tesseract-wasm'
|
|
398
|
-
}
|
|
399
|
-
});
|
|
162
|
+
extractWithOcr().catch(console.error);
|
|
400
163
|
```
|
|
401
164
|
|
|
402
|
-
### Supported Languages
|
|
403
165
|
|
|
404
|
-
`eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
|
|
405
166
|
|
|
406
|
-
Training data is automatically loaded from jsDelivr CDN:
|
|
407
|
-
```
|
|
408
|
-
https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
|
|
409
|
-
```
|
|
410
167
|
|
|
411
|
-
|
|
168
|
+
#### Table Extraction
|
|
412
169
|
|
|
413
|
-
### Extract Tables
|
|
414
170
|
|
|
415
|
-
|
|
416
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
171
|
+
See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
|
|
417
172
|
|
|
418
|
-
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
419
|
-
extract_tables: true
|
|
420
|
-
});
|
|
421
173
|
|
|
422
|
-
if (result.tables) {
|
|
423
|
-
for (const table of result.tables) {
|
|
424
|
-
console.log('Table as Markdown:');
|
|
425
|
-
console.log(table.markdown);
|
|
426
174
|
|
|
427
|
-
|
|
428
|
-
console.log(JSON.stringify(table.cells, null, 2));
|
|
429
|
-
}
|
|
430
|
-
}
|
|
431
|
-
```
|
|
432
|
-
|
|
433
|
-
### Extract Images
|
|
175
|
+
#### Processing Multiple Files
|
|
434
176
|
|
|
435
|
-
```typescript
|
|
436
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
437
177
|
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
image_config: {
|
|
441
|
-
target_dpi: 300,
|
|
442
|
-
max_image_dimension: 4096
|
|
443
|
-
}
|
|
444
|
-
});
|
|
178
|
+
```ts
|
|
179
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
445
180
|
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
|
|
449
|
-
|
|
450
|
-
}
|
|
181
|
+
interface DocumentJob {
|
|
182
|
+
name: string;
|
|
183
|
+
bytes: Uint8Array;
|
|
184
|
+
mimeType: string;
|
|
451
185
|
}
|
|
452
|
-
```
|
|
453
|
-
|
|
454
|
-
### Text Chunking
|
|
455
186
|
|
|
456
|
-
|
|
457
|
-
|
|
458
|
-
|
|
459
|
-
const
|
|
460
|
-
|
|
461
|
-
|
|
462
|
-
|
|
463
|
-
|
|
464
|
-
|
|
465
|
-
|
|
466
|
-
|
|
467
|
-
if (
|
|
468
|
-
|
|
469
|
-
|
|
470
|
-
|
|
187
|
+
async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
|
|
188
|
+
await initWasm();
|
|
189
|
+
|
|
190
|
+
const results: Record<string, string> = {};
|
|
191
|
+
const queue = [...documents];
|
|
192
|
+
|
|
193
|
+
const workers = Array(concurrency)
|
|
194
|
+
.fill(null)
|
|
195
|
+
.map(async () => {
|
|
196
|
+
while (queue.length > 0) {
|
|
197
|
+
const doc = queue.shift();
|
|
198
|
+
if (!doc) break;
|
|
199
|
+
|
|
200
|
+
try {
|
|
201
|
+
const result = await extractBytes(doc.bytes, doc.mimeType);
|
|
202
|
+
results[doc.name] = result.content;
|
|
203
|
+
} catch (error) {
|
|
204
|
+
console.error(`Failed to process ${doc.name}:`, error);
|
|
205
|
+
}
|
|
206
|
+
}
|
|
207
|
+
});
|
|
208
|
+
|
|
209
|
+
await Promise.all(workers);
|
|
210
|
+
return results;
|
|
471
211
|
}
|
|
472
212
|
```
|
|
473
213
|
|
|
474
|
-
### Language Detection
|
|
475
214
|
|
|
476
|
-
```typescript
|
|
477
|
-
import { extractBytes } from '@kreuzberg/wasm';
|
|
478
215
|
|
|
479
|
-
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
480
|
-
enable_language_detection: true
|
|
481
|
-
});
|
|
482
216
|
|
|
483
|
-
if (result.language) {
|
|
484
|
-
console.log(`Detected language: ${result.language.code}`);
|
|
485
|
-
console.log(`Confidence: ${result.language.confidence}`);
|
|
486
|
-
}
|
|
487
|
-
```
|
|
488
217
|
|
|
489
|
-
|
|
490
|
-
|
|
491
|
-
```typescript
|
|
492
|
-
import {
|
|
493
|
-
extractBytes,
|
|
494
|
-
type ExtractionConfig
|
|
495
|
-
} from '@kreuzberg/wasm';
|
|
496
|
-
|
|
497
|
-
const config: ExtractionConfig = {
|
|
498
|
-
extract_tables: true,
|
|
499
|
-
extract_images: true,
|
|
500
|
-
extract_metadata: true,
|
|
501
|
-
|
|
502
|
-
enable_ocr: true,
|
|
503
|
-
ocr_config: {
|
|
504
|
-
languages: ['eng'],
|
|
505
|
-
backend: 'tesseract-wasm',
|
|
506
|
-
dpi: 300,
|
|
507
|
-
preprocessing: {
|
|
508
|
-
deskew: true,
|
|
509
|
-
denoise: true,
|
|
510
|
-
binarize: true
|
|
511
|
-
}
|
|
512
|
-
},
|
|
513
|
-
|
|
514
|
-
enable_chunking: true,
|
|
515
|
-
chunking_config: {
|
|
516
|
-
max_chars: 1000,
|
|
517
|
-
max_overlap: 200
|
|
518
|
-
},
|
|
519
|
-
|
|
520
|
-
enable_language_detection: true,
|
|
521
|
-
|
|
522
|
-
enable_quality: true,
|
|
523
|
-
|
|
524
|
-
extract_keywords: true,
|
|
525
|
-
keywords_config: {
|
|
526
|
-
max_keywords: 10,
|
|
527
|
-
method: 'yake'
|
|
528
|
-
}
|
|
529
|
-
};
|
|
530
|
-
|
|
531
|
-
const result = await extractBytes(data, mimeType, config);
|
|
532
|
-
```
|
|
218
|
+
#### Async Processing
|
|
533
219
|
|
|
534
|
-
|
|
220
|
+
For non-blocking document processing:
|
|
535
221
|
|
|
536
|
-
|
|
222
|
+
```ts
|
|
223
|
+
import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
|
|
537
224
|
|
|
538
|
-
|
|
539
|
-
|
|
225
|
+
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
|
|
226
|
+
const caps = getWasmCapabilities();
|
|
227
|
+
if (!caps.hasWasm) {
|
|
228
|
+
throw new Error("WebAssembly not supported");
|
|
229
|
+
}
|
|
540
230
|
|
|
541
|
-
|
|
542
|
-
const fileInput = document.querySelector<HTMLInputElement>('#files');
|
|
543
|
-
const files = Array.from(fileInput.files);
|
|
231
|
+
await initWasm();
|
|
544
232
|
|
|
545
|
-
const results = await
|
|
546
|
-
extract_tables: true
|
|
547
|
-
});
|
|
233
|
+
const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
|
|
548
234
|
|
|
549
|
-
|
|
550
|
-
|
|
235
|
+
return results.map((r) => ({
|
|
236
|
+
content: r.content,
|
|
237
|
+
pageCount: r.metadata?.pageCount,
|
|
238
|
+
}));
|
|
551
239
|
}
|
|
552
240
|
|
|
553
|
-
|
|
554
|
-
const
|
|
555
|
-
const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
|
|
241
|
+
const fileBytes = [new Uint8Array([1, 2, 3])];
|
|
242
|
+
const mimes = ["application/pdf"];
|
|
556
243
|
|
|
557
|
-
|
|
244
|
+
extractDocuments(fileBytes, mimes)
|
|
245
|
+
.then((results) => console.log(results))
|
|
246
|
+
.catch(console.error);
|
|
558
247
|
```
|
|
559
248
|
|
|
560
|
-
### Synchronous Extraction
|
|
561
249
|
|
|
562
|
-
```typescript
|
|
563
|
-
import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
|
|
564
250
|
|
|
565
|
-
// Synchronous single extraction
|
|
566
|
-
const result = extractBytesSync(data, 'application/pdf', config);
|
|
567
251
|
|
|
568
|
-
// Synchronous batch extraction
|
|
569
|
-
const results = batchExtractBytesSync(dataList, mimeTypes, config);
|
|
570
|
-
```
|
|
571
252
|
|
|
572
|
-
### Plugin System
|
|
573
253
|
|
|
574
|
-
|
|
254
|
+
### Next Steps
|
|
575
255
|
|
|
576
|
-
|
|
577
|
-
|
|
256
|
+
- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
|
|
257
|
+
- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
|
|
258
|
+
- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
|
|
259
|
+
- **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
|
|
260
|
+
- **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
|
|
578
261
|
|
|
579
|
-
registerPostProcessor({
|
|
580
|
-
name: 'uppercase',
|
|
581
|
-
async process(result) {
|
|
582
|
-
return {
|
|
583
|
-
...result,
|
|
584
|
-
content: result.content.toUpperCase()
|
|
585
|
-
};
|
|
586
|
-
}
|
|
587
|
-
});
|
|
588
262
|
|
|
589
|
-
// Now all extractions will use this processor
|
|
590
|
-
const result = await extractBytes(data, mimeType);
|
|
591
|
-
console.log(result.content); // UPPERCASE TEXT
|
|
592
|
-
```
|
|
593
263
|
|
|
594
|
-
|
|
264
|
+
## Features
|
|
595
265
|
|
|
596
|
-
|
|
597
|
-
import { registerValidator } from '@kreuzberg/wasm';
|
|
266
|
+
### Supported File Formats (56+)
|
|
598
267
|
|
|
599
|
-
|
|
600
|
-
name: 'min-length',
|
|
601
|
-
async validate(result) {
|
|
602
|
-
if (result.content.length < 100) {
|
|
603
|
-
throw new Error('Document too short');
|
|
604
|
-
}
|
|
605
|
-
}
|
|
606
|
-
});
|
|
607
|
-
```
|
|
268
|
+
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
|
|
608
269
|
|
|
609
|
-
####
|
|
610
|
-
|
|
611
|
-
```typescript
|
|
612
|
-
import { registerOcrBackend } from '@kreuzberg/wasm';
|
|
613
|
-
|
|
614
|
-
registerOcrBackend({
|
|
615
|
-
name: 'custom-ocr',
|
|
616
|
-
supportedLanguages() {
|
|
617
|
-
return ['eng', 'fra'];
|
|
618
|
-
},
|
|
619
|
-
async initialize() {
|
|
620
|
-
// Initialize your OCR backend
|
|
621
|
-
},
|
|
622
|
-
async processImage(imageBytes, language) {
|
|
623
|
-
// Process image and return result
|
|
624
|
-
return {
|
|
625
|
-
content: 'extracted text',
|
|
626
|
-
mime_type: 'text/plain',
|
|
627
|
-
metadata: {},
|
|
628
|
-
tables: []
|
|
629
|
-
};
|
|
630
|
-
},
|
|
631
|
-
async shutdown() {
|
|
632
|
-
// Cleanup
|
|
633
|
-
}
|
|
634
|
-
});
|
|
635
|
-
```
|
|
270
|
+
#### Office Documents
|
|
636
271
|
|
|
637
|
-
|
|
272
|
+
| Category | Formats | Capabilities |
|
|
273
|
+
|----------|---------|--------------|
|
|
274
|
+
| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
|
|
275
|
+
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
|
|
276
|
+
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
|
|
277
|
+
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
|
|
278
|
+
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
|
|
638
279
|
|
|
639
|
-
|
|
640
|
-
import {
|
|
641
|
-
detectMimeFromBytes,
|
|
642
|
-
getMimeFromExtension,
|
|
643
|
-
getExtensionsForMime,
|
|
644
|
-
normalizeMimeType
|
|
645
|
-
} from '@kreuzberg/wasm';
|
|
280
|
+
#### Images (OCR-Enabled)
|
|
646
281
|
|
|
647
|
-
|
|
648
|
-
|
|
282
|
+
| Category | Formats | Features |
|
|
283
|
+
|----------|---------|----------|
|
|
284
|
+
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
|
|
285
|
+
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
|
|
286
|
+
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
|
|
649
287
|
|
|
650
|
-
|
|
651
|
-
const mime = getMimeFromExtension('pdf'); // 'application/pdf'
|
|
288
|
+
#### Web & Data
|
|
652
289
|
|
|
653
|
-
|
|
654
|
-
|
|
290
|
+
| Category | Formats | Features |
|
|
291
|
+
|----------|---------|----------|
|
|
292
|
+
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
|
|
293
|
+
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
|
|
294
|
+
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
|
|
655
295
|
|
|
656
|
-
|
|
657
|
-
const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
|
|
658
|
-
```
|
|
296
|
+
#### Email & Archives
|
|
659
297
|
|
|
660
|
-
|
|
298
|
+
| Category | Formats | Features |
|
|
299
|
+
|----------|---------|----------|
|
|
300
|
+
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
|
|
301
|
+
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
|
|
661
302
|
|
|
662
|
-
|
|
663
|
-
import { loadConfigFromString } from '@kreuzberg/wasm';
|
|
303
|
+
#### Academic & Scientific
|
|
664
304
|
|
|
665
|
-
|
|
666
|
-
|
|
667
|
-
|
|
668
|
-
|
|
669
|
-
|
|
670
|
-
languages: [eng, deu]
|
|
671
|
-
`;
|
|
672
|
-
const config = loadConfigFromString(yamlConfig, 'yaml');
|
|
305
|
+
| Category | Formats | Features |
|
|
306
|
+
|----------|---------|----------|
|
|
307
|
+
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
|
|
308
|
+
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
|
|
309
|
+
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
|
|
673
310
|
|
|
674
|
-
|
|
675
|
-
const jsonConfig = '{"extract_tables":true}';
|
|
676
|
-
const config2 = loadConfigFromString(jsonConfig, 'json');
|
|
311
|
+
**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
|
|
677
312
|
|
|
678
|
-
|
|
679
|
-
const tomlConfig = 'extract_tables = true';
|
|
680
|
-
const config3 = loadConfigFromString(tomlConfig, 'toml');
|
|
681
|
-
```
|
|
313
|
+
### Key Capabilities
|
|
682
314
|
|
|
683
|
-
|
|
315
|
+
- **Text Extraction** - Extract all text content with position and formatting information
|
|
316
|
+
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
|
|
317
|
+
- **Table Extraction** - Parse tables with structure and cell content preservation
|
|
318
|
+
- **Image Extraction** - Extract embedded images and render page previews
|
|
319
|
+
- **OCR Support** - Integrate multiple OCR backends for scanned documents
|
|
684
320
|
|
|
685
|
-
|
|
321
|
+
- **Async/Await** - Non-blocking document processing with concurrent operations
|
|
686
322
|
|
|
687
|
-
#### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
|
|
688
|
-
Extract content from a browser `File` object.
|
|
689
323
|
|
|
690
|
-
|
|
691
|
-
Asynchronously extract content from a `Uint8Array`.
|
|
324
|
+
- **Plugin System** - Extensible post-processing for custom text transformation
|
|
692
325
|
|
|
693
|
-
#### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
|
|
694
|
-
Synchronously extract content from a `Uint8Array`.
|
|
695
326
|
|
|
696
|
-
|
|
697
|
-
|
|
327
|
+
- **Batch Processing** - Efficiently process multiple documents in parallel
|
|
328
|
+
- **Memory Efficient** - Stream large files without loading entirely into memory
|
|
329
|
+
- **Language Detection** - Detect and support multiple languages in documents
|
|
330
|
+
- **Configuration** - Fine-grained control over extraction behavior
|
|
698
331
|
|
|
699
|
-
|
|
700
|
-
Extract multiple byte arrays in parallel.
|
|
332
|
+
### Performance Characteristics
|
|
701
333
|
|
|
702
|
-
|
|
703
|
-
|
|
334
|
+
| Format | Speed | Memory | Notes |
|
|
335
|
+
|--------|-------|--------|-------|
|
|
336
|
+
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
|
|
337
|
+
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
|
|
338
|
+
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
|
|
339
|
+
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
|
|
340
|
+
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
|
|
704
341
|
|
|
705
|
-
### Plugin Management
|
|
706
342
|
|
|
707
|
-
#### Post-Processors
|
|
708
343
|
|
|
709
|
-
|
|
710
|
-
registerPostProcessor(processor: PostProcessorProtocol): void
|
|
711
|
-
unregisterPostProcessor(name: string): void
|
|
712
|
-
clearPostProcessors(): void
|
|
713
|
-
listPostProcessors(): string[]
|
|
714
|
-
```
|
|
344
|
+
## OCR Support
|
|
715
345
|
|
|
716
|
-
|
|
346
|
+
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
|
|
717
347
|
|
|
718
|
-
```typescript
|
|
719
|
-
registerValidator(validator: ValidatorProtocol): void
|
|
720
|
-
unregisterValidator(name: string): void
|
|
721
|
-
clearValidators(): void
|
|
722
|
-
listValidators(): string[]
|
|
723
|
-
```
|
|
724
348
|
|
|
725
|
-
|
|
349
|
+
- **Tesseract-Wasm**
|
|
726
350
|
|
|
727
|
-
```typescript
|
|
728
|
-
registerOcrBackend(backend: OcrBackendProtocol): void
|
|
729
|
-
unregisterOcrBackend(name: string): void
|
|
730
|
-
clearOcrBackends(): void
|
|
731
|
-
listOcrBackends(): string[]
|
|
732
|
-
```
|
|
733
351
|
|
|
734
|
-
###
|
|
352
|
+
### OCR Configuration Example
|
|
735
353
|
|
|
736
|
-
```
|
|
737
|
-
|
|
738
|
-
unregisterDocumentExtractor(name: string): void
|
|
739
|
-
clearDocumentExtractors(): void
|
|
740
|
-
```
|
|
354
|
+
```ts
|
|
355
|
+
import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
741
356
|
|
|
742
|
-
|
|
357
|
+
async function extractWithOcr() {
|
|
358
|
+
await initWasm();
|
|
743
359
|
|
|
744
|
-
|
|
745
|
-
|
|
746
|
-
|
|
747
|
-
|
|
748
|
-
|
|
749
|
-
|
|
360
|
+
try {
|
|
361
|
+
await enableOcr();
|
|
362
|
+
console.log("OCR enabled successfully");
|
|
363
|
+
} catch (error) {
|
|
364
|
+
console.error("Failed to enable OCR:", error);
|
|
365
|
+
return;
|
|
366
|
+
}
|
|
750
367
|
|
|
751
|
-
|
|
368
|
+
const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
|
|
752
369
|
|
|
753
|
-
|
|
754
|
-
|
|
755
|
-
|
|
370
|
+
const result = await extractBytes(bytes, "image/png", {
|
|
371
|
+
ocr: {
|
|
372
|
+
backend: "tesseract-wasm",
|
|
373
|
+
language: "eng",
|
|
374
|
+
},
|
|
375
|
+
});
|
|
756
376
|
|
|
757
|
-
|
|
377
|
+
console.log("Extracted text:");
|
|
378
|
+
console.log(result.content);
|
|
379
|
+
}
|
|
758
380
|
|
|
759
|
-
|
|
760
|
-
listEmbeddingPresets(): string[]
|
|
761
|
-
getEmbeddingPreset(name: string): EmbeddingPreset | null
|
|
381
|
+
extractWithOcr().catch(console.error);
|
|
762
382
|
```
|
|
763
383
|
|
|
764
|
-
## Types
|
|
765
|
-
|
|
766
|
-
All types are shared via the `@kreuzberg/core` package:
|
|
767
|
-
|
|
768
|
-
```typescript
|
|
769
|
-
import type {
|
|
770
|
-
ExtractionResult,
|
|
771
|
-
ExtractionConfig,
|
|
772
|
-
OcrConfig,
|
|
773
|
-
ChunkingConfig,
|
|
774
|
-
ImageConfig,
|
|
775
|
-
KeywordsConfig,
|
|
776
|
-
Table,
|
|
777
|
-
ExtractedImage,
|
|
778
|
-
Chunk,
|
|
779
|
-
Metadata,
|
|
780
|
-
PostProcessorProtocol,
|
|
781
|
-
ValidatorProtocol,
|
|
782
|
-
OcrBackendProtocol
|
|
783
|
-
} from '@kreuzberg/core';
|
|
784
|
-
```
|
|
785
384
|
|
|
786
|
-
### ExtractionResult
|
|
787
|
-
|
|
788
|
-
Main result object containing:
|
|
789
|
-
- `content: string` - Extracted text content
|
|
790
|
-
- `mime_type: string` - MIME type of the document
|
|
791
|
-
- `metadata?: Metadata` - Document metadata
|
|
792
|
-
- `tables?: Table[]` - Extracted tables
|
|
793
|
-
- `images?: ExtractedImage[]` - Extracted images
|
|
794
|
-
- `chunks?: Chunk[]` - Text chunks (if chunking enabled)
|
|
795
|
-
- `language?: LanguageInfo` - Detected language (if enabled)
|
|
796
|
-
- `keywords?: Keyword[]` - Extracted keywords (if enabled)
|
|
797
|
-
|
|
798
|
-
### ExtractionConfig
|
|
799
|
-
|
|
800
|
-
Configuration object for extraction:
|
|
801
|
-
- `extract_tables?: boolean` - Extract tables as structured data
|
|
802
|
-
- `extract_images?: boolean` - Extract embedded images
|
|
803
|
-
- `extract_metadata?: boolean` - Extract document metadata
|
|
804
|
-
- `enable_ocr?: boolean` - Enable OCR for images
|
|
805
|
-
- `ocr_config?: OcrConfig` - OCR settings
|
|
806
|
-
- `enable_chunking?: boolean` - Split text into semantic chunks
|
|
807
|
-
- `chunking_config?: ChunkingConfig` - Text chunking settings
|
|
808
|
-
- `enable_language_detection?: boolean` - Detect document language
|
|
809
|
-
- `enable_quality?: boolean` - Encoding detection, normalization
|
|
810
|
-
- `extract_keywords?: boolean` - Extract important keywords
|
|
811
|
-
- `keywords_config?: KeywordsConfig` - Keyword extraction settings
|
|
812
|
-
|
|
813
|
-
### Table
|
|
814
|
-
|
|
815
|
-
Extracted table structure:
|
|
816
|
-
- `markdown: string` - Table in Markdown format
|
|
817
|
-
- `cells: TableCell[][]` - 2D array of table cells
|
|
818
|
-
- `row_count: number` - Number of rows
|
|
819
|
-
- `column_count: number` - Number of columns
|
|
820
|
-
|
|
821
|
-
## Supported Formats
|
|
822
|
-
|
|
823
|
-
| Category | Formats |
|
|
824
|
-
|----------|---------|
|
|
825
|
-
| **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
|
|
826
|
-
| **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
|
|
827
|
-
| **Web** | HTML, XHTML, XML, EPUB |
|
|
828
|
-
| **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
|
|
829
|
-
| **Email** | EML, MSG |
|
|
830
|
-
| **Archives** | ZIP, TAR, 7Z |
|
|
831
|
-
| **Other** | And 30+ more formats |
|
|
832
|
-
|
|
833
|
-
## Build from Source
|
|
834
|
-
|
|
835
|
-
### Prerequisites
|
|
836
|
-
|
|
837
|
-
- Rust 1.75+ with `wasm32-unknown-unknown` target
|
|
838
|
-
- Node.js 18+ with pnpm
|
|
839
|
-
- wasm-pack
|
|
840
385
|
|
|
841
|
-
```bash
|
|
842
|
-
# Install Rust target
|
|
843
|
-
rustup target add wasm32-unknown-unknown
|
|
844
386
|
|
|
845
|
-
|
|
846
|
-
curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
|
|
387
|
+
## Async Support
|
|
847
388
|
|
|
848
|
-
|
|
849
|
-
cd crates/kreuzberg-wasm
|
|
850
|
-
pnpm install
|
|
851
|
-
pnpm run build
|
|
389
|
+
This binding provides full async/await support for non-blocking document processing:
|
|
852
390
|
|
|
853
|
-
|
|
854
|
-
|
|
855
|
-
```
|
|
391
|
+
```ts
|
|
392
|
+
import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";
|
|
856
393
|
|
|
857
|
-
|
|
394
|
+
async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
|
|
395
|
+
const caps = getWasmCapabilities();
|
|
396
|
+
if (!caps.hasWasm) {
|
|
397
|
+
throw new Error("WebAssembly not supported");
|
|
398
|
+
}
|
|
858
399
|
|
|
859
|
-
|
|
860
|
-
# For browsers (ESM modules)
|
|
861
|
-
pnpm run build:wasm:web
|
|
400
|
+
await initWasm();
|
|
862
401
|
|
|
863
|
-
|
|
864
|
-
pnpm run build:wasm:bundler
|
|
402
|
+
const results = await Promise.all(files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])));
|
|
865
403
|
|
|
866
|
-
|
|
867
|
-
|
|
404
|
+
return results.map((r) => ({
|
|
405
|
+
content: r.content,
|
|
406
|
+
pageCount: r.metadata?.pageCount,
|
|
407
|
+
}));
|
|
408
|
+
}
|
|
868
409
|
|
|
869
|
-
|
|
870
|
-
|
|
410
|
+
const fileBytes = [new Uint8Array([1, 2, 3])];
|
|
411
|
+
const mimes = ["application/pdf"];
|
|
871
412
|
|
|
872
|
-
|
|
873
|
-
|
|
413
|
+
extractDocuments(fileBytes, mimes)
|
|
414
|
+
.then((results) => console.log(results))
|
|
415
|
+
.catch(console.error);
|
|
874
416
|
```
|
|
875
417
|
|
|
876
|
-
## Limitations
|
|
877
418
|
|
|
878
|
-
### No File System Access
|
|
879
419
|
|
|
880
|
-
The WASM binding cannot access the file system directly. Use file readers:
|
|
881
420
|
|
|
882
|
-
|
|
883
|
-
// ❌ Won't work
|
|
884
|
-
await extractFileSync('./document.pdf'); // Throws error
|
|
421
|
+
## Plugin System
|
|
885
422
|
|
|
886
|
-
|
|
887
|
-
const bytes = await Deno.readFile('./document.pdf'); // Deno
|
|
888
|
-
const bytes = await fs.readFile('./document.pdf'); // Node.js
|
|
889
|
-
const bytes = await file.arrayBuffer(); // Browser
|
|
890
|
-
```
|
|
423
|
+
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
|
|
891
424
|
|
|
892
|
-
|
|
425
|
+
For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
|
|
893
426
|
|
|
894
|
-
Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
|
|
895
427
|
|
|
896
|
-
### Size Constraints
|
|
897
428
|
|
|
898
|
-
Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
|
|
899
429
|
|
|
900
|
-
## Troubleshooting
|
|
901
430
|
|
|
902
|
-
### "WASM module failed to initialize"
|
|
903
431
|
|
|
904
|
-
|
|
432
|
+
## Batch Processing
|
|
905
433
|
|
|
906
|
-
|
|
907
|
-
```typescript
|
|
908
|
-
// vite.config.ts
|
|
909
|
-
export default {
|
|
910
|
-
optimizeDeps: {
|
|
911
|
-
exclude: ['@kreuzberg/wasm']
|
|
912
|
-
}
|
|
913
|
-
}
|
|
914
|
-
```
|
|
434
|
+
Process multiple documents efficiently:
|
|
915
435
|
|
|
916
|
-
|
|
917
|
-
|
|
918
|
-
// webpack.config.js
|
|
919
|
-
module.exports = {
|
|
920
|
-
experiments: {
|
|
921
|
-
asyncWebAssembly: true
|
|
922
|
-
}
|
|
923
|
-
}
|
|
924
|
-
```
|
|
925
|
-
|
|
926
|
-
### "Module not found: @kreuzberg/core"
|
|
436
|
+
```ts
|
|
437
|
+
import { extractBytes, initWasm } from "@kreuzberg/wasm";
|
|
927
438
|
|
|
928
|
-
|
|
439
|
+
interface DocumentJob {
|
|
440
|
+
name: string;
|
|
441
|
+
bytes: Uint8Array;
|
|
442
|
+
mimeType: string;
|
|
443
|
+
}
|
|
929
444
|
|
|
930
|
-
|
|
931
|
-
|
|
445
|
+
async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
|
|
446
|
+
await initWasm();
|
|
447
|
+
|
|
448
|
+
const results: Record<string, string> = {};
|
|
449
|
+
const queue = [...documents];
|
|
450
|
+
|
|
451
|
+
const workers = Array(concurrency)
|
|
452
|
+
.fill(null)
|
|
453
|
+
.map(async () => {
|
|
454
|
+
while (queue.length > 0) {
|
|
455
|
+
const doc = queue.shift();
|
|
456
|
+
if (!doc) break;
|
|
457
|
+
|
|
458
|
+
try {
|
|
459
|
+
const result = await extractBytes(doc.bytes, doc.mimeType);
|
|
460
|
+
results[doc.name] = result.content;
|
|
461
|
+
} catch (error) {
|
|
462
|
+
console.error(`Failed to process ${doc.name}:`, error);
|
|
463
|
+
}
|
|
464
|
+
}
|
|
465
|
+
});
|
|
466
|
+
|
|
467
|
+
await Promise.all(workers);
|
|
468
|
+
return results;
|
|
469
|
+
}
|
|
932
470
|
```
|
|
933
471
|
|
|
934
|
-
### Memory Issues in Workers
|
|
935
472
|
|
|
936
|
-
For large documents in Cloudflare Workers, process in smaller chunks:
|
|
937
473
|
|
|
938
|
-
```typescript
|
|
939
|
-
const result = await extractBytes(pdfBytes, 'application/pdf', {
|
|
940
|
-
chunking_config: { max_chars: 1000 }
|
|
941
|
-
});
|
|
942
|
-
```
|
|
943
474
|
|
|
944
|
-
|
|
475
|
+
## Configuration
|
|
945
476
|
|
|
946
|
-
|
|
477
|
+
For advanced configuration options including language detection, table extraction, OCR settings, and more:
|
|
947
478
|
|
|
948
|
-
|
|
479
|
+
**[Configuration Guide](https://kreuzberg.dev/configuration/)**
|
|
949
480
|
|
|
950
|
-
|
|
481
|
+
## Documentation
|
|
951
482
|
|
|
952
|
-
- **
|
|
953
|
-
- **
|
|
954
|
-
- **
|
|
955
|
-
- **Node.js**: Batch processing script
|
|
483
|
+
- **[Official Documentation](https://kreuzberg.dev/)**
|
|
484
|
+
- **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
|
|
485
|
+
- **[Examples & Guides](https://kreuzberg.dev/guides/)**
|
|
956
486
|
|
|
957
|
-
##
|
|
487
|
+
## Troubleshooting
|
|
958
488
|
|
|
959
|
-
For
|
|
489
|
+
For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
|
|
960
490
|
|
|
961
491
|
## Contributing
|
|
962
492
|
|
|
963
|
-
|
|
493
|
+
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
|
|
964
494
|
|
|
965
495
|
## License
|
|
966
496
|
|
|
967
|
-
MIT
|
|
968
|
-
|
|
969
|
-
## Links
|
|
970
|
-
|
|
971
|
-
- [Website](https://kreuzberg.dev)
|
|
972
|
-
- [Documentation](https://kreuzberg.dev)
|
|
973
|
-
- [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
|
|
974
|
-
- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
975
|
-
- [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
|
|
976
|
-
- [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|
497
|
+
MIT License - see LICENSE file for details.
|
|
977
498
|
|
|
978
|
-
##
|
|
499
|
+
## Support
|
|
979
500
|
|
|
980
|
-
- [
|
|
981
|
-
- [
|
|
982
|
-
- [
|
|
501
|
+
- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
|
|
502
|
+
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
|
503
|
+
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
|