npm - @kreuzberg/wasm - Versions diffs - 4.0.0-rc.20 → 4.0.0-rc.23 - Mend

@kreuzberg/wasm 4.0.0-rc.20 → 4.0.0-rc.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/README.md +520 -837
package/dist/adapters/wasm-adapter.cjs.map +1 -1
package/dist/adapters/wasm-adapter.d.cts +1 -1
package/dist/adapters/wasm-adapter.d.ts +1 -1
package/dist/adapters/wasm-adapter.js.map +1 -1
package/dist/index.cjs +192 -48
package/dist/index.cjs.map +1 -1
package/dist/index.d.cts +219 -3
package/dist/index.d.ts +219 -3
package/dist/index.js +199 -48
package/dist/index.js.map +1 -1
package/dist/ocr/registry.cjs.map +1 -1
package/dist/ocr/registry.d.cts +1 -1
package/dist/ocr/registry.d.ts +1 -1
package/dist/ocr/registry.js.map +1 -1
package/dist/ocr/tesseract-wasm-backend.cjs +0 -46
package/dist/ocr/tesseract-wasm-backend.cjs.map +1 -1
package/dist/ocr/tesseract-wasm-backend.d.cts +1 -1
package/dist/ocr/tesseract-wasm-backend.d.ts +1 -1
package/dist/ocr/tesseract-wasm-backend.js +0 -46
package/dist/ocr/tesseract-wasm-backend.js.map +1 -1
package/dist/pdfium.js +0 -5
package/dist/runtime.cjs +0 -1
package/dist/runtime.cjs.map +1 -1
package/dist/runtime.js +0 -1
package/dist/runtime.js.map +1 -1
package/dist/{types-CKjcIYcX.d.cts → types-wVLLDHkl.d.cts} +73 -3
package/dist/{types-CKjcIYcX.d.ts → types-wVLLDHkl.d.ts} +73 -3
package/package.json +162 -162

package/README.md CHANGED Viewed

@@ -1,1059 +1,742 @@
-# Kreuzberg
-[![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
-[![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
-[![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
-[![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
-[![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
-[![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
-[![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
-[![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
-[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
-[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
-High-performance document intelligence for browsers, Deno, and Cloudflare Workers, powered by WebAssembly.
-Extract text, tables, images, and metadata from 50+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
-> **Note for Node.js/Bun Users:** If you're building for Node.js or Bun, use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) instead for ~2-3x better performance with native NAPI-RS bindings.
->
-> **This WASM package is designed for:**
-> - Browser applications (including web workers)
-> - Cloudflare Workers and edge runtimes
-> - Deno applications
-> - Environments without native build toolchain
-> **🚀 Version 4.0.0 Release Candidate**
-> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
-## Features
-- **50+ File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
-- **OCR Support**: Built-in tesseract-wasm with 40+ languages for scanned documents
-- **Table Extraction**: Advanced table detection and structured data extraction
-- **Cross-Runtime**: Browser, Deno, Cloudflare Workers, and other edge runtimes
-- **Type-Safe**: Full TypeScript definitions from shared @kreuzberg/core package
-- **API Parity**: All extraction functions from the Node.js binding
-- **Plugin System**: Custom post-processors, validators, and OCR backends
-- **Optimized Bundle**: <5MB uncompressed, <2MB compressed
-- **Zero Configuration**: Works out of the box with sensible defaults
-- **Portable**: Runs anywhere WASM is supported without native dependencies
-## Requirements
-- **Browser**: Modern browsers with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
-- **Node.js**: 18 or higher
-- **Deno**: 1.0 or higher
-- **Cloudflare Workers**: Compatible with Workers runtime
-### Optional Dependencies
-- **tesseract-wasm**: Automatically loaded for OCR functionality (40+ language support)
+# WebAssembly Bindings
+<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
+  <!-- Language Bindings -->
+  <a href="https://crates.io/crates/kreuzberg">
+    <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
+  </a>
+  <a href="https://hex.pm/packages/kreuzberg">
+    <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
+  </a>
+  <a href="https://pypi.org/project/kreuzberg/">
+    <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
+  </a>
+  <a href="https://www.npmjs.com/package/@kreuzberg/node">
+    <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
+  </a>
+  <a href="https://www.npmjs.com/package/@kreuzberg/wasm">
+    <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
+  </a>
+<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
+    <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
+  </a>
+  <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
+    <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0-*" alt="Go">
+  </a>
+  <a href="https://www.nuget.org/packages/Kreuzberg/">
+    <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
+  </a>
+  <a href="https://packagist.org/packages/kreuzberg/kreuzberg">
+    <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
+  </a>
+  <a href="https://rubygems.org/gems/kreuzberg">
+    <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
+  </a>
+<!-- Project Info -->
+<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
+    <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
+  </a>
+  <a href="https://docs.kreuzberg.dev">
+    <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
+  </a>
+</div>
+<img width="1128" height="191" alt="Banner2" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
+<div align="center" style="margin-top: 20px;">
+  <a href="https://discord.gg/pXxagNK2zN">
+      <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord">
+  </a>
+</div>
+Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Node.js, Deno, and Cloudflare Workers with portable deployment and optional multi-threading support.
+> **Version 4.0.0 Release Candidate**
+> Kreuzberg v4.0.0 is in **Release Candidate** stage. Bugs and breaking changes are expected.
+> This is a pre-release version. Please test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
 ## Installation
-### Choosing the Right Package
+### Package Installation
-| Use Case | Recommendation | Reason |
-|----------|---|---|
-| **Node.js/Bun runtime** | [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) | 2-3x faster native bindings |
-| **Browser/Web Worker** | @kreuzberg/wasm (this package) | Required for browser environments |
-| **Cloudflare Workers** | @kreuzberg/wasm (this package) | Only WASM option for Workers |
-| **Deno** | @kreuzberg/wasm (this package) | Full WASM support via npm packages |
-| **Edge runtime** | @kreuzberg/wasm (this package) | Portable across all edge platforms |
+Install via one of the supported package managers:
-### Install via npm/pnpm/yarn
+**npm:**
 ```bash
 npm install @kreuzberg/wasm
 ```
-Or with pnpm:
+**pnpm:**
 ```bash
 pnpm add @kreuzberg/wasm
 ```
-Or with yarn:
+**yarn:**
 ```bash
 yarn add @kreuzberg/wasm
 ```
-### Deno
-```typescript
-import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
-```
-## PDF Support and PDFium Initialization
-**IMPORTANT**: PDF extraction requires a one-time initialization step to load the PDFium WASM module.
+### Platform Support
-### Why PDFium Initialization is Needed
+Runs on:
+- Modern browsers (Chrome, Firefox, Safari, Edge with WebAssembly support)
+- Node.js 16+ (with WASM runtime)
+- Deno 1.0+
+- Cloudflare Workers
+- Any JavaScript environment with WebAssembly support
-Kreuzberg uses the high-performance PDFium library (from Google Chrome) for PDF processing. In WASM environments, PDFium runs as a separate WASM module that must be loaded and bound to the main kreuzberg module before PDF extraction can work.
+### System Requirements
-### How to Initialize PDFium
+- WebAssembly support in runtime environment
+- 50 MB minimum free memory for extraction
+- Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
-```javascript
-import init, { initialize_pdfium_render, extractBytes } from '@kreuzberg/wasm';
-import pdfiumModule from '@kreuzberg/wasm/pdfium.js';
+### Runtime Detection
-// Step 1: Initialize kreuzberg WASM
-await init();
+Check platform capabilities before extraction:
-// Step 2: Load PDFium WASM module
-const pdfium = await pdfiumModule();
-// Step 3: Bind kreuzberg to PDFium (required before any PDF operations)
-const success = initialize_pdfium_render(pdfium, wasm, false);
-if (!success) {
-    throw new Error('Failed to initialize PDFium');
-}
+```typescript
+import { getWasmCapabilities } from '@kreuzberg/wasm';
-// Step 4: Now PDF extraction works
-const pdfBytes = new Uint8Array(await pdfFile.arrayBuffer());
-const result = await extractBytes(pdfBytes);
-console.log(result.text);
-```
-### Error: "PdfiumWASMModuleNotConfigured"
-If you see this error, it means `initialize_pdfium_render()` was not called before attempting PDF extraction. Make sure to follow the initialization sequence above.
-### PDFium Files Location
-The PDFium WASM files (`pdfium.js`, `pdfium.wasm`) should be included in the `@kreuzberg/wasm` package. If they're missing:
-1. Check your `node_modules/@kreuzberg/wasm/` directory
-2. Ensure both `pdfium.js` and `pdfium.wasm` are present
-3. If missing, reinstall the package
-For self-hosted builds, copy the files from:
-```bash
-target/wasm32-unknown-unknown/release/build/kreuzberg-*/out/pdfium/release/node/
+const caps = getWasmCapabilities();
+console.log('WASM available:', caps.hasWasm);
+console.log('Web Workers available:', caps.hasWorkers);
+console.log('Module Workers available:', caps.hasModuleWorkers);
+console.log('File API available:', caps.hasFileApi);
+console.log('SharedArrayBuffer available:', caps.hasSharedArrayBuffer);
 ```
 ## Quick Start
-### Browser (ESM)
-```typescript
-import { extractFile } from '@kreuzberg/wasm';
-async function handleFileUpload() {
-  const fileInput = document.querySelector<HTMLInputElement>('#file-upload');
-  const file = fileInput.files[0];
+### Basic Extraction
-  const result = await extractFile(file, {
-    extract_tables: true,
-    extract_images: true
-  });
-  console.log('Extracted text:', result.content);
-  console.log('Tables found:', result.tables.length);
-}
-```
+Extract text, metadata, and structure from any supported document format:
-### Node.js (ESM)
+```ts
+import { extractBytes, initWasm } from "@kreuzberg/wasm";
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
-import { readFile } from 'fs/promises';
-const pdfBytes = await readFile('./document.pdf');
-const result = await extractBytes(
-  new Uint8Array(pdfBytes),
-  'application/pdf',
-  { extract_tables: true }
-);
-console.log(result.content);
-console.log('Found', result.tables.length, 'tables');
-```
+async function main() {
+  await initWasm();
-### Deno
+  const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
+  const bytes = new Uint8Array(buffer);
-```typescript
-import { extractBytes } from "npm:@kreuzberg/wasm@^4.0.0";
+  const result = await extractBytes(bytes, "application/pdf");
-const pdfBytes = await Deno.readFile("./document.pdf");
-const result = await extractBytes(pdfBytes, "application/pdf");
+  console.log("Extracted content:");
+  console.log(result.content);
+  console.log("MIME type:", result.mimeType);
+  console.log("Metadata:", result.metadata);
+}
-console.log(result.content);
+main().catch(console.error);
 ```
-### Cloudflare Workers
+### Common Use Cases
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
+#### Extract with Custom Configuration
-export default {
-  async fetch(request: Request): Promise<Response> {
-    if (request.method === 'POST') {
-      const formData = await request.formData();
-      const file = formData.get('file') as File;
+Most use cases benefit from configuration to control extraction behavior:
-      const arrayBuffer = await file.arrayBuffer();
-      const bytes = new Uint8Array(arrayBuffer);
+**With OCR (for scanned documents):**
-      const result = await extractBytes(bytes, file.type);
+```ts
+import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
-      return Response.json({
-        text: result.content,
-        metadata: result.metadata,
-        tables: result.tables
-      });
-    }
+async function extractWithOcr() {
+  await initWasm();
-    return new Response('Upload a file', { status: 400 });
+  try {
+    await enableOcr();
+    console.log("OCR enabled successfully");
+  } catch (error) {
+    console.error("Failed to enable OCR:", error);
+    return;
   }
-};
-```
-## Performance Comparison
-Kreuzberg WASM provides excellent portability but trades some performance for this flexibility. Here's how it compares to native bindings:
-| Metric | Native (@kreuzberg/node) | WASM (@kreuzberg/wasm) | Notes |
-|--------|---|---|---|
-| **PDF extraction** | 100ms (baseline) | 120-170ms (60-80%) | WASM slower due to JS/WASM boundary calls |
-| **OCR processing** | ~500ms | ~600-700ms (60-80%) | Performance gap increases with image size |
-| **Table extraction** | 50ms | 70-90ms (60-80%) | Consistent overhead from WASM compilation |
-| **Bundle size** | N/A (native) | <2MB gzip | WASM compresses extremely well |
-| **Runtime flexibility** | Node.js/Bun only | Browsers/Edge/Deno | Different use cases, not directly comparable |
-### When to Use WASM vs Native
-**Use WASM (@kreuzberg/wasm) when:**
-- Building browser applications (no choice, WASM required)
-- Targeting Cloudflare Workers or edge runtimes
-- Supporting Deno applications
-- You don't have a native build toolchain available
-- Portability across runtimes is critical
-**Use Native (@kreuzberg/node) when:**
-- Building Node.js or Bun applications (2-3x faster)
-- Performance is your primary concern
-- You're processing large volumes of documents
-- You have native build tools available
-### Performance Tips for WASM
-1. **Enable multi-threading** with `initThreadPool()` for better CPU utilization
-2. **Batch operations** with `batchExtractBytes()` to amortize WASM boundary overhead
-3. **Cache WASM module** by loading it once per application
-4. **Preload OCR models** by calling extraction with OCR enabled early
-## Examples
+  const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
-Kreuzberg WASM includes complete working examples for different environments:
-- **[Deno](../../examples/wasm-deno)** - Server-side document extraction with Deno runtime. Demonstrates basic extraction, batch processing, and OCR capabilities.
-- **[Cloudflare Workers](../../examples/wasm-cloudflare-workers)** - Serverless API for document processing on the edge. Includes file upload endpoint, error handling, and production-ready configuration.
-- **[Browser](../../examples/wasm-browser)** - Interactive web application with drag-and-drop file upload, progress tracking, and multi-threaded extraction using Vite.
-See the [examples documentation](../../examples/wasm/README.md) for a comprehensive overview and comparison of all examples.
-## Multi-Threading with wasm-bindgen-rayon
-Kreuzberg WASM leverages [wasm-bindgen-rayon](https://docs.rs/wasm-bindgen-rayon/latest/wasm_bindgen_rayon/) to enable multi-threaded document processing in browsers and server environments with SharedArrayBuffer support.
-### Initializing the Thread Pool
-To unlock multi-threaded performance, initialize the thread pool with the available CPU cores:
-```typescript
-import { initThreadPool } from '@kreuzberg/wasm';
-// Initialize thread pool for multi-threaded extraction
-await initThreadPool(navigator.hardwareConcurrency);
-// Now extractions will use multiple threads for better performance
-const result = await extractBytes(pdfBytes, 'application/pdf');
-```
-### Required HTTP Headers for SharedArrayBuffer
-Multi-threading requires specific HTTP headers to enable SharedArrayBuffer in browsers:
-**Important:** These headers are required for the thread pool to function. Without them, the library will fall back to single-threaded processing.
-Set these headers in your server configuration:
-```
-Cross-Origin-Opener-Policy: same-origin
-Cross-Origin-Embedder-Policy: require-corp
-```
-#### Server Configuration Examples
-**Express.js:**
-```javascript
-app.use((req, res, next) => {
-  res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
-  res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
-  next();
-});
-```
-**Nginx:**
-```nginx
-add_header 'Cross-Origin-Opener-Policy' 'same-origin';
-add_header 'Cross-Origin-Embedder-Policy' 'require-corp';
-```
+  const result = await extractBytes(bytes, "image/png", {
+    ocr: {
+      backend: "tesseract-wasm",
+      language: "eng",
+    },
+  });
-**Apache:**
-```apache
-Header set Cross-Origin-Opener-Policy "same-origin"
-Header set Cross-Origin-Embedder-Policy "require-corp"
-```
+  console.log("Extracted text:");
+  console.log(result.content);
+}
-**Cloudflare Workers:**
-```javascript
-export default {
-  async fetch(request: Request): Promise<Response> {
-    const response = new Response(body);
-    response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
-    response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
-    return response;
-  }
-};
+extractWithOcr().catch(console.error);
 ```
-### Browser Compatibility
-Multi-threading with SharedArrayBuffer is available in:
+#### Table Extraction
-- **Chrome/Edge**: 74+
-- **Firefox**: 79+
-- **Safari**: 15.2+
-- **Opera**: 60+
+See [Table Extraction Guide](https://kreuzberg.dev/features/table-extraction/) for detailed examples.
-In unsupported browsers or when headers are not set, the library automatically degrades to single-threaded mode.
+#### Processing Multiple Files
-### Graceful Degradation
+```ts
+import { extractBytes, initWasm } from "@kreuzberg/wasm";
-The library handles thread pool initialization gracefully. If initialization fails or is unavailable:
-```typescript
-import { initThreadPool } from '@kreuzberg/wasm';
-try {
-  await initThreadPool(navigator.hardwareConcurrency);
-  console.log('Multi-threading enabled');
-} catch (error) {
-  // Fall back to single-threaded processing
-  console.warn('Multi-threading unavailable:', error);
-  console.log('Using single-threaded extraction');
+interface DocumentJob {
+  name: string;
+  bytes: Uint8Array;
+  mimeType: string;
 }
-// Extraction will work in both cases
-const result = await extractBytes(pdfBytes, 'application/pdf');
+async function processBatch(documents: DocumentJob[], concurrency: number = 3) {
+  await initWasm();
+  const results: Record<string, string> = {};
+  const queue = [...documents];
+  const workers = Array(concurrency)
+    .fill(null)
+    .map(async () => {
+      while (queue.length > 0) {
+        const doc = queue.shift();
+        if (!doc) break;
+        try {
+          const result = await extractBytes(doc.bytes, doc.mimeType);
+          results[doc.name] = result.content;
+        } catch (error) {
+          console.error(`Failed to process ${doc.name}:`, error);
+        }
+      }
+    });
+  await Promise.all(workers);
+  return results;
+}
 ```
-### Complete Example with Thread Pool
-```typescript
-import { initWasm, initThreadPool, extractBytes } from '@kreuzberg/wasm';
+#### Async Processing
-async function initializeKreuzbergWithThreading() {
-  try {
-    // Initialize WASM module
-    await initWasm();
+For non-blocking document processing:
-    // Initialize multi-threading
-    const cpuCount = navigator.hardwareConcurrency || 1;
-    try {
-      await initThreadPool(cpuCount);
-      console.log(`Thread pool initialized with ${cpuCount} workers`);
-    } catch (error) {
-      console.warn('Could not initialize thread pool, using single-threaded mode');
-    }
+```ts
+import { extractBytes, initWasm, getWasmCapabilities } from "@kreuzberg/wasm";
-  } catch (error) {
-    console.error('Failed to initialize Kreuzberg:', error);
+async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
+  const caps = getWasmCapabilities();
+  if (!caps.hasWasm) {
+    throw new Error("WebAssembly not supported");
   }
-}
-async function extractDocument(file: File) {
-  const bytes = new Uint8Array(await file.arrayBuffer());
+  await initWasm();
-  // Extraction will use multiple threads if available
-  const result = await extractBytes(bytes, file.type, {
-    extract_tables: true,
-    extract_images: true
-  });
+  const results = await Promise.all(
+    files.map((bytes, index) => extractBytes(bytes, mimeTypes[index]))
+  );
-  return result;
+  return results.map((r) => ({
+    content: r.content,
+    pageCount: r.metadata?.pageCount,
+  }));
 }
-// Initialize once at app startup
-await initializeKreuzbergWithThreading();
+const fileBytes = [new Uint8Array([1, 2, 3])];
+const mimes = ["application/pdf"];
-// Later, handle file uploads
-fileInput.addEventListener('change', async (e) => {
-  const file = e.target.files?.[0];
-  if (file) {
-    const result = await extractDocument(file);
-    console.log('Extracted text:', result.content);
-  }
-});
+extractDocuments(fileBytes, mimes)
+  .then((results) => console.log(results))
+  .catch(console.error);
 ```
-### Performance Considerations
+#### Worker Pool Usage
-- **Thread Pool Size**: Generally, using `navigator.hardwareConcurrency` is optimal. For servers, use the number of available CPU cores.
-- **Memory Usage**: Each thread has its own memory context. Large documents may require significant heap space.
-- **Network Requests**: Training data and models are cached locally, so subsequent extractions are faster.
+When Web Workers are available, use worker threads for parallel document processing without blocking the main thread:
-## OCR Support
+```typescript
+import { extractBytes, initWasm, hasWorkers, hasModuleWorkers } from '@kreuzberg/wasm';
-The WASM binding integrates [tesseract-wasm](https://github.com/robertknight/tesseract-wasm) for OCR support with 40+ languages.
+class DocumentWorkerPool {
+  private workers: Worker[] = [];
+  private taskQueue: Array<{ id: number; data: Uint8Array; mimeType: string; resolve: Function; reject: Function }> = [];
+  private currentTaskId = 0;
-### Basic OCR
+  constructor(workerCount: number = navigator.hardwareConcurrency || 4) {
+    // Module workers allow importing ES modules, standard workers are more compatible
+    const useModuleWorkers = hasModuleWorkers();
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
-const imageBytes = await fetch('./scan.jpg').then(r => r.arrayBuffer());
-const result = await extractBytes(
-  new Uint8Array(imageBytes),
-  'image/jpeg',
-  {
-    enable_ocr: true,
-    ocr_config: {
-      languages: ['eng'],  // English
-      backend: 'tesseract-wasm'
+    for (let i = 0; i < workerCount; i++) {
+      const worker = useModuleWorkers
+        ? new Worker(new URL('./extraction-worker.ts', import.meta.url), { type: 'module' })
+        : new Worker(new URL('./extraction-worker.js', import.meta.url));
+      worker.onmessage = (event) => this.handleWorkerMessage(event.data);
+      worker.onerror = (error) => this.handleWorkerError(error);
+      this.workers.push(worker);
     }
   }
-);
-console.log('OCR text:', result.content);
-```
-### Multi-Language OCR
-```typescript
-const result = await extractBytes(imageBytes, 'image/png', {
-  enable_ocr: true,
-  ocr_config: {
-    languages: ['eng', 'deu', 'fra'],  // English, German, French
-    backend: 'tesseract-wasm'
+  async extract(data: Uint8Array, mimeType: string): Promise<string> {
+    return new Promise((resolve, reject) => {
+      this.taskQueue.push({
+        id: this.currentTaskId++,
+        data,
+        mimeType,
+        resolve,
+        reject
+      });
+      this.processQueue();
+    });
   }
-});
-```
-### Supported Languages
-`eng`, `deu`, `fra`, `spa`, `ita`, `por`, `nld`, `pol`, `rus`, `jpn`, `chi_sim`, `chi_tra`, `kor`, `ara`, `hin`, `tha`, `vie`, and 25+ more.
-Training data is automatically loaded from jsDelivr CDN:
-```
-https://cdn.jsdelivr.net/npm/tesseract-wasm@0.11.0/dist/{lang}.traineddata
-```
-## Configuration
-### Extract Tables
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
-const result = await extractBytes(pdfBytes, 'application/pdf', {
-  extract_tables: true
-});
-if (result.tables) {
-  for (const table of result.tables) {
-    console.log('Table as Markdown:');
-    console.log(table.markdown);
-    console.log('Table cells:');
-    console.log(JSON.stringify(table.cells, null, 2));
+  private processQueue(): void {
+    while (this.taskQueue.length > 0) {
+      const task = this.taskQueue.shift();
+      if (task) {
+        const worker = this.workers[task.id % this.workers.length];
+        worker.postMessage({ id: task.id, data: task.data, mimeType: task.mimeType });
+      }
+    }
   }
-}
-```
-### Extract Images
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
-const result = await extractBytes(pdfBytes, 'application/pdf', {
-  extract_images: true,
-  image_config: {
-    target_dpi: 300,
-    max_image_dimension: 4096
+  private handleWorkerMessage(data: { id: number; result: string }): void {
+    const task = this.taskQueue.find(t => t.id === data.id);
+    if (task) {
+      task.resolve(data.result);
+      this.processQueue();
+    }
   }
-});
-if (result.images) {
-  for (const image of result.images) {
-    console.log(`Image ${image.index}: ${image.format}`);
-    // image.data is a Uint8Array
+  private handleWorkerError(error: ErrorEvent): void {
+    console.error('Worker error:', error.message);
   }
-}
-```
-### Text Chunking
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
-const result = await extractBytes(pdfBytes, 'application/pdf', {
-  enable_chunking: true,
-  chunking_config: {
-    max_chars: 1000,
-    max_overlap: 200
+  terminate(): void {
+    this.workers.forEach(w => w.terminate());
   }
-});
+}
-if (result.chunks) {
-  for (const chunk of result.chunks) {
-    console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
+// Usage
+async function processDocumentsInParallel() {
+  if (!hasWorkers()) {
+    console.log('Web Workers not available, falling back to main thread');
+    return;
   }
-}
-```
-### Language Detection
+  await initWasm();
+  const pool = new DocumentWorkerPool(4);
-```typescript
-import { extractBytes } from '@kreuzberg/wasm';
+  const documents = [
+    { data: new Uint8Array([...]), mimeType: 'application/pdf' },
+    { data: new Uint8Array([...]), mimeType: 'application/pdf' },
+  ];
-const result = await extractBytes(pdfBytes, 'application/pdf', {
-  enable_language_detection: true
-});
+  const results = await Promise.all(
+    documents.map(doc => pool.extract(doc.data, doc.mimeType))
+  );
-if (result.language) {
-  console.log(`Detected language: ${result.language.code}`);
-  console.log(`Confidence: ${result.language.confidence}`);
+  pool.terminate();
+  return results;
 }
 ```
-### Complete Configuration Example
+Worker code (`extraction-worker.ts`):
 ```typescript
-import {
-  extractBytes,
-  type ExtractionConfig
-} from '@kreuzberg/wasm';
-const config: ExtractionConfig = {
-  extract_tables: true,
-  extract_images: true,
-  extract_metadata: true,
-  enable_ocr: true,
-  ocr_config: {
-    languages: ['eng'],
-    backend: 'tesseract-wasm',
-    dpi: 300,
-    preprocessing: {
-      deskew: true,
-      denoise: true,
-      binarize: true
-    }
-  },
-  enable_chunking: true,
-  chunking_config: {
-    max_chars: 1000,
-    max_overlap: 200
-  },
+import { extractBytes, initWasm } from '@kreuzberg/wasm';
-  enable_language_detection: true,
+let wasmInitialized = false;
-  enable_quality: true,
+self.onmessage = async (event) => {
+  if (!wasmInitialized) {
+    await initWasm();
+    wasmInitialized = true;
+  }
-  extract_keywords: true,
-  keywords_config: {
-    max_keywords: 10,
-    method: 'yake'
+  const { id, data, mimeType } = event.data;
+  try {
+    const result = await extractBytes(new Uint8Array(data), mimeType);
+    self.postMessage({ id, result: result.content });
+  } catch (error) {
+    self.postMessage({ id, error: (error as Error).message });
   }
 };
-const result = await extractBytes(data, mimeType, config);
 ```
-## Advanced Usage
+### Memory Management
-### Batch Processing
+WASM memory is managed by the JavaScript garbage collector:
 ```typescript
-import { batchExtractFiles, batchExtractBytes } from '@kreuzberg/wasm';
+import { initWasm, extractBytes } from '@kreuzberg/wasm';
-// Browser: Process multiple files
-const fileInput = document.querySelector<HTMLInputElement>('#files');
-const files = Array.from(fileInput.files);
+async function extractWithMemoryAwareness() {
+  await initWasm();
-const results = await batchExtractFiles(files, {
-  extract_tables: true
-});
+  // Process documents one at a time to control memory usage
+  const documents = [/* ... */];
-for (const result of results) {
-  console.log(`${result.mime_type}: ${result.content.length} characters`);
-}
+  for (const doc of documents) {
+    const result = await extractBytes(doc, 'application/pdf');
-// Or from Uint8Arrays
-const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
-const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
+    // Process result immediately
+    console.log(result.content);
-const results = await batchExtractBytes(dataList, mimeTypes);
-```
+    // Result will be garbage collected when no longer referenced
+    // Explicitly clear large objects if needed
+    // gc(); // Requires --expose-gc flag
+  }
+}
-### Synchronous Extraction
+// Check available memory (browser only)
+if (performance.memory) {
+  console.log('Memory usage:', {
+    usedJSHeapSize: performance.memory.usedJSHeapSize,
+    totalJSHeapSize: performance.memory.totalJSHeapSize,
+    jsHeapSizeLimit: performance.memory.jsHeapSizeLimit
+  });
+}
+```
-```typescript
-import { extractBytesSync, batchExtractBytesSync } from '@kreuzberg/wasm';
+### Next Steps
-// Synchronous single extraction
-const result = extractBytesSync(data, 'application/pdf', config);
+- **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
+- **[API Documentation](https://kreuzberg.dev/api/)** - Complete API reference
+- **[Examples & Guides](https://kreuzberg.dev/guides/)** - Full code examples and usage guides
+- **[Configuration Guide](https://kreuzberg.dev/configuration/)** - Advanced configuration options
+- **[Troubleshooting](https://kreuzberg.dev/troubleshooting/)** - Common issues and solutions
-// Synchronous batch extraction
-const results = batchExtractBytesSync(dataList, mimeTypes, config);
-```
+## WASM-Specific Implementation Details
-### Plugin System
+### Initialization
-#### Custom Post-Processors
+WASM binaries must be loaded before extraction:
 ```typescript
-import { registerPostProcessor } from '@kreuzberg/wasm';
-registerPostProcessor({
-  name: 'uppercase',
-  async process(result) {
-    return {
-      ...result,
-      content: result.content.toUpperCase()
-    };
-  }
-});
+import { initWasm } from '@kreuzberg/wasm';
-// Now all extractions will use this processor
-const result = await extractBytes(data, mimeType);
-console.log(result.content); // UPPERCASE TEXT
+// Initialize once at application startup
+await initWasm();
+// Now extraction functions can be used
 ```
-#### Custom Validators
+The init function:
+- Downloads and instantiates the WASM binary
+- Initializes the memory space (linear memory module)
+- Prepares thread pools if available
+- Throws if WASM is not supported in the environment
-```typescript
-import { registerValidator } from '@kreuzberg/wasm';
+### Threading Model
-registerValidator({
-  name: 'min-length',
-  async validate(result) {
-    if (result.content.length < 100) {
-      throw new Error('Document too short');
-    }
-  }
-});
-```
+- Single-threaded by default (main thread execution)
+- Web Workers optional for background processing
+- Shared memory (SharedArrayBuffer) not required
+- Message passing used for worker communication
+- No blocking operations on main thread with worker pool
-#### Custom OCR Backends
+### Memory Considerations
-```typescript
-import { registerOcrBackend } from '@kreuzberg/wasm';
-registerOcrBackend({
-  name: 'custom-ocr',
-  supportedLanguages() {
-    return ['eng', 'fra'];
-  },
-  async initialize() {
-    // Initialize your OCR backend
-  },
-  async processImage(imageBytes, language) {
-    // Process image and return result
-    return {
-      content: 'extracted text',
-      mime_type: 'text/plain',
-      metadata: {},
-      tables: []
-    };
-  },
-  async shutdown() {
-    // Cleanup
-  }
-});
-```
+- Each WASM instance has its own 4GB linear memory address space
+- Large documents (> 100 MB) may not fit in WASM memory
+- Binary data is copied between JavaScript and WASM boundaries
+- Garbage collection is handled by JavaScript runtime
+- No manual memory management required
-### MIME Type Detection
+### Supported Extraction Targets
-```typescript
-import {
-  detectMimeFromBytes,
-  getMimeFromExtension,
-  getExtensionsForMime,
-  normalizeMimeType
-} from '@kreuzberg/wasm';
+Different file formats have varying support in WASM:
-// Auto-detect MIME type from file bytes
-const mimeType = detectMimeFromBytes(fileBytes);
+| Format | Support | Notes |
+|--------|---------|-------|
+| PDF | Full | Text, images, metadata extraction |
+| Office (DOCX, XLSX, PPTX) | Full | All features supported |
+| Images (PNG, JPG, etc) | Full | EXIF metadata extraction |
+| Archives (ZIP, TAR) | Full | Listing and extraction |
+| OCR | Limited | Tesseract WASM only, main thread only |
+| Embeddings | Not Available | WASM has no ML model support |
-// Get MIME type from file extension
-const mime = getMimeFromExtension('pdf'); // 'application/pdf'
+### Platform Limitations
-// Get extensions for MIME type
-const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
+**LibreOffice-Dependent Formats Not Available**
-// Normalize MIME type
-const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
-```
+WASM cannot load native LibreOffice binaries, so older Office formats are **not supported**:
-### Configuration Loading
+- ❌ **DOC** (Microsoft Word 97-2003) - Use DOCX instead
+- ❌ **XLS** (Microsoft Excel 97-2003) - Use XLSX instead
+- ❌ **PPT** (Microsoft PowerPoint 97-2003) - Use PPTX instead
+- ❌ **RTF** (Rich Text Format with complex features)
+- ❌ **ODT/ODS/ODP** (LibreOffice/OpenOffice formats)
-```typescript
-import { loadConfigFromString } from '@kreuzberg/wasm';
-// Load from YAML
-const yamlConfig = `
-extract_tables: true
-enable_ocr: true
-ocr_config:
-  languages: [eng, deu]
-`;
-const config = loadConfigFromString(yamlConfig, 'yaml');
-// Load from JSON
-const jsonConfig = '{"extract_tables":true}';
-const config2 = loadConfigFromString(jsonConfig, 'json');
-// Load from TOML
-const tomlConfig = 'extract_tables = true';
-const config3 = loadConfigFromString(tomlConfig, 'toml');
-```
+Modern Office formats (DOCX, XLSX, PPTX) are fully supported and don't require LibreOffice.
-## API Reference
+**Polars Integration Not Available**
-### Extraction Functions
+- ❌ Polars DataFrame extraction/conversion not available in WASM
+- ❌ Structured data operations limited compared to Node.js binding
-#### `extractFile(file: File, mimeType?: string, config?: ExtractionConfig): Promise<ExtractionResult>`
-Extract content from a browser `File` object.
+**Alternative: Use Node.js Binding**
-#### `extractBytes(data: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
-Asynchronously extract content from a `Uint8Array`.
+If you need support for older Office formats or Polars integration, use the `@kreuzberg/node` package instead:
-#### `extractBytesSync(data: Uint8Array, mimeType: string, config?: ExtractionConfig): ExtractionResult`
-Synchronously extract content from a `Uint8Array`.
+```bash
+npm install @kreuzberg/node
+```
-#### `batchExtractFiles(files: File[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
-Extract multiple files in parallel.
+The Node.js binding provides:
+- ✅ Full LibreOffice format support (DOC, XLS, PPT, RTF, ODT)
+- ✅ Polars DataFrame integration
+- ✅ All OCR backends (Tesseract, EasyOCR, PaddleOCR)
+- ✅ Full embedding model support
-#### `batchExtractBytes(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
-Extract multiple byte arrays in parallel.
+**Format Comparison Table**
-#### `batchExtractBytesSync(dataList: Uint8Array[], mimeTypes: string[], config?: ExtractionConfig): ExtractionResult[]`
-Extract multiple byte arrays synchronously.
+| Format Type | WASM Support | Node.js Support |
+|-------------|--------------|-----------------|
+| Modern Office (DOCX/XLSX/PPTX) | ✅ Full | ✅ Full |
+| Legacy Office (DOC/XLS/PPT) | ❌ Not Available | ✅ Requires LibreOffice |
+| OpenOffice (ODT/ODS/ODP) | ❌ Not Available | ✅ Requires LibreOffice |
+| PDF | ✅ Full | ✅ Full |
+| Images | ✅ Full | ✅ Full |
+| Embeddings | ❌ Not Available | ✅ With ONNX Runtime |
+| Polars | ❌ Not Available | ✅ Available |
-### Plugin Management
+### Sandbox Security
-#### Post-Processors
+- WASM code runs in a sandbox with restricted capabilities
+- File system access requires user interaction (File API)
+- Network access follows CORS restrictions
+- No access to Node.js native modules
+- Content Security Policy (CSP) may restrict WASM loading
-```typescript
-registerPostProcessor(processor: PostProcessorProtocol): void
-unregisterPostProcessor(name: string): void
-clearPostProcessors(): void
-listPostProcessors(): string[]
-```
+## Features
-#### Validators
+### Supported File Formats (56+)
-```typescript
-registerValidator(validator: ValidatorProtocol): void
-unregisterValidator(name: string): void
-clearValidators(): void
-listValidators(): string[]
-```
+56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
-#### OCR Backends
+#### Office Documents
-```typescript
-registerOcrBackend(backend: OcrBackendProtocol): void
-unregisterOcrBackend(name: string): void
-clearOcrBackends(): void
-listOcrBackends(): string[]
-```
+| Category | Formats | Capabilities |
+|----------|---------|--------------|
+| **Word Processing** | `.docx`, `.odt` | Full text, tables, images, metadata, styles |
+| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | Sheet data, formulas, cell metadata, charts |
+| **Presentations** | `.pptx`, `.ppt`, `.ppsx` | Slides, speaker notes, images, metadata |
+| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
+| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
-### Document Extractors
+#### Images (OCR-Enabled)
-```typescript
-listDocumentExtractors(): string[]
-unregisterDocumentExtractor(name: string): void
-clearDocumentExtractors(): void
-```
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
+| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR, table detection, format-specific metadata |
+| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
-### MIME Utilities
+#### Web & Data
-```typescript
-detectMimeFromBytes(data: Uint8Array): string
-getMimeFromExtension(ext: string): string | null
-getExtensionsForMime(mime: string): string[]
-normalizeMimeType(mime: string): string
-```
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
+| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
+| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
-### Configuration
+#### Email & Archives
-```typescript
-loadConfigFromString(content: string, format: 'yaml' | 'toml' | 'json'): ExtractionConfig
-```
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
+| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
-### Embeddings
+#### Academic & Scientific
-```typescript
-listEmbeddingPresets(): string[]
-getEmbeddingPreset(name: string): EmbeddingPreset | null
-```
+| Category | Formats | Features |
+|----------|---------|----------|
+| **Citations** | `.bib`, `.biblatex`, `.ris`, `.enw`, `.csl` | Bibliography parsing, citation extraction |
+| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
+| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
-## Types
+**[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
-All types are shared via the `@kreuzberg/core` package:
+### Key Capabilities
-```typescript
-import type {
-  ExtractionResult,
-  ExtractionConfig,
-  OcrConfig,
-  ChunkingConfig,
-  ImageConfig,
-  KeywordsConfig,
-  Table,
-  ExtractedImage,
-  Chunk,
-  Metadata,
-  PostProcessorProtocol,
-  ValidatorProtocol,
-  OcrBackendProtocol
-} from '@kreuzberg/core';
-```
+- **Text Extraction** - Extract all text content with position and formatting information
-### ExtractionResult
-Main result object containing:
-- `content: string` - Extracted text content
-- `mime_type: string` - MIME type of the document
-- `metadata?: Metadata` - Document metadata
-- `tables?: Table[]` - Extracted tables
-- `images?: ExtractedImage[]` - Extracted images
-- `chunks?: Chunk[]` - Text chunks (if chunking enabled)
-- `language?: LanguageInfo` - Detected language (if enabled)
-- `keywords?: Keyword[]` - Extracted keywords (if enabled)
-### ExtractionConfig
-Configuration object for extraction:
-- `extract_tables?: boolean` - Extract tables as structured data
-- `extract_images?: boolean` - Extract embedded images
-- `extract_metadata?: boolean` - Extract document metadata
-- `enable_ocr?: boolean` - Enable OCR for images
-- `ocr_config?: OcrConfig` - OCR settings
-- `enable_chunking?: boolean` - Split text into semantic chunks
-- `chunking_config?: ChunkingConfig` - Text chunking settings
-- `enable_language_detection?: boolean` - Detect document language
-- `enable_quality?: boolean` - Encoding detection, normalization
-- `extract_keywords?: boolean` - Extract important keywords
-- `keywords_config?: KeywordsConfig` - Keyword extraction settings
-### Table
-Extracted table structure:
-- `markdown: string` - Table in Markdown format
-- `cells: TableCell[][]` - 2D array of table cells
-- `row_count: number` - Number of rows
-- `column_count: number` - Number of columns
-## Supported Formats
-| Category | Formats |
-|----------|---------|
-| **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
-| **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
-| **Web** | HTML, XHTML, XML, EPUB |
-| **Text** | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TYP, FB2 |
-| **Email** | EML, MSG |
-| **Archives** | ZIP, TAR, 7Z |
-| **Other** | And 30+ more formats |
-## Build from Source
-### Prerequisites
-- Rust 1.75+ with `wasm32-unknown-unknown` target
-- Node.js 18+ with pnpm
-- wasm-pack
+- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
-```bash
-# Install Rust target
-rustup target add wasm32-unknown-unknown
+- **Table Extraction** - Parse tables with structure and cell content preservation
-# Install wasm-pack
-curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
+- **Image Extraction** - Extract embedded images and render page previews
-# Build WASM package
-cd crates/kreuzberg-wasm
-pnpm install
-pnpm run build
+- **OCR Support** - Integrate multiple OCR backends for scanned documents
-# Run tests
-pnpm test
-```
+- **Async/Await** - Non-blocking document processing with concurrent operations
-### Build Targets
+- **Plugin System** - Extensible post-processing for custom text transformation
-```bash
-# For browsers (ESM modules)
-pnpm run build:wasm:web
+- **Batch Processing** - Efficiently process multiple documents in parallel
-# For bundlers (webpack, rollup, vite)
-pnpm run build:wasm:bundler
+- **Memory Efficient** - Stream large files without loading entirely into memory
-# For Node.js
-pnpm run build:wasm:nodejs
+- **Language Detection** - Detect and support multiple languages in documents
-# For Deno
-pnpm run build:wasm:deno
+- **Configuration** - Fine-grained control over extraction behavior
-# Build all targets
-pnpm run build:all
-```
+### Performance Characteristics
-## Limitations
+| Format | Speed | Memory | Notes |
+|--------|-------|--------|-------|
+| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
+| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
+| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
+| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
+| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
-### No File System Access
+## OCR Support
-The WASM binding cannot access the file system directly. Use file readers:
+Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
-```typescript
-// ❌ Won't work
-await extractFileSync('./document.pdf');  // Throws error
+- **Tesseract-Wasm**
-// ✅ Use file readers instead
-const bytes = await Deno.readFile('./document.pdf');  // Deno
-const bytes = await fs.readFile('./document.pdf');    // Node.js
-const bytes = await file.arrayBuffer();                // Browser
-```
+### OCR Configuration Example
-### OCR Training Data
+```ts
+import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";
-Tesseract training data (`.traineddata` files) are loaded from jsDelivr CDN on first use. For offline usage or custom CDN, see the [OCR documentation](https://kreuzberg.dev).
+async function extractWithOcr() {
+  await initWasm();
-### Size Constraints
+  try {
+    await enableOcr();
+    console.log("OCR enabled successfully");
+  } catch (error) {
+    console.error("Failed to enable OCR:", error);
+    return;
+  }
-Cloudflare Workers has a 10MB bundle size limit (compressed). The WASM binary is ~2MB compressed, leaving room for your application code.
+  const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));
-### HTML File Size Limits
+  const result = await extractBytes(bytes, "image/png", {
+    ocr: {
+      backend: "tesseract-wasm",
+      language: "eng",
+    },
+  });
-**WASM builds have a 2MB limit for HTML files** due to limited stack space. HTML files larger than 2MB will be rejected with a validation error to prevent stack overflow.
+  console.log("Extracted text:");
+  console.log(result.content);
+}
-```typescript
-// Files > 2MB will throw an error in WASM builds
-const largeHtml = new Uint8Array(3 * 1024 * 1024); // 3MB
-await extractBytes(largeHtml, 'text/html');
-// ❌ Throws: "HTML file size exceeds WASM limit of 2MB"
+extractWithOcr().catch(console.error);
 ```
-For large HTML files, use the native [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) binding which has no size limits.
+## Async Support
-### PDF Extraction in Non-Browser Environments
+This binding provides full async/await support for non-blocking document processing:
-PDF extraction requires PDFium, which is only available in browser environments. In Deno, Node.js, and Cloudflare Workers, PDF extraction will fail with an error.
+```ts
+import { extractBytes, initWasm, getWasmCapabilities } from "@kreuzberg/wasm";
-```typescript
-// ❌ Won't work in Deno/Node.js/Workers
-await extractBytes(pdfBytes, 'application/pdf');
-// Throws: "PDF extraction requires proper WASM module initialization"
-```
+async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
+  const caps = getWasmCapabilities();
+  if (!caps.hasWasm) {
+    throw new Error("WebAssembly not supported");
+  }
-**Solutions:**
-- **Browser**: PDF extraction works out of the box
-- **Deno/Node.js**: Use [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) with native PDFium bindings
-- **Cloudflare Workers**: PDF extraction is not currently supported
+  await initWasm();
-## Troubleshooting
+  const results = await Promise.all(
+    files.map((bytes, index) => extractBytes(bytes, mimeTypes[index]))
+  );
-### "WASM module failed to initialize"
+  return results.map((r) => ({
+    content: r.content,
+    pageCount: r.metadata?.pageCount,
+  }));
+}
-Ensure your bundler is configured to handle WASM files:
+const fileBytes = [new Uint8Array([1, 2, 3])];
+const mimes = ["application/pdf"];
-**Vite:**
-```typescript
-// vite.config.ts
-export default {
-  optimizeDeps: {
-    exclude: ['@kreuzberg/wasm']
-  }
-}
+extractDocuments(fileBytes, mimes)
+  .then((results) => console.log(results))
+  .catch(console.error);
 ```
-**Webpack:**
-```javascript
-// webpack.config.js
-module.exports = {
-  experiments: {
-    asyncWebAssembly: true
-  }
-}
-```
+## Plugin System
-### "Module not found: @kreuzberg/core"
+Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
-The @kreuzberg/core package is a peer dependency. Install it:
+For detailed plugin documentation, visit [Plugin System Guide](https://kreuzberg.dev/plugins/).
-```bash
-pnpm add @kreuzberg/core
-```
+## Batch Processing
-### Memory Issues in Workers
+Process multiple documents efficiently:
-For large documents in Cloudflare Workers, process in smaller chunks:
+```ts
+import { extractBytes, initWasm } from "@kreuzberg/wasm";
-```typescript
-const result = await extractBytes(pdfBytes, 'application/pdf', {
-  chunking_config: { max_chars: 1000 }
-});
+interface DocumentJob {
+  name: string;
+  bytes: Uint8Array;
+  mimeType: string;
+}
+async function processBatch(documents: DocumentJob[], concurrency: number = 3) {
+  await initWasm();
+  const results: Record<string, string> = {};
+  const queue = [...documents];
+  const workers = Array(concurrency)
+    .fill(null)
+    .map(async () => {
+      while (queue.length > 0) {
+        const doc = queue.shift();
+        if (!doc) break;
+        try {
+          const result = await extractBytes(doc.bytes, doc.mimeType);
+          results[doc.name] = result.content;
+        } catch (error) {
+          console.error(`Failed to process ${doc.name}:`, error);
+        }
+      }
+    });
+  await Promise.all(workers);
+  return results;
+}
 ```
-### OCR Not Working
+## Configuration
-Check that tesseract-wasm is loading correctly. The training data is automatically fetched from CDN on first use.
+For advanced configuration options including language detection, table extraction, OCR settings, and more:
-## Examples
+**[Configuration Guide](https://kreuzberg.dev/configuration/)**
-See the [`examples/`](./examples/) directory for complete working examples:
+## Documentation
-- **Browser**: Vanilla JS file upload interface
-- **Deno**: Command-line document extraction
-- **Cloudflare Workers**: Document processing API
-- **Node.js**: Batch processing script
+- **[Official Documentation](https://kreuzberg.dev/)**
+- **[API Reference](https://kreuzberg.dev/reference/api-wasm/)**
+- **[Examples & Guides](https://kreuzberg.dev/guides/)**
-## Documentation
+## Troubleshooting
-For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
+For common issues and solutions, visit [Troubleshooting Guide](https://kreuzberg.dev/troubleshooting/).
 ## Contributing
-We welcome contributions! Please see our [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/docs/contributing.md) for details.
+Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
 ## License
-MIT
-## Links
-- [Website](https://kreuzberg.dev)
-- [Documentation](https://kreuzberg.dev)
-- [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
-- [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
-- [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
-- [npm Package](https://www.npmjs.com/package/@kreuzberg/wasm)
+MIT License - see LICENSE file for details.
-## Related Packages
+## Support
-- [@kreuzberg/node](https://www.npmjs.com/package/@kreuzberg/node) - Native Node.js bindings (NAPI)
-- [@kreuzberg/core](https://www.npmjs.com/package/@kreuzberg/core) - Shared TypeScript types
-- [kreuzberg](https://crates.io/crates/kreuzberg) - Rust core library
+- **Discord Community**: [Join our Discord](https://discord.gg/pXxagNK2zN)
+- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
+- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)