@kreuzberg/wasm 4.6.2 → 4.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -22,7 +22,7 @@
22
22
  <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
23
23
  </a>
24
24
  <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
25
- <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.6.2" alt="Go">
25
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
26
26
  </a>
27
27
  <a href="https://www.nuget.org/packages/Kreuzberg/">
28
28
  <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
@@ -42,13 +42,16 @@
42
42
 
43
43
  <!-- Project Info -->
44
44
  <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
45
- <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
45
+ <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
46
46
  </a>
47
47
  <a href="https://docs.kreuzberg.dev">
48
- <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
48
+ <img src="https://img.shields.io/badge/docs-kreuzberg.dev-007ec6" alt="Documentation">
49
+ </a>
50
+ <a href="https://docs.kreuzberg.dev/demo.html">
51
+ <img src="https://img.shields.io/badge/%E2%96%B6%EF%B8%8F_Live_Demo-007ec6" alt="Live Demo">
49
52
  </a>
50
53
  <a href="https://huggingface.co/Kreuzberg">
51
- <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow" alt="Hugging Face">
54
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97_Hugging_Face-007ec6" alt="Hugging Face">
52
55
  </a>
53
56
  </div>
54
57
 
@@ -61,7 +64,7 @@
61
64
  </div>
62
65
 
63
66
 
64
- Extract text, tables, images, and metadata from 91+ file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
67
+ Extract text, tables, images, and metadata from 91+ file formats and 248 programming languages including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
65
68
 
66
69
 
67
70
  ## Installation
@@ -74,6 +77,7 @@ Install via one of the supported package managers:
74
77
 
75
78
 
76
79
  **npm:**
80
+
77
81
  ```bash
78
82
  npm install @kreuzberg/wasm
79
83
  ```
@@ -82,6 +86,7 @@ npm install @kreuzberg/wasm
82
86
 
83
87
 
84
88
  **pnpm:**
89
+
85
90
  ```bash
86
91
  pnpm add @kreuzberg/wasm
87
92
  ```
@@ -90,6 +95,7 @@ pnpm add @kreuzberg/wasm
90
95
 
91
96
 
92
97
  **yarn:**
98
+
93
99
  ```bash
94
100
  yarn add @kreuzberg/wasm
95
101
  ```
@@ -318,6 +324,19 @@ extractDocuments(fileBytes, mimes)
318
324
  | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
319
325
  | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
320
326
 
327
+ #### Code Intelligence (248 Languages)
328
+
329
+ | Feature | Description |
330
+ |---------|-------------|
331
+ | **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
332
+ | **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
333
+ | **Symbol Extraction** | Variables, constants, type aliases, properties |
334
+ | **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
335
+ | **Diagnostics** | Parse errors with line/column positions |
336
+ | **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
337
+
338
+ Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
339
+
321
340
  **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
322
341
 
323
342
  ### Key Capabilities
@@ -337,6 +356,9 @@ extractDocuments(fileBytes, mimes)
337
356
  - **Batch Processing** - Efficiently process multiple documents in parallel
338
357
  - **Memory Efficient** - Stream large files without loading entirely into memory
339
358
  - **Language Detection** - Detect and support multiple languages in documents
359
+
360
+ - **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [248 programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
361
+
340
362
  - **Configuration** - Fine-grained control over extraction behavior
341
363
 
342
364
  ### Performance Characteristics
@@ -488,30 +510,6 @@ For advanced configuration options including language detection, table extractio
488
510
 
489
511
  **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)**
490
512
 
491
- ## Platform Limitations
492
-
493
- WASM runs in single-threaded environments without access to ONNX Runtime, which constrains some features:
494
-
495
- ### Unsupported Features
496
-
497
- - **Layout Detection** – Requires RT-DETR model inference via ONNX Runtime, which is unavailable in WebAssembly
498
- - **Hardware Acceleration** – No GPU support (AccelerationConfig is not applicable)
499
- - **Concurrency Configuration** – Single-threaded WASM environment (ConcurrencyConfig does not apply)
500
- - **Email Codepage Configuration** – EmailConfig is not supported in WASM
501
-
502
- ### Supported Features
503
-
504
- - **Text Extraction** – Full text content from all supported formats
505
- - **OCR via Tesseract WASM** – Scanned document and image OCR using browser-native Tesseract
506
- - **Embeddings** – FastEmbed-based vector generation
507
- - **Chunking** – Text segmentation for RAG pipelines
508
- - **Metadata Extraction** – Document properties, creation dates, page counts
509
- - **Table Extraction** – Structured table data from PDFs and spreadsheets
510
- - **Language Detection** – Identify document language
511
- - **Image Extraction** – Embedded images from documents
512
-
513
- All 91+ file formats supported by Kreuzberg are available in WASM, with the exception that features requiring ONNX Runtime (layout detection) will fail gracefully with an unsupported error.
514
-
515
513
  ## Documentation
516
514
 
517
515
  - **[Official Documentation](https://kreuzberg.dev/)**
@@ -22,7 +22,7 @@
22
22
  <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
23
23
  </a>
24
24
  <a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
25
- <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.6.2" alt="Go">
25
+ <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.0.0" alt="Go">
26
26
  </a>
27
27
  <a href="https://www.nuget.org/packages/Kreuzberg/">
28
28
  <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
@@ -42,13 +42,16 @@
42
42
 
43
43
  <!-- Project Info -->
44
44
  <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
45
- <img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License">
45
+ <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License">
46
46
  </a>
47
47
  <a href="https://docs.kreuzberg.dev">
48
- <img src="https://img.shields.io/badge/docs-kreuzberg.dev-blue" alt="Documentation">
48
+ <img src="https://img.shields.io/badge/docs-kreuzberg.dev-007ec6" alt="Documentation">
49
+ </a>
50
+ <a href="https://docs.kreuzberg.dev/demo.html">
51
+ <img src="https://img.shields.io/badge/%E2%96%B6%EF%B8%8F_Live_Demo-007ec6" alt="Live Demo">
49
52
  </a>
50
53
  <a href="https://huggingface.co/Kreuzberg">
51
- <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow" alt="Hugging Face">
54
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97_Hugging_Face-007ec6" alt="Hugging Face">
52
55
  </a>
53
56
  </div>
54
57
 
@@ -61,7 +64,7 @@
61
64
  </div>
62
65
 
63
66
 
64
- Extract text, tables, images, and metadata from 91+ file formats including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
67
+ Extract text, tables, images, and metadata from 91+ file formats and 248 programming languages including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.
65
68
 
66
69
 
67
70
  ## Installation
@@ -74,6 +77,7 @@ Install via one of the supported package managers:
74
77
 
75
78
 
76
79
  **npm:**
80
+
77
81
  ```bash
78
82
  npm install @kreuzberg/wasm
79
83
  ```
@@ -82,6 +86,7 @@ npm install @kreuzberg/wasm
82
86
 
83
87
 
84
88
  **pnpm:**
89
+
85
90
  ```bash
86
91
  pnpm add @kreuzberg/wasm
87
92
  ```
@@ -90,6 +95,7 @@ pnpm add @kreuzberg/wasm
90
95
 
91
96
 
92
97
  **yarn:**
98
+
93
99
  ```bash
94
100
  yarn add @kreuzberg/wasm
95
101
  ```
@@ -318,6 +324,19 @@ extractDocuments(fileBytes, mimes)
318
324
  | **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
319
325
  | **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
320
326
 
327
+ #### Code Intelligence (248 Languages)
328
+
329
+ | Feature | Description |
330
+ |---------|-------------|
331
+ | **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
332
+ | **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
333
+ | **Symbol Extraction** | Variables, constants, type aliases, properties |
334
+ | **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
335
+ | **Diagnostics** | Parse errors with line/column positions |
336
+ | **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
337
+
338
+ Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
339
+
321
340
  **[Complete Format Reference](https://kreuzberg.dev/reference/formats/)**
322
341
 
323
342
  ### Key Capabilities
@@ -337,6 +356,9 @@ extractDocuments(fileBytes, mimes)
337
356
  - **Batch Processing** - Efficiently process multiple documents in parallel
338
357
  - **Memory Efficient** - Stream large files without loading entirely into memory
339
358
  - **Language Detection** - Detect and support multiple languages in documents
359
+
360
+ - **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [248 programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
361
+
340
362
  - **Configuration** - Fine-grained control over extraction behavior
341
363
 
342
364
  ### Performance Characteristics
@@ -488,30 +510,6 @@ For advanced configuration options including language detection, table extractio
488
510
 
489
511
  **[Configuration Guide](https://kreuzberg.dev/guides/configuration/)**
490
512
 
491
- ## Platform Limitations
492
-
493
- WASM runs in single-threaded environments without access to ONNX Runtime, which constrains some features:
494
-
495
- ### Unsupported Features
496
-
497
- - **Layout Detection** – Requires RT-DETR model inference via ONNX Runtime, which is unavailable in WebAssembly
498
- - **Hardware Acceleration** – No GPU support (AccelerationConfig is not applicable)
499
- - **Concurrency Configuration** – Single-threaded WASM environment (ConcurrencyConfig does not apply)
500
- - **Email Codepage Configuration** – EmailConfig is not supported in WASM
501
-
502
- ### Supported Features
503
-
504
- - **Text Extraction** – Full text content from all supported formats
505
- - **OCR via Tesseract WASM** – Scanned document and image OCR using browser-native Tesseract
506
- - **Embeddings** – FastEmbed-based vector generation
507
- - **Chunking** – Text segmentation for RAG pipelines
508
- - **Metadata Extraction** – Document properties, creation dates, page counts
509
- - **Table Extraction** – Structured table data from PDFs and spreadsheets
510
- - **Language Detection** – Identify document language
511
- - **Image Extraction** – Embedded images from documents
512
-
513
- All 91+ file formats supported by Kreuzberg are available in WASM, with the exception that features requiring ONNX Runtime (layout detection) will fail gracefully with an unsupported error.
514
-
515
513
  ## Documentation
516
514
 
517
515
  - **[Official Documentation](https://kreuzberg.dev/)**