@kreuzberg/node 4.0.0-rc.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,705 @@
1
+ # Kreuzberg
2
+
3
+ [![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust)](https://crates.io/crates/kreuzberg)
4
+ [![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python)](https://pypi.org/project/kreuzberg/)
5
+ [![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript)](https://www.npmjs.com/package/@kreuzberg/node)
6
+ [![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM)](https://www.npmjs.com/package/@kreuzberg/wasm)
7
+ [![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby)](https://rubygems.org/gems/kreuzberg)
8
+ [![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
9
+ [![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg)
10
+ [![C#](https://img.shields.io/nuget/v/Goldziher.Kreuzberg?label=C%23)](https://www.nuget.org/packages/Goldziher.Kreuzberg/)
11
+
12
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
13
+ [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
14
+ [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
15
+
16
+ High-performance document intelligence for Node.js and TypeScript, powered by Rust.
17
+
18
+ Extract text, tables, images, and metadata from 56 file formats including PDF, DOCX, PPTX, XLSX, images, and more.
19
+
20
+ > **Recommended for Node.js and Bun** - Native NAPI-RS bindings provide the best performance (2-3x faster than WASM).
21
+ >
22
+ > For browser, Deno, or Cloudflare Workers, use [@kreuzberg/wasm](../kreuzberg-wasm/) instead.
23
+
24
+ > **Version 4.0.0 Release Candidate**
25
+ > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
26
+
27
+ ## Features
28
+
29
+ - **56 File Formats**: PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, XML, JSON, and more
30
+ - **OCR Support**: Built-in Tesseract, EasyOCR, and PaddleOCR backends for scanned documents
31
+ - **Table Extraction**: Advanced table detection and structured data extraction
32
+ - **Native Performance**: 2-3x faster than WASM; 10-50x faster than pure JavaScript
33
+ - **Zero-Copy Operations**: Direct system calls and minimal data copying
34
+ - **Type-Safe**: Full TypeScript definitions for all methods, configurations, and return types
35
+ - **Async/Sync APIs**: Both asynchronous and synchronous extraction methods
36
+ - **Batch Processing**: Process multiple documents in parallel with optimized concurrency
37
+ - **Language Detection**: Automatic language detection for extracted text
38
+ - **Text Chunking**: Split long documents into manageable chunks for LLM processing
39
+ - **Caching**: Built-in result caching for faster repeated extractions
40
+ - **Zero Configuration**: Works out of the box with sensible defaults
41
+
42
+ ## Why Use This Package?
43
+
44
+ Choose `@kreuzberg/node` if you're building with:
45
+
46
+ - **Node.js 18+** - Native bindings provide direct access to system resources
47
+ - **Bun** - Full compatibility with Bun's Node.js API
48
+ - **Performance-critical applications** - Processing large document batches or real-time extraction
49
+ - **Server-side extraction** - APIs, microservices, document processing pipelines
50
+
51
+ ### Comparison with @kreuzberg/wasm
52
+
53
+ | Aspect | `@kreuzberg/node` | `@kreuzberg/wasm` |
54
+ |--------|------------------|-------------------|
55
+ | **Performance** | 2-3x faster (native) | Standard baseline |
56
+ | **Environment** | Node.js, Bun | Browser, Deno, Workers, Node.js |
57
+ | **Bundle Size** | 10-15 MB (prebuilt binary) | 2-4 MB (WASM module) |
58
+ | **System Access** | Direct system calls | Sandboxed via WASM |
59
+ | **Best For** | Server-side, batch processing | Client-side, edge computing |
60
+
61
+ Use `@kreuzberg/wasm` for browser applications, Cloudflare Workers, Deno, or when you need a smaller bundle size.
62
+
63
+ ## Requirements
64
+
65
+ - Node.js 18 or higher
66
+ - Native bindings are prebuilt for:
67
+ - macOS (x64, arm64)
68
+ - Linux (x64, arm64, armv7)
69
+ - Windows (x64, arm64)
70
+
71
+ ### Optional System Dependencies
72
+
73
+ - **ONNX Runtime**: For embeddings functionality
74
+ - macOS: `brew install onnxruntime`
75
+ - Ubuntu: `sudo apt-get install libonnxruntime libonnxruntime-dev`
76
+ - Windows: `scoop install onnxruntime` or download from [GitHub](https://github.com/microsoft/onnxruntime/releases)
77
+
78
+ - **Tesseract**: For OCR functionality
79
+ - macOS: `brew install tesseract`
80
+ - Ubuntu: `sudo apt-get install tesseract-ocr`
81
+ - Windows: Download from [GitHub](https://github.com/tesseract-ocr/tesseract)
82
+
83
+ - **LibreOffice**: For legacy MS Office formats (.doc, .ppt)
84
+ - macOS: `brew install libreoffice`
85
+ - Ubuntu: `sudo apt-get install libreoffice`
86
+
87
+ - **Pandoc**: For advanced document conversion
88
+ - macOS: `brew install pandoc`
89
+ - Ubuntu: `sudo apt-get install pandoc`
90
+
91
+ ## Installation
92
+
93
+ ```bash
94
+ npm install @kreuzberg/node
95
+ ```
96
+
97
+ Or with pnpm:
98
+
99
+ ```bash
100
+ pnpm add @kreuzberg/node
101
+ ```
102
+
103
+ Or with yarn:
104
+
105
+ ```bash
106
+ yarn add @kreuzberg/node
107
+ ```
108
+
109
+ The package includes prebuilt native binaries for major platforms. No additional build steps required.
110
+
111
+ ## Quick Start
112
+
113
+ ### Basic Extraction
114
+
115
+ ```typescript
116
+ import { extractFileSync } from '@kreuzberg/node';
117
+
118
+ // Synchronous extraction
119
+ const result = extractFileSync('document.pdf');
120
+ console.log(result.content);
121
+ console.log(result.metadata);
122
+ ```
123
+
124
+ ### Async Extraction (Recommended)
125
+
126
+ ```typescript
127
+ import { extractFile } from '@kreuzberg/node';
128
+
129
+ // Asynchronous extraction
130
+ const result = await extractFile('document.pdf');
131
+ console.log(result.content);
132
+ console.log(result.tables);
133
+ ```
134
+
135
+ ### With Full Type Safety
136
+
137
+ ```typescript
138
+ import {
139
+ extractFile,
140
+ type ExtractionConfig,
141
+ type ExtractionResult
142
+ } from '@kreuzberg/node';
143
+
144
+ const config: ExtractionConfig = {
145
+ useCache: true,
146
+ enableQualityProcessing: true
147
+ };
148
+
149
+ const result: ExtractionResult = await extractFile('invoice.pdf', config);
150
+
151
+ // Type-safe access to all properties
152
+ console.log(result.content);
153
+ console.log(result.mimeType);
154
+ console.log(result.metadata);
155
+
156
+ if (result.tables) {
157
+ for (const table of result.tables) {
158
+ console.log(table.markdown);
159
+ }
160
+ }
161
+ ```
162
+
163
+ ## Configuration
164
+
165
+ ### OCR Configuration
166
+
167
+ ```typescript
168
+ import { extractFile, type ExtractionConfig, type OcrConfig } from '@kreuzberg/node';
169
+
170
+ const config: ExtractionConfig = {
171
+ ocr: {
172
+ backend: 'tesseract',
173
+ language: 'eng',
174
+ tesseractConfig: {
175
+ enableTableDetection: true,
176
+ psm: 6,
177
+ minConfidence: 50.0
178
+ }
179
+ } as OcrConfig
180
+ };
181
+
182
+ const result = await extractFile('scanned.pdf', config);
183
+ console.log(result.content);
184
+ ```
185
+
186
+ ### PDF Password Protection
187
+
188
+ ```typescript
189
+ import { extractFile, type PdfConfig } from '@kreuzberg/node';
190
+
191
+ const config = {
192
+ pdfOptions: {
193
+ passwords: ['password1', 'password2'],
194
+ extractImages: true,
195
+ extractMetadata: true
196
+ } as PdfConfig
197
+ };
198
+
199
+ const result = await extractFile('protected.pdf', config);
200
+ ```
201
+
202
+ ### Extract Tables
203
+
204
+ ```typescript
205
+ import { extractFile } from '@kreuzberg/node';
206
+
207
+ const result = await extractFile('financial-report.pdf');
208
+
209
+ if (result.tables) {
210
+ for (const table of result.tables) {
211
+ console.log('Table as Markdown:');
212
+ console.log(table.markdown);
213
+
214
+ console.log('Table cells:');
215
+ console.log(JSON.stringify(table.cells, null, 2));
216
+ }
217
+ }
218
+ ```
219
+
220
+ ### Text Chunking
221
+
222
+ ```typescript
223
+ import { extractFile, type ChunkingConfig } from '@kreuzberg/node';
224
+
225
+ const config = {
226
+ chunking: {
227
+ maxChars: 1000,
228
+ maxOverlap: 200
229
+ } as ChunkingConfig
230
+ };
231
+
232
+ const result = await extractFile('long-document.pdf', config);
233
+
234
+ if (result.chunks) {
235
+ for (const chunk of result.chunks) {
236
+ console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 100)}...`);
237
+ }
238
+ }
239
+ ```
240
+
241
+ ### Language Detection
242
+
243
+ ```typescript
244
+ import { extractFile, type LanguageDetectionConfig } from '@kreuzberg/node';
245
+
246
+ const config = {
247
+ languageDetection: {
248
+ enabled: true,
249
+ minConfidence: 0.8,
250
+ detectMultiple: false
251
+ } as LanguageDetectionConfig
252
+ };
253
+
254
+ const result = await extractFile('multilingual.pdf', config);
255
+
256
+ if (result.language) {
257
+ console.log(`Detected language: ${result.language.code}`);
258
+ console.log(`Confidence: ${result.language.confidence}`);
259
+ }
260
+ ```
261
+
262
+ ### Image Extraction
263
+
264
+ ```typescript
265
+ import { extractFile, type ImageExtractionConfig } from '@kreuzberg/node';
266
+ import { writeFile } from 'fs/promises';
267
+
268
+ const config = {
269
+ images: {
270
+ extractImages: true,
271
+ targetDpi: 300,
272
+ maxImageDimension: 4096,
273
+ autoAdjustDpi: true
274
+ } as ImageExtractionConfig
275
+ };
276
+
277
+ const result = await extractFile('document-with-images.pdf', config);
278
+
279
+ if (result.images) {
280
+ for (let i = 0; i < result.images.length; i++) {
281
+ const image = result.images[i];
282
+ await writeFile(`image-${i}.${image.format}`, Buffer.from(image.data));
283
+ }
284
+ }
285
+ ```
286
+
287
+ ### Complete Configuration Example
288
+
289
+ ```typescript
290
+ import {
291
+ extractFile,
292
+ type ExtractionConfig,
293
+ type OcrConfig,
294
+ type ChunkingConfig,
295
+ type ImageExtractionConfig,
296
+ type PdfConfig,
297
+ type TokenReductionConfig,
298
+ type LanguageDetectionConfig
299
+ } from '@kreuzberg/node';
300
+
301
+ const config: ExtractionConfig = {
302
+ useCache: true,
303
+ enableQualityProcessing: true,
304
+ forceOcr: false,
305
+ maxConcurrentExtractions: 8,
306
+
307
+ ocr: {
308
+ backend: 'tesseract',
309
+ language: 'eng',
310
+ preprocessing: true,
311
+ tesseractConfig: {
312
+ enableTableDetection: true,
313
+ psm: 6,
314
+ oem: 3,
315
+ minConfidence: 50.0
316
+ }
317
+ } as OcrConfig,
318
+
319
+ chunking: {
320
+ maxChars: 1000,
321
+ maxOverlap: 200
322
+ } as ChunkingConfig,
323
+
324
+ images: {
325
+ extractImages: true,
326
+ targetDpi: 300,
327
+ maxImageDimension: 4096,
328
+ autoAdjustDpi: true
329
+ } as ImageExtractionConfig,
330
+
331
+ pdfOptions: {
332
+ extractImages: true,
333
+ passwords: [],
334
+ extractMetadata: true
335
+ } as PdfConfig,
336
+
337
+ tokenReduction: {
338
+ mode: 'moderate',
339
+ preserveImportantWords: true
340
+ } as TokenReductionConfig,
341
+
342
+ languageDetection: {
343
+ enabled: true,
344
+ minConfidence: 0.8,
345
+ detectMultiple: false
346
+ } as LanguageDetectionConfig
347
+ };
348
+
349
+ const result = await extractFile('document.pdf', config);
350
+ ```
351
+
352
+ ## Advanced Usage
353
+
354
+ ### Extract from Buffer
355
+
356
+ ```typescript
357
+ import { extractBytes } from '@kreuzberg/node';
358
+ import { readFile } from 'fs/promises';
359
+
360
+ const buffer = await readFile('document.pdf');
361
+ const result = await extractBytes(buffer, 'application/pdf');
362
+ console.log(result.content);
363
+ ```
364
+
365
+ ### Batch Processing
366
+
367
+ ```typescript
368
+ import { batchExtractFiles } from '@kreuzberg/node';
369
+
370
+ const files = [
371
+ 'document1.pdf',
372
+ 'document2.docx',
373
+ 'document3.xlsx'
374
+ ];
375
+
376
+ const results = await batchExtractFiles(files);
377
+
378
+ for (const result of results) {
379
+ console.log(`${result.mimeType}: ${result.content.length} characters`);
380
+ }
381
+ ```
382
+
383
+ ### Batch Processing with Custom Concurrency
384
+
385
+ ```typescript
386
+ import { batchExtractFiles } from '@kreuzberg/node';
387
+
388
+ const config = {
389
+ maxConcurrentExtractions: 4 // Process 4 files at a time
390
+ };
391
+
392
+ const files = Array.from({ length: 20 }, (_, i) => `file-${i}.pdf`);
393
+ const results = await batchExtractFiles(files, config);
394
+
395
+ console.log(`Processed ${results.length} files`);
396
+ ```
397
+
398
+ ### Extract with Metadata
399
+
400
+ ```typescript
401
+ import { extractFile } from '@kreuzberg/node';
402
+
403
+ const result = await extractFile('document.pdf');
404
+
405
+ if (result.metadata) {
406
+ console.log('Title:', result.metadata.title);
407
+ console.log('Author:', result.metadata.author);
408
+ console.log('Creation Date:', result.metadata.creationDate);
409
+ console.log('Page Count:', result.metadata.pageCount);
410
+ console.log('Word Count:', result.metadata.wordCount);
411
+ }
412
+ ```
413
+
414
+ ### Token Reduction for LLM Processing
415
+
416
+ ```typescript
417
+ import { extractFile, type TokenReductionConfig } from '@kreuzberg/node';
418
+
419
+ const config = {
420
+ tokenReduction: {
421
+ mode: 'aggressive', // Options: 'light', 'moderate', 'aggressive'
422
+ preserveImportantWords: true
423
+ } as TokenReductionConfig
424
+ };
425
+
426
+ const result = await extractFile('long-document.pdf', config);
427
+
428
+ // Reduced token count while preserving meaning
429
+ console.log(`Original length: ${result.content.length}`);
430
+ console.log(`Processed for LLM context window`);
431
+ ```
432
+
433
+ ## Error Handling
434
+
435
+ ```typescript
436
+ import {
437
+ extractFile,
438
+ KreuzbergError,
439
+ ValidationError,
440
+ ParsingError,
441
+ OCRError,
442
+ MissingDependencyError
443
+ } from '@kreuzberg/node';
444
+
445
+ try {
446
+ const result = await extractFile('document.pdf');
447
+ console.log(result.content);
448
+ } catch (error) {
449
+ if (error instanceof ValidationError) {
450
+ console.error('Invalid configuration or input:', error.message);
451
+ } else if (error instanceof ParsingError) {
452
+ console.error('Failed to parse document:', error.message);
453
+ } else if (error instanceof OCRError) {
454
+ console.error('OCR processing failed:', error.message);
455
+ } else if (error instanceof MissingDependencyError) {
456
+ console.error(`Missing dependency: ${error.dependency}`);
457
+ console.error('Installation instructions:', error.message);
458
+ } else if (error instanceof KreuzbergError) {
459
+ console.error('Kreuzberg error:', error.message);
460
+ } else {
461
+ throw error;
462
+ }
463
+ }
464
+ ```
465
+
466
+ ## API Reference
467
+
468
+ ### Extraction Functions
469
+
470
+ #### `extractFile(filePath: string, config?: ExtractionConfig): Promise<ExtractionResult>`
471
+ Asynchronously extract content from a file.
472
+
473
+ #### `extractFileSync(filePath: string, config?: ExtractionConfig): ExtractionResult`
474
+ Synchronously extract content from a file.
475
+
476
+ #### `extractBytes(data: Buffer, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult>`
477
+ Asynchronously extract content from a buffer.
478
+
479
+ #### `extractBytesSync(data: Buffer, mimeType: string, config?: ExtractionConfig): ExtractionResult`
480
+ Synchronously extract content from a buffer.
481
+
482
+ #### `batchExtractFiles(paths: string[], config?: ExtractionConfig): Promise<ExtractionResult[]>`
483
+ Asynchronously extract content from multiple files in parallel.
484
+
485
+ #### `batchExtractFilesSync(paths: string[], config?: ExtractionConfig): ExtractionResult[]`
486
+ Synchronously extract content from multiple files.
487
+
488
+ ### Types
489
+
490
+ #### `ExtractionResult`
491
+ Main result object containing:
492
+ - `content: string` - Extracted text content
493
+ - `mimeType: string` - MIME type of the document
494
+ - `metadata?: Metadata` - Document metadata
495
+ - `tables?: Table[]` - Extracted tables
496
+ - `images?: ImageData[]` - Extracted images
497
+ - `chunks?: Chunk[]` - Text chunks (if chunking enabled)
498
+ - `language?: LanguageInfo` - Detected language (if enabled)
499
+
500
+ #### `ExtractionConfig`
501
+ Configuration object for extraction:
502
+ - `useCache?: boolean` - Enable result caching
503
+ - `enableQualityProcessing?: boolean` - Enable text quality improvements
504
+ - `forceOcr?: boolean` - Force OCR even for text-based PDFs
505
+ - `maxConcurrentExtractions?: number` - Max parallel extractions
506
+ - `ocr?: OcrConfig` - OCR settings
507
+ - `chunking?: ChunkingConfig` - Text chunking settings
508
+ - `images?: ImageExtractionConfig` - Image extraction settings
509
+ - `pdfOptions?: PdfConfig` - PDF-specific options
510
+ - `tokenReduction?: TokenReductionConfig` - Token reduction settings
511
+ - `languageDetection?: LanguageDetectionConfig` - Language detection settings
512
+
513
+ #### `OcrConfig`
514
+ OCR configuration:
515
+ - `backend: string` - OCR backend ('tesseract', 'easyocr', 'paddleocr')
516
+ - `language: string` - Language code (e.g., 'eng', 'fra', 'deu')
517
+ - `preprocessing?: boolean` - Enable image preprocessing
518
+ - `tesseractConfig?: TesseractConfig` - Tesseract-specific options
519
+
520
+ #### `Table`
521
+ Extracted table structure:
522
+ - `markdown: string` - Table in Markdown format
523
+ - `cells: TableCell[][]` - 2D array of table cells
524
+ - `rowCount: number` - Number of rows
525
+ - `columnCount: number` - Number of columns
526
+
527
+ ### Exceptions
528
+
529
+ All Kreuzberg exceptions extend the base `KreuzbergError` class:
530
+
531
+ - `KreuzbergError` - Base error class for all Kreuzberg errors
532
+ - `ValidationError` - Invalid configuration, missing required fields, or invalid input
533
+ - `ParsingError` - Document parsing failure or corrupted file
534
+ - `OCRError` - OCR processing failure
535
+ - `MissingDependencyError` - Missing optional system dependency (includes installation instructions)
536
+
537
+ ## Supported Formats
538
+
539
+ | Category | Formats |
540
+ |----------|---------|
541
+ | **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
542
+ | **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
543
+ | **Web** | HTML, XHTML, XML |
544
+ | **Text** | TXT, MD, CSV, TSV, JSON, YAML, TOML |
545
+ | **Email** | EML, MSG |
546
+ | **Archives** | ZIP, TAR, 7Z |
547
+ | **Other** | And 30+ more formats |
548
+
549
+ ## Performance
550
+
551
+ Kreuzberg is built with a native Rust core, providing significant performance improvements over pure JavaScript solutions:
552
+
553
+ - **10-50x faster** text extraction compared to pure Node.js libraries
554
+ - **Native multithreading** for batch processing
555
+ - **Optimized memory usage** with streaming for large files
556
+ - **Zero-copy operations** where possible
557
+ - **Efficient caching** to avoid redundant processing
558
+
559
+ ### Benchmarks
560
+
561
+ Processing 100 mixed documents (PDF, DOCX, XLSX):
562
+
563
+ | Library | Time | Memory |
564
+ |---------|------|--------|
565
+ | Kreuzberg | 2.3s | 145 MB |
566
+ | pdf-parse + mammoth | 23.1s | 890 MB |
567
+ | textract | 45.2s | 1.2 GB |
568
+
569
+ ## Troubleshooting
570
+
571
+ ### Native Module Not Found
572
+
573
+ If you encounter errors about missing native modules:
574
+
575
+ ```bash
576
+ npm rebuild @kreuzberg/node
577
+ ```
578
+
579
+ ### OCR Not Working
580
+
581
+ Ensure Tesseract is installed and available in PATH:
582
+
583
+ ```bash
584
+ tesseract --version
585
+ ```
586
+
587
+ If Tesseract is not found:
588
+ - macOS: `brew install tesseract`
589
+ - Ubuntu: `sudo apt-get install tesseract-ocr`
590
+ - Windows: Download from [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
591
+
592
+ ### Memory Issues with Large PDFs
593
+
594
+ For very large PDFs, use chunking to reduce memory usage:
595
+
596
+ ```typescript
597
+ const config = {
598
+ chunking: { maxChars: 1000 }
599
+ };
600
+ const result = await extractFile('large.pdf', config);
601
+ ```
602
+
603
+ ### TypeScript Types Not Resolving
604
+
605
+ Make sure you're using:
606
+ - Node.js 18 or higher
607
+ - TypeScript 5.0 or higher
608
+
609
+ The package includes built-in type definitions.
610
+
611
+ ### Performance Optimization
612
+
613
+ For maximum performance when processing many files:
614
+
615
+ ```typescript
616
+ // Use batch processing instead of sequential
617
+ const results = await batchExtractFiles(files, {
618
+ maxConcurrentExtractions: 8 // Tune based on CPU cores
619
+ });
620
+ ```
621
+
622
+ ## Examples
623
+
624
+ ### Extract Invoice Data
625
+
626
+ ```typescript
627
+ import { extractFile } from '@kreuzberg/node';
628
+
629
+ const result = await extractFile('invoice.pdf');
630
+
631
+ // Access tables for line items
632
+ if (result.tables && result.tables.length > 0) {
633
+ const lineItems = result.tables[0];
634
+ console.log(lineItems.markdown);
635
+ }
636
+
637
+ // Access metadata for invoice details
638
+ if (result.metadata) {
639
+ console.log('Invoice Date:', result.metadata.creationDate);
640
+ }
641
+ ```
642
+
643
+ ### Process Scanned Documents
644
+
645
+ ```typescript
646
+ import { extractFile } from '@kreuzberg/node';
647
+
648
+ const config = {
649
+ forceOcr: true,
650
+ ocr: {
651
+ backend: 'tesseract',
652
+ language: 'eng',
653
+ preprocessing: true
654
+ }
655
+ };
656
+
657
+ const result = await extractFile('scanned-contract.pdf', config);
658
+ console.log(result.content);
659
+ ```
660
+
661
+ ### Build a Document Search Index
662
+
663
+ ```typescript
664
+ import { batchExtractFiles } from '@kreuzberg/node';
665
+ import { glob } from 'glob';
666
+
667
+ // Find all documents
668
+ const files = await glob('documents/**/*.{pdf,docx,xlsx}');
669
+
670
+ // Extract in batches
671
+ const results = await batchExtractFiles(files, {
672
+ maxConcurrentExtractions: 8,
673
+ enableQualityProcessing: true
674
+ });
675
+
676
+ // Build search index
677
+ const searchIndex = results.map((result, i) => ({
678
+ path: files[i],
679
+ content: result.content,
680
+ metadata: result.metadata
681
+ }));
682
+
683
+ console.log(`Indexed ${searchIndex.length} documents`);
684
+ ```
685
+
686
+ ## Documentation
687
+
688
+ For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
689
+
690
+ ## Contributing
691
+
692
+ We welcome contributions! Please see our [Contributing Guide](../../CONTRIBUTING.md) for details.
693
+
694
+ ## License
695
+
696
+ MIT
697
+
698
+ ## Links
699
+
700
+ - [Website](https://kreuzberg.dev)
701
+ - [Documentation](https://kreuzberg.dev)
702
+ - [GitHub](https://github.com/kreuzberg-dev/kreuzberg)
703
+ - [Issue Tracker](https://github.com/kreuzberg-dev/kreuzberg/issues)
704
+ - [Changelog](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md)
705
+ - [npm Package](https://www.npmjs.com/package/@kreuzberg/node)
package/dist/cli.d.mts ADDED
@@ -0,0 +1,9 @@
1
+ #!/usr/bin/env node
2
+ /**
3
+ * Proxy entry point that forwards to the Rust-based Kreuzberg CLI.
4
+ *
5
+ * This keeps `npx kreuzberg` working without shipping an additional TypeScript CLI implementation.
6
+ */
7
+ declare function main(argv: string[]): number;
8
+
9
+ export { main };