@dcyfr/ai-rag 0.2.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/API.md ADDED
@@ -0,0 +1,1267 @@
1
+ # API Reference
2
+
3
+ **@dcyfr/ai-rag** - RAG (Retrieval-Augmented Generation) framework for Node.js and TypeScript
4
+
5
+ Version: 1.0.0 (Production Ready)
6
+ Last Updated: February 7, 2026
7
+
8
+ ---
9
+
10
+ ## Table of Contents
11
+
12
+ - [Overview](#overview)
13
+ - [Installation](#installation)
14
+ - [Document Loaders](#document-loaders)
15
+ - [TextLoader](#textloader)
16
+ - [MarkdownLoader](#markdownloader)
17
+ - [HTMLLoader](#htmlloader)
18
+ - [Embedding Generators](#embedding-generators)
19
+ - [SimpleEmbeddingGenerator](#simpleembeddinggenerator)
20
+ - [Vector Stores](#vector-stores)
21
+ - [InMemoryVectorStore](#inmemoryvectorstore)
22
+ - [Pipelines](#pipelines)
23
+ - [IngestionPipeline](#ingestionpipeline)
24
+ - [RetrievalPipeline](#retrievalpipeline)
25
+ - [EmbeddingPipeline](#embeddingpipeline)
26
+ - [Core Types](#core-types)
27
+ - [Document Types](#document-types)
28
+ - [Vector Store Types](#vector-store-types)
29
+ - [Pipeline Types](#pipeline-types)
30
+ - [Configuration](#configuration)
31
+ - [Error Handling](#error-handling)
32
+ - [Semantic Versioning Commitment](#semantic-versioning-commitment)
33
+
34
+ ---
35
+
36
+ ## Overview
37
+
38
+ `@dcyfr/ai-rag` provides a comprehensive framework for building production-ready RAG (Retrieval-Augmented Generation) systems, including:
39
+
40
+ - **Document loaders** for text, markdown, HTML with intelligent chunking
41
+ - **Embedding generators** with pluggable providers (OpenAI, Cohere, Anthropic, local)
42
+ - **Vector stores** for semantic search (in-memory, Chroma, Pinecone, Weaviate)
43
+ - **Pipelines** for streamlined ingestion and retrieval workflows
44
+ - **Metadata filtering** with complex queries (AND/OR, nested, temporal)
45
+ - **Multiple distance metrics** (cosine, dot product, euclidean)
46
+
47
+ ### Key Features
48
+
49
+ - ✅ TypeScript-first with complete type safety
50
+ - ✅ Zero-config defaults with deep customization
51
+ - ✅ Tree-shakeable ESM modules
52
+ - ✅ Comprehensive test coverage (97.67% lines, 86.15% branch)
53
+ - ✅ Production-ready error handling and retry logic
54
+ - ✅ Semantic versioning for API stability
55
+
56
+ ---
57
+
58
+ ## Installation
59
+
60
+ ```bash
61
+ npm install @dcyfr/ai-rag
62
+
63
+ # Or with yarn
64
+ yarn add @dcyfr/ai-rag
65
+
66
+ # Or with pnpm
67
+ pnpm add @dcyfr/ai-rag
68
+ ```
69
+
70
+ ### Optional Dependencies
71
+
72
+ ```bash
73
+ # For production embeddings
74
+ npm install openai # OpenAI embeddings
75
+ npm install @anthropic-ai/sdk # Anthropic embeddings
76
+ npm install cohere-ai # Cohere embeddings
77
+
78
+ # For persistent vector storage
79
+ npm install chromadb # Chroma vector database
80
+ npm install @pinecone-database/pinecone # Pinecone vector database
81
+ npm install weaviate-client # Weaviate vector database
82
+ ```
83
+
84
+ ### Requirements
85
+
86
+ - Node.js >= 20.0.0
87
+ - TypeScript >= 5.0.0 (for type definitions)
88
+
89
+ ---
90
+
91
+ ## Document Loaders
92
+
93
+ ### TextLoader
94
+
95
+ Load plain text documents with configurable chunking strategies.
96
+
97
+ #### API
98
+
99
+ ```typescript
100
+ import { TextLoader } from '@dcyfr/ai-rag';
101
+
102
+ const loader = new TextLoader();
103
+
104
+ const documents = await loader.load('./document.txt', {
105
+ chunkSize: 1000,
106
+ chunkOverlap: 200,
107
+ preserveFormatting: false,
108
+ metadata: {
109
+ author: 'John Doe',
110
+ category: 'technical',
111
+ },
112
+ });
113
+
114
+ console.log(documents);
115
+ // [
116
+ // {
117
+ // id: 'text-...',
118
+ // content: '...',
119
+ // metadata: { source: './document.txt', type: 'text', ... }
120
+ // }
121
+ // ]
122
+ ```
123
+
124
+ #### Options
125
+
126
+ | Option | Type | Default | Description |
127
+ |--------|------|---------|-------------|
128
+ | `chunkSize` | `number` | `1000` | Maximum chunk size in characters |
129
+ | `chunkOverlap` | `number` | `200` | Overlap between chunks (prevents context loss) |
130
+ | `preserveFormatting` | `boolean` | `false` | Keep original whitespace and formatting |
131
+ | `metadata` | `Record<string, unknown>` | `{}` | Custom metadata to attach to documents |
132
+
133
+ #### Supported Extensions
134
+
135
+ - `.txt` - Plain text files
136
+
137
+ #### Example: Custom Chunking
138
+
139
+ ```typescript
140
+ const loader = new TextLoader();
141
+
142
+ // Large overlap for narrative text
143
+ const docs = await loader.load('./story.txt', {
144
+ chunkSize: 500,
145
+ chunkOverlap: 100, // 20% overlap
146
+ });
147
+
148
+ // No overlap for structured data
149
+ const structuredDocs = await loader.load('./data.csv', {
150
+ chunkSize: 1000,
151
+ chunkOverlap: 0,
152
+ preserveFormatting: true,
153
+ });
154
+ ```
155
+
156
+ ---
157
+
158
+ ### MarkdownLoader
159
+
160
+ Load markdown documents with section-aware chunking and formatting options.
161
+
162
+ #### API
163
+
164
+ ```typescript
165
+ import { MarkdownLoader } from '@dcyfr/ai-rag';
166
+
167
+ const loader = new MarkdownLoader();
168
+
169
+ const documents = await loader.load('./README.md', {
170
+ chunkSize: 800,
171
+ chunkOverlap: 150,
172
+ preserveFormatting: false,
173
+ });
174
+
175
+ // Each document will have section metadata
176
+ documents.forEach(doc => {
177
+ console.log(doc.metadata.section); // Heading title
178
+ console.log(doc.metadata.title); // Document title (from # H1)
179
+ });
180
+ ```
181
+
182
+ #### Options
183
+
184
+ | Option | Type | Default | Description |
185
+ |--------|------|---------|-------------|
186
+ | `chunkSize` | `number` | `1000` | Maximum chunk size in characters |
187
+ | `chunkOverlap` | `number` | `200` | Overlap between chunks |
188
+ | `preserveFormatting` | `boolean` | `false` | Keep markdown syntax (**, *, #, etc.) |
189
+ | `metadata` | `Record<string, unknown>` | `{}` | Custom metadata |
190
+
191
+ #### Supported Extensions
192
+
193
+ - `.md` - Markdown files
194
+ - `.markdown` - Markdown files
195
+
196
+ #### Features
197
+
198
+ - **Title extraction** - Automatically extracts title from first `# Heading`
199
+ - **Section-aware chunking** - Splits by headings (`## Section`), preserves context
200
+ - **Format cleaning** - Removes markdown syntax (bold, italic, links, code blocks)
201
+ - **Large section handling** - Automatically sub-chunks sections exceeding `chunkSize`
202
+
203
+ #### Example: Preserve Formatting
204
+
205
+ ```typescript
206
+ const loader = new MarkdownLoader();
207
+
208
+ // Keep markdown syntax for LLM prompts
209
+ const docs = await loader.load('./tutorial.md', {
210
+ preserveFormatting: true, // Keeps **, *, `, etc.
211
+ });
212
+
213
+ console.log(docs[0].content);
214
+ // "## Introduction\n\nLearn **TypeScript** with *examples*..."
215
+ ```
216
+
217
+ ---
218
+
219
+ ### HTMLLoader
220
+
221
+ Load HTML documents with tag cleaning and content extraction.
222
+
223
+ #### API
224
+
225
+ ```typescript
226
+ import { HTMLLoader } from '@dcyfr/ai-rag';
227
+
228
+ const loader = new HTMLLoader();
229
+
230
+ const documents = await loader.load('./page.html', {
231
+ chunkSize: 600,
232
+ chunkOverlap: 100,
233
+ preserveFormatting: false,
234
+ });
235
+
236
+ // Metadata includes extracted title from <title> tag
237
+ console.log(documents[0].metadata.title);
238
+ ```
239
+
240
+ #### Options
241
+
242
+ | Option | Type | Default | Description |
243
+ |--------|------|---------|-------------|
244
+ | `chunkSize` | `number` | `1000` | Maximum chunk size in characters |
245
+ | `chunkOverlap` | `number` | `200` | Overlap between chunks |
246
+ | `preserveFormatting` | `boolean` | `false` | Preserve some HTML structure |
247
+ | `metadata` | `Record<string, unknown>` | `{}` | Custom metadata |
248
+
249
+ #### Supported Extensions
250
+
251
+ - `.html` - HTML files
252
+ - `.htm` - HTML files
253
+
254
+ #### Features
255
+
256
+ - **Title extraction** - Extracts title from `<title>` tag
257
+ - **Script/style removal** - Strips `<script>` and `<style>` tags
258
+ - **HTML entity decoding** - Converts `&nbsp;`, `&lt;`, `&gt;`, etc.
259
+ - **Comment removal** - Removes HTML comments
260
+ - **Tag stripping** - Extracts plain text from HTML structure
261
+ - **Whitespace normalization** - Cleans excessive whitespace
262
+
263
+ #### Example: Web Scraping
264
+
265
+ ```typescript
266
+ const loader = new HTMLLoader();
267
+
268
+ // Load scraped web pages
269
+ const docs = await loader.load('./scraped-article.html', {
270
+ chunkSize: 500,
271
+ metadata: {
272
+ url: 'https://example.com/article',
273
+ scrapedAt: new Date().toISOString(),
274
+ },
275
+ });
276
+ ```
277
+
278
+ ---
279
+
280
+ ## Embedding Generators
281
+
282
+ ### SimpleEmbeddingGenerator
283
+
284
+ Lightweight embedding generator using random vectors (for testing/development).
285
+
286
+ #### API
287
+
288
+ ```typescript
289
+ import { SimpleEmbeddingGenerator } from '@dcyfr/ai-rag';
290
+
291
+ const embedder = new SimpleEmbeddingGenerator({
292
+ dimensions: 384,
293
+ model: 'simple-v1', // For logging/tracking
294
+ });
295
+
296
+ // Generate embeddings
297
+ const embeddings = await embedder.embed([
298
+ 'First document',
299
+ 'Second document',
300
+ 'Third document',
301
+ ]);
302
+
303
+ console.log(embeddings.length); // 3
304
+ console.log(embeddings[0].length); // 384
305
+
306
+ // Get embedding dimensions
307
+ const dims = embedder.getDimensions(); // 384
308
+ ```
309
+
310
+ #### Options
311
+
312
+ ```typescript
313
+ interface SimpleEmbeddingOptions {
314
+ /** Embedding dimensions (default: 384) */
315
+ dimensions?: number;
316
+
317
+ /** Model identifier for logging (default: 'simple') */
318
+ model?: string;
319
+
320
+ /** Random seed for reproducibility (default: Date.now()) */
321
+ seed?: number;
322
+ }
323
+ ```
324
+
325
+ #### Use Cases
326
+
327
+ - **Local development** - No API keys required
328
+ - **Testing** - Fast, deterministic embeddings with seed
329
+ - **Prototyping** - Quick setup before production embeddings
330
+ - **CI/CD** - Unit tests without external dependencies
331
+
332
+ #### Example: Production Embeddings
333
+
334
+ ```typescript
335
+ // For production, use real embedding providers
336
+ import OpenAI from 'openai';
337
+
338
+ class OpenAIEmbeddingGenerator {
339
+ private client: OpenAI;
340
+
341
+ constructor(apiKey: string) {
342
+ this.client = new OpenAI({ apiKey });
343
+ }
344
+
345
+ async embed(texts: string[]): Promise<number[][]> {
346
+ const response = await this.client.embeddings.create({
347
+ model: 'text-embedding-3-large',
348
+ input: texts,
349
+ });
350
+
351
+ return response.data.map(item => item.embedding);
352
+ }
353
+
354
+ getDimensions(): number {
355
+ return 3072; // text-embedding-3-large dimensions
356
+ }
357
+ }
358
+
359
+ const embedder = new OpenAIEmbeddingGenerator(process.env.OPENAI_API_KEY);
360
+ ```
361
+
362
+ ---
363
+
364
+ ## Vector Stores
365
+
366
+ ### InMemoryVectorStore
367
+
368
+ Fast in-memory vector store with metadata filtering and multiple distance metrics.
369
+
370
+ #### API
371
+
372
+ ```typescript
373
+ import { InMemoryVectorStore } from '@dcyfr/ai-rag';
374
+
375
+ const store = new InMemoryVectorStore({
376
+ collectionName: 'my-docs',
377
+ embeddingDimensions: 384,
378
+ distanceMetric: 'cosine', // or 'dot' or 'euclidean'
379
+ });
380
+
381
+ // Add documents
382
+ await store.addDocuments([
383
+ {
384
+ id: 'doc-1',
385
+ documentId: 'parent-1',
386
+ content: 'Machine learning is...',
387
+ index: 0,
388
+ metadata: { category: 'AI', author: 'John' },
389
+ embedding: [0.1, 0.2, ...],
390
+ },
391
+ ]);
392
+
393
+ // Search by embedding vector
394
+ const results = await store.search([0.15, 0.18, ...], 5);
395
+
396
+ // Search with metadata filter
397
+ const filteredResults = await store.search([0.15, 0.18, ...], 5, {
398
+ field: 'category',
399
+ operator: 'eq',
400
+ value: 'AI',
401
+ });
402
+
403
+ // Update document
404
+ await store.updateDocument('doc-1', {
405
+ metadata: { ...metadata, updated: true },
406
+ });
407
+
408
+ // Delete documents
409
+ await store.deleteDocuments(['doc-1', 'doc-2']);
410
+
411
+ // Get store statistics
412
+ const stats = await store.getStats();
413
+ console.log(stats);
414
+ // { collectionName: 'my-docs', dimensions: 384, documentCount: 100 }
415
+ ```
416
+
417
+ #### Configuration
418
+
419
+ ```typescript
420
+ interface VectorStoreConfig {
421
+ /** Collection/index name */
422
+ collectionName: string;
423
+
424
+ /** Embedding vector dimensions */
425
+ embeddingDimensions: number;
426
+
427
+ /** Distance metric for similarity (default: 'cosine') */
428
+ distanceMetric?: 'cosine' | 'dot' | 'euclidean';
429
+
430
+ /** Storage path for persistent stores (optional) */
431
+ storagePath?: string;
432
+ }
433
+ ```
434
+
435
+ #### Distance Metrics
436
+
437
+ | Metric | Use Case | Range | Notes |
438
+ |--------|----------|-------|-------|
439
+ | `cosine` | General text similarity | 0-1 (higher = more similar) | Normalized, handles different lengths well |
440
+ | `dot` | Fast similarity when vectors normalized | -∞ to +∞ | Faster than cosine, assumes normalized vectors |
441
+ | `euclidean` | Spatial distance | 0 to +∞ (lower = more similar) | L2 distance, sensitive to magnitude |
442
+
443
+ #### Metadata Filtering
444
+
445
+ ```typescript
446
+ interface MetadataFilter {
447
+ field: string;
448
+ operator: 'eq' | 'ne' | 'gt' | 'gte' | 'lt' | 'lte' | 'in' | 'nin';
449
+ value: unknown;
450
+ }
451
+
452
+ // Equality
453
+ await store.search(query, 10, {
454
+ field: 'category',
455
+ operator: 'eq',
456
+ value: 'AI',
457
+ });
458
+
459
+ // Numeric comparison
460
+ await store.search(query, 10, {
461
+ field: 'publishedYear',
462
+ operator: 'gte',
463
+ value: 2023,
464
+ });
465
+
466
+ // Array membership
467
+ await store.search(query, 10, {
468
+ field: 'tags',
469
+ operator: 'in',
470
+ value: ['typescript', 'nodejs'],
471
+ });
472
+
473
+ // Not in array
474
+ await store.search(query, 10, {
475
+ field: 'status',
476
+ operator: 'nin',
477
+ value: ['draft', 'archived'],
478
+ });
479
+ ```
480
+
481
+ #### Example: Multi-Metric Store
482
+
483
+ ```typescript
484
+ // Compare different distance metrics
485
+ const cosineStore = new InMemoryVectorStore({
486
+ collectionName: 'cosine',
487
+ embeddingDimensions: 384,
488
+ distanceMetric: 'cosine',
489
+ });
490
+
491
+ const euclideanStore = new InMemoryVectorStore({
492
+ collectionName: 'euclidean',
493
+ embeddingDimensions: 384,
494
+ distanceMetric: 'euclidean',
495
+ });
496
+
497
+ // Results may differ based on metric
498
+ const cosineResults = await cosineStore.search(queryVector, 5);
499
+ const euclideanResults = await euclideanStore.search(queryVector, 5);
500
+ ```
501
+
502
+ ---
503
+
504
+ ## Pipelines
505
+
506
+ ### IngestionPipeline
507
+
508
+ Streamlined pipeline for loading, chunking, embedding, and storing documents.
509
+
510
+ #### API
511
+
512
+ ```typescript
513
+ import {
514
+ IngestionPipeline,
515
+ TextLoader,
516
+ SimpleEmbeddingGenerator,
517
+ InMemoryVectorStore,
518
+ } from '@dcyfr/ai-rag';
519
+
520
+ const loader = new TextLoader();
521
+ const embedder = new SimpleEmbeddingGenerator({ dimensions: 384 });
522
+ const store = new InMemoryVectorStore({
523
+ collectionName: 'docs',
524
+ embeddingDimensions: 384,
525
+ });
526
+
527
+ const pipeline = new IngestionPipeline(loader, embedder, store);
528
+
529
+ // Ingest single file
530
+ const result = await pipeline.ingest(['./document.txt'], {
531
+ batchSize: 10,
532
+ loaderConfig: {
533
+ chunkSize: 1000,
534
+ chunkOverlap: 200,
535
+ },
536
+ onProgress: (current, total, details) => {
537
+ console.log(`Processing ${current}/${total}: ${details.currentFile}`);
538
+ },
539
+ });
540
+
541
+ console.log(result);
542
+ // {
543
+ // documentsProcessed: 1,
544
+ // chunksGenerated: 5,
545
+ // errors: [],
546
+ // durationMs: 1234
547
+ // }
548
+ ```
549
+
550
+ #### Options
551
+
552
+ ```typescript
553
+ interface IngestionOptions {
554
+ /** Batch size for embedding generation (default: 10) */
555
+ batchSize?: number;
556
+
557
+ /** Loader configuration (passed to loader.load()) */
558
+ loaderConfig?: LoaderConfig;
559
+
560
+ /** Progress callback for monitoring */
561
+ onProgress?: (
562
+ current: number,
563
+ total: number,
564
+ details?: {
565
+ currentFile: string;
566
+ documentsProcessed: number;
567
+ totalDocuments: number;
568
+ chunksGenerated: number;
569
+ }
570
+ ) => void;
571
+ }
572
+ ```
573
+
574
+ #### Example: Batch Ingestion
575
+
576
+ ```typescript
577
+ const pipeline = new IngestionPipeline(loader, embedder, store);
578
+
579
+ // Ingest entire directory
580
+ const files = [
581
+ './docs/intro.md',
582
+ './docs/guide.md',
583
+ './docs/api.md',
584
+ './docs/examples.md',
585
+ ];
586
+
587
+ let totalChunks = 0;
588
+
589
+ const result = await pipeline.ingest(files, {
590
+ batchSize: 20, // Process 20 chunks at a time
591
+ loaderConfig: {
592
+ chunkSize: 800,
593
+ chunkOverlap: 150,
594
+ },
595
+ onProgress: (current, total, details) => {
596
+ if (details) {
597
+ totalChunks = details.chunksGenerated;
598
+ console.log(
599
+ `[${current}/${total}] ${details.currentFile} - ` +
600
+ `${details.chunksGenerated} chunks generated`
601
+ );
602
+ }
603
+ },
604
+ });
605
+
606
+ console.log(`✅ Ingested ${result.documentsProcessed} documents`);
607
+ console.log(`✅ Generated ${result.chunksGenerated} chunks`);
608
+ console.log(`⏱️ Duration: ${result.durationMs}ms`);
609
+ ```
610
+
611
+ ---
612
+
613
+ ### RetrievalPipeline
614
+
615
+ Streamlined pipeline for querying vector stores and assembling context.
616
+
617
+ #### API
618
+
619
+ ```typescript
620
+ import {
621
+ RetrievalPipeline,
622
+ InMemoryVectorStore,
623
+ SimpleEmbeddingGenerator,
624
+ } from '@dcyfr/ai-rag';
625
+
626
+ const store = new InMemoryVectorStore({
627
+ collectionName: 'docs',
628
+ embeddingDimensions: 384,
629
+ });
630
+
631
+ const embedder = new SimpleEmbeddingGenerator({ dimensions: 384 });
632
+
633
+ const pipeline = new RetrievalPipeline(store, embedder);
634
+
635
+ // Query for relevant context
636
+ const result = await pipeline.query('What is machine learning?', {
637
+ limit: 5,
638
+ threshold: 0.7,
639
+ includeMetadata: true,
640
+ });
641
+
642
+ console.log(result.query); // Original query
643
+ console.log(result.context); // Assembled context string
644
+ console.log(result.results); // Array of SearchResult
645
+ console.log(result.metadata); // Query statistics
646
+ ```
647
+
648
+ #### Options
649
+
650
+ ```typescript
651
+ interface QueryOptions {
652
+ /** Number of results to retrieve (default: 5) */
653
+ limit?: number;
654
+
655
+ /** Minimum similarity threshold 0-1 (default: 0.0) */
656
+ threshold?: number;
657
+
658
+ /** Metadata filter */
659
+ filter?: MetadataFilter;
660
+
661
+ /** Include metadata in context (default: false) */
662
+ includeMetadata?: boolean;
663
+
664
+ /** Re-rank results (default: false) */
665
+ rerank?: boolean;
666
+ }
667
+ ```
668
+
669
+ #### Query Result
670
+
671
+ ```typescript
672
+ interface QueryResult {
673
+ /** Original query */
674
+ query: string;
675
+
676
+ /** Search results with scores */
677
+ results: SearchResult[];
678
+
679
+ /** Assembled context (ready for LLM prompt) */
680
+ context: string;
681
+
682
+ /** Query metadata */
683
+ metadata: {
684
+ totalResults: number;
685
+ durationMs: number;
686
+ averageScore: number;
687
+ };
688
+ }
689
+ ```
690
+
691
+ #### Example: Advanced Retrieval
692
+
693
+ ```typescript
694
+ const pipeline = new RetrievalPipeline(store, embedder);
695
+
696
+ // Query with metadata filtering
697
+ const result = await pipeline.query(
698
+ 'Explain TypeScript generics',
699
+ {
700
+ limit: 10,
701
+ threshold: 0.75, // Only highly relevant results
702
+ filter: {
703
+ field: 'category',
704
+ operator: 'eq',
705
+ value: 'typescript',
706
+ },
707
+ includeMetadata: true, // Add source references
708
+ }
709
+ );
710
+
711
+ // Use context in LLM prompt
712
+ const prompt = `
713
+ Context:
714
+ ${result.context}
715
+
716
+ Question: ${result.query}
717
+
718
+ Answer:
719
+ `;
720
+
721
+ console.log(`Found ${result.results.length} relevant chunks`);
722
+ console.log(`Average relevance: ${result.metadata.averageScore.toFixed(2)}`);
723
+ ```
724
+
725
+ ---
726
+
727
+ ### EmbeddingPipeline
728
+
729
+ Dedicated pipeline for batch embedding generation with progress tracking.
730
+
731
+ #### API
732
+
733
+ ```typescript
734
+ import { EmbeddingPipeline, SimpleEmbeddingGenerator } from '@dcyfr/ai-rag';
735
+
736
+ const embedder = new SimpleEmbeddingGenerator({ dimensions: 384 });
737
+ const pipeline = new EmbeddingPipeline(embedder);
738
+
739
+ // Generate embeddings for documents
740
+ const documents = [
741
+ { id: '1', content: 'First document', /* ... */ },
742
+ { id: '2', content: 'Second document', /* ... */ },
743
+ ];
744
+
745
+ const embeddedDocs = await pipeline.embedDocuments(documents, {
746
+ batchSize: 100,
747
+ onProgress: (current, total) => {
748
+ console.log(`Embedding ${current}/${total}`);
749
+ },
750
+ });
751
+
752
+ console.log(embeddedDocs[0].embedding); // [0.1, 0.2, ...]
753
+ ```
754
+
755
+ #### Options
756
+
757
+ ```typescript
758
+ interface EmbeddingPipelineOptions {
759
+ /** Batch size for API calls (default: 100) */
760
+ batchSize?: number;
761
+
762
+ /** Progress callback */
763
+ onProgress?: (current: number, total: number) => void;
764
+ }
765
+ ```
766
+
767
+ #### Example: Rate-Limited Embedding
768
+
769
+ ```typescript
770
+ import { EmbeddingPipeline } from '@dcyfr/ai-rag';
771
+ import OpenAI from 'openai';
772
+
773
+ class RateLimitedEmbedder {
774
+ private client: OpenAI;
775
+ private lastCall = 0;
776
+ private minDelay = 1000; // 1 second between calls
777
+
778
+ constructor(apiKey: string) {
779
+ this.client = new OpenAI({ apiKey });
780
+ }
781
+
782
+ async embed(texts: string[]): Promise<number[][]> {
783
+ // Rate limiting
784
+ const now = Date.now();
785
+ const timeSinceLastCall = now - this.lastCall;
786
+ if (timeSinceLastCall < this.minDelay) {
787
+ await new Promise(resolve =>
788
+ setTimeout(resolve, this.minDelay - timeSinceLastCall)
789
+ );
790
+ }
791
+
792
+ const response = await this.client.embeddings.create({
793
+ model: 'text-embedding-3-small',
794
+ input: texts,
795
+ });
796
+
797
+ this.lastCall = Date.now();
798
+ return response.data.map(item => item.embedding);
799
+ }
800
+
801
+ getDimensions(): number {
802
+ return 1536;
803
+ }
804
+ }
805
+
806
+ const embedder = new RateLimitedEmbedder(process.env.OPENAI_API_KEY);
807
+ const pipeline = new EmbeddingPipeline(embedder);
808
+
809
+ // Safe batch processing with rate limiting
810
+ const result = await pipeline.embedDocuments(documents, {
811
+ batchSize: 50, // Small batches to respect rate limits
812
+ });
813
+ ```
814
+
815
+ ---
816
+
817
+ ## Core Types
818
+
819
+ ### Document Types
820
+
821
+ #### Document
822
+
823
+ Represents a loaded document before embedding.
824
+
825
+ ```typescript
826
+ interface Document {
827
+ /** Unique identifier */
828
+ id: string;
829
+
830
+ /** Document content */
831
+ content: string;
832
+
833
+ /** Document metadata */
834
+ metadata: DocumentMetadata;
835
+
836
+ /** Embedding vector (if generated) */
837
+ embedding?: number[];
838
+ }
839
+ ```
840
+
841
+ #### DocumentMetadata
842
+
843
+ ```typescript
844
+ interface DocumentMetadata {
845
+ /** Source file path or URL */
846
+ source: string;
847
+
848
+ /** Document type */
849
+ type: 'pdf' | 'markdown' | 'html' | 'text' | 'json' | 'other';
850
+
851
+ /** Creation timestamp */
852
+ createdAt: Date;
853
+
854
+ /** Last modified timestamp */
855
+ updatedAt?: Date;
856
+
857
+ /** Author information */
858
+ author?: string;
859
+
860
+ /** Document title */
861
+ title?: string;
862
+
863
+ /** Additional custom metadata */
864
+ [key: string]: unknown;
865
+ }
866
+ ```
867
+
868
+ #### DocumentChunk
869
+
870
+ Represents a chunk of a document for vector storage.
871
+
872
+ ```typescript
873
+ interface DocumentChunk {
874
+ /** Chunk identifier */
875
+ id: string;
876
+
877
+ /** Parent document ID */
878
+ documentId: string;
879
+
880
+ /** Chunk content */
881
+ content: string;
882
+
883
+ /** Chunk index in document */
884
+ index: number;
885
+
886
+ /** Chunk metadata */
887
+ metadata: ChunkMetadata;
888
+
889
+ /** Embedding vector */
890
+ embedding?: number[];
891
+ }
892
+ ```
893
+
894
+ #### ChunkMetadata
895
+
896
+ ```typescript
897
+ interface ChunkMetadata {
898
+ /** Chunk index */
899
+ chunkIndex: number;
900
+
901
+ /** Total chunks in document */
902
+ chunkCount: number;
903
+
904
+ /** Character start position in original document */
905
+ startChar?: number;
906
+
907
+ /** Character end position in original document */
908
+ endChar?: number;
909
+
910
+ /** Parent document ID */
911
+ parentDocumentId?: string;
912
+
913
+ /** Section title (for markdown/HTML) */
914
+ section?: string;
915
+
916
+ /** Token count (if available) */
917
+ tokenCount?: number;
918
+
919
+ /** Additional metadata from parent document */
920
+ [key: string]: unknown;
921
+ }
922
+ ```
923
+
924
+ ### Vector Store Types
925
+
926
+ #### SearchResult
927
+
928
+ ```typescript
929
+ interface SearchResult {
930
+ /** Matching document chunk */
931
+ document: DocumentChunk;
932
+
933
+ /** Similarity score (0-1, higher = more similar) */
934
+ score: number;
935
+
936
+ /** Distance from query (lower = more similar) */
937
+ distance?: number;
938
+ }
939
+ ```
940
+
941
+ #### MetadataFilter
942
+
943
+ ```typescript
944
+ interface MetadataFilter {
945
+ /** Field to filter on */
946
+ field: string;
947
+
948
+ /** Comparison operator */
949
+ operator: 'eq' | 'ne' | 'gt' | 'gte' | 'lt' | 'lte' | 'in' | 'nin';
950
+
951
+ /** Value to compare */
952
+ value: unknown;
953
+ }
954
+ ```
955
+
956
+ ### Pipeline Types
957
+
958
+ #### IngestionResult
959
+
960
+ ```typescript
961
+ interface IngestionResult {
962
+ /** Number of documents processed */
963
+ documentsProcessed: number;
964
+
965
+ /** Number of chunks generated */
966
+ chunksGenerated: number;
967
+
968
+ /** Errors during ingestion */
969
+ errors: Array<{ file: string; error: string }>;
970
+
971
+ /** Total duration in milliseconds */
972
+ durationMs: number;
973
+ }
974
+ ```
975
+
976
+ #### QueryResult
977
+
978
+ ```typescript
979
+ interface QueryResult {
980
+ /** Original query */
981
+ query: string;
982
+
983
+ /** Search results */
984
+ results: SearchResult[];
985
+
986
+ /** Assembled context */
987
+ context: string;
988
+
989
+ /** Metadata about the query */
990
+ metadata: {
991
+ /** Total results returned */
992
+ totalResults: number;
993
+
994
+ /** Query execution time in ms */
995
+ durationMs: number;
996
+
997
+ /** Average relevance score */
998
+ averageScore: number;
999
+ };
1000
+ }
1001
+ ```
1002
+
1003
+ ---
1004
+
1005
+ ## Configuration
1006
+
1007
+ ### LoaderConfig
1008
+
1009
+ Configuration for all document loaders.
1010
+
1011
+ ```typescript
1012
+ interface LoaderConfig {
1013
+ /** Chunk size in characters (default: 1000) */
1014
+ chunkSize?: number;
1015
+
1016
+ /** Chunk overlap in characters (default: 200) */
1017
+ chunkOverlap?: number;
1018
+
1019
+ /** Whether to preserve formatting (default: false) */
1020
+ preserveFormatting?: boolean;
1021
+
1022
+ /** Custom metadata to add (default: {}) */
1023
+ metadata?: Record<string, unknown>;
1024
+ }
1025
+ ```
1026
+
1027
+ ### EmbeddingConfig
1028
+
1029
+ Configuration for embedding generators.
1030
+
1031
+ ```typescript
1032
+ interface EmbeddingConfig {
1033
+ /** Model name or identifier */
1034
+ model?: string;
1035
+
1036
+ /** Embedding dimensions */
1037
+ dimensions?: number;
1038
+
1039
+ /** Batch size for processing (default: 100) */
1040
+ batchSize?: number;
1041
+ }
1042
+ ```
1043
+
1044
+ ### VectorStoreConfig
1045
+
1046
+ Configuration for vector stores.
1047
+
1048
+ ```typescript
1049
+ interface VectorStoreConfig {
1050
+ /** Collection/index name */
1051
+ collectionName: string;
1052
+
1053
+ /** Embedding dimensions */
1054
+ embeddingDimensions: number;
1055
+
1056
+ /** Distance metric (default: 'cosine') */
1057
+ distanceMetric?: 'cosine' | 'euclidean' | 'dot';
1058
+
1059
+ /** Storage path (for file-based stores) */
1060
+ storagePath?: string;
1061
+ }
1062
+ ```
1063
+
1064
+ ### RAGConfig
1065
+
1066
+ Complete RAG system configuration.
1067
+
1068
+ ```typescript
1069
+ interface RAGConfig {
1070
+ /** Vector store configuration */
1071
+ vectorStore: VectorStoreConfig;
1072
+
1073
+ /** Embedding configuration */
1074
+ embedding?: EmbeddingConfig;
1075
+
1076
+ /** Loader configuration */
1077
+ loader?: LoaderConfig;
1078
+
1079
+ /** Retrieval configuration */
1080
+ retrieval?: RetrievalConfig;
1081
+ }
1082
+ ```
1083
+
1084
+ ### RetrievalConfig
1085
+
1086
+ Configuration for retrieval pipelines.
1087
+
1088
+ ```typescript
1089
+ interface RetrievalConfig {
1090
+ /** Number of results to retrieve (default: 5) */
1091
+ topK?: number;
1092
+
1093
+ /** Minimum similarity score threshold (default: 0.0) */
1094
+ scoreThreshold?: number;
1095
+
1096
+ /** Whether to rerank results (default: false) */
1097
+ rerank?: boolean;
1098
+
1099
+ /** Maximum tokens in context (default: 4096) */
1100
+ maxTokens?: number;
1101
+ }
1102
+ ```
1103
+
1104
+ ---
1105
+
1106
+ ## Error Handling
1107
+
1108
+ ### Common Errors
1109
+
1110
+ ```typescript
1111
+ // File not found
1112
+ try {
1113
+ await loader.load('./nonexistent.txt');
1114
+ } catch (error) {
1115
+ console.error(error.message);
1116
+ // "Failed to load text file ./nonexistent.txt: ENOENT"
1117
+ }
1118
+
1119
+ // Dimension mismatch
1120
+ try {
1121
+ await store.addDocuments([{
1122
+ ...chunk,
1123
+ embedding: [1, 2, 3], // Wrong dimensions
1124
+ }]);
1125
+ } catch (error) {
1126
+ console.error(error.message);
1127
+ // "Embedding dimensions mismatch: expected 384, got 3"
1128
+ }
1129
+
1130
+ // Missing embedding
1131
+ try {
1132
+ await store.addDocuments([{
1133
+ ...chunk,
1134
+ embedding: undefined, // No embedding
1135
+ }]);
1136
+ } catch (error) {
1137
+ console.error(error.message);
1138
+ // "Document missing embedding: doc-1"
1139
+ }
1140
+
1141
+ // String query to vector store
1142
+ try {
1143
+ await store.search('text query', 10);
1144
+ } catch (error) {
1145
+ console.error(error.message);
1146
+ // "Query must be an embedding vector"
1147
+ }
1148
+
1149
+ // Document not found
1150
+ try {
1151
+ await store.updateDocument('nonexistent', { content: 'new' });
1152
+ } catch (error) {
1153
+ console.error(error.message);
1154
+ // "Document nonexistent not found"
1155
+ }
1156
+ ```
1157
+
1158
+ ### Error Recovery Patterns
1159
+
1160
+ ```typescript
1161
+ // Graceful ingestion with error tracking
1162
+ const result = await pipeline.ingest(files, {
1163
+ batchSize: 10,
1164
+ onProgress: (current, total, details) => {
1165
+ if (details && result.errors.length > 0) {
1166
+ console.warn(`Errors so far: ${result.errors.length}`);
1167
+ }
1168
+ },
1169
+ });
1170
+
1171
+ if (result.errors.length > 0) {
1172
+ console.error('Ingestion completed with errors:');
1173
+ result.errors.forEach(({ file, error }) => {
1174
+ console.error(` ${file}: ${error}`);
1175
+ });
1176
+ }
1177
+
1178
+ // Retry logic for embeddings
1179
+ async function embedWithRetry(
1180
+ embedder: EmbeddingGenerator,
1181
+ texts: string[],
1182
+ maxRetries = 3
1183
+ ): Promise<number[][]> {
1184
+ for (let attempt = 1; attempt <= maxRetries; attempt++) {
1185
+ try {
1186
+ return await embedder.embed(texts);
1187
+ } catch (error) {
1188
+ if (attempt === maxRetries) throw error;
1189
+
1190
+ const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
1191
+ console.warn(`Retry ${attempt}/${maxRetries} after ${delay}ms`);
1192
+ await new Promise(resolve => setTimeout(resolve, delay));
1193
+ }
1194
+ }
1195
+
1196
+ throw new Error('Max retries exceeded');
1197
+ }
1198
+ ```
1199
+
1200
+ ---
1201
+
1202
+ ## Semantic Versioning Commitment
1203
+
1204
+ ### Version 1.0.0 Guarantees
1205
+
1206
+ `@dcyfr/ai-rag` follows [Semantic Versioning 2.0.0](https://semver.org/) strictly:
1207
+
1208
+ - **Patch releases (1.0.x)** - Bug fixes, no API changes
1209
+ - **Minor releases (1.x.0)** - New features, backward compatible
1210
+ - **Major releases (x.0.0)** - Breaking changes (with migration guide)
1211
+
1212
+ ### Deprecation Policy
1213
+
1214
+ - **6-month notice** - Deprecated features remain functional for 6 months
1215
+ - **Clear warnings** - TypeScript `@deprecated` JSDoc tags with alternatives
1216
+ - **Migration guides** - Step-by-step guides for breaking changes
1217
+
1218
+ ### API Stability
1219
+
1220
+ The following APIs are stable and will not change in minor/patch releases:
1221
+
1222
+ - ✅ All exported classes (TextLoader, MarkdownLoader, HTMLLoader, etc.)
1223
+ - ✅ All interfaces and types
1224
+ - ✅ Public method signatures
1225
+ - ✅ Configuration option structures
1226
+ - ✅ Default behavior
1227
+
1228
+ ### Example: Deprecation
1229
+
1230
+ ```typescript
1231
+ /**
1232
+ * @deprecated Use `embedDocuments()` instead. Will be removed in v2.0.0.
1233
+ */
1234
+ async function legacyEmbed(texts: string[]): Promise<number[][]> {
1235
+ console.warn(
1236
+ 'Warning: legacyEmbed() is deprecated. ' +
1237
+ 'Use embedDocuments() instead. ' +
1238
+ 'This method will be removed in v2.0.0.'
1239
+ );
1240
+
1241
+ // Forward to new implementation
1242
+ return this.embedDocuments(texts);
1243
+ }
1244
+ ```
1245
+
1246
+ ### Breaking Change Process
1247
+
1248
+ 1. **Proposal** - GitHub Discussion with rationale
1249
+ 2. **Deprecation** - Mark old API `@deprecated` in current minor release
1250
+ 3. **Documentation** - Migration guide published
1251
+ 4. **Timeline** - Minimum 6 months before removal
1252
+ 5. **Major Release** - Breaking change in next major version
1253
+
1254
+ ---
1255
+
1256
+ ## Support
1257
+
1258
+ - **Documentation:** https://github.com/dcyfr/dcyfr-ai-rag/tree/main/docs
1259
+ - **Issues:** https://github.com/dcyfr/dcyfr-ai-rag/issues
1260
+ - **Discussions:** https://github.com/dcyfr/dcyfr-ai-rag/discussions
1261
+ - **Email:** hello@dcyfr.ai
1262
+
1263
+ ---
1264
+
1265
+ **Version:** 1.0.0
1266
+ **License:** MIT
1267
+ **Last Updated:** February 7, 2026