@operor/knowledge 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,457 @@
1
+ # @operor/knowledge
2
+
3
+ Knowledge base system with vector search, retrieval pipeline, and document ingestors for Operor.
4
+
5
+ ## Overview
6
+
7
+ `@operor/knowledge` provides a complete knowledge base solution for AI agents, enabling them to retrieve relevant context from documents, FAQs, and web content. The system uses vector embeddings for semantic search and includes specialized pipelines for ingestion and retrieval.
8
+
9
+ Inspired by [Astra Agent KB design](https://github.com/datastax/astra-assistants-api) and [OpenClaw's extensible plugin architecture](https://github.com/openclaw/openclaw), this package provides a modular, production-ready knowledge base with pluggable ingestors.
10
+
11
+ ## Quick Start
12
+
13
+ ### Installation
14
+
15
+ ```bash
16
+ pnpm add @operor/knowledge
17
+ ```
18
+
19
+ ### Basic Usage
20
+
21
+ ```typescript
22
+ import {
23
+ SQLiteKnowledgeStore,
24
+ EmbeddingService,
25
+ TextChunker,
26
+ IngestionPipeline,
27
+ RetrievalPipeline,
28
+ } from '@operor/knowledge';
29
+
30
+ // 1. Initialize components
31
+ const store = new SQLiteKnowledgeStore('./knowledge.db', 1536);
32
+ await store.initialize();
33
+
34
+ const embedder = new EmbeddingService({
35
+ provider: 'openai',
36
+ apiKey: process.env.OPENAI_API_KEY,
37
+ model: 'text-embedding-3-small',
38
+ });
39
+
40
+ const chunker = new TextChunker({ chunkSize: 500, chunkOverlap: 50 });
41
+ const ingestion = new IngestionPipeline(store, embedder, chunker);
42
+ const retrieval = new RetrievalPipeline(store, embedder, 0.85);
43
+
44
+ // 2. Ingest documents
45
+ await ingestion.ingest({
46
+ sourceType: 'url',
47
+ content: 'Your document content here...',
48
+ title: 'Getting Started Guide',
49
+ });
50
+
51
+ // 3. Query the knowledge base
52
+ const result = await retrieval.retrieve('How do I get started?', { limit: 5 });
53
+ console.log(result.context); // Formatted context for LLM injection
54
+ console.log(result.isFaqMatch); // true if FAQ fast-path matched
55
+ ```
56
+
57
+ ## Architecture
58
+
59
+ ### Core Components
60
+
61
+ - **KnowledgeStore**: SQLite-backed storage with `sqlite-vec` for vector search
62
+ - **EmbeddingService**: Multi-provider embeddings via Vercel AI SDK (OpenAI, Google, Mistral, Cohere, Ollama)
63
+ - **TextChunker**: Document chunking using LangChain text splitters
64
+ - **IngestionPipeline**: Orchestrates document ingestion (chunk → embed → store)
65
+ - **RetrievalPipeline**: Query-time retrieval with FAQ fast-path optimization
66
+
67
+ ### Ingestors (Extensible)
68
+
69
+ Following OpenClaw's plugin pattern, ingestors are modular and extensible:
70
+
71
+ - **UrlIngestor**: Web crawling with Readability extraction
72
+ - **FileIngestor**: PDF, DOCX, XLSX, CSV, TXT, MD, HTML parsing
73
+ - **WatiFaqSync**: FAQ extraction from WATI conversations via LLM
74
+
75
+ ## Technology Stack
76
+
77
+ All dependencies use the latest stable versions:
78
+
79
+ - **Vector Search**: `sqlite-vec` (0.1.7-alpha.2) — Fast vector search in SQLite
80
+ - **Embeddings**: `ai` (6.x) + `@ai-sdk/*` (3.x) — Vercel AI SDK with multi-provider support
81
+ - **Text Splitting**: `@langchain/textsplitters` (1.x) — Recursive and markdown splitters
82
+ - **Document Parsing**:
83
+ - PDF: `unpdf` (1.4.0)
84
+ - DOCX: `mammoth` (1.11.0)
85
+ - XLSX: `xlsx` (0.18.5)
86
+ - HTML: `@mozilla/readability` (0.6.0) + `linkedom` (0.18.12)
87
+ - **Storage**: `better-sqlite3` (12.x) — Fast, synchronous SQLite
88
+
89
+ ## Embedding Providers
90
+
91
+ The `EmbeddingService` supports multiple providers via the Vercel AI SDK:
92
+
93
+ | Provider | Model (default) | Dimensions | API Key Required |
94
+ |----------|----------------|------------|------------------|
95
+ | OpenAI | `text-embedding-3-small` | 1536 | Yes |
96
+ | Google | `text-embedding-004` | 768 | Yes |
97
+ | Mistral | `mistral-embed` | 1024 | Yes |
98
+ | Cohere | `embed-english-v3.0` | 1024 | Yes |
99
+ | Ollama | `nomic-embed-text` | 768 | No (local) |
100
+
101
+ **Important**: When switching embedding providers, you must re-ingest all documents. The vector dimensions must match what the store was initialized with, or search will fail.
102
+
103
+ ### Provider Examples
104
+
105
+ ```typescript
106
+ // OpenAI (default)
107
+ const embedder = new EmbeddingService({
108
+ provider: 'openai',
109
+ apiKey: process.env.OPENAI_API_KEY,
110
+ model: 'text-embedding-3-small', // optional
111
+ dimensions: 1536, // optional, defaults per provider
112
+ });
113
+
114
+ // Google Gemini
115
+ const embedder = new EmbeddingService({
116
+ provider: 'google',
117
+ apiKey: process.env.GOOGLE_API_KEY,
118
+ model: 'text-embedding-004',
119
+ });
120
+
121
+ // Mistral
122
+ const embedder = new EmbeddingService({
123
+ provider: 'mistral',
124
+ apiKey: process.env.MISTRAL_API_KEY,
125
+ });
126
+
127
+ // Cohere
128
+ const embedder = new EmbeddingService({
129
+ provider: 'cohere',
130
+ apiKey: process.env.COHERE_API_KEY,
131
+ });
132
+
133
+ // Ollama (local, no API key)
134
+ const embedder = new EmbeddingService({
135
+ provider: 'ollama',
136
+ model: 'nomic-embed-text',
137
+ baseURL: 'http://localhost:11434/v1', // optional
138
+ });
139
+ ```
140
+
141
+ ## Usage Guide
142
+
143
+ ### Ingesting Documents
144
+
145
+ #### From URL
146
+
147
+ ```typescript
148
+ import { UrlIngestor } from '@operor/knowledge';
149
+
150
+ const urlIngestor = new UrlIngestor(ingestion);
151
+
152
+ // Single URL
153
+ await urlIngestor.ingestUrl('https://docs.example.com/guide');
154
+
155
+ // Sitemap (batch ingest)
156
+ await urlIngestor.ingestSitemap('https://example.com/sitemap.xml', {
157
+ maxPages: 50,
158
+ });
159
+
160
+ // Crawl website (BFS traversal)
161
+ await urlIngestor.crawl('https://docs.example.com', {
162
+ maxPages: 20,
163
+ maxDepth: 2,
164
+ });
165
+ ```
166
+
167
+ #### From File
168
+
169
+ ```typescript
170
+ import { FileIngestor } from '@operor/knowledge';
171
+
172
+ const fileIngestor = new FileIngestor(ingestion);
173
+
174
+ // Supports: PDF, DOCX, XLSX, CSV, TXT, MD, HTML
175
+ await fileIngestor.ingestFile('./docs/manual.pdf', 'User Manual');
176
+ await fileIngestor.ingestFile('./data/products.xlsx', 'Product Catalog');
177
+ ```
178
+
179
+ #### FAQ Entries
180
+
181
+ ```typescript
182
+ // Manual FAQ entry
183
+ await ingestion.ingestFaq(
184
+ 'What is the return policy?',
185
+ 'You can return items within 30 days of purchase.'
186
+ );
187
+
188
+ // Batch FAQ sync from WATI conversations
189
+ import { WatiFaqSync } from '@operor/knowledge';
190
+
191
+ const faqSync = new WatiFaqSync(ingestion, async (conversation) => {
192
+ // Your LLM extraction logic here
193
+ return [
194
+ { question: 'How do I reset my password?', answer: 'Click Forgot Password...' },
195
+ ];
196
+ });
197
+
198
+ await faqSync.syncFromConversations(conversations, {
199
+ minAnswerLength: 20,
200
+ maxPairs: 100,
201
+ });
202
+ ```
203
+
204
+ ### Retrieving Context
205
+
206
+ ```typescript
207
+ // Basic retrieval
208
+ const result = await retrieval.retrieve('What is the return policy?', {
209
+ limit: 5,
210
+ scoreThreshold: 0.7,
211
+ });
212
+
213
+ console.log(result.isFaqMatch); // true if FAQ fast-path matched
214
+ console.log(result.context); // Formatted context for LLM injection
215
+ console.log(result.results); // Raw search results with scores
216
+
217
+ // Filter by source type
218
+ const result = await retrieval.retrieve('pricing info', {
219
+ limit: 3,
220
+ sourceTypes: ['url', 'file'], // exclude FAQs
221
+ });
222
+ ```
223
+
224
+ ### FAQ Fast-Path
225
+
226
+ The retrieval pipeline includes an optimization for FAQs:
227
+
228
+ 1. First searches FAQ documents only
229
+ 2. If score ≥ 0.85 (configurable), returns immediately (fast-path)
230
+ 3. Otherwise, searches full knowledge base
231
+
232
+ This ensures instant responses for common questions while maintaining comprehensive search for complex queries.
233
+
234
+ ```typescript
235
+ // Adjust FAQ threshold (default: 0.85)
236
+ const retrieval = new RetrievalPipeline(store, embedder, 0.90);
237
+ ```
238
+
239
+ ## CLI Usage
240
+
241
+ The `@operor/cli` package provides commands for managing the knowledge base:
242
+
243
+ ```bash
244
+ # Add documents
245
+ operor kb add-url https://docs.example.com/guide
246
+ operor kb add-file ./manual.pdf
247
+ operor kb add-faq "What are your hours?" "We are open 9-5 Mon-Fri"
248
+
249
+ # Search
250
+ operor kb search "return policy" -n 5
251
+
252
+ # List all documents
253
+ operor kb list
254
+
255
+ # Delete a document
256
+ operor kb delete <doc-id>
257
+
258
+ # Show statistics
259
+ operor kb stats
260
+ ```
261
+
262
+ ## API Reference
263
+
264
+ ### EmbeddingService
265
+
266
+ ```typescript
267
+ class EmbeddingService {
268
+ constructor(config: EmbeddingServiceConfig);
269
+
270
+ embed(text: string): Promise<number[]>;
271
+ embedMany(texts: string[]): Promise<number[][]>;
272
+
273
+ get dimensions(): number;
274
+ get provider(): string;
275
+
276
+ static defaultDimensions(provider: string, model?: string): number;
277
+ }
278
+
279
+ interface EmbeddingServiceConfig {
280
+ provider: 'openai' | 'google' | 'mistral' | 'cohere' | 'ollama';
281
+ apiKey?: string;
282
+ model?: string;
283
+ baseURL?: string;
284
+ dimensions?: number;
285
+ }
286
+ ```
287
+
288
+ ### SQLiteKnowledgeStore
289
+
290
+ ```typescript
291
+ class SQLiteKnowledgeStore implements KnowledgeStore {
292
+ constructor(dbPath?: string, dimensions?: number);
293
+
294
+ initialize(): Promise<void>;
295
+ close(): Promise<void>;
296
+
297
+ addDocument(doc: KBDocument): Promise<void>;
298
+ getDocument(id: string): Promise<KBDocument | null>;
299
+ listDocuments(): Promise<KBDocument[]>;
300
+ deleteDocument(id: string): Promise<void>;
301
+
302
+ addChunks(chunks: KBChunk[]): Promise<void>;
303
+ search(query: string, embedding: number[], options?: KBSearchOptions): Promise<KBSearchResult[]>;
304
+ searchByEmbedding(embedding: number[], options?: KBSearchOptions): Promise<KBSearchResult[]>;
305
+
306
+ getDimensions(): number;
307
+ }
308
+ ```
309
+
310
+ ### TextChunker
311
+
312
+ ```typescript
313
+ class TextChunker {
314
+ constructor(options?: ChunkOptions);
315
+
316
+ chunk(text: string, options?: ChunkOptions): Promise<string[]>;
317
+ chunkMarkdown(markdown: string, options?: ChunkOptions): Promise<string[]>;
318
+ }
319
+
320
+ interface ChunkOptions {
321
+ chunkSize?: number; // default: 500
322
+ chunkOverlap?: number; // default: 50
323
+ }
324
+ ```
325
+
326
+ ### IngestionPipeline
327
+
328
+ ```typescript
329
+ class IngestionPipeline {
330
+ constructor(store: KnowledgeStore, embedder: EmbeddingService, chunker: TextChunker);
331
+
332
+ ingest(input: IngestInput): Promise<KBDocument>;
333
+ ingestFaq(question: string, answer: string, metadata?: Record<string, any>): Promise<KBDocument>;
334
+ }
335
+
336
+ interface IngestInput {
337
+ sourceType: 'url' | 'file' | 'faq' | 'annotation';
338
+ content: string;
339
+ title?: string;
340
+ sourceUrl?: string;
341
+ fileName?: string;
342
+ metadata?: Record<string, any>;
343
+ }
344
+ ```
345
+
346
+ ### RetrievalPipeline
347
+
348
+ ```typescript
349
+ class RetrievalPipeline {
350
+ constructor(store: KnowledgeStore, embedder: EmbeddingService, faqThreshold?: number);
351
+
352
+ retrieve(query: string, options?: KBSearchOptions): Promise<RetrievalResult>;
353
+ }
354
+
355
+ interface RetrievalResult {
356
+ results: KBSearchResult[];
357
+ context: string; // Formatted for LLM injection
358
+ isFaqMatch: boolean; // true if FAQ fast-path matched
359
+ }
360
+ ```
361
+
362
+ ### Types
363
+
364
+ ```typescript
365
+ interface KBDocument {
366
+ id: string;
367
+ sourceType: 'url' | 'file' | 'faq' | 'annotation';
368
+ sourceUrl?: string;
369
+ fileName?: string;
370
+ title?: string;
371
+ content: string;
372
+ metadata?: Record<string, any>;
373
+ createdAt: number;
374
+ updatedAt: number;
375
+ }
376
+
377
+ interface KBChunk {
378
+ id: string;
379
+ documentId: string;
380
+ content: string;
381
+ chunkIndex: number;
382
+ embedding?: number[];
383
+ metadata?: Record<string, any>;
384
+ }
385
+
386
+ interface KBSearchResult {
387
+ chunk: KBChunk;
388
+ document: KBDocument;
389
+ score: number; // 0-1, higher is better
390
+ distance: number; // raw vector distance
391
+ }
392
+
393
+ interface KBSearchOptions {
394
+ limit?: number;
395
+ scoreThreshold?: number;
396
+ sourceTypes?: ('url' | 'file' | 'faq' | 'annotation')[];
397
+ metadata?: Record<string, any>;
398
+ }
399
+ ```
400
+
401
+ ## Extending with Custom Ingestors
402
+
403
+ Following OpenClaw's plugin architecture pattern, you can create custom ingestors:
404
+
405
+ ```typescript
406
+ export class CustomIngestor {
407
+ private pipeline: IngestionPipeline;
408
+
409
+ constructor(pipeline: IngestionPipeline) {
410
+ this.pipeline = pipeline;
411
+ }
412
+
413
+ async ingestCustomSource(source: string): Promise<KBDocument> {
414
+ // 1. Extract content from your source
415
+ const content = await this.extractContent(source);
416
+
417
+ // 2. Use pipeline to ingest
418
+ return this.pipeline.ingest({
419
+ sourceType: 'url', // or 'file', 'annotation'
420
+ content,
421
+ title: 'Custom Source',
422
+ metadata: { source: 'custom' },
423
+ });
424
+ }
425
+
426
+ private async extractContent(source: string): Promise<string> {
427
+ // Your extraction logic
428
+ return 'extracted content';
429
+ }
430
+ }
431
+ ```
432
+
433
+ ## Troubleshooting
434
+
435
+ ### Dimension Mismatch Errors
436
+
437
+ If you see errors like "Dimension mismatch for inserted vector", it means:
438
+ - You switched embedding providers without re-ingesting documents
439
+ - The store was initialized with different dimensions than the embedder produces
440
+
441
+ **Solution**: Delete the KB database and re-ingest all documents with the new provider.
442
+
443
+ ```bash
444
+ rm knowledge.db knowledge.db-shm knowledge.db-wal
445
+ operor kb add-url https://docs.example.com # re-ingest
446
+ ```
447
+
448
+ ### Performance Tips
449
+
450
+ - Use batch ingestion for multiple documents (sitemap, crawl)
451
+ - Adjust chunk size based on your content (smaller for FAQs, larger for docs)
452
+ - Set appropriate `scoreThreshold` to filter low-quality results
453
+ - Use `sourceTypes` filter to narrow search scope
454
+
455
+ ## License
456
+
457
+ MIT