vectra 0.12.1 → 0.12.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/LICENSE +1 -1
  2. package/README.draft.md +499 -0
  3. package/README.draft.outline.md +160 -0
  4. package/README.research.md +2159 -0
  5. package/lib/FileFetcher.d.ts +5 -0
  6. package/lib/FileFetcher.d.ts.map +1 -0
  7. package/lib/FileFetcher.js +79 -0
  8. package/lib/FileFetcher.js.map +1 -0
  9. package/lib/GPT3Tokenizer.d.ts +9 -0
  10. package/lib/GPT3Tokenizer.d.ts.map +1 -0
  11. package/lib/GPT3Tokenizer.js +17 -0
  12. package/lib/GPT3Tokenizer.js.map +1 -0
  13. package/lib/ItemSelector.d.ts +41 -0
  14. package/lib/ItemSelector.d.ts.map +1 -0
  15. package/lib/ItemSelector.js +168 -0
  16. package/lib/ItemSelector.js.map +1 -0
  17. package/lib/LocalDocument.d.ts +54 -0
  18. package/lib/LocalDocument.d.ts.map +1 -0
  19. package/lib/LocalDocument.js +156 -0
  20. package/lib/LocalDocument.js.map +1 -0
  21. package/lib/LocalDocumentIndex.d.ts +132 -0
  22. package/lib/LocalDocumentIndex.d.ts.map +1 -0
  23. package/lib/LocalDocumentIndex.js +456 -0
  24. package/lib/LocalDocumentIndex.js.map +1 -0
  25. package/lib/LocalDocumentResult.d.ts +45 -0
  26. package/lib/LocalDocumentResult.d.ts.map +1 -0
  27. package/lib/LocalDocumentResult.js +328 -0
  28. package/lib/LocalDocumentResult.js.map +1 -0
  29. package/lib/LocalIndex.d.ts +150 -0
  30. package/lib/LocalIndex.d.ts.map +1 -0
  31. package/lib/LocalIndex.js +515 -0
  32. package/lib/LocalIndex.js.map +1 -0
  33. package/lib/LocalIndex.spec.d.ts +2 -0
  34. package/lib/LocalIndex.spec.js +218 -7
  35. package/lib/LocalIndex.spec.js.map +1 -1
  36. package/lib/OpenAIEmbeddings.d.ts +126 -0
  37. package/lib/OpenAIEmbeddings.d.ts.map +1 -0
  38. package/lib/OpenAIEmbeddings.js +174 -0
  39. package/lib/OpenAIEmbeddings.js.map +1 -0
  40. package/lib/TextSplitter.d.ts +19 -0
  41. package/lib/TextSplitter.d.ts.map +1 -0
  42. package/lib/TextSplitter.js +457 -0
  43. package/lib/TextSplitter.js.map +1 -0
  44. package/lib/TextSplitter.spec.d.ts +2 -0
  45. package/lib/TextSplitter.spec.d.ts.map +1 -0
  46. package/lib/TextSplitter.spec.js +109 -0
  47. package/lib/TextSplitter.spec.js.map +1 -0
  48. package/lib/WebFetcher.d.ts +15 -0
  49. package/lib/WebFetcher.d.ts.map +1 -0
  50. package/lib/WebFetcher.js +234 -0
  51. package/lib/WebFetcher.js.map +1 -0
  52. package/lib/index.d.ts +12 -0
  53. package/lib/index.d.ts.map +1 -0
  54. package/lib/index.js +28 -0
  55. package/lib/index.js.map +1 -0
  56. package/lib/internals/Colorize.d.ts +14 -0
  57. package/lib/internals/Colorize.d.ts.map +1 -0
  58. package/lib/internals/Colorize.js +64 -0
  59. package/lib/internals/Colorize.js.map +1 -0
  60. package/lib/internals/index.d.ts +3 -0
  61. package/lib/internals/index.d.ts.map +1 -0
  62. package/lib/internals/index.js +19 -0
  63. package/lib/internals/index.js.map +1 -0
  64. package/lib/internals/types.d.ts +43 -0
  65. package/lib/internals/types.d.ts.map +1 -0
  66. package/lib/internals/types.js +3 -0
  67. package/lib/internals/types.js.map +1 -0
  68. package/lib/types.d.ts +146 -0
  69. package/lib/types.d.ts.map +1 -0
  70. package/lib/types.js +3 -0
  71. package/lib/types.js.map +1 -0
  72. package/lib/vectra-cli.d.ts +2 -0
  73. package/lib/vectra-cli.d.ts.map +1 -0
  74. package/lib/vectra-cli.js +323 -0
  75. package/lib/vectra-cli.js.map +1 -0
  76. package/package.json +5 -3
  77. package/src/GPT3Tokenizer.ts +1 -1
  78. package/src/LocalIndex.spec.ts +265 -8
  79. package/src/LocalIndex.ts +1 -0
  80. package/src/TextSplitter.spec.ts +87 -0
  81. package/src/TextSplitter.ts +459 -531
package/LICENSE CHANGED
@@ -1,6 +1,6 @@
1
1
  MIT License
2
2
 
3
- Copyright (c) 2023 Steven Ickman
3
+ Copyright (c) 2023-2026 Steven Ickman
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
@@ -0,0 +1,499 @@
1
+ # Vectra
2
+
3
+ - A local, file-backed vector database for Node.js with Pinecone-like features and zero external infrastructure. Each index is a folder on disk, loaded into memory for ultra-fast queries.
4
+
5
+ - Key features
6
+ - Local, file-backed vector database for Node.js
7
+ - In-memory search with cosine similarity (pre-normalized vectors for speed)
8
+ - Metadata filtering (Pinecone-compatible Mongo-style operators)
9
+ - Document indexing with chunking and optional hybrid BM25 keyword search
10
+ - Simple CLI and TypeScript API
11
+
12
+ - When to use Vectra (and when not)
13
+ - Great for small, mostly static corpora; few-shot examples; single-document Q&A
14
+ - Not suited for long-term, ever-growing chat memory (entire index loads into RAM)
15
+ - Mimic namespaces by using separate folders (one index per folder)
16
+
17
+ - Language agnostic file format note (indices can be read/written by any language)
18
+ - Indexes are plain JSON and text files on disk; while this package targets Node.js, any language can read/write the folder format.
19
+
20
+ ## Requirements
21
+
22
+ - Node.js >= 20.x
23
+ - NPM or Yarn
24
+ - An embeddings provider (OpenAI, Azure OpenAI, or any OpenAI-compatible OSS endpoint)
25
+
26
+ ## Installation
27
+
28
+ - Library
29
+ - npm install vectra
30
+ - CLI
31
+ - Use via npx: npx vectra --help
32
+ - Or install globally: npm install -g vectra
33
+
34
+ ## Quick Start (5 minutes)
35
+
36
+ ### Choose your path
37
+
38
+ - Option A: Vector Item Index (LocalIndex) — store your own vectors + metadata; run similarity + metadata filters
39
+ - Option B: Document Index (LocalDocumentIndex) — chunk raw documents, store on disk, query via embeddings; render relevant sections
40
+
41
+ ### A. LocalIndex (items + metadata)
42
+
43
+ - Steps
44
+ 1) Create an index folder and initialize
45
+ 2) Generate embeddings (any provider) and insert items with metadata
46
+ 3) Query by vector with optional metadata filter; get topK sorted by similarity
47
+
48
+ - Example (code)
49
+
50
+ ```ts
51
+ import path from 'node:path';
52
+ import { LocalIndex } from 'vectra';
53
+ import { OpenAI } from 'openai';
54
+
55
+ const indexPath = path.join(process.cwd(), 'my-localindex');
56
+ const index = new LocalIndex(indexPath);
57
+
58
+ async function ensureIndex() {
59
+ if (!(await index.isIndexCreated())) {
60
+ await index.createIndex({
61
+ version: 1,
62
+ metadata_config: { indexed: ['category'] } // index only fields you need to filter on
63
+ });
64
+ }
65
+ }
66
+
67
+ const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
68
+
69
+ async function getVector(text: string) {
70
+ const res = await openai.embeddings.create({
71
+ model: 'text-embedding-3-small',
72
+ input: text
73
+ });
74
+ return res.data[0].embedding;
75
+ }
76
+
77
+ async function addItem(id: string, text: string, category?: string) {
78
+ await index.insertItem({
79
+ id,
80
+ vector: await getVector(text),
81
+ metadata: { text, category }
82
+ });
83
+ }
84
+
85
+ async function main() {
86
+ await ensureIndex();
87
+
88
+ await addItem('1', 'apple', 'food');
89
+ await addItem('2', 'oranges', 'food');
90
+ await addItem('3', 'red', 'color');
91
+ await addItem('4', 'blue', 'color');
92
+
93
+ const queryVec = await getVector('banana');
94
+ const results = await index.queryItems(queryVec, 'banana', 3); // vector, query, topK, [optional filter]
95
+
96
+ for (const r of results) {
97
+ console.log(`[${r.score.toFixed(3)}] ${r.item.metadata.text}`);
98
+ }
99
+
100
+ // With metadata filter (e.g., only colors)
101
+ const colorResults = await index.queryItems(queryVec, '', 3, { category: { $eq: 'color' } });
102
+ console.log('Only colors:');
103
+ for (const r of colorResults) {
104
+ console.log(`[${r.score.toFixed(3)}] ${r.item.metadata.text}`);
105
+ }
106
+ }
107
+
108
+ main().catch(console.error);
109
+ ```
110
+
111
+ ### B. LocalDocumentIndex (documents + chunking + retrieval)
112
+
113
+ - Steps
114
+ 1) Configure embeddings via OpenAIEmbeddings (OpenAI, Azure OpenAI, or OSS)
115
+ 2) Create index and add documents (from strings, files, or web pages)
116
+ 3) Query documents and render top sections
117
+
118
+ - Example (code)
119
+
120
+ ```ts
121
+ import path from 'node:path';
122
+ import { LocalDocumentIndex, OpenAIEmbeddings } from 'vectra';
123
+
124
+ const folderPath = path.join(process.cwd(), 'my-docindex');
125
+
126
+ const embeddings = new OpenAIEmbeddings({
127
+ apiKey: process.env.OPENAI_API_KEY!,
128
+ model: 'text-embedding-3-small',
129
+ // optional: dimensions, requestConfig, retryPolicy, etc.
130
+ });
131
+
132
+ const docIndex = new LocalDocumentIndex({
133
+ folderPath,
134
+ embeddings,
135
+ chunkingConfig: { chunkSize: 512 } // tokens per chunk
136
+ });
137
+
138
+ async function setup() {
139
+ await docIndex.createIndex({ version: 1, deleteIfExists: true });
140
+ await docIndex.upsertDocument(
141
+ 'doc://getting-started',
142
+ `
143
+ Vectra is a local vector DB for Node.js.
144
+ It supports metadata filtering and blazing-fast in-memory search.
145
+ Great for small, mostly static corpora.
146
+ `,
147
+ 'md'
148
+ );
149
+ }
150
+
151
+ async function search() {
152
+ const results = await docIndex.queryDocuments('How do I use Vectra for small corpora?', {
153
+ maxDocuments: 5,
154
+ maxChunks: 50,
155
+ isBm25: false // set true for hybrid keyword+semantic retrieval
156
+ });
157
+
158
+ for (const doc of results) {
159
+ console.log(`\nURI: ${doc.uri} (score: ${doc.score.toFixed(3)})`);
160
+ const sections = await doc.renderSections(800, 1, true); // tokens, section count, overlap
161
+ for (const s of sections) {
162
+ console.log(`Tokens: ${s.tokenCount}, Section score: ${s.score.toFixed(3)}`);
163
+ console.log(s.text.trim());
164
+ }
165
+ }
166
+ }
167
+
168
+ setup().then(search).catch(console.error);
169
+ ```
170
+
171
+ ## CLI
172
+
173
+ - Installation
174
+ - Global: npm install -g vectra
175
+ - One-off: npx vectra --help
176
+
177
+ - keys.json formats
178
+ - OpenAI
179
+ {
180
+ "apiKey": "sk-...",
181
+ "model": "text-embedding-3-small"
182
+ // optional: "organization": "org_...", "endpoint": "https://api.openai.com",
183
+ // optional: "dimensions": 1536, "logRequests": false, "maxTokens": 8000,
184
+ // optional: "retryPolicy": [2000, 5000], "requestConfig": { "timeout": 60000 }
185
+ }
186
+ - Note: If you omit model when using the CLI, it defaults to "text-embedding-ada-002" (with maxTokens 8000).
187
+ - Azure OpenAI
188
+ {
189
+ "azureApiKey": "<YOUR_AZURE_OPENAI_KEY>",
190
+ "azureEndpoint": "https://<your-resource-name>.openai.azure.com",
191
+ "azureDeployment": "<your-deployment-name>",
192
+ "azureApiVersion": "2023-05-15",
193
+ // optional: "dimensions": 1536, "logRequests": false, "maxTokens": 8000,
194
+ // optional: "retryPolicy": [2000, 5000], "requestConfig": { "timeout": 60000 }
195
+ }
196
+ - OSS (OpenAI-compatible)
197
+ {
198
+ "ossEndpoint": "https://api.your-oss-endpoint.com",
199
+ "ossModel": "text-embedding-3-small",
200
+ // optional: "dimensions": 1536, "logRequests": false, "maxTokens": 8000,
201
+ // optional: "retryPolicy": [2000, 5000], "requestConfig": { "timeout": 60000 }
202
+ }
203
+
204
+ - Commands
205
+ - vectra create <index>
206
+ - Create a new local document index (folder). Overwrite with --deleteIfExists via API; CLI create always creates a fresh catalog.
207
+ - Example: npx vectra create ./my-docindex
208
+ - vectra delete <index>
209
+ - Delete an existing document index folder.
210
+ - Example: npx vectra delete ./my-docindex
211
+ - vectra add <index> --keys keys.json [--uri <url-or-file> ...] [--list file] [--cookie str] [--chunk-size N]
212
+ - Add one or more web pages or local files to the index. Auto-detects http/https vs file path.
213
+ - Example (single URL): npx vectra add ./my-docindex --keys keys.json --uri https://example.com/docs/intro
214
+ - Example (local file): npx vectra add ./my-docindex --keys keys.json --uri ./docs/guide.md
215
+ - Example (list file): npx vectra add ./my-docindex --keys keys.json --list urls.txt
216
+ - Example (with cookie): npx vectra add ./my-docindex --keys keys.json --uri https://site.com/protected --cookie "sessionid=abc; other=xyz"
217
+ - Example (custom chunk size): npx vectra add ./my-docindex --keys keys.json --uri https://example.com --chunk-size 512
218
+ - vectra remove <index> --uri <uri> [--list file]
219
+ - Remove one or more documents (by stored URI) from the index.
220
+ - Example: npx vectra remove ./my-docindex --uri https://example.com/docs/intro
221
+ - Example (list): npx vectra remove ./my-docindex --list uris-to-remove.txt
222
+ - vectra stats <index>
223
+ - Print catalog stats (version, doc count, etc.).
224
+ - Example: npx vectra stats ./my-docindex
225
+ - vectra query <index> "<query>" --keys keys.json [--document-count N] [--chunk-count N] [--section-count N] [--tokens N] [--format sections|stats|chunks] [--overlap] [--bm25]
226
+ - Query the index and render results.
227
+ - Example (default sections view): npx vectra query ./my-docindex "how do I get started?" --keys keys.json
228
+ - Example (limit docs/sections/tokens): npx vectra query ./my-docindex "hybrid search" --keys keys.json --document-count 5 --section-count 2 --tokens 800
229
+ - Example (show chunks): npx vectra query ./my-docindex "metadata filtering" --keys keys.json --format chunks
230
+ - Example (enable hybrid keyword+semantic): npx vectra query ./my-docindex "install steps" --keys keys.json --bm25
231
+
232
+ ## Data Model & On-Disk Layout
233
+
234
+ - Index folder structure overview
235
+ - A Vectra index is a single folder on disk you choose.
236
+ - Core files
237
+ - index.json — the in-memory index snapshot (vectors + selected metadata + config).
238
+ - For item/document payloads
239
+ - <id>.json — non-indexed metadata for an item or document.
240
+ - <id>.txt — raw document text when using LocalDocumentIndex.
241
+
242
+ - LocalIndex
243
+ - index.json contents (high level)
244
+ - version — schema/versioning number.
245
+ - metadata_config — which metadata fields are stored in-memory for filtering (indexed).
246
+ - items — array of items:
247
+ - id — your ID (or auto-generated if omitted).
248
+ - vector — numeric array embedding.
249
+ - norm — precomputed vector norm for fast cosine similarity.
250
+ - metadata — only the fields listed in metadata_config.indexed.
251
+ - Non-indexed metadata
252
+ - Stored separately as <id>.json on disk.
253
+ - At query time, Vectra filters first by the in-memory indexed metadata. If a filter refers to a field not present in memory, Vectra may read the item’s metadata file to evaluate the filter.
254
+ - Namespaces
255
+ - Not directly supported; create a separate folder per “namespace”.
256
+
257
+ - LocalDocumentIndex
258
+ - What’s stored
259
+ - index.json — embedding vectors and metadata for document chunks and catalog info.
260
+ - <id>.txt — full document text (enables section rendering and context extraction).
261
+ - <id>.json — additional document-level metadata you provide (optional).
262
+ - Document identity
263
+ - Documents are addressed by a URI you supply (e.g., https://example.com/page or doc://my-doc).
264
+ - Internally, Vectra uses an ID to store files (<id>.txt/.json) and tracks the URI↔ID mapping in the index.
265
+ - Chunk metadata
266
+ - startPos and endPos — byte or character offsets into the <id>.txt content for the chunk.
267
+ - Optional flags (e.g., isBm25) used for hybrid retrieval.
268
+ - Chunking
269
+ - Documents are split into token-based chunks using a configurable chunk size and optional overlap logic when rendering.
270
+
271
+ ## Search & Filtering
272
+
273
+ - Similarity
274
+ - Cosine similarity with pre-normalized vectors for speed.
275
+ - For LocalIndex: all items are filtered by metadata first, then scored and returned sorted by similarity.
276
+ - For LocalDocumentIndex: chunk-level scoring aggregates into document-level results.
277
+
278
+ - Metadata filters (Pinecone-compatible subset)
279
+ - Logical operators: $and, $or
280
+ - Comparison operators: $eq, $ne, $gt, $gte, $lt, $lte
281
+ - Set operators: $in, $nin
282
+ - Filters apply to fields defined in metadata_config.indexed; non-indexed fields are stored per-item/per-doc in <id>.json and may be read during filtering when needed.
283
+
284
+ - Hybrid search (BM25) for documents
285
+ - Optional keyword scoring combined with semantic matches to improve recall.
286
+ - Enable via CLI flag --bm25 or corresponding API options when querying LocalDocumentIndex.
287
+
288
+ - Result rendering
289
+ - LocalDocumentResult
290
+ - renderSections(maxTokens, maxSections, overlap?): returns top sections with aggregated scores; can optionally include overlapping chunks.
291
+ - renderAllSections(maxTokens): renders all matched spans split into sections up to maxTokens each.
292
+ - Sections include token counts and per-section scores, enabling easy prompt assembly.
293
+
294
+ ## API Overview
295
+
296
+ - Core exports
297
+ - LocalIndex
298
+ - LocalDocumentIndex
299
+ - LocalDocument
300
+ - LocalDocumentResult
301
+ - OpenAIEmbeddings
302
+ - TextSplitter
303
+ - ItemSelector
304
+ - FileFetcher, WebFetcher
305
+ - GPT3Tokenizer
306
+ - types (shared type definitions)
307
+
308
+ - LocalIndex (vectors + metadata)
309
+ - Purpose: Store your own vectors and metadata; run cosine similarity + metadata filters in-memory.
310
+ - Key methods
311
+ - createIndex(options?: { version?: number; deleteIfExists?: boolean; metadata_config?: { indexed?: string[] } })
312
+ - isIndexCreated(): Promise<boolean>
313
+ - getIndexStats(): Promise<{ version: number; metadata_config: object; items: number }>
314
+ - insertItem(item: { id?: string; vector: number[]; metadata?: Record<string, any> }): Promise<IndexItem>
315
+ - batchInsertItems(items: Partial<IndexItem>[]): Promise<IndexItem[]>
316
+ - deleteItem(id: string): Promise<void>
317
+ - listItems(): Promise<IndexItem[]>
318
+ - listItemsByMetadata(filter: MetadataFilter): Promise<IndexItem[]>
319
+ - getItem(id: string): Promise<IndexItem | undefined>
320
+ - queryItems(vector: number[], namespace: string, topK: number, filter?: MetadataFilter): Promise<Array<{ item: IndexItem; score: number }>>
321
+ - beginUpdate(): Promise<void> / endUpdate(): Promise<void> (optional batching and atomic save)
322
+ - Notes
323
+ - metadata_config.indexed controls which fields are kept in-memory for fast filtering.
324
+ - Non-indexed metadata is stored as <id>.json and may be read during filtering.
325
+
326
+ - LocalDocumentIndex (document chunking + retrieval)
327
+ - Purpose: Ingest raw documents (strings, files, web pages), chunk and embed them, then query by text.
328
+ - Constructor options
329
+ - { folderPath: string; embeddings?: EmbeddingsModel; chunkingConfig?: { chunkSize?: number; chunkOverlap?: number; docType?: string } }
330
+ - Key methods
331
+ - createIndex(options?: { version?: number; deleteIfExists?: boolean }): Promise<void>
332
+ - deleteIndex(): Promise<void>
333
+ - getCatalogStats(): Promise<any>
334
+ - upsertDocument(uri: string, text: string, docType?: string): Promise<void>
335
+ - deleteDocument(uri: string): Promise<void>
336
+ - queryDocuments(query: string, options?: { maxDocuments?: number; maxChunks?: number; isBm25?: boolean }): Promise<LocalDocumentResult[]>
337
+
338
+ - LocalDocument
339
+ - Properties: id, uri, folderPath
340
+ - Methods
341
+ - getLength(): Promise<number>
342
+ - hasMetadata(): Promise<boolean>
343
+ - loadMetadata(): Promise<Record<string, any>>
344
+ - loadText(): Promise<string>
345
+
346
+ - LocalDocumentResult (extends LocalDocument)
347
+ - Properties
348
+ - chunks: QueryResult<DocumentChunkMetadata>[]
349
+ - score: number (average score across matching chunks)
350
+ - Methods
351
+ - renderSections(maxTokens: number, maxSections: number, overlap?: boolean): Promise<DocumentTextSection[]>
352
+ - renderAllSections(maxTokens: number): Promise<DocumentTextSection[]>
353
+
354
+ - OpenAIEmbeddings
355
+ - Purpose: Generate embeddings via OpenAI, Azure OpenAI, or an OSS OpenAI-compatible endpoint.
356
+ - Constructors
357
+ - OpenAI: { apiKey: string; model: string; organization?, endpoint?, dimensions?, logRequests?, maxTokens?, retryPolicy?, requestConfig? }
358
+ - Azure: { azureApiKey: string; azureEndpoint: string; azureDeployment: string; azureApiVersion?, dimensions?, logRequests?, maxTokens?, retryPolicy?, requestConfig? }
359
+ - OSS: { ossEndpoint: string; ossModel: string; dimensions?, logRequests?, maxTokens?, retryPolicy?, requestConfig? }
360
+ - Methods
361
+ - createEmbeddings(input: string | string[]): Promise<{ status: 'success'|'error'|'rate_limited'; output?: number[][]; message?: string }>
362
+
363
+ - TextSplitter
364
+ - Purpose: Token-aware splitting by separators with configurable chunk size and overlap.
365
+ - Constructor config
366
+ - { separators?: string[]; keepSeparators?: boolean; chunkSize?: number; chunkOverlap?: number; tokenizer?: Tokenizer; docType?: string }
367
+ - Methods
368
+ - split(text: string): TextChunk[]
369
+
370
+ - ItemSelector
371
+ - Static helpers
372
+ - cosineSimilarity(a: number[], b: number[]): number
373
+ - normalizedCosineSimilarity(a: number[], normA: number, b: number[], normB: number): number
374
+ - select(metadata: Record<string, any>, filter: MetadataFilter): boolean
375
+
376
+ - Fetchers and utilities
377
+ - FileFetcher: Read local files and infer docType.
378
+ - WebFetcher: Fetch and clean webpages (supports custom headers like cookies).
379
+ - GPT3Tokenizer: Default tokenizer used for chunking.
380
+
381
+ - Types (high level)
382
+ - IndexItem: { id: string; vector: number[]; norm: number; metadata: Record<string, any>; metadataFile?: string }
383
+ - MetadataFilter: Mongo/Pinecone-style filter object ($and, $or, $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin)
384
+ - TextChunk: { text: string; tokens: number[]; startPos: number; endPos: number; startOverlap: number[]; endOverlap: number[] }
385
+ - QueryResult<T>: { item: { id: string; metadata: T }; score: number }
386
+ - DocumentChunkMetadata: { startPos: number; endPos: number; isBm25?: boolean }
387
+ - DocumentTextSection: { text: string; tokenCount: number; score: number; isBm25: boolean }
388
+
389
+ ## Performance & Limits
390
+
391
+ - In-memory design
392
+ - Entire index.json is loaded into memory for ultra-fast filtering and cosine scoring.
393
+ - Linear scan with pre-normalized vectors keeps per-query latency low for small to medium corpora.
394
+
395
+ - Typical latency
396
+ - Small indexes: often sub-millisecond on a modern laptop.
397
+ - Medium indexes: commonly 1–2ms per query.
398
+ - Note: No ANN/approximate indexing; performance scales linearly with item count and vector dimension.
399
+
400
+ - Memory footprint (rule of thumb)
401
+ - Roughly items × dims × 8 bytes for vectors (JavaScript numbers are 64-bit) + per-item metadata overhead + norms.
402
+ - Example: 50k items × 1536 dims ≈ ~600 MB just for vectors (plus overhead).
403
+ - Keep indexes modest; consider separate folders to partition data.
404
+
405
+ - DocumentIndex specifics
406
+ - Adds .txt bodies on disk and chunk metadata in index.json.
407
+ - Query time aggregates chunk scores to document results and supports optional BM25.
408
+
409
+ - Concurrency and durability
410
+ - beginUpdate/endUpdate guards against concurrent writes; endUpdate writes atomically to index.json.
411
+ - Batch operations are faster and safer than many small writes.
412
+
413
+ - Not for growing chat memory
414
+ - Because everything lives in RAM, use Vectra for small, mostly static corpora. For large or ever-growing datasets, use a hosted vector DB.
415
+
416
+ ## Best Practices
417
+
418
+ - Use separate folders to mimic namespaces
419
+ - Create one index folder per logical dataset (e.g., ./indexes/support, ./indexes/blog). This keeps memory usage predictable and lets you target queries precisely.
420
+
421
+ - Index only metadata fields you need for filtering
422
+ - Configure metadata_config.indexed with the minimal set of fields you’ll filter on. This keeps index.json small and speeds up filtering; store everything else in the per-item .json files.
423
+
424
+ - Batch inserts and use beginUpdate/endUpdate for bulk changes
425
+ - For many writes, call beginUpdate(), perform your inserts/deletes, then endUpdate() once. This reduces disk I/O and ensures atomic saves. Avoid concurrent writes; the lock prevents overlapping updates.
426
+
427
+ - Choose appropriate chunk size/overlap for documents
428
+ - For LocalDocumentIndex, start with chunkSize ~512 tokens and overlap during rendering only (overlap=true in renderSections). If documents are highly structured or short, smaller chunks (256–384) can help precision; for long prose, larger chunks (768–1024) can improve continuity.
429
+
430
+ ## Troubleshooting
431
+
432
+ - Common issues
433
+ - Missing keys.json or API keys
434
+ - Symptom: CLI add/query fails or embeddings return error.
435
+ - Fix: Provide --keys keys.json with the correct fields for your provider. Ensure environment variables are loaded if you construct OpenAIEmbeddings in code.
436
+ - Invalid endpoint or deployment (Azure/OSS)
437
+ - Symptom: “Client created with an invalid endpoint…” or 404/401 from API.
438
+ - Fix: Use a valid https:// endpoint. For Azure, set azureEndpoint, azureDeployment, and (optionally) azureApiVersion correctly.
439
+ - File permissions or locked files
440
+ - Symptom: Error creating/saving index, or reading .txt/.json files.
441
+ - Fix: Ensure the index folder exists and is writable. Avoid opening the same index folder with multiple processes for writes.
442
+ - Rate limits
443
+ - Symptom: Embeddings API returns 429.
444
+ - Fix: OpenAIEmbeddings retries per retryPolicy (default [2000, 5000] ms). Increase backoff or reduce concurrency. Consider caching embeddings.
445
+ - Update lock errors
446
+ - Error: “Update already in progress”
447
+ - Cause: A write is already in flight between beginUpdate() and endUpdate().
448
+ - Fix: Avoid concurrent writes. Use a single critical section for batch updates.
449
+ - Error: “No update in progress”
450
+ - Cause: endUpdate() called without a matching beginUpdate().
451
+ - Fix: Ensure you pair beginUpdate()/endUpdate() calls.
452
+ - Index already exists / create vs. recreate
453
+ - Error: “Index already exists”
454
+ - Fix: Pass deleteIfExists: true to createIndex() if you intend to recreate. For CLI, you can delete and re-create: npx vectra delete ./index && npx vectra create ./index
455
+ - Partial writes or index corruption
456
+ - Symptom: Errors reading index.json after a failed write.
457
+ - Fix: Recreate the index folder and re-ingest data. Batch operations reduce risk: beginUpdate() … endUpdate().
458
+ - Metadata filters not matching
459
+ - Symptom: listItemsByMetadata or queryItems returns no results.
460
+ - Fix: Ensure the fields you filter on are included in metadata_config.indexed or present in the per-item .json. Verify filter syntax ($eq, $in, etc.).
461
+ - Node version mismatch
462
+ - Symptom: Build/runtime errors.
463
+ - Fix: Use Node.js >= 20.x as required.
464
+
465
+ ## Contributing
466
+
467
+ - Getting started
468
+ - Requirements: Node.js >= 20.x, Yarn or NPM
469
+ - Clone the repo: git clone https://github.com/Stevenic/vectra.git && cd vectra
470
+ - Install dependencies: yarn install (or npm install)
471
+
472
+ - Build, test, lint
473
+ - Build: yarn build
474
+ - Run tests: yarn test
475
+ - Lint and auto-fix: yarn lint
476
+ - Clean: yarn clean
477
+
478
+ - Submitting changes
479
+ - Fork the repository and create a feature/fix branch from main (e.g., feature/add-bm25-option, fix/metadata-filter).
480
+ - Write focused, self-contained commits; include tests for new features or bug fixes.
481
+ - Ensure all tests pass and lint issues are resolved.
482
+ - Open a Pull Request with a clear description and reference related issues (e.g., Closes #123).
483
+
484
+ - Reporting bugs and requesting features
485
+ - Open an issue with steps to reproduce, expected behavior, and environment details (OS, Node.js version).
486
+ - For enhancements, describe the use case and proposed solution.
487
+
488
+ - Code of Conduct
489
+ - Please be respectful and follow our community guidelines.
490
+
491
+ ## License
492
+
493
+ - MIT License
494
+ - See the LICENSE file in this repository for full text.
495
+
496
+ ## Acknowledgements
497
+
498
+ - Inspiration from Pinecone and Qdrant for vector database concepts and APIs.
499
+ - Thanks to the open-source ecosystem and libraries used in this project, including (but not limited to): axios, openai, gpt-tokenizer, wink-bm25-text-search, wink-nlp, cheerio, turndown, yargs, uuid, json-colorizer.
@@ -0,0 +1,160 @@
1
+ # Vectra
2
+
3
+ - One-line description
4
+ - Key features
5
+ - Local, file-backed vector database for Node.js
6
+ - In-memory search with cosine similarity
7
+ - Metadata filtering (Pinecone-compatible Mongo-style operators)
8
+ - Document indexing with chunking and optional hybrid BM25
9
+ - Simple CLI and TypeScript API
10
+ - When to use Vectra (and when not)
11
+ - Great for small, mostly static corpora; few-shot examples; single-doc QA
12
+ - Not suited for long-term, ever-growing chat memory (index fully in memory)
13
+ - Mimic namespaces by using separate folders (one index per folder)
14
+ - Language agnostic file format note (indices can be read/written by any language)
15
+
16
+ ## Requirements
17
+
18
+ - Node.js >= 20.x
19
+ - NPM or Yarn
20
+ - An embeddings model/provider (OpenAI, Azure OpenAI, or OSS OpenAI-compatible)
21
+
22
+ ## Installation
23
+
24
+ - npm install vectra
25
+ - Optional global CLI install or npx usage
26
+
27
+ ## Quick Start (5 minutes)
28
+
29
+ ### Choose your path
30
+
31
+ - Option A: Vector Item Index (LocalIndex) — store your own vectors + metadata; run similarity + metadata filters
32
+ - Option B: Document Index (LocalDocumentIndex) — chunk raw documents, store on disk, query via embeddings; render relevant sections
33
+
34
+ ### A. LocalIndex (items + metadata)
35
+
36
+ - Steps
37
+ 1) Create an index folder and initialize
38
+ 2) Generate embeddings (any provider) and insert items with metadata
39
+ 3) Query by vector with optional metadata filter; get topK sorted by similarity
40
+ - Example (code)
41
+ - Create index
42
+ - Insert items with vector + metadata
43
+ - Query with and without filter
44
+
45
+ ### B. LocalDocumentIndex (documents + chunking + retrieval)
46
+
47
+ - Steps
48
+ 1) Configure embeddings via OpenAIEmbeddings (OpenAI, Azure OpenAI, or OSS)
49
+ 2) Create index and add documents (from strings, files, or web pages)
50
+ 3) Query documents and render top sections
51
+ - Example (code)
52
+ - Initialize embeddings
53
+ - Create index with chunking config
54
+ - Upsert documents (uri, text, docType)
55
+ - Query and render sections (with overlap option)
56
+
57
+ ## CLI
58
+
59
+ - Installation
60
+ - Global install or use npx
61
+ - keys.json formats
62
+ - OpenAI (apiKey, model)
63
+ - Azure OpenAI (azureApiKey, azureEndpoint, azureDeployment, optional api-version)
64
+ - OSS (ossEndpoint, ossModel)
65
+ - Commands
66
+ - vectra create <index>
67
+ - vectra delete <index>
68
+ - vectra add <index> --keys keys.json --uri <url-or-file> [--list file] [--cookie str] [--chunk-size N]
69
+ - vectra remove <index> --uri <uri> [--list file]
70
+ - vectra stats <index>
71
+ - vectra query <index> "<query>" --keys keys.json [--document-count N] [--chunk-count N] [--section-count N] [--tokens N] [--format sections|stats|chunks] [--overlap] [--bm25]
72
+ - Usage examples
73
+ - Create, add web pages, query, render sections
74
+
75
+ ## Data Model & On-Disk Layout
76
+
77
+ - Index folder structure overview
78
+ - index.json
79
+ - Per-item or per-document files
80
+ - LocalIndex
81
+ - Stored vectors
82
+ - Indexed vs non-indexed metadata (metadata_config)
83
+ - Unindexed metadata file-by-id
84
+ - LocalDocumentIndex
85
+ - Document .txt and .json files
86
+ - Chunk metadata (startPos, endPos, overlaps)
87
+ - Catalog and index management
88
+
89
+ ## Search & Filtering
90
+
91
+ - Similarity
92
+ - Cosine similarity with pre-normalized vectors
93
+ - Metadata filters (Pinecone-compatible subset)
94
+ - $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or
95
+ - Hybrid search (BM25) for documents
96
+ - Optional keyword scoring and combination with semantic matches
97
+ - Result rendering
98
+ - Render sections with token limits and optional overlap
99
+ - Sorting by score; multiple sections per document
100
+
101
+ ## API Overview
102
+
103
+ - Core exports
104
+ - LocalIndex
105
+ - createIndex, isIndexCreated, getIndexStats
106
+ - insertItem, batchInsertItems, deleteItem, listItems, listItemsByMetadata, getItem
107
+ - queryItems(vector, topK, filter?)
108
+ - beginUpdate/endUpdate (batched changes)
109
+ - LocalDocumentIndex
110
+ - createIndex/deleteIndex/getCatalogStats
111
+ - upsertDocument(uri, text, docType?)
112
+ - deleteDocument(uri)
113
+ - queryDocuments(query, { maxDocuments, maxChunks, isBm25 })
114
+ - LocalDocumentResult
115
+ - chunks, score
116
+ - loadText, loadMetadata
117
+ - renderSections(maxTokens, maxSections, overlap?)
118
+ - renderAllSections(maxTokens)
119
+ - OpenAIEmbeddings (OpenAI, Azure OpenAI, OSS)
120
+ - TextSplitter, FileFetcher, WebFetcher
121
+ - Tokenizer utilities (GPT3Tokenizer)
122
+ - ItemSelector (cosine similarity, metadata selection)
123
+ - Types summary (high level)
124
+ - IndexItem, MetadataFilter, Embeddings options
125
+
126
+ ## Performance & Limits
127
+
128
+ - Entire index loaded in memory for ultra-fast filtering + scoring
129
+ - Typical latency expectations for small to medium corpora
130
+ - Guidance on index size and memory footprint
131
+
132
+ ## Best Practices
133
+
134
+ - Use separate folders to mimic namespaces
135
+ - Index only metadata fields you need for filtering
136
+ - Batch inserts and use beginUpdate/endUpdate for bulk changes
137
+ - Choose appropriate chunk size/overlap for documents
138
+
139
+ ## Troubleshooting
140
+
141
+ - Common issues (missing keys.json, invalid endpoint, file permissions)
142
+ - Rate limiting and retry behavior
143
+ - Index corruption or partial updates (how to recreate)
144
+
145
+ ## Contributing
146
+
147
+ - How to build, test, and lint
148
+ - yarn install, yarn build, yarn test, yarn lint
149
+ - Open issues and PR guidelines
150
+ - Code of Conduct
151
+ - Link to CONTRIBUTING.md
152
+
153
+ ## License
154
+
155
+ - MIT License
156
+
157
+ ## Acknowledgements
158
+
159
+ - Inspiration from Pinecone and Qdrant
160
+ - Libraries used in this repo