vectra 0.12.1 → 0.12.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.draft.md +499 -0
- package/README.draft.outline.md +160 -0
- package/README.research.md +2159 -0
- package/lib/FileFetcher.d.ts +5 -0
- package/lib/FileFetcher.d.ts.map +1 -0
- package/lib/FileFetcher.js +79 -0
- package/lib/FileFetcher.js.map +1 -0
- package/lib/GPT3Tokenizer.d.ts +9 -0
- package/lib/GPT3Tokenizer.d.ts.map +1 -0
- package/lib/GPT3Tokenizer.js +17 -0
- package/lib/GPT3Tokenizer.js.map +1 -0
- package/lib/ItemSelector.d.ts +41 -0
- package/lib/ItemSelector.d.ts.map +1 -0
- package/lib/ItemSelector.js +168 -0
- package/lib/ItemSelector.js.map +1 -0
- package/lib/LocalDocument.d.ts +54 -0
- package/lib/LocalDocument.d.ts.map +1 -0
- package/lib/LocalDocument.js +156 -0
- package/lib/LocalDocument.js.map +1 -0
- package/lib/LocalDocumentIndex.d.ts +132 -0
- package/lib/LocalDocumentIndex.d.ts.map +1 -0
- package/lib/LocalDocumentIndex.js +456 -0
- package/lib/LocalDocumentIndex.js.map +1 -0
- package/lib/LocalDocumentResult.d.ts +45 -0
- package/lib/LocalDocumentResult.d.ts.map +1 -0
- package/lib/LocalDocumentResult.js +328 -0
- package/lib/LocalDocumentResult.js.map +1 -0
- package/lib/LocalIndex.d.ts +150 -0
- package/lib/LocalIndex.d.ts.map +1 -0
- package/lib/LocalIndex.js +515 -0
- package/lib/LocalIndex.js.map +1 -0
- package/lib/LocalIndex.spec.d.ts +2 -0
- package/lib/LocalIndex.spec.js +218 -7
- package/lib/LocalIndex.spec.js.map +1 -1
- package/lib/OpenAIEmbeddings.d.ts +126 -0
- package/lib/OpenAIEmbeddings.d.ts.map +1 -0
- package/lib/OpenAIEmbeddings.js +174 -0
- package/lib/OpenAIEmbeddings.js.map +1 -0
- package/lib/TextSplitter.d.ts +19 -0
- package/lib/TextSplitter.d.ts.map +1 -0
- package/lib/TextSplitter.js +457 -0
- package/lib/TextSplitter.js.map +1 -0
- package/lib/TextSplitter.spec.d.ts +2 -0
- package/lib/TextSplitter.spec.d.ts.map +1 -0
- package/lib/TextSplitter.spec.js +109 -0
- package/lib/TextSplitter.spec.js.map +1 -0
- package/lib/WebFetcher.d.ts +15 -0
- package/lib/WebFetcher.d.ts.map +1 -0
- package/lib/WebFetcher.js +234 -0
- package/lib/WebFetcher.js.map +1 -0
- package/lib/index.d.ts +12 -0
- package/lib/index.d.ts.map +1 -0
- package/lib/index.js +28 -0
- package/lib/index.js.map +1 -0
- package/lib/internals/Colorize.d.ts +14 -0
- package/lib/internals/Colorize.d.ts.map +1 -0
- package/lib/internals/Colorize.js +64 -0
- package/lib/internals/Colorize.js.map +1 -0
- package/lib/internals/index.d.ts +3 -0
- package/lib/internals/index.d.ts.map +1 -0
- package/lib/internals/index.js +19 -0
- package/lib/internals/index.js.map +1 -0
- package/lib/internals/types.d.ts +43 -0
- package/lib/internals/types.d.ts.map +1 -0
- package/lib/internals/types.js +3 -0
- package/lib/internals/types.js.map +1 -0
- package/lib/types.d.ts +146 -0
- package/lib/types.d.ts.map +1 -0
- package/lib/types.js +3 -0
- package/lib/types.js.map +1 -0
- package/lib/vectra-cli.d.ts +2 -0
- package/lib/vectra-cli.d.ts.map +1 -0
- package/lib/vectra-cli.js +323 -0
- package/lib/vectra-cli.js.map +1 -0
- package/package.json +5 -3
- package/src/GPT3Tokenizer.ts +1 -1
- package/src/LocalIndex.spec.ts +265 -8
- package/src/LocalIndex.ts +1 -0
- package/src/TextSplitter.spec.ts +87 -0
- package/src/TextSplitter.ts +459 -531
package/LICENSE
CHANGED
package/README.draft.md
ADDED
|
@@ -0,0 +1,499 @@
|
|
|
1
|
+
# Vectra
|
|
2
|
+
|
|
3
|
+
- A local, file-backed vector database for Node.js with Pinecone-like features and zero external infrastructure. Each index is a folder on disk, loaded into memory for ultra-fast queries.
|
|
4
|
+
|
|
5
|
+
- Key features
|
|
6
|
+
- Local, file-backed vector database for Node.js
|
|
7
|
+
- In-memory search with cosine similarity (pre-normalized vectors for speed)
|
|
8
|
+
- Metadata filtering (Pinecone-compatible Mongo-style operators)
|
|
9
|
+
- Document indexing with chunking and optional hybrid BM25 keyword search
|
|
10
|
+
- Simple CLI and TypeScript API
|
|
11
|
+
|
|
12
|
+
- When to use Vectra (and when not)
|
|
13
|
+
- Great for small, mostly static corpora; few-shot examples; single-document Q&A
|
|
14
|
+
- Not suited for long-term, ever-growing chat memory (entire index loads into RAM)
|
|
15
|
+
- Mimic namespaces by using separate folders (one index per folder)
|
|
16
|
+
|
|
17
|
+
- Language agnostic file format note (indices can be read/written by any language)
|
|
18
|
+
- Indexes are plain JSON and text files on disk; while this package targets Node.js, any language can read/write the folder format.
|
|
19
|
+
|
|
20
|
+
## Requirements
|
|
21
|
+
|
|
22
|
+
- Node.js >= 20.x
|
|
23
|
+
- NPM or Yarn
|
|
24
|
+
- An embeddings provider (OpenAI, Azure OpenAI, or any OpenAI-compatible OSS endpoint)
|
|
25
|
+
|
|
26
|
+
## Installation
|
|
27
|
+
|
|
28
|
+
- Library
|
|
29
|
+
- npm install vectra
|
|
30
|
+
- CLI
|
|
31
|
+
- Use via npx: npx vectra --help
|
|
32
|
+
- Or install globally: npm install -g vectra
|
|
33
|
+
|
|
34
|
+
## Quick Start (5 minutes)
|
|
35
|
+
|
|
36
|
+
### Choose your path
|
|
37
|
+
|
|
38
|
+
- Option A: Vector Item Index (LocalIndex) — store your own vectors + metadata; run similarity + metadata filters
|
|
39
|
+
- Option B: Document Index (LocalDocumentIndex) — chunk raw documents, store on disk, query via embeddings; render relevant sections
|
|
40
|
+
|
|
41
|
+
### A. LocalIndex (items + metadata)
|
|
42
|
+
|
|
43
|
+
- Steps
|
|
44
|
+
1) Create an index folder and initialize
|
|
45
|
+
2) Generate embeddings (any provider) and insert items with metadata
|
|
46
|
+
3) Query by vector with optional metadata filter; get topK sorted by similarity
|
|
47
|
+
|
|
48
|
+
- Example (code)
|
|
49
|
+
|
|
50
|
+
```ts
|
|
51
|
+
import path from 'node:path';
|
|
52
|
+
import { LocalIndex } from 'vectra';
|
|
53
|
+
import { OpenAI } from 'openai';
|
|
54
|
+
|
|
55
|
+
const indexPath = path.join(process.cwd(), 'my-localindex');
|
|
56
|
+
const index = new LocalIndex(indexPath);
|
|
57
|
+
|
|
58
|
+
async function ensureIndex() {
|
|
59
|
+
if (!(await index.isIndexCreated())) {
|
|
60
|
+
await index.createIndex({
|
|
61
|
+
version: 1,
|
|
62
|
+
metadata_config: { indexed: ['category'] } // index only fields you need to filter on
|
|
63
|
+
});
|
|
64
|
+
}
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
|
|
68
|
+
|
|
69
|
+
async function getVector(text: string) {
|
|
70
|
+
const res = await openai.embeddings.create({
|
|
71
|
+
model: 'text-embedding-3-small',
|
|
72
|
+
input: text
|
|
73
|
+
});
|
|
74
|
+
return res.data[0].embedding;
|
|
75
|
+
}
|
|
76
|
+
|
|
77
|
+
async function addItem(id: string, text: string, category?: string) {
|
|
78
|
+
await index.insertItem({
|
|
79
|
+
id,
|
|
80
|
+
vector: await getVector(text),
|
|
81
|
+
metadata: { text, category }
|
|
82
|
+
});
|
|
83
|
+
}
|
|
84
|
+
|
|
85
|
+
async function main() {
|
|
86
|
+
await ensureIndex();
|
|
87
|
+
|
|
88
|
+
await addItem('1', 'apple', 'food');
|
|
89
|
+
await addItem('2', 'oranges', 'food');
|
|
90
|
+
await addItem('3', 'red', 'color');
|
|
91
|
+
await addItem('4', 'blue', 'color');
|
|
92
|
+
|
|
93
|
+
const queryVec = await getVector('banana');
|
|
94
|
+
const results = await index.queryItems(queryVec, 'banana', 3); // vector, query, topK, [optional filter]
|
|
95
|
+
|
|
96
|
+
for (const r of results) {
|
|
97
|
+
console.log(`[${r.score.toFixed(3)}] ${r.item.metadata.text}`);
|
|
98
|
+
}
|
|
99
|
+
|
|
100
|
+
// With metadata filter (e.g., only colors)
|
|
101
|
+
const colorResults = await index.queryItems(queryVec, '', 3, { category: { $eq: 'color' } });
|
|
102
|
+
console.log('Only colors:');
|
|
103
|
+
for (const r of colorResults) {
|
|
104
|
+
console.log(`[${r.score.toFixed(3)}] ${r.item.metadata.text}`);
|
|
105
|
+
}
|
|
106
|
+
}
|
|
107
|
+
|
|
108
|
+
main().catch(console.error);
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### B. LocalDocumentIndex (documents + chunking + retrieval)
|
|
112
|
+
|
|
113
|
+
- Steps
|
|
114
|
+
1) Configure embeddings via OpenAIEmbeddings (OpenAI, Azure OpenAI, or OSS)
|
|
115
|
+
2) Create index and add documents (from strings, files, or web pages)
|
|
116
|
+
3) Query documents and render top sections
|
|
117
|
+
|
|
118
|
+
- Example (code)
|
|
119
|
+
|
|
120
|
+
```ts
|
|
121
|
+
import path from 'node:path';
|
|
122
|
+
import { LocalDocumentIndex, OpenAIEmbeddings } from 'vectra';
|
|
123
|
+
|
|
124
|
+
const folderPath = path.join(process.cwd(), 'my-docindex');
|
|
125
|
+
|
|
126
|
+
const embeddings = new OpenAIEmbeddings({
|
|
127
|
+
apiKey: process.env.OPENAI_API_KEY!,
|
|
128
|
+
model: 'text-embedding-3-small',
|
|
129
|
+
// optional: dimensions, requestConfig, retryPolicy, etc.
|
|
130
|
+
});
|
|
131
|
+
|
|
132
|
+
const docIndex = new LocalDocumentIndex({
|
|
133
|
+
folderPath,
|
|
134
|
+
embeddings,
|
|
135
|
+
chunkingConfig: { chunkSize: 512 } // tokens per chunk
|
|
136
|
+
});
|
|
137
|
+
|
|
138
|
+
async function setup() {
|
|
139
|
+
await docIndex.createIndex({ version: 1, deleteIfExists: true });
|
|
140
|
+
await docIndex.upsertDocument(
|
|
141
|
+
'doc://getting-started',
|
|
142
|
+
`
|
|
143
|
+
Vectra is a local vector DB for Node.js.
|
|
144
|
+
It supports metadata filtering and blazing-fast in-memory search.
|
|
145
|
+
Great for small, mostly static corpora.
|
|
146
|
+
`,
|
|
147
|
+
'md'
|
|
148
|
+
);
|
|
149
|
+
}
|
|
150
|
+
|
|
151
|
+
async function search() {
|
|
152
|
+
const results = await docIndex.queryDocuments('How do I use Vectra for small corpora?', {
|
|
153
|
+
maxDocuments: 5,
|
|
154
|
+
maxChunks: 50,
|
|
155
|
+
isBm25: false // set true for hybrid keyword+semantic retrieval
|
|
156
|
+
});
|
|
157
|
+
|
|
158
|
+
for (const doc of results) {
|
|
159
|
+
console.log(`\nURI: ${doc.uri} (score: ${doc.score.toFixed(3)})`);
|
|
160
|
+
const sections = await doc.renderSections(800, 1, true); // tokens, section count, overlap
|
|
161
|
+
for (const s of sections) {
|
|
162
|
+
console.log(`Tokens: ${s.tokenCount}, Section score: ${s.score.toFixed(3)}`);
|
|
163
|
+
console.log(s.text.trim());
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
}
|
|
167
|
+
|
|
168
|
+
setup().then(search).catch(console.error);
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
## CLI
|
|
172
|
+
|
|
173
|
+
- Installation
|
|
174
|
+
- Global: npm install -g vectra
|
|
175
|
+
- One-off: npx vectra --help
|
|
176
|
+
|
|
177
|
+
- keys.json formats
|
|
178
|
+
- OpenAI
|
|
179
|
+
{
|
|
180
|
+
"apiKey": "sk-...",
|
|
181
|
+
"model": "text-embedding-3-small"
|
|
182
|
+
// optional: "organization": "org_...", "endpoint": "https://api.openai.com",
|
|
183
|
+
// optional: "dimensions": 1536, "logRequests": false, "maxTokens": 8000,
|
|
184
|
+
// optional: "retryPolicy": [2000, 5000], "requestConfig": { "timeout": 60000 }
|
|
185
|
+
}
|
|
186
|
+
- Note: If you omit model when using the CLI, it defaults to "text-embedding-ada-002" (with maxTokens 8000).
|
|
187
|
+
- Azure OpenAI
|
|
188
|
+
{
|
|
189
|
+
"azureApiKey": "<YOUR_AZURE_OPENAI_KEY>",
|
|
190
|
+
"azureEndpoint": "https://<your-resource-name>.openai.azure.com",
|
|
191
|
+
"azureDeployment": "<your-deployment-name>",
|
|
192
|
+
"azureApiVersion": "2023-05-15",
|
|
193
|
+
// optional: "dimensions": 1536, "logRequests": false, "maxTokens": 8000,
|
|
194
|
+
// optional: "retryPolicy": [2000, 5000], "requestConfig": { "timeout": 60000 }
|
|
195
|
+
}
|
|
196
|
+
- OSS (OpenAI-compatible)
|
|
197
|
+
{
|
|
198
|
+
"ossEndpoint": "https://api.your-oss-endpoint.com",
|
|
199
|
+
"ossModel": "text-embedding-3-small",
|
|
200
|
+
// optional: "dimensions": 1536, "logRequests": false, "maxTokens": 8000,
|
|
201
|
+
// optional: "retryPolicy": [2000, 5000], "requestConfig": { "timeout": 60000 }
|
|
202
|
+
}
|
|
203
|
+
|
|
204
|
+
- Commands
|
|
205
|
+
- vectra create <index>
|
|
206
|
+
- Create a new local document index (folder). Overwrite with --deleteIfExists via API; CLI create always creates a fresh catalog.
|
|
207
|
+
- Example: npx vectra create ./my-docindex
|
|
208
|
+
- vectra delete <index>
|
|
209
|
+
- Delete an existing document index folder.
|
|
210
|
+
- Example: npx vectra delete ./my-docindex
|
|
211
|
+
- vectra add <index> --keys keys.json [--uri <url-or-file> ...] [--list file] [--cookie str] [--chunk-size N]
|
|
212
|
+
- Add one or more web pages or local files to the index. Auto-detects http/https vs file path.
|
|
213
|
+
- Example (single URL): npx vectra add ./my-docindex --keys keys.json --uri https://example.com/docs/intro
|
|
214
|
+
- Example (local file): npx vectra add ./my-docindex --keys keys.json --uri ./docs/guide.md
|
|
215
|
+
- Example (list file): npx vectra add ./my-docindex --keys keys.json --list urls.txt
|
|
216
|
+
- Example (with cookie): npx vectra add ./my-docindex --keys keys.json --uri https://site.com/protected --cookie "sessionid=abc; other=xyz"
|
|
217
|
+
- Example (custom chunk size): npx vectra add ./my-docindex --keys keys.json --uri https://example.com --chunk-size 512
|
|
218
|
+
- vectra remove <index> --uri <uri> [--list file]
|
|
219
|
+
- Remove one or more documents (by stored URI) from the index.
|
|
220
|
+
- Example: npx vectra remove ./my-docindex --uri https://example.com/docs/intro
|
|
221
|
+
- Example (list): npx vectra remove ./my-docindex --list uris-to-remove.txt
|
|
222
|
+
- vectra stats <index>
|
|
223
|
+
- Print catalog stats (version, doc count, etc.).
|
|
224
|
+
- Example: npx vectra stats ./my-docindex
|
|
225
|
+
- vectra query <index> "<query>" --keys keys.json [--document-count N] [--chunk-count N] [--section-count N] [--tokens N] [--format sections|stats|chunks] [--overlap] [--bm25]
|
|
226
|
+
- Query the index and render results.
|
|
227
|
+
- Example (default sections view): npx vectra query ./my-docindex "how do I get started?" --keys keys.json
|
|
228
|
+
- Example (limit docs/sections/tokens): npx vectra query ./my-docindex "hybrid search" --keys keys.json --document-count 5 --section-count 2 --tokens 800
|
|
229
|
+
- Example (show chunks): npx vectra query ./my-docindex "metadata filtering" --keys keys.json --format chunks
|
|
230
|
+
- Example (enable hybrid keyword+semantic): npx vectra query ./my-docindex "install steps" --keys keys.json --bm25
|
|
231
|
+
|
|
232
|
+
## Data Model & On-Disk Layout
|
|
233
|
+
|
|
234
|
+
- Index folder structure overview
|
|
235
|
+
- A Vectra index is a single folder on disk you choose.
|
|
236
|
+
- Core files
|
|
237
|
+
- index.json — the in-memory index snapshot (vectors + selected metadata + config).
|
|
238
|
+
- For item/document payloads
|
|
239
|
+
- <id>.json — non-indexed metadata for an item or document.
|
|
240
|
+
- <id>.txt — raw document text when using LocalDocumentIndex.
|
|
241
|
+
|
|
242
|
+
- LocalIndex
|
|
243
|
+
- index.json contents (high level)
|
|
244
|
+
- version — schema/versioning number.
|
|
245
|
+
- metadata_config — which metadata fields are stored in-memory for filtering (indexed).
|
|
246
|
+
- items — array of items:
|
|
247
|
+
- id — your ID (or auto-generated if omitted).
|
|
248
|
+
- vector — numeric array embedding.
|
|
249
|
+
- norm — precomputed vector norm for fast cosine similarity.
|
|
250
|
+
- metadata — only the fields listed in metadata_config.indexed.
|
|
251
|
+
- Non-indexed metadata
|
|
252
|
+
- Stored separately as <id>.json on disk.
|
|
253
|
+
- At query time, Vectra filters first by the in-memory indexed metadata. If a filter refers to a field not present in memory, Vectra may read the item’s metadata file to evaluate the filter.
|
|
254
|
+
- Namespaces
|
|
255
|
+
- Not directly supported; create a separate folder per “namespace”.
|
|
256
|
+
|
|
257
|
+
- LocalDocumentIndex
|
|
258
|
+
- What’s stored
|
|
259
|
+
- index.json — embedding vectors and metadata for document chunks and catalog info.
|
|
260
|
+
- <id>.txt — full document text (enables section rendering and context extraction).
|
|
261
|
+
- <id>.json — additional document-level metadata you provide (optional).
|
|
262
|
+
- Document identity
|
|
263
|
+
- Documents are addressed by a URI you supply (e.g., https://example.com/page or doc://my-doc).
|
|
264
|
+
- Internally, Vectra uses an ID to store files (<id>.txt/.json) and tracks the URI↔ID mapping in the index.
|
|
265
|
+
- Chunk metadata
|
|
266
|
+
- startPos and endPos — byte or character offsets into the <id>.txt content for the chunk.
|
|
267
|
+
- Optional flags (e.g., isBm25) used for hybrid retrieval.
|
|
268
|
+
- Chunking
|
|
269
|
+
- Documents are split into token-based chunks using a configurable chunk size and optional overlap logic when rendering.
|
|
270
|
+
|
|
271
|
+
## Search & Filtering
|
|
272
|
+
|
|
273
|
+
- Similarity
|
|
274
|
+
- Cosine similarity with pre-normalized vectors for speed.
|
|
275
|
+
- For LocalIndex: all items are filtered by metadata first, then scored and returned sorted by similarity.
|
|
276
|
+
- For LocalDocumentIndex: chunk-level scoring aggregates into document-level results.
|
|
277
|
+
|
|
278
|
+
- Metadata filters (Pinecone-compatible subset)
|
|
279
|
+
- Logical operators: $and, $or
|
|
280
|
+
- Comparison operators: $eq, $ne, $gt, $gte, $lt, $lte
|
|
281
|
+
- Set operators: $in, $nin
|
|
282
|
+
- Filters apply to fields defined in metadata_config.indexed; non-indexed fields are stored per-item/per-doc in <id>.json and may be read during filtering when needed.
|
|
283
|
+
|
|
284
|
+
- Hybrid search (BM25) for documents
|
|
285
|
+
- Optional keyword scoring combined with semantic matches to improve recall.
|
|
286
|
+
- Enable via CLI flag --bm25 or corresponding API options when querying LocalDocumentIndex.
|
|
287
|
+
|
|
288
|
+
- Result rendering
|
|
289
|
+
- LocalDocumentResult
|
|
290
|
+
- renderSections(maxTokens, maxSections, overlap?): returns top sections with aggregated scores; can optionally include overlapping chunks.
|
|
291
|
+
- renderAllSections(maxTokens): renders all matched spans split into sections up to maxTokens each.
|
|
292
|
+
- Sections include token counts and per-section scores, enabling easy prompt assembly.
|
|
293
|
+
|
|
294
|
+
## API Overview
|
|
295
|
+
|
|
296
|
+
- Core exports
|
|
297
|
+
- LocalIndex
|
|
298
|
+
- LocalDocumentIndex
|
|
299
|
+
- LocalDocument
|
|
300
|
+
- LocalDocumentResult
|
|
301
|
+
- OpenAIEmbeddings
|
|
302
|
+
- TextSplitter
|
|
303
|
+
- ItemSelector
|
|
304
|
+
- FileFetcher, WebFetcher
|
|
305
|
+
- GPT3Tokenizer
|
|
306
|
+
- types (shared type definitions)
|
|
307
|
+
|
|
308
|
+
- LocalIndex (vectors + metadata)
|
|
309
|
+
- Purpose: Store your own vectors and metadata; run cosine similarity + metadata filters in-memory.
|
|
310
|
+
- Key methods
|
|
311
|
+
- createIndex(options?: { version?: number; deleteIfExists?: boolean; metadata_config?: { indexed?: string[] } })
|
|
312
|
+
- isIndexCreated(): Promise<boolean>
|
|
313
|
+
- getIndexStats(): Promise<{ version: number; metadata_config: object; items: number }>
|
|
314
|
+
- insertItem(item: { id?: string; vector: number[]; metadata?: Record<string, any> }): Promise<IndexItem>
|
|
315
|
+
- batchInsertItems(items: Partial<IndexItem>[]): Promise<IndexItem[]>
|
|
316
|
+
- deleteItem(id: string): Promise<void>
|
|
317
|
+
- listItems(): Promise<IndexItem[]>
|
|
318
|
+
- listItemsByMetadata(filter: MetadataFilter): Promise<IndexItem[]>
|
|
319
|
+
- getItem(id: string): Promise<IndexItem | undefined>
|
|
320
|
+
- queryItems(vector: number[], namespace: string, topK: number, filter?: MetadataFilter): Promise<Array<{ item: IndexItem; score: number }>>
|
|
321
|
+
- beginUpdate(): Promise<void> / endUpdate(): Promise<void> (optional batching and atomic save)
|
|
322
|
+
- Notes
|
|
323
|
+
- metadata_config.indexed controls which fields are kept in-memory for fast filtering.
|
|
324
|
+
- Non-indexed metadata is stored as <id>.json and may be read during filtering.
|
|
325
|
+
|
|
326
|
+
- LocalDocumentIndex (document chunking + retrieval)
|
|
327
|
+
- Purpose: Ingest raw documents (strings, files, web pages), chunk and embed them, then query by text.
|
|
328
|
+
- Constructor options
|
|
329
|
+
- { folderPath: string; embeddings?: EmbeddingsModel; chunkingConfig?: { chunkSize?: number; chunkOverlap?: number; docType?: string } }
|
|
330
|
+
- Key methods
|
|
331
|
+
- createIndex(options?: { version?: number; deleteIfExists?: boolean }): Promise<void>
|
|
332
|
+
- deleteIndex(): Promise<void>
|
|
333
|
+
- getCatalogStats(): Promise<any>
|
|
334
|
+
- upsertDocument(uri: string, text: string, docType?: string): Promise<void>
|
|
335
|
+
- deleteDocument(uri: string): Promise<void>
|
|
336
|
+
- queryDocuments(query: string, options?: { maxDocuments?: number; maxChunks?: number; isBm25?: boolean }): Promise<LocalDocumentResult[]>
|
|
337
|
+
|
|
338
|
+
- LocalDocument
|
|
339
|
+
- Properties: id, uri, folderPath
|
|
340
|
+
- Methods
|
|
341
|
+
- getLength(): Promise<number>
|
|
342
|
+
- hasMetadata(): Promise<boolean>
|
|
343
|
+
- loadMetadata(): Promise<Record<string, any>>
|
|
344
|
+
- loadText(): Promise<string>
|
|
345
|
+
|
|
346
|
+
- LocalDocumentResult (extends LocalDocument)
|
|
347
|
+
- Properties
|
|
348
|
+
- chunks: QueryResult<DocumentChunkMetadata>[]
|
|
349
|
+
- score: number (average score across matching chunks)
|
|
350
|
+
- Methods
|
|
351
|
+
- renderSections(maxTokens: number, maxSections: number, overlap?: boolean): Promise<DocumentTextSection[]>
|
|
352
|
+
- renderAllSections(maxTokens: number): Promise<DocumentTextSection[]>
|
|
353
|
+
|
|
354
|
+
- OpenAIEmbeddings
|
|
355
|
+
- Purpose: Generate embeddings via OpenAI, Azure OpenAI, or an OSS OpenAI-compatible endpoint.
|
|
356
|
+
- Constructors
|
|
357
|
+
- OpenAI: { apiKey: string; model: string; organization?, endpoint?, dimensions?, logRequests?, maxTokens?, retryPolicy?, requestConfig? }
|
|
358
|
+
- Azure: { azureApiKey: string; azureEndpoint: string; azureDeployment: string; azureApiVersion?, dimensions?, logRequests?, maxTokens?, retryPolicy?, requestConfig? }
|
|
359
|
+
- OSS: { ossEndpoint: string; ossModel: string; dimensions?, logRequests?, maxTokens?, retryPolicy?, requestConfig? }
|
|
360
|
+
- Methods
|
|
361
|
+
- createEmbeddings(input: string | string[]): Promise<{ status: 'success'|'error'|'rate_limited'; output?: number[][]; message?: string }>
|
|
362
|
+
|
|
363
|
+
- TextSplitter
|
|
364
|
+
- Purpose: Token-aware splitting by separators with configurable chunk size and overlap.
|
|
365
|
+
- Constructor config
|
|
366
|
+
- { separators?: string[]; keepSeparators?: boolean; chunkSize?: number; chunkOverlap?: number; tokenizer?: Tokenizer; docType?: string }
|
|
367
|
+
- Methods
|
|
368
|
+
- split(text: string): TextChunk[]
|
|
369
|
+
|
|
370
|
+
- ItemSelector
|
|
371
|
+
- Static helpers
|
|
372
|
+
- cosineSimilarity(a: number[], b: number[]): number
|
|
373
|
+
- normalizedCosineSimilarity(a: number[], normA: number, b: number[], normB: number): number
|
|
374
|
+
- select(metadata: Record<string, any>, filter: MetadataFilter): boolean
|
|
375
|
+
|
|
376
|
+
- Fetchers and utilities
|
|
377
|
+
- FileFetcher: Read local files and infer docType.
|
|
378
|
+
- WebFetcher: Fetch and clean webpages (supports custom headers like cookies).
|
|
379
|
+
- GPT3Tokenizer: Default tokenizer used for chunking.
|
|
380
|
+
|
|
381
|
+
- Types (high level)
|
|
382
|
+
- IndexItem: { id: string; vector: number[]; norm: number; metadata: Record<string, any>; metadataFile?: string }
|
|
383
|
+
- MetadataFilter: Mongo/Pinecone-style filter object ($and, $or, $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin)
|
|
384
|
+
- TextChunk: { text: string; tokens: number[]; startPos: number; endPos: number; startOverlap: number[]; endOverlap: number[] }
|
|
385
|
+
- QueryResult<T>: { item: { id: string; metadata: T }; score: number }
|
|
386
|
+
- DocumentChunkMetadata: { startPos: number; endPos: number; isBm25?: boolean }
|
|
387
|
+
- DocumentTextSection: { text: string; tokenCount: number; score: number; isBm25: boolean }
|
|
388
|
+
|
|
389
|
+
## Performance & Limits
|
|
390
|
+
|
|
391
|
+
- In-memory design
|
|
392
|
+
- Entire index.json is loaded into memory for ultra-fast filtering and cosine scoring.
|
|
393
|
+
- Linear scan with pre-normalized vectors keeps per-query latency low for small to medium corpora.
|
|
394
|
+
|
|
395
|
+
- Typical latency
|
|
396
|
+
- Small indexes: often sub-millisecond on a modern laptop.
|
|
397
|
+
- Medium indexes: commonly 1–2ms per query.
|
|
398
|
+
- Note: No ANN/approximate indexing; performance scales linearly with item count and vector dimension.
|
|
399
|
+
|
|
400
|
+
- Memory footprint (rule of thumb)
|
|
401
|
+
- Roughly items × dims × 8 bytes for vectors (JavaScript numbers are 64-bit) + per-item metadata overhead + norms.
|
|
402
|
+
- Example: 50k items × 1536 dims ≈ ~600 MB just for vectors (plus overhead).
|
|
403
|
+
- Keep indexes modest; consider separate folders to partition data.
|
|
404
|
+
|
|
405
|
+
- DocumentIndex specifics
|
|
406
|
+
- Adds .txt bodies on disk and chunk metadata in index.json.
|
|
407
|
+
- Query time aggregates chunk scores to document results and supports optional BM25.
|
|
408
|
+
|
|
409
|
+
- Concurrency and durability
|
|
410
|
+
- beginUpdate/endUpdate guards against concurrent writes; endUpdate writes atomically to index.json.
|
|
411
|
+
- Batch operations are faster and safer than many small writes.
|
|
412
|
+
|
|
413
|
+
- Not for growing chat memory
|
|
414
|
+
- Because everything lives in RAM, use Vectra for small, mostly static corpora. For large or ever-growing datasets, use a hosted vector DB.
|
|
415
|
+
|
|
416
|
+
## Best Practices
|
|
417
|
+
|
|
418
|
+
- Use separate folders to mimic namespaces
|
|
419
|
+
- Create one index folder per logical dataset (e.g., ./indexes/support, ./indexes/blog). This keeps memory usage predictable and lets you target queries precisely.
|
|
420
|
+
|
|
421
|
+
- Index only metadata fields you need for filtering
|
|
422
|
+
- Configure metadata_config.indexed with the minimal set of fields you’ll filter on. This keeps index.json small and speeds up filtering; store everything else in the per-item .json files.
|
|
423
|
+
|
|
424
|
+
- Batch inserts and use beginUpdate/endUpdate for bulk changes
|
|
425
|
+
- For many writes, call beginUpdate(), perform your inserts/deletes, then endUpdate() once. This reduces disk I/O and ensures atomic saves. Avoid concurrent writes; the lock prevents overlapping updates.
|
|
426
|
+
|
|
427
|
+
- Choose appropriate chunk size/overlap for documents
|
|
428
|
+
- For LocalDocumentIndex, start with chunkSize ~512 tokens and overlap during rendering only (overlap=true in renderSections). If documents are highly structured or short, smaller chunks (256–384) can help precision; for long prose, larger chunks (768–1024) can improve continuity.
|
|
429
|
+
|
|
430
|
+
## Troubleshooting
|
|
431
|
+
|
|
432
|
+
- Common issues
|
|
433
|
+
- Missing keys.json or API keys
|
|
434
|
+
- Symptom: CLI add/query fails or embeddings return error.
|
|
435
|
+
- Fix: Provide --keys keys.json with the correct fields for your provider. Ensure environment variables are loaded if you construct OpenAIEmbeddings in code.
|
|
436
|
+
- Invalid endpoint or deployment (Azure/OSS)
|
|
437
|
+
- Symptom: “Client created with an invalid endpoint…” or 404/401 from API.
|
|
438
|
+
- Fix: Use a valid https:// endpoint. For Azure, set azureEndpoint, azureDeployment, and (optionally) azureApiVersion correctly.
|
|
439
|
+
- File permissions or locked files
|
|
440
|
+
- Symptom: Error creating/saving index, or reading .txt/.json files.
|
|
441
|
+
- Fix: Ensure the index folder exists and is writable. Avoid opening the same index folder with multiple processes for writes.
|
|
442
|
+
- Rate limits
|
|
443
|
+
- Symptom: Embeddings API returns 429.
|
|
444
|
+
- Fix: OpenAIEmbeddings retries per retryPolicy (default [2000, 5000] ms). Increase backoff or reduce concurrency. Consider caching embeddings.
|
|
445
|
+
- Update lock errors
|
|
446
|
+
- Error: “Update already in progress”
|
|
447
|
+
- Cause: A write is already in flight between beginUpdate() and endUpdate().
|
|
448
|
+
- Fix: Avoid concurrent writes. Use a single critical section for batch updates.
|
|
449
|
+
- Error: “No update in progress”
|
|
450
|
+
- Cause: endUpdate() called without a matching beginUpdate().
|
|
451
|
+
- Fix: Ensure you pair beginUpdate()/endUpdate() calls.
|
|
452
|
+
- Index already exists / create vs. recreate
|
|
453
|
+
- Error: “Index already exists”
|
|
454
|
+
- Fix: Pass deleteIfExists: true to createIndex() if you intend to recreate. For CLI, you can delete and re-create: npx vectra delete ./index && npx vectra create ./index
|
|
455
|
+
- Partial writes or index corruption
|
|
456
|
+
- Symptom: Errors reading index.json after a failed write.
|
|
457
|
+
- Fix: Recreate the index folder and re-ingest data. Batch operations reduce risk: beginUpdate() … endUpdate().
|
|
458
|
+
- Metadata filters not matching
|
|
459
|
+
- Symptom: listItemsByMetadata or queryItems returns no results.
|
|
460
|
+
- Fix: Ensure the fields you filter on are included in metadata_config.indexed or present in the per-item .json. Verify filter syntax ($eq, $in, etc.).
|
|
461
|
+
- Node version mismatch
|
|
462
|
+
- Symptom: Build/runtime errors.
|
|
463
|
+
- Fix: Use Node.js >= 20.x as required.
|
|
464
|
+
|
|
465
|
+
## Contributing
|
|
466
|
+
|
|
467
|
+
- Getting started
|
|
468
|
+
- Requirements: Node.js >= 20.x, Yarn or NPM
|
|
469
|
+
- Clone the repo: git clone https://github.com/Stevenic/vectra.git && cd vectra
|
|
470
|
+
- Install dependencies: yarn install (or npm install)
|
|
471
|
+
|
|
472
|
+
- Build, test, lint
|
|
473
|
+
- Build: yarn build
|
|
474
|
+
- Run tests: yarn test
|
|
475
|
+
- Lint and auto-fix: yarn lint
|
|
476
|
+
- Clean: yarn clean
|
|
477
|
+
|
|
478
|
+
- Submitting changes
|
|
479
|
+
- Fork the repository and create a feature/fix branch from main (e.g., feature/add-bm25-option, fix/metadata-filter).
|
|
480
|
+
- Write focused, self-contained commits; include tests for new features or bug fixes.
|
|
481
|
+
- Ensure all tests pass and lint issues are resolved.
|
|
482
|
+
- Open a Pull Request with a clear description and reference related issues (e.g., Closes #123).
|
|
483
|
+
|
|
484
|
+
- Reporting bugs and requesting features
|
|
485
|
+
- Open an issue with steps to reproduce, expected behavior, and environment details (OS, Node.js version).
|
|
486
|
+
- For enhancements, describe the use case and proposed solution.
|
|
487
|
+
|
|
488
|
+
- Code of Conduct
|
|
489
|
+
- Please be respectful and follow our community guidelines.
|
|
490
|
+
|
|
491
|
+
## License
|
|
492
|
+
|
|
493
|
+
- MIT License
|
|
494
|
+
- See the LICENSE file in this repository for full text.
|
|
495
|
+
|
|
496
|
+
## Acknowledgements
|
|
497
|
+
|
|
498
|
+
- Inspiration from Pinecone and Qdrant for vector database concepts and APIs.
|
|
499
|
+
- Thanks to the open-source ecosystem and libraries used in this project, including (but not limited to): axios, openai, gpt-tokenizer, wink-bm25-text-search, wink-nlp, cheerio, turndown, yargs, uuid, json-colorizer.
|
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
# Vectra
|
|
2
|
+
|
|
3
|
+
- One-line description
|
|
4
|
+
- Key features
|
|
5
|
+
- Local, file-backed vector database for Node.js
|
|
6
|
+
- In-memory search with cosine similarity
|
|
7
|
+
- Metadata filtering (Pinecone-compatible Mongo-style operators)
|
|
8
|
+
- Document indexing with chunking and optional hybrid BM25
|
|
9
|
+
- Simple CLI and TypeScript API
|
|
10
|
+
- When to use Vectra (and when not)
|
|
11
|
+
- Great for small, mostly static corpora; few-shot examples; single-doc QA
|
|
12
|
+
- Not suited for long-term, ever-growing chat memory (index fully in memory)
|
|
13
|
+
- Mimic namespaces by using separate folders (one index per folder)
|
|
14
|
+
- Language agnostic file format note (indices can be read/written by any language)
|
|
15
|
+
|
|
16
|
+
## Requirements
|
|
17
|
+
|
|
18
|
+
- Node.js >= 20.x
|
|
19
|
+
- NPM or Yarn
|
|
20
|
+
- An embeddings model/provider (OpenAI, Azure OpenAI, or OSS OpenAI-compatible)
|
|
21
|
+
|
|
22
|
+
## Installation
|
|
23
|
+
|
|
24
|
+
- npm install vectra
|
|
25
|
+
- Optional global CLI install or npx usage
|
|
26
|
+
|
|
27
|
+
## Quick Start (5 minutes)
|
|
28
|
+
|
|
29
|
+
### Choose your path
|
|
30
|
+
|
|
31
|
+
- Option A: Vector Item Index (LocalIndex) — store your own vectors + metadata; run similarity + metadata filters
|
|
32
|
+
- Option B: Document Index (LocalDocumentIndex) — chunk raw documents, store on disk, query via embeddings; render relevant sections
|
|
33
|
+
|
|
34
|
+
### A. LocalIndex (items + metadata)
|
|
35
|
+
|
|
36
|
+
- Steps
|
|
37
|
+
1) Create an index folder and initialize
|
|
38
|
+
2) Generate embeddings (any provider) and insert items with metadata
|
|
39
|
+
3) Query by vector with optional metadata filter; get topK sorted by similarity
|
|
40
|
+
- Example (code)
|
|
41
|
+
- Create index
|
|
42
|
+
- Insert items with vector + metadata
|
|
43
|
+
- Query with and without filter
|
|
44
|
+
|
|
45
|
+
### B. LocalDocumentIndex (documents + chunking + retrieval)
|
|
46
|
+
|
|
47
|
+
- Steps
|
|
48
|
+
1) Configure embeddings via OpenAIEmbeddings (OpenAI, Azure OpenAI, or OSS)
|
|
49
|
+
2) Create index and add documents (from strings, files, or web pages)
|
|
50
|
+
3) Query documents and render top sections
|
|
51
|
+
- Example (code)
|
|
52
|
+
- Initialize embeddings
|
|
53
|
+
- Create index with chunking config
|
|
54
|
+
- Upsert documents (uri, text, docType)
|
|
55
|
+
- Query and render sections (with overlap option)
|
|
56
|
+
|
|
57
|
+
## CLI
|
|
58
|
+
|
|
59
|
+
- Installation
|
|
60
|
+
- Global install or use npx
|
|
61
|
+
- keys.json formats
|
|
62
|
+
- OpenAI (apiKey, model)
|
|
63
|
+
- Azure OpenAI (azureApiKey, azureEndpoint, azureDeployment, optional api-version)
|
|
64
|
+
- OSS (ossEndpoint, ossModel)
|
|
65
|
+
- Commands
|
|
66
|
+
- vectra create <index>
|
|
67
|
+
- vectra delete <index>
|
|
68
|
+
- vectra add <index> --keys keys.json --uri <url-or-file> [--list file] [--cookie str] [--chunk-size N]
|
|
69
|
+
- vectra remove <index> --uri <uri> [--list file]
|
|
70
|
+
- vectra stats <index>
|
|
71
|
+
- vectra query <index> "<query>" --keys keys.json [--document-count N] [--chunk-count N] [--section-count N] [--tokens N] [--format sections|stats|chunks] [--overlap] [--bm25]
|
|
72
|
+
- Usage examples
|
|
73
|
+
- Create, add web pages, query, render sections
|
|
74
|
+
|
|
75
|
+
## Data Model & On-Disk Layout
|
|
76
|
+
|
|
77
|
+
- Index folder structure overview
|
|
78
|
+
- index.json
|
|
79
|
+
- Per-item or per-document files
|
|
80
|
+
- LocalIndex
|
|
81
|
+
- Stored vectors
|
|
82
|
+
- Indexed vs non-indexed metadata (metadata_config)
|
|
83
|
+
- Unindexed metadata file-by-id
|
|
84
|
+
- LocalDocumentIndex
|
|
85
|
+
- Document .txt and .json files
|
|
86
|
+
- Chunk metadata (startPos, endPos, overlaps)
|
|
87
|
+
- Catalog and index management
|
|
88
|
+
|
|
89
|
+
## Search & Filtering
|
|
90
|
+
|
|
91
|
+
- Similarity
|
|
92
|
+
- Cosine similarity with pre-normalized vectors
|
|
93
|
+
- Metadata filters (Pinecone-compatible subset)
|
|
94
|
+
- $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or
|
|
95
|
+
- Hybrid search (BM25) for documents
|
|
96
|
+
- Optional keyword scoring and combination with semantic matches
|
|
97
|
+
- Result rendering
|
|
98
|
+
- Render sections with token limits and optional overlap
|
|
99
|
+
- Sorting by score; multiple sections per document
|
|
100
|
+
|
|
101
|
+
## API Overview
|
|
102
|
+
|
|
103
|
+
- Core exports
|
|
104
|
+
- LocalIndex
|
|
105
|
+
- createIndex, isIndexCreated, getIndexStats
|
|
106
|
+
- insertItem, batchInsertItems, deleteItem, listItems, listItemsByMetadata, getItem
|
|
107
|
+
- queryItems(vector, topK, filter?)
|
|
108
|
+
- beginUpdate/endUpdate (batched changes)
|
|
109
|
+
- LocalDocumentIndex
|
|
110
|
+
- createIndex/deleteIndex/getCatalogStats
|
|
111
|
+
- upsertDocument(uri, text, docType?)
|
|
112
|
+
- deleteDocument(uri)
|
|
113
|
+
- queryDocuments(query, { maxDocuments, maxChunks, isBm25 })
|
|
114
|
+
- LocalDocumentResult
|
|
115
|
+
- chunks, score
|
|
116
|
+
- loadText, loadMetadata
|
|
117
|
+
- renderSections(maxTokens, maxSections, overlap?)
|
|
118
|
+
- renderAllSections(maxTokens)
|
|
119
|
+
- OpenAIEmbeddings (OpenAI, Azure OpenAI, OSS)
|
|
120
|
+
- TextSplitter, FileFetcher, WebFetcher
|
|
121
|
+
- Tokenizer utilities (GPT3Tokenizer)
|
|
122
|
+
- ItemSelector (cosine similarity, metadata selection)
|
|
123
|
+
- Types summary (high level)
|
|
124
|
+
- IndexItem, MetadataFilter, Embeddings options
|
|
125
|
+
|
|
126
|
+
## Performance & Limits
|
|
127
|
+
|
|
128
|
+
- Entire index loaded in memory for ultra-fast filtering + scoring
|
|
129
|
+
- Typical latency expectations for small to medium corpora
|
|
130
|
+
- Guidance on index size and memory footprint
|
|
131
|
+
|
|
132
|
+
## Best Practices
|
|
133
|
+
|
|
134
|
+
- Use separate folders to mimic namespaces
|
|
135
|
+
- Index only metadata fields you need for filtering
|
|
136
|
+
- Batch inserts and use beginUpdate/endUpdate for bulk changes
|
|
137
|
+
- Choose appropriate chunk size/overlap for documents
|
|
138
|
+
|
|
139
|
+
## Troubleshooting
|
|
140
|
+
|
|
141
|
+
- Common issues (missing keys.json, invalid endpoint, file permissions)
|
|
142
|
+
- Rate limiting and retry behavior
|
|
143
|
+
- Index corruption or partial updates (how to recreate)
|
|
144
|
+
|
|
145
|
+
## Contributing
|
|
146
|
+
|
|
147
|
+
- How to build, test, and lint
|
|
148
|
+
- yarn install, yarn build, yarn test, yarn lint
|
|
149
|
+
- Open issues and PR guidelines
|
|
150
|
+
- Code of Conduct
|
|
151
|
+
- Link to CONTRIBUTING.md
|
|
152
|
+
|
|
153
|
+
## License
|
|
154
|
+
|
|
155
|
+
- MIT License
|
|
156
|
+
|
|
157
|
+
## Acknowledgements
|
|
158
|
+
|
|
159
|
+
- Inspiration from Pinecone and Qdrant
|
|
160
|
+
- Libraries used in this repo
|