npm - vecito - Versions diffs - 0.1.0 - Mend

vecito 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Jeka Kiselyov
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,176 @@
+# vecito
+Tiny **hybrid semantic search** for Node and the browser — dense embeddings
+([`@huggingface/transformers`](https://github.com/huggingface/transformers.js); default
+`Xenova/all-MiniLM-L6-v2`, but any feature-extraction model works) fused with **BM25** sparse
+lexical scoring over an [`altor-vec`](https://github.com/altor-lab/altor-vec) WASM HNSW index. No server,
+no API keys.
+**Where to build the index.** Building a snapshot means embedding every document and constructing the HNSW graph — the expensive part. It's usually best to do this once on the server or with the CLI (`vecito index`), then serve the resulting `.vecito` file and load it in the browser with `Vecito.loadFromUrl()`, which restores the pre-built graph in milliseconds. Building, indexing, and adding documents directly in the browser is fully supported too — it just runs that same per-document embedding client-side, which is slow for large corpora.
+## Install
+```bash
+npm install vecito
+```
+The first run downloads the embedding model to the Hugging Face cache. By default it loads
+**quantized** weights (`dtype: 'q8'`) — about **22 MB** for the default model. Pass
+`dtype: 'fp32'` for full precision (~87 MB) if you need maximum quality.
+## Library usage
+The core library indexes **any data** — plain strings or raw JSON objects — with no
+pre-formatting. By default the searchable text is the item's flattened string values, and the
+whole object comes back as metadata on each hit.
+```js
+import { Vecito } from 'vecito';
+const v = new Vecito();
+await v.addDocuments([
+  { id: 'a', title: 'Animals', body: 'The quick brown fox jumps over the lazy dog.' },
+  { id: 'b', title: 'Search', body: 'BM25 ranks documents by term frequency and rarity.' },
+  { id: 'c', title: 'Botany', body: 'Photosynthesis converts sunlight into chemical energy.' },
+]);
+const hits = await v.search('how do plants make food?', { mode: 'hybrid', top: 3 });
+// → [{ score, metadata: { id, title, body }, dense_rank?, sparse_rank? }, ...]
+```
+Plain strings work too (`addDocuments(['some text', ...])`). For custom shapes, pass extractor
+functions:
+```js
+await v.addDocuments(rows, {
+  text: r => `${r.headline}\n${r.summary}`,  // what gets embedded + BM25-scored
+  metadata: r => ({ id: r.id }),             // what's returned with hits
+});
+```
+`mode` is `'hybrid'` (default, reciprocal-rank fusion), `'dense'` (vectors only), or `'sparse'`
+(BM25-weighted). All modes support a `filter` predicate to post-filter results by metadata.
+If the query has no in-vocabulary terms, hybrid/sparse automatically fall back to dense.
+### Options & models
+```js
+new Vecito({ model, dtype, k1, b });
+```
+Pass any transformers.js feature-extraction model as `model` — the embedding width is detected
+automatically (e.g. 384 for MiniLM/BGE-small, 768 for MPNet/GTE-base). `dtype` picks the weight
+precision: `'q8'` (quantized, the default, ~4× smaller download) or `'fp32'` (full precision);
+the chosen dtype is stored in the snapshot so loads stay consistent. `k1` / `b` tune BM25.
+`v.model`, `v.dtype`, `v.dimensions`, and `v.count` expose index state.
+`addDocuments` fits BM25 on the **first** call, then freezes it, so you can add more documents
+later (including to a loaded snapshot — see below). Dense search covers new documents fully;
+sparse scoring only sees terms already in the frozen vocabulary, so pass your whole corpus up
+front for best lexical recall.
+### Persistence
+```js
+// Node — single self-contained file (vectors + metadata + sparse + BM25 + model)
+await v.save('data.vecito');
+const loaded = await Vecito.load('data.vecito');
+// Universal — in-memory bytes (use in the browser, or to store anywhere)
+const bytes = v.exportBytes();             // Uint8Array
+const fromBytes = await Vecito.loadFromBytes(bytes);
+const fromUrl = await Vecito.loadFromUrl('https://example.com/data.vecito');
+```
+A snapshot is self-describing — it stores the model it was built with, so `load` always searches
+with the right embedder. You can keep extending a loaded index and re-save it:
+```js
+const loaded = await Vecito.load('data.vecito');
+await loaded.addDocuments(moreItems);
+await loaded.save('data.vecito');
+```
+The primitives are exported too if you want to wire them yourself:
+`import { Embedder, BM25, VecStore } from 'vecito'`.
+## File indexing (`vecito/file`)
+A thin Node-only layer on top of the core turns files and directories into indexed documents.
+It's a separate subpath import, so the core stays browser-safe.
+```js
+import { indexDirectory } from 'vecito/file';
+const v = await indexDirectory('./docs');   // → a ready Vecito
+await v.save('docs.vecito');
+const hits = await v.search('renewable energy sources', { top: 5 });
+```
+Each file becomes one document: `.json`/`.jsonl` are parsed to objects (then flattened),
+everything else is indexed as raw text; metadata is `{ path, name }`. Options: `ext` (extension
+allowlist), `hidden` (include dotfiles, off by default), `limit`, `model`. Also exported:
+`indexFiles(paths, opts)`, `walk(dir, opts)`, and `DEFAULT_EXTENSIONS`.
+## Browser
+The core (`Embedder` + `BM25` + `VecStore` + `Vecito`) runs in the browser unchanged — it only
+needs a dev server that resolves bare imports and serves the two WASM deps. A ready-to-run smoke
+test lives in [`browser/index.html`](browser/index.html):
+```bash
+pnpm install
+pnpm dev:browser   # vite — opens browser/index.html
+```
+It imports `vecito`, loads pre-built snapshots or embeds documents from scratch in the altor-vec
+WASM HNSW store, and runs hybrid search entirely client-side (the only network call is the
+one-time ~22 MB quantized model download from the Hugging Face CDN). Don't import `vecito/file`
+in the browser — it's the Node-only layer.
+The included [`vite.config.js`](vite.config.js) keeps `@huggingface/transformers` and `altor-vec`
+out of dependency pre-bundling (`optimizeDeps.exclude`) so their `import.meta.url`-relative
+`.wasm` URLs resolve correctly. Any bundler works as long as it does the same.
+## CLI
+Install globally to get the `vecito` command on your `PATH`:
+```bash
+pnpm add -g vecito
+```
+Or run it without installing via `pnpm dlx vecito …`.
+```bash
+# Index the current directory into ./data.vecito
+vecito index
+# …or a specific directory / output file
+vecito index ./docs -o docs.vecito
+# Search (path is optional; defaults to data.vecito in the current directory)
+vecito search "renewable energy sources" --mode hybrid --top 5
+vecito search "renewable energy sources" docs.vecito --top 5
+```
+`index` recursively walks the directory, indexing a broad set of text/data/code extensions
+(`.md`, `.txt`, `.json`, `.yaml`, `.js`, `.py`, … — override with `--ext .md,.txt`).
+**Dotfiles and dot-directories are skipped by default** (pass `--hidden` to include them).
+`.json` files are flattened to their string values before indexing.
+The trailing path is optional and **defaults to the current directory** — `index` scans cwd, and
+`search` loads `data.vecito` from cwd (a directory path resolves to `data.vecito` inside it).
+```
+vecito index [dir] [-o data.vecito] [--ext .md,.txt,...] [--hidden] [--limit N]
+vecito search <query> [path] [--mode dense|sparse|hybrid] [--top N] [--filter <expr>]
+```
+## Sample data
+The pre-built snapshots in `sampledata/` are derived from the [Books Dataset](https://www.kaggle.com/datasets/saurabhbagchi/books-dataset/data) on Kaggle.
+## License
+MIT © Jeka Kiselyov

package/bin/cli.js ADDED Viewed

@@ -0,0 +1,181 @@
+#!/usr/bin/env node
+import { existsSync, statSync } from 'fs';
+import { join } from 'path';
+import { Vecito } from '../lib/vecito.js';
+import { walk, indexFiles } from '../lib/file-index.js';
+/** Default index filename, used for both output and directory-relative lookup. */
+const DEFAULT_INDEX = 'data.vecito';
+/** Flags that consume the following argv token as their value. */
+const VALUE_FLAGS = new Set(['-o', '--out', '--ext', '--limit', '--mode', '--top', '--filter']);
+/**
+ * Read the value following a flag in argv.
+ * @param {string} name Flag, e.g. '--top'.
+ * @param {*} fallback Value returned when the flag is absent.
+ * @returns {string|*} The argument after the flag, or `fallback`.
+ */
+function flag(name, fallback) {
+	const i = process.argv.indexOf(name);
+	return i !== -1 ? process.argv[i + 1] : fallback;
+}
+/**
+ * Whether a boolean flag is present in argv.
+ * @param {string} name Flag, e.g. '--hidden'.
+ * @returns {boolean}
+ */
+function hasFlag(name) {
+	return process.argv.includes(name);
+}
+/**
+ * Extract positional arguments for the current command, skipping flags and the
+ * values they consume.
+ * @returns {string[]} Positional args in order (after the command word).
+ */
+function positionals() {
+	const args = process.argv.slice(3); // after `node cli.js <command>`
+	const out = [];
+	for (let i = 0; i < args.length; i++) {
+		const a = args[i];
+		if (a.startsWith('-')) {
+			if (VALUE_FLAGS.has(a)) i++; // skip this flag's value
+			continue;
+		}
+		out.push(a);
+	}
+	return out;
+}
+/**
+ * Resolve a search path to an index file: a directory (or the cwd default)
+ * resolves to `data.vecito` inside it; a file path is used as-is.
+ * @param {string} [p] Path argument; defaults to the current directory.
+ * @returns {string} Path to the index file.
+ */
+function resolveIndexPath(p) {
+	const target = p || '.';
+	if (existsSync(target) && statSync(target).isDirectory()) return join(target, DEFAULT_INDEX);
+	return target;
+}
+/**
+ * Print CLI usage to stderr.
+ * @returns {void}
+ */
+function usage() {
+	console.error(`vecito — hybrid (dense + BM25) semantic search
+Usage:
+  vecito index [dir] [-o data.vecito] [--mode dense|hybrid] [--ext .md,.txt,...] [--hidden] [--limit N]
+  vecito search <query> [path] [--mode dense|sparse|hybrid] [--top N] [--filter <expr>]
+The trailing path is optional and defaults to the current directory.
+Index options:
+  -o, --out <file>   Output index file (default: data.vecito)
+  --mode <m>         Index mode: hybrid (default, dense+BM25) or dense (vectors only, smaller file)
+  --ext <list>       Comma-separated extensions to index (default: broad text set)
+  --hidden           Include dotfiles and dot-directories (skipped by default)
+  --limit <n>        Index at most n files
+Search options:
+  --mode <m>         Search mode: hybrid (default), dense, or sparse
+  --top <n>          Number of results (default: 10)
+  --filter <expr>    JS expression over metadata, e.g. 'meta.category === "science"'`);
+}
+/**
+ * `vecito index [dir]` — walk a directory (default: cwd) and write a
+ * self-contained index. Thin wrapper over the file layer (lib/file-index.js).
+ * @returns {Promise<void>}
+ */
+async function cmdIndex() {
+	const dir = positionals().pop() || '.';
+	const out = flag('-o', flag('--out', DEFAULT_INDEX));
+	const extArg = flag('--ext', null);
+	const ext = extArg ? extArg.split(',') : undefined;
+	const hidden = hasFlag('--hidden');
+	const limitArg = flag('--limit', null);
+	const limit = limitArg ? parseInt(limitArg, 10) : undefined;
+	const mode = flag('--mode', 'hybrid');
+	if (mode !== 'hybrid' && mode !== 'dense') {
+		console.error(`Unknown index mode "${mode}". Use hybrid (default) or dense.`);
+		process.exit(1);
+	}
+	const files = walk(dir, { ext, hidden, limit });
+	console.log(`Found ${files.length} file(s) to index in ${dir}`);
+	const label = mode === 'dense' ? 'Embedding (dense only)...' : 'Embedding + fitting BM25...';
+	console.log(label);
+	const vecito = await indexFiles(files, { base: dir, mode });
+	await vecito.save(out);
+	console.log(`Indexed ${vecito.count} document(s) [${vecito.indexMode}] → ${out}`);
+}
+/**
+ * `vecito search <query> [path]` — load an index (default: data.vecito in the
+ * current directory) and print ranked results.
+ * @returns {Promise<void>}
+ */
+async function cmdSearch() {
+	const pos = positionals();
+	const query = pos[0];
+	if (!query) { usage(); process.exit(1); }
+	const indexFile = resolveIndexPath(pos[1]);
+	const mode = flag('--mode', 'hybrid');
+	const top = parseInt(flag('--top', '10'), 10);
+	const filterExpr = flag('--filter', undefined);
+	let filter;
+	if (filterExpr) {
+		try {
+			filter = new Function('meta', `return (${filterExpr})`);
+		} catch (e) {
+			console.error(`Invalid --filter expression: ${e.message}`);
+			process.exit(1);
+		}
+	}
+	if (!existsSync(indexFile)) {
+		console.error(`Index not found: ${indexFile}`);
+		process.exit(1);
+	}
+	const vecito = await Vecito.load(indexFile);
+	const effectiveMode = vecito.indexMode === 'dense' ? 'dense' : mode;
+	console.log(`Loaded ${vecito.count} doc(s) from ${indexFile} [index: ${vecito.indexMode}, search: ${effectiveMode}]\n`);
+	console.log(`Results for "${query}"${filterExpr ? ` (filter: ${filterExpr})` : ''}:\n`);
+	const results = await vecito.search(query, { mode: effectiveMode, top, filter });
+	if (results.length === 0) {
+		console.log('(no results)');
+		return;
+	}
+	for (let i = 0; i < results.length; i++) {
+		const r = results[i];
+		const m = r.metadata || {};
+		const score = r.score?.toFixed(4) ?? '-';
+		const ranks = [];
+		if (r.dense_rank) ranks.push(`dense:#${r.dense_rank}`);
+		if (r.sparse_rank) ranks.push(`sparse:#${r.sparse_rank}`);
+		const rankStr = ranks.length ? ` (${ranks.join(', ')})` : '';
+		console.log(`${i + 1}. [${score}] ${m.name || m.path || '?'}${rankStr}`);
+		if (m.path && m.path !== m.name) console.log(`   ${m.path}`);
+	}
+}
+const cmd = process.argv[2];
+if (cmd === 'index') {
+	await cmdIndex();
+} else if (cmd === 'search') {
+	await cmdSearch();
+} else {
+	usage();
+	process.exit(1);
+}

package/file.d.ts ADDED Viewed

@@ -0,0 +1,27 @@
+import { Vecito } from './index';
+export const DEFAULT_EXTENSIONS: string[];
+export interface WalkOptions {
+	ext?: string[];
+	hidden?: boolean;
+	limit?: number;
+}
+export interface IndexOptions extends WalkOptions {
+	model?: string;
+	dtype?: string;
+	mode?: 'hybrid' | 'dense';
+}
+/** Recursively collect matching file paths, skipping dotfiles by default. */
+export function walk(dir: string, opts?: WalkOptions): string[];
+/** Index an explicit list of files into a fresh Vecito. */
+export function indexFiles(
+	paths: string[],
+	opts?: { model?: string; dtype?: string; mode?: 'hybrid' | 'dense'; base?: string }
+): Promise<Vecito>;
+/** Walk a directory and index every matching file into a fresh Vecito. */
+export function indexDirectory(dir: string, opts?: IndexOptions): Promise<Vecito>;

package/index.d.ts ADDED Viewed

@@ -0,0 +1,113 @@
+export interface SparseVector {
+	indices: Uint32Array;
+	values: Float32Array;
+	dim: number;
+}
+export interface SearchResult {
+	id?: number;
+	score?: number;
+	dense_rank?: number;
+	sparse_rank?: number;
+	metadata: Record<string, any>;
+	/** BM25-matched (hybrid) or tokenized (dense) terms. Present only when search was called with `{ matchedTerms: true }`. */
+	matchedTerms?: string[];
+}
+export class Highlighter {
+	/** Escape HTML special characters in a plain-text string. */
+	static escape(s: string): string;
+	/** Tokenize a query string for dense-mode fallback highlighting. */
+	static tokenize(text: string): string[];
+	/** Wrap occurrences of `terms` in `text` with `<mark>` tags. Matching is stem-aware: "run" matches "running", "adventure" matches "adventures". */
+	static highlight(text: string, terms: string[] | Set<string>): string;
+	/** Extract a snippet centred on the first stem match (plain text — pass to highlight for markup). */
+	static snippet(text: string, terms: string[] | Set<string>, maxLen?: number): string;
+}
+export class Embedder {
+	constructor(opts?: { model?: string; dtype?: string });
+	init(): Promise<void>;
+	embed(text: string): Promise<Float32Array>;
+	embedBatch(texts: string[], opts?: { batchSize?: number }): Promise<Float32Array[]>;
+	get dimensions(): number;
+	get dtype(): string;
+	get model(): string;
+}
+export class BM25 {
+	constructor(opts?: { k1?: number; b?: number });
+	fit(texts: string[]): void;
+	score(text: string): SparseVector;
+	scoreAll(texts: string[]): SparseVector[];
+	/** Map a query string to the in-vocabulary term ids it contains. */
+	scoreQuery(queryText: string): { indices: number[]; vocabSize: number };
+	querySparse(queryText: string): SparseVector;
+	/** Map vocabulary term ids back to their original term strings (unknown ids omitted). */
+	termsForIndices(indices: Uint32Array | number[]): string[];
+	toJSON(): Record<string, any>;
+	static fromJSON(data: Record<string, any>): BM25;
+	get vocabSize(): number;
+}
+export class VecStore {
+	constructor(opts: { dimensions: number });
+	init(): Promise<void>;
+	insert(vector: Float32Array | number[], metadata?: Record<string, any>): number;
+	initSparse(): void;
+	insertSparse(sparse: SparseVector): void;
+	search(query: Float32Array, k?: number): Promise<SearchResult[]>;
+	/** Post-filters HNSW candidates with a JS predicate over metadata objects. */
+	searchWithFilter(query: Float32Array, filter: (meta: Record<string, any>) => boolean, k?: number): Promise<SearchResult[]>;
+	hybridSearch(
+		denseQuery: Float32Array,
+		sparse: SparseVector,
+		k?: number,
+		opts?: { dense_k?: number; sparse_k?: number; fusion?: any }
+	): SearchResult[];
+	save(filePath: string): Promise<void>;
+	/** Alias for {@link VecStore.save}. */
+	exportToFile(filePath: string): Promise<void>;
+	exportBytes(): Uint8Array;
+	static load(filePath: string): Promise<VecStore>;
+	static loadFromBytes(bytes: Uint8Array): Promise<VecStore>;
+	static loadFromUrl(url: string): Promise<VecStore>;
+	get count(): number;
+	/** Dense vector width of the index. */
+	get dimensions(): number;
+}
+export interface AddOptions {
+	/** Extract searchable text from an item (default: flatten string values). */
+	text?: (item: any) => string;
+	/** Extract metadata returned with hits (default: the object itself). */
+	metadata?: (item: any) => Record<string, any>;
+}
+export interface VecitoSearchOptions {
+	mode?: 'hybrid' | 'dense' | 'sparse';
+	top?: number;
+	/** JS predicate over metadata — post-filters results in any mode, over-fetching to preserve the requested count. */
+	filter?: (meta: Record<string, any>) => boolean;
+	/** When true, each result includes `matchedTerms` for use with `Highlighter.highlight`. */
+	matchedTerms?: boolean;
+}
+export class Vecito {
+	constructor(opts?: { model?: string; dtype?: string; embedder?: Embedder; mode?: 'hybrid' | 'dense'; k1?: number; b?: number });
+	addDocuments(items: any | any[], opts?: AddOptions): Promise<this>;
+	search(query: string, opts?: VecitoSearchOptions): Promise<SearchResult[]>;
+	exportBytes(): Uint8Array;
+	save(path: string): Promise<void>;
+	static load(path: string): Promise<Vecito>;
+	static loadFromBytes(bytes: Uint8Array | ArrayBuffer): Promise<Vecito>;
+	static loadFromUrl(url: string): Promise<Vecito>;
+	get count(): number;
+	get model(): string;
+	/** Weight precision the embedder loads (e.g. 'q8', 'fp32'). */
+	get dtype(): string;
+	/** Dense vector width of the index, or null before anything is indexed. */
+	get dimensions(): number | null;
+	/** Index mode this instance was built with ('hybrid' or 'dense'). */
+	get indexMode(): 'hybrid' | 'dense';
+}

package/index.js ADDED Viewed

@@ -0,0 +1,5 @@
+export { Embedder } from './lib/embedder.js';
+export { BM25 } from './lib/bm25.js';
+export { VecStore } from './lib/vec-store.js';
+export { Vecito } from './lib/vecito.js';
+export { Highlighter } from './lib/highlight.js';