npm - @neuralsea/workspace-indexer - Versions diffs - 0.6.0 → 0.6.1 - Mend

@neuralsea/workspace-indexer 0.6.0 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +128 -324
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -2,32 +2,77 @@
 A **local-first**, **multi-repo** workspace indexer for AI agents (e.g. your custom agent “Damocles”).
+This package provides high-fidelity indexing, retrieval, and context expansion across entire workspaces, while remaining safe to run locally (including VS Code extension hosts).
-## Default backends
+---
+## Default index backends
+- **Catalogue / Indexing DB**: SQLite via **sql.js (WASM)**
+  Runs everywhere (Node, VS Code extension host, webview environments). No native binaries required.
+- **Vector backend**: `bruteforce` (default)
+  Zero‑config, in‑memory exact search.
+- **Graph backend**: disabled by default
+> For enterprise‑scale persistence and performance, configure a remote vector backend such as **Qdrant**, and optionally a graph backend such as **Neo4j**.
+---
+## What this package provides
+- **Whole‑workspace indexing**
+  Multiple Git repositories under a single workspace root.
+- **Meaningful chunking**
+  TypeScript/JavaScript AST‑aware chunking with robust fallbacks for other languages.
+- **Semantic embeddings**
+  Pluggable providers:
+  - Ollama (local)
+  - OpenAI
+  - Deterministic offline hash embeddings
+- **Hybrid retrieval**
+  Vector similarity + lexical search (SQLite FTS5) with configurable weights.
+- **Pluggable vector backends**
+  `bruteforce`, `hnswlib`, `qdrant`, `faiss`, or a custom provider.
+- **Enterprise‑safe invalidation**
+  Repo indices are keyed by:
+  `(repo_id, head_commit, embedder_id, index_fingerprint)`
+  Any change forces a clean rebuild to avoid stale context.
+- **Incremental updates**
+  File watching + `.git/HEAD` detection.
+- **Security controls**
+  Git‑native ignore rules, additional ignore files, and redaction hooks.
+This allows the same index to support multiple agent domains:
+- Search
+- Refactor
+- Review
+- Architecture understanding
+- RCA (root cause analysis)
+…by selecting different **retrieval profiles**.
+---
+## Index backends (vector & graph)
-- **Indexing DB**: SQLite via **sql.js (WASM)** (runs in VS Code extension hosts; no native binaries).
-- **Vector backend**: `bruteforce` by default (zero-config). For enterprise persistence and scalability, configure `qdrant`.
+Workspace‑Indexer separates **index infrastructure** from agent logic.
-It provides:
+Index backends define *where and how* indexed knowledge is stored and queried:
-- **Whole-workspace indexing** (multiple Git repos under a workspace root)
-- **Meaningful chunking** (TypeScript/JavaScript AST-aware chunking + robust fallback for other files)
-- **Semantic embeddings** (pluggable: **Ollama local**, **OpenAI**, or deterministic offline **hash**)
-- **Hybrid retrieval**: vector similarity **plus** lexical search (SQLite FTS5) with configurable weights
-- **Pluggable vector backends**: `bruteforce`, `hnswlib` (HNSW), `qdrant` (local/remote), `faiss`, or a custom provider
-- **Head-synchronised indexing (enterprise-safe invalidation)**: each repo index is keyed by `(repo_id, head_commit, embedder_id, index_fingerprint)`. Any change invalidates and forces a clean rebuild to avoid stale or mixed-context results.
-- **Fast incremental updates**: file watching + `.git/HEAD` switch detection
-- **Security controls**: respects `.gitignore` via `git ls-files`, plus `.petriignore/.augmentignore`, plus redaction hooks
+- **Catalogue DB** (files, chunks, metadata, FTS)
+- **Vector backend** (similarity search)
+- **Graph backend** (optional dependency / symbol / architecture graph)
-This package is designed so Damocles can use the same index in different problem domains:
+Backends are configured via **profiles**, allowing:
-- **Search**
-- **Refactor**
-- **Review**
-- **Architecture understanding**
-- **RCA (root cause analysis)**
+- Local or remote providers
+- Safe backend switching (automatic rebuilds)
+- Environment‑specific defaults
-…by selecting different **retrieval profiles** (k/weights/context-expansion/scope).
+> **Important:**
+> Index Backends are **not** MCP *Knowledge Servers*.
+> Knowledge Servers are reserved exclusively for MCP.
 ---
@@ -41,22 +86,28 @@ Node 18+ required.
 Docs: `docs/README.md`
-### Browser / VS Code webview
+---
+## Browser / VS Code webview
-This package publishes a browser-safe entrypoint for use in browsers and VS Code webviews:
+This package publishes a browser‑safe entrypoint:
 ```ts
 import { chunkSource, OpenAIEmbeddingsProvider } from "@neuralsea/workspace-indexer/browser";
 ```
-The full indexer (`WorkspaceIndexer`, file watching, git scanning, sqlite-on-disk, etc.) is Node-only and should run in the VS Code extension host (send data to the webview via `postMessage`).
+The full indexer (`WorkspaceIndexer`, file watching, git scanning, persistence) is **Node‑only** and should run in the VS Code extension host, communicating with webviews via `postMessage`.
 ---
 ## Quick start (library)
 ```ts
-import { WorkspaceIndexer, OllamaEmbeddingsProvider, IndexerProgressObservable } from "@neuralsea/workspace-indexer";
+import {
+  WorkspaceIndexer,
+  OllamaEmbeddingsProvider,
+  IndexerProgressObservable
+} from "@neuralsea/workspace-indexer";
 const embedder = new OllamaEmbeddingsProvider({ model: "nomic-embed-text" });
@@ -67,355 +118,108 @@ const ix = new WorkspaceIndexer("/path/to/workspace", embedder, { progress });
 await ix.indexAll();
-// Domain: search
-const search = await ix.retrieve("Where is authentication enforced?", { profile: "search" });
-// Domain: refactor (more context)
-const refactor = await ix.retrieve("Refactor the caching layer to support TTL per key", { profile: "refactor" });
-// Domain: review (changed files only)
-const review = await ix.retrieve("Explain the risk of this change", {
-  profile: "review",
-  scope: { changedOnly: true, baseRef: "origin/main" }
+const search = await ix.retrieve("Where is authentication enforced?", {
+  profile: "search"
 });
 console.log(search.hits.map(h => h.chunk.path));
-await ix.closeAsync();
-```
----
-## VS Code: high-fidelity symbol graphs (optional)
-In a VS Code extension, you can pass a `symbolGraphProvider` that uses VS Code (LSP-backed) providers to extract symbols.
-```ts
-import { WorkspaceIndexer, createVSCodeSymbolGraphProvider } from "@neuralsea/workspace-indexer";
-const symbolGraphProvider = await createVSCodeSymbolGraphProvider({
-  languages: ["typescript", "javascript", "python", "go"]
-});
-const ix = new WorkspaceIndexer(workspaceRoot, embedder, {
-  symbolGraphProvider: symbolGraphProvider ?? undefined
-});
-```
-To enable the optional Neo4j graph store, install `neo4j-driver` in your extension/app and set `workspace.graph` in config.
----
-## CLI
-### Index a workspace
-```bash
-npx petri-index index /path/to/workspace --provider ollama --model nomic-embed-text
-```
-### Watch (keeps index current)
-```bash
-npx petri-index watch /path/to/workspace --provider ollama --model nomic-embed-text
-```
-### Query (profile: search)
-```bash
-npx petri-index query "rate limiting middleware" /path/to/workspace --k 8
-```
-### Retrieve (full context bundle as JSON)
-```bash
-npx petri-index retrieve "Why are requests timing out?" /path/to/workspace \
-  --profile rca \
-  --changedOnly true \
-  --baseRef origin/main
+await ix.closeAsync();
 ```
 ---
-## Retrieval profiles (how Petri adapts per domain)
-The same index can be used differently depending on the task. The package provides defaults:
-- `search`
-  Tight top-k; favours precise matches; minimal context expansion.
+## Retrieval profiles
-- `refactor`
-  Wider k; includes adjacent chunks and follows relative imports to pull in dependent modules.
+The same index can be queried differently depending on the task.
-- `review`
-  Biases to changed files (when scoped) and includes file synopsis for reviewer context.
+Built‑in profiles:
-- `architecture`
-  Larger candidate pools; prioritises file synopses and follows imports more aggressively.
+- **search** — tight top‑k, precise matches
+- **refactor** — wider k, follows imports and adjacency
+- **review** — biases to changed files, includes file synopsis
+- **architecture** — aggressive expansion across imports
+- **rca** — review + recency bias
-- `rca`
-  Like review + recency bias (recently modified files rank higher).
+Profiles control:
-Each profile controls:
+- k (primary hits)
+- weights (vector / lexical / recency)
+- expansion rules
+- candidate pool sizes
-- **k** (how many primary hits)
-- **weights** (vector/lexical/recency)
-- **expand** (adjacent chunks, follow imports, include file synopsis)
-- **candidate pool sizes** (vectorK/lexicalK)
-You can override any of these at runtime:
-```ts
-const bundle = await ix.retrieve("Explain auth flow", {
-  profile: "architecture",
-  profileOverrides: {
-    k: 30,
-    weights: { vector: 0.6, lexical: 0.3, recency: 0.1 },
-    expand: { followImports: 5 }
-  }
-});
-```
+Profiles can be overridden at runtime.
 ---
-## Config file
+## Index backend configuration (profiles)
-The CLI supports `--config` pointing to a JSON file.
-Example: `petri-index.config.json`
+Index backends are configured using named profiles.
 ```json
 {
-  "workspace": {
-    "discovery": {
-      "exclude": ["**/vendor/**", "**/node_modules/**"],
-      "maxDepth": 8,
-      "includeSubmodules": true
+  "indexBackends": {
+    "vectorProfiles": {
+      "local-default": {
+        "kind": "local",
+        "provider": "bruteforce",
+        "metric": "cosine"
+      },
+      "qdrant-dev": {
+        "kind": "qdrant",
+        "url": "http://localhost:6333",
+        "collectionPrefix": "petri"
+      }
     },
-    "graph": {
-      "provider": "neo4j",
-      "neo4j": {
+    "graphProfiles": {
+      "none": { "kind": "none" },
+      "neo4j-local": {
+        "kind": "neo4j",
         "uri": "neo4j://localhost:7687",
         "user": "neo4j",
-        "password": "password",
+        "passwordRef": "NEO4J_PASSWORD",
         "database": "neo4j",
         "labelPrefix": "Petri"
       }
     },
-    "repoOverrides": [
-      {
-        "match": "apps/**",
-        "config": { "storage": { "ftsMode": "tokens" } }
-      }
-    ]
-  },
-  "storage": {
-    "storeText": true,
-    "ftsMode": "full"
-  },
-  "vector": {
-    "provider": "hnswlib",
-    "metric": "cosine",
-    "hnswlib": {
-      "persist": true,
-      "persistDebounceMs": 2000,
-      "efSearch": 64
-    }
-  },
-  "chunk": {
-    "maxLines": 260,
-    "overlapLines": 50
-  },
-  "profiles": {
-    "architecture": {
-      "k": 30,
-      "expand": { "followImports": 4 }
-    },
-    "rca": {
-      "weights": { "recency": 0.35 }
+    "defaults": {
+      "vectorProfile": "local-default",
+      "graphProfile": "none"
     }
   }
 }
 ```
-Run:
-```bash
-npx petri-index retrieve "How does login work?" /path/to/workspace --config petri-index.config.json --profile architecture
-```
-### Lexical modes (`storage.ftsMode`)
-- `"full"` (default): best retrieval; stores (redacted) chunk text in the FTS table.
-- `"tokens"`: stores only extracted identifiers/tokens for lexical search (less sensitive; still useful for code search).
-- `"off"`: disables lexical indexing entirely (vector-only retrieval).
----
-## Vector backends
+The selected profiles are resolved internally into runtime configuration.
-Configure the ANN backend via `vector.provider`:
+### Neo4j migration note
-- `"bruteforce"` (default): in-memory exact search, no extra dependencies
-- `"hnswlib"`: fast local ANN using HNSW via `hnswlib-node`
-- `"qdrant"`: Qdrant (local or remote) via `@qdrant/js-client-rest`
-- `"faiss"`: FAISS via `faiss-node` (rebuild-on-write; good for experimentation)
-- `"auto"`: picks the best available backend (prefers Qdrant if configured)
-- `"custom"`: load a custom provider module that implements the `VectorIndex` interface
+Earlier versions accepted Neo4j configuration under `workspace.graph`.
-### HNSW (local)
-Install:
-```bash
-npm i hnswlib-node
-```
-Config:
-```json
-{
-  "vector": {
-    "provider": "hnswlib",
-    "metric": "cosine",
-    "hnswlib": {
-      "persist": true,
-      "persistDebounceMs": 2000,
-      "m": 16,
-      "efConstruction": 200,
-      "efSearch": 64
-    }
-  }
-}
-```
-### Qdrant (local)
-Start a local Qdrant:
-```bash
-docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
-```
-Install client:
-```bash
-npm i @qdrant/js-client-rest
-```
-Config:
-```json
-{
-  "vector": {
-    "provider": "qdrant",
-    "metric": "cosine",
-    "qdrant": {
-      "url": "http://127.0.0.1:6333",
-      "collectionPrefix": "petri",
-      "mode": "commit",
-      "recreateOnRebuild": true
-    }
-  }
-}
-```
-### FAISS
-Install:
-```bash
-npm i faiss-node
-```
-Config:
-```json
-{
-  "vector": {
-    "provider": "faiss",
-    "metric": "cosine",
-    "faiss": {
-      "descriptor": "HNSW,Flat",
-      "persist": true,
-      "persistDebounceMs": 2000,
-      "rebuildStrategy": "lazy"
-    }
-  }
-}
-```
-### Custom provider
-Point `vector.custom` to an ES module that exports either:
-- a class implementing `VectorIndex`, or
-- a factory function returning a `VectorIndex`
-```json
-{
-  "vector": {
-    "provider": "custom",
-    "custom": {
-      "module": "./my-vector-provider.mjs",
-      "export": "default",
-      "options": { "foo": "bar" }
-    }
-  }
-}
-```
-## Security model
-Local indexing means **your source stays on your machine**.
-Controls:
-1. **Git-native ignore**: files are selected via:
-   - `git ls-files --cached --others --exclude-standard`
-   which honours `.gitignore` exactly.
-2. **Extra ignores**: `.petriignore` and `.augmentignore`
-3. **Redaction hooks** (on by default):
-   - skip obvious secret files by path substring
-   - redact patterns (e.g. private keys) before embedding + storage
-> For higher assurance, set `storage.ftsMode = "tokens"` and review `redact.patterns`.
+This version automatically migrates those settings into a graph profile on first run. After migration, legacy settings are ignored.
 ---
-## Output format for agents
+## Persistence semantics
-`WorkspaceIndexer.retrieve()` returns a `ContextBundle`:
+Disabling the graph backend **does not** disable index persistence.
-- `hits[]` — ranked primary chunks with scores and previews
-- `context[]` — expanded context blocks with reasons (adjacency/imports/synopsis)
-- `stats` — diagnostics useful for your agent logs
-This is a good structure for:
-- Search answers (just `hits`)
-- Multi-file refactoring (use `context` as grounded evidence)
-- Review/RCA (scope to changed files, include synopsis, bias by recency)
+Persistence of catalogue data, embeddings, and vector indices is controlled independently via storage settings.
 ---
-## Performance notes
-- Default vector backend is **bruteforce** (exact search in memory). For large repos, use:
-  - `vector.provider = "hnswlib"` for fast local ANN (HNSW)
-  - `vector.provider = "qdrant"` for durable, scalable vector search
-  - `vector.provider = "faiss"` if you already run FAISS locally
-- SQLite remains the source-of-truth for file/chunk metadata, so you can rebuild the vector index at any time.
----
-## Files ignored by default (recommended)
+## Security model
-Create a `.petriignore` in each repo to exclude heavy or noisy artefacts:
+- Git‑native ignore (`git ls-files`)
+- Additional `.petriignore` / `.augmentignore`
+- Redaction hooks before embedding and storage
-```txt
-dist/
-build/
-coverage/
-**/*.min.js
-**/*.map
-```
+For higher assurance:
+- set `storage.ftsMode = "tokens"`
+- review redaction patterns
 ---
 ## Licence
-MIT (add your own licence file if desired).
+MIT

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@neuralsea/workspace-indexer",
-  "version": "0.6.0",
+  "version": "0.6.1",
   "description": "Local-first multi-repo workspace indexer (semantic embeddings + git-aware incremental updates + hybrid retrieval profiles) for AI agents.",
   "repository": {
     "type": "git",