@neuralsea/workspace-indexer 0.6.0 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +128 -324
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -2,32 +2,77 @@
2
2
 
3
3
  A **local-first**, **multi-repo** workspace indexer for AI agents (e.g. your custom agent “Damocles”).
4
4
 
5
+ This package provides high-fidelity indexing, retrieval, and context expansion across entire workspaces, while remaining safe to run locally (including VS Code extension hosts).
5
6
 
6
- ## Default backends
7
+ ---
8
+
9
+ ## Default index backends
10
+
11
+ - **Catalogue / Indexing DB**: SQLite via **sql.js (WASM)**
12
+ Runs everywhere (Node, VS Code extension host, webview environments). No native binaries required.
13
+ - **Vector backend**: `bruteforce` (default)
14
+ Zero‑config, in‑memory exact search.
15
+ - **Graph backend**: disabled by default
16
+
17
+ > For enterprise‑scale persistence and performance, configure a remote vector backend such as **Qdrant**, and optionally a graph backend such as **Neo4j**.
18
+
19
+ ---
20
+
21
+ ## What this package provides
22
+
23
+ - **Whole‑workspace indexing**
24
+ Multiple Git repositories under a single workspace root.
25
+ - **Meaningful chunking**
26
+ TypeScript/JavaScript AST‑aware chunking with robust fallbacks for other languages.
27
+ - **Semantic embeddings**
28
+ Pluggable providers:
29
+ - Ollama (local)
30
+ - OpenAI
31
+ - Deterministic offline hash embeddings
32
+ - **Hybrid retrieval**
33
+ Vector similarity + lexical search (SQLite FTS5) with configurable weights.
34
+ - **Pluggable vector backends**
35
+ `bruteforce`, `hnswlib`, `qdrant`, `faiss`, or a custom provider.
36
+ - **Enterprise‑safe invalidation**
37
+ Repo indices are keyed by:
38
+ `(repo_id, head_commit, embedder_id, index_fingerprint)`
39
+ Any change forces a clean rebuild to avoid stale context.
40
+ - **Incremental updates**
41
+ File watching + `.git/HEAD` detection.
42
+ - **Security controls**
43
+ Git‑native ignore rules, additional ignore files, and redaction hooks.
44
+
45
+ This allows the same index to support multiple agent domains:
46
+
47
+ - Search
48
+ - Refactor
49
+ - Review
50
+ - Architecture understanding
51
+ - RCA (root cause analysis)
52
+
53
+ …by selecting different **retrieval profiles**.
54
+
55
+ ---
56
+
57
+ ## Index backends (vector & graph)
7
58
 
8
- - **Indexing DB**: SQLite via **sql.js (WASM)** (runs in VS Code extension hosts; no native binaries).
9
- - **Vector backend**: `bruteforce` by default (zero-config). For enterprise persistence and scalability, configure `qdrant`.
59
+ Workspace‑Indexer separates **index infrastructure** from agent logic.
10
60
 
11
- It provides:
61
+ Index backends define *where and how* indexed knowledge is stored and queried:
12
62
 
13
- - **Whole-workspace indexing** (multiple Git repos under a workspace root)
14
- - **Meaningful chunking** (TypeScript/JavaScript AST-aware chunking + robust fallback for other files)
15
- - **Semantic embeddings** (pluggable: **Ollama local**, **OpenAI**, or deterministic offline **hash**)
16
- - **Hybrid retrieval**: vector similarity **plus** lexical search (SQLite FTS5) with configurable weights
17
- - **Pluggable vector backends**: `bruteforce`, `hnswlib` (HNSW), `qdrant` (local/remote), `faiss`, or a custom provider
18
- - **Head-synchronised indexing (enterprise-safe invalidation)**: each repo index is keyed by `(repo_id, head_commit, embedder_id, index_fingerprint)`. Any change invalidates and forces a clean rebuild to avoid stale or mixed-context results.
19
- - **Fast incremental updates**: file watching + `.git/HEAD` switch detection
20
- - **Security controls**: respects `.gitignore` via `git ls-files`, plus `.petriignore/.augmentignore`, plus redaction hooks
63
+ - **Catalogue DB** (files, chunks, metadata, FTS)
64
+ - **Vector backend** (similarity search)
65
+ - **Graph backend** (optional dependency / symbol / architecture graph)
21
66
 
22
- This package is designed so Damocles can use the same index in different problem domains:
67
+ Backends are configured via **profiles**, allowing:
23
68
 
24
- - **Search**
25
- - **Refactor**
26
- - **Review**
27
- - **Architecture understanding**
28
- - **RCA (root cause analysis)**
69
+ - Local or remote providers
70
+ - Safe backend switching (automatic rebuilds)
71
+ - Environment‑specific defaults
29
72
 
30
- …by selecting different **retrieval profiles** (k/weights/context-expansion/scope).
73
+ > **Important:**
74
+ > Index Backends are **not** MCP *Knowledge Servers*.
75
+ > Knowledge Servers are reserved exclusively for MCP.
31
76
 
32
77
  ---
33
78
 
@@ -41,22 +86,28 @@ Node 18+ required.
41
86
 
42
87
  Docs: `docs/README.md`
43
88
 
44
- ### Browser / VS Code webview
89
+ ---
90
+
91
+ ## Browser / VS Code webview
45
92
 
46
- This package publishes a browser-safe entrypoint for use in browsers and VS Code webviews:
93
+ This package publishes a browsersafe entrypoint:
47
94
 
48
95
  ```ts
49
96
  import { chunkSource, OpenAIEmbeddingsProvider } from "@neuralsea/workspace-indexer/browser";
50
97
  ```
51
98
 
52
- The full indexer (`WorkspaceIndexer`, file watching, git scanning, sqlite-on-disk, etc.) is Node-only and should run in the VS Code extension host (send data to the webview via `postMessage`).
99
+ The full indexer (`WorkspaceIndexer`, file watching, git scanning, persistence) is **Nodeonly** and should run in the VS Code extension host, communicating with webviews via `postMessage`.
53
100
 
54
101
  ---
55
102
 
56
103
  ## Quick start (library)
57
104
 
58
105
  ```ts
59
- import { WorkspaceIndexer, OllamaEmbeddingsProvider, IndexerProgressObservable } from "@neuralsea/workspace-indexer";
106
+ import {
107
+ WorkspaceIndexer,
108
+ OllamaEmbeddingsProvider,
109
+ IndexerProgressObservable
110
+ } from "@neuralsea/workspace-indexer";
60
111
 
61
112
  const embedder = new OllamaEmbeddingsProvider({ model: "nomic-embed-text" });
62
113
 
@@ -67,355 +118,108 @@ const ix = new WorkspaceIndexer("/path/to/workspace", embedder, { progress });
67
118
 
68
119
  await ix.indexAll();
69
120
 
70
- // Domain: search
71
- const search = await ix.retrieve("Where is authentication enforced?", { profile: "search" });
72
-
73
- // Domain: refactor (more context)
74
- const refactor = await ix.retrieve("Refactor the caching layer to support TTL per key", { profile: "refactor" });
75
-
76
- // Domain: review (changed files only)
77
- const review = await ix.retrieve("Explain the risk of this change", {
78
- profile: "review",
79
- scope: { changedOnly: true, baseRef: "origin/main" }
121
+ const search = await ix.retrieve("Where is authentication enforced?", {
122
+ profile: "search"
80
123
  });
81
124
 
82
125
  console.log(search.hits.map(h => h.chunk.path));
83
- await ix.closeAsync();
84
- ```
85
-
86
- ---
87
-
88
- ## VS Code: high-fidelity symbol graphs (optional)
89
-
90
- In a VS Code extension, you can pass a `symbolGraphProvider` that uses VS Code (LSP-backed) providers to extract symbols.
91
-
92
- ```ts
93
- import { WorkspaceIndexer, createVSCodeSymbolGraphProvider } from "@neuralsea/workspace-indexer";
94
-
95
- const symbolGraphProvider = await createVSCodeSymbolGraphProvider({
96
- languages: ["typescript", "javascript", "python", "go"]
97
- });
98
-
99
- const ix = new WorkspaceIndexer(workspaceRoot, embedder, {
100
- symbolGraphProvider: symbolGraphProvider ?? undefined
101
- });
102
- ```
103
-
104
- To enable the optional Neo4j graph store, install `neo4j-driver` in your extension/app and set `workspace.graph` in config.
105
-
106
- ---
107
-
108
- ## CLI
109
-
110
- ### Index a workspace
111
- ```bash
112
- npx petri-index index /path/to/workspace --provider ollama --model nomic-embed-text
113
- ```
114
-
115
- ### Watch (keeps index current)
116
- ```bash
117
- npx petri-index watch /path/to/workspace --provider ollama --model nomic-embed-text
118
- ```
119
-
120
- ### Query (profile: search)
121
- ```bash
122
- npx petri-index query "rate limiting middleware" /path/to/workspace --k 8
123
- ```
124
126
 
125
- ### Retrieve (full context bundle as JSON)
126
- ```bash
127
- npx petri-index retrieve "Why are requests timing out?" /path/to/workspace \
128
- --profile rca \
129
- --changedOnly true \
130
- --baseRef origin/main
127
+ await ix.closeAsync();
131
128
  ```
132
129
 
133
130
  ---
134
131
 
135
- ## Retrieval profiles (how Petri adapts per domain)
136
-
137
- The same index can be used differently depending on the task. The package provides defaults:
138
-
139
- - `search`
140
- Tight top-k; favours precise matches; minimal context expansion.
132
+ ## Retrieval profiles
141
133
 
142
- - `refactor`
143
- Wider k; includes adjacent chunks and follows relative imports to pull in dependent modules.
134
+ The same index can be queried differently depending on the task.
144
135
 
145
- - `review`
146
- Biases to changed files (when scoped) and includes file synopsis for reviewer context.
136
+ Built‑in profiles:
147
137
 
148
- - `architecture`
149
- Larger candidate pools; prioritises file synopses and follows imports more aggressively.
138
+ - **search** — tight top‑k, precise matches
139
+ - **refactor** wider k, follows imports and adjacency
140
+ - **review** — biases to changed files, includes file synopsis
141
+ - **architecture** — aggressive expansion across imports
142
+ - **rca** — review + recency bias
150
143
 
151
- - `rca`
152
- Like review + recency bias (recently modified files rank higher).
144
+ Profiles control:
153
145
 
154
- Each profile controls:
146
+ - k (primary hits)
147
+ - weights (vector / lexical / recency)
148
+ - expansion rules
149
+ - candidate pool sizes
155
150
 
156
- - **k** (how many primary hits)
157
- - **weights** (vector/lexical/recency)
158
- - **expand** (adjacent chunks, follow imports, include file synopsis)
159
- - **candidate pool sizes** (vectorK/lexicalK)
160
-
161
- You can override any of these at runtime:
162
-
163
- ```ts
164
- const bundle = await ix.retrieve("Explain auth flow", {
165
- profile: "architecture",
166
- profileOverrides: {
167
- k: 30,
168
- weights: { vector: 0.6, lexical: 0.3, recency: 0.1 },
169
- expand: { followImports: 5 }
170
- }
171
- });
172
- ```
151
+ Profiles can be overridden at runtime.
173
152
 
174
153
  ---
175
154
 
176
- ## Config file
155
+ ## Index backend configuration (profiles)
177
156
 
178
- The CLI supports `--config` pointing to a JSON file.
179
-
180
- Example: `petri-index.config.json`
157
+ Index backends are configured using named profiles.
181
158
 
182
159
  ```json
183
160
  {
184
- "workspace": {
185
- "discovery": {
186
- "exclude": ["**/vendor/**", "**/node_modules/**"],
187
- "maxDepth": 8,
188
- "includeSubmodules": true
161
+ "indexBackends": {
162
+ "vectorProfiles": {
163
+ "local-default": {
164
+ "kind": "local",
165
+ "provider": "bruteforce",
166
+ "metric": "cosine"
167
+ },
168
+ "qdrant-dev": {
169
+ "kind": "qdrant",
170
+ "url": "http://localhost:6333",
171
+ "collectionPrefix": "petri"
172
+ }
189
173
  },
190
- "graph": {
191
- "provider": "neo4j",
192
- "neo4j": {
174
+ "graphProfiles": {
175
+ "none": { "kind": "none" },
176
+ "neo4j-local": {
177
+ "kind": "neo4j",
193
178
  "uri": "neo4j://localhost:7687",
194
179
  "user": "neo4j",
195
- "password": "password",
180
+ "passwordRef": "NEO4J_PASSWORD",
196
181
  "database": "neo4j",
197
182
  "labelPrefix": "Petri"
198
183
  }
199
184
  },
200
- "repoOverrides": [
201
- {
202
- "match": "apps/**",
203
- "config": { "storage": { "ftsMode": "tokens" } }
204
- }
205
- ]
206
- },
207
- "storage": {
208
- "storeText": true,
209
- "ftsMode": "full"
210
- },
211
- "vector": {
212
- "provider": "hnswlib",
213
- "metric": "cosine",
214
- "hnswlib": {
215
- "persist": true,
216
- "persistDebounceMs": 2000,
217
- "efSearch": 64
218
- }
219
- },
220
- "chunk": {
221
- "maxLines": 260,
222
- "overlapLines": 50
223
- },
224
- "profiles": {
225
- "architecture": {
226
- "k": 30,
227
- "expand": { "followImports": 4 }
228
- },
229
- "rca": {
230
- "weights": { "recency": 0.35 }
185
+ "defaults": {
186
+ "vectorProfile": "local-default",
187
+ "graphProfile": "none"
231
188
  }
232
189
  }
233
190
  }
234
191
  ```
235
192
 
236
- Run:
237
-
238
- ```bash
239
- npx petri-index retrieve "How does login work?" /path/to/workspace --config petri-index.config.json --profile architecture
240
- ```
241
-
242
- ### Lexical modes (`storage.ftsMode`)
243
- - `"full"` (default): best retrieval; stores (redacted) chunk text in the FTS table.
244
- - `"tokens"`: stores only extracted identifiers/tokens for lexical search (less sensitive; still useful for code search).
245
- - `"off"`: disables lexical indexing entirely (vector-only retrieval).
246
-
247
- ---
248
-
249
- ## Vector backends
193
+ The selected profiles are resolved internally into runtime configuration.
250
194
 
251
- Configure the ANN backend via `vector.provider`:
195
+ ### Neo4j migration note
252
196
 
253
- - `"bruteforce"` (default): in-memory exact search, no extra dependencies
254
- - `"hnswlib"`: fast local ANN using HNSW via `hnswlib-node`
255
- - `"qdrant"`: Qdrant (local or remote) via `@qdrant/js-client-rest`
256
- - `"faiss"`: FAISS via `faiss-node` (rebuild-on-write; good for experimentation)
257
- - `"auto"`: picks the best available backend (prefers Qdrant if configured)
258
- - `"custom"`: load a custom provider module that implements the `VectorIndex` interface
197
+ Earlier versions accepted Neo4j configuration under `workspace.graph`.
259
198
 
260
- ### HNSW (local)
261
-
262
- Install:
263
-
264
- ```bash
265
- npm i hnswlib-node
266
- ```
267
-
268
- Config:
269
-
270
- ```json
271
- {
272
- "vector": {
273
- "provider": "hnswlib",
274
- "metric": "cosine",
275
- "hnswlib": {
276
- "persist": true,
277
- "persistDebounceMs": 2000,
278
- "m": 16,
279
- "efConstruction": 200,
280
- "efSearch": 64
281
- }
282
- }
283
- }
284
- ```
285
-
286
- ### Qdrant (local)
287
-
288
- Start a local Qdrant:
289
-
290
- ```bash
291
- docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
292
- ```
293
-
294
- Install client:
295
-
296
- ```bash
297
- npm i @qdrant/js-client-rest
298
- ```
299
-
300
- Config:
301
-
302
- ```json
303
- {
304
- "vector": {
305
- "provider": "qdrant",
306
- "metric": "cosine",
307
- "qdrant": {
308
- "url": "http://127.0.0.1:6333",
309
- "collectionPrefix": "petri",
310
- "mode": "commit",
311
- "recreateOnRebuild": true
312
- }
313
- }
314
- }
315
- ```
316
-
317
- ### FAISS
318
-
319
- Install:
320
-
321
- ```bash
322
- npm i faiss-node
323
- ```
324
-
325
- Config:
326
-
327
- ```json
328
- {
329
- "vector": {
330
- "provider": "faiss",
331
- "metric": "cosine",
332
- "faiss": {
333
- "descriptor": "HNSW,Flat",
334
- "persist": true,
335
- "persistDebounceMs": 2000,
336
- "rebuildStrategy": "lazy"
337
- }
338
- }
339
- }
340
- ```
341
-
342
- ### Custom provider
343
-
344
- Point `vector.custom` to an ES module that exports either:
345
-
346
- - a class implementing `VectorIndex`, or
347
- - a factory function returning a `VectorIndex`
348
-
349
- ```json
350
- {
351
- "vector": {
352
- "provider": "custom",
353
- "custom": {
354
- "module": "./my-vector-provider.mjs",
355
- "export": "default",
356
- "options": { "foo": "bar" }
357
- }
358
- }
359
- }
360
- ```
361
-
362
- ## Security model
363
-
364
- Local indexing means **your source stays on your machine**.
365
-
366
- Controls:
367
-
368
- 1. **Git-native ignore**: files are selected via:
369
- - `git ls-files --cached --others --exclude-standard`
370
- which honours `.gitignore` exactly.
371
- 2. **Extra ignores**: `.petriignore` and `.augmentignore`
372
- 3. **Redaction hooks** (on by default):
373
- - skip obvious secret files by path substring
374
- - redact patterns (e.g. private keys) before embedding + storage
375
-
376
- > For higher assurance, set `storage.ftsMode = "tokens"` and review `redact.patterns`.
199
+ This version automatically migrates those settings into a graph profile on first run. After migration, legacy settings are ignored.
377
200
 
378
201
  ---
379
202
 
380
- ## Output format for agents
203
+ ## Persistence semantics
381
204
 
382
- `WorkspaceIndexer.retrieve()` returns a `ContextBundle`:
205
+ Disabling the graph backend **does not** disable index persistence.
383
206
 
384
- - `hits[]` ranked primary chunks with scores and previews
385
- - `context[]` — expanded context blocks with reasons (adjacency/imports/synopsis)
386
- - `stats` — diagnostics useful for your agent logs
387
-
388
- This is a good structure for:
389
- - Search answers (just `hits`)
390
- - Multi-file refactoring (use `context` as grounded evidence)
391
- - Review/RCA (scope to changed files, include synopsis, bias by recency)
207
+ Persistence of catalogue data, embeddings, and vector indices is controlled independently via storage settings.
392
208
 
393
209
  ---
394
210
 
395
- ## Performance notes
396
-
397
- - Default vector backend is **bruteforce** (exact search in memory). For large repos, use:
398
- - `vector.provider = "hnswlib"` for fast local ANN (HNSW)
399
- - `vector.provider = "qdrant"` for durable, scalable vector search
400
- - `vector.provider = "faiss"` if you already run FAISS locally
401
- - SQLite remains the source-of-truth for file/chunk metadata, so you can rebuild the vector index at any time.
402
-
403
- ---
404
-
405
- ## Files ignored by default (recommended)
211
+ ## Security model
406
212
 
407
- Create a `.petriignore` in each repo to exclude heavy or noisy artefacts:
213
+ - Git‑native ignore (`git ls-files`)
214
+ - Additional `.petriignore` / `.augmentignore`
215
+ - Redaction hooks before embedding and storage
408
216
 
409
- ```txt
410
- dist/
411
- build/
412
- coverage/
413
- **/*.min.js
414
- **/*.map
415
- ```
217
+ For higher assurance:
218
+ - set `storage.ftsMode = "tokens"`
219
+ - review redaction patterns
416
220
 
417
221
  ---
418
222
 
419
223
  ## Licence
420
224
 
421
- MIT (add your own licence file if desired).
225
+ MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@neuralsea/workspace-indexer",
3
- "version": "0.6.0",
3
+ "version": "0.6.1",
4
4
  "description": "Local-first multi-repo workspace indexer (semantic embeddings + git-aware incremental updates + hybrid retrieval profiles) for AI agents.",
5
5
  "repository": {
6
6
  "type": "git",