@neuralsea/workspace-indexer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,356 @@
1
+ # @petri-ai/workspace-indexer
2
+
3
+ A **local-first**, **multi-repo** workspace indexer for AI agents (e.g. your custom agent “Damocles”).
4
+
5
+ It provides:
6
+
7
+ - **Whole-workspace indexing** (multiple Git repos under a workspace root)
8
+ - **Meaningful chunking** (TypeScript/JavaScript AST-aware chunking + robust fallback for other files)
9
+ - **Semantic embeddings** (pluggable: **Ollama local**, **OpenAI**, or deterministic offline **hash**)
10
+ - **Hybrid retrieval**: vector similarity **plus** lexical search (SQLite FTS5) with configurable weights
11
+ - **Pluggable vector backends**: `bruteforce`, `hnswlib` (HNSW), `qdrant` (local/remote), `faiss`, or a custom provider
12
+ - **Branch/commit isolation**: a separate index per repo per Git **HEAD** commit (reduces stale-context errors)
13
+ - **Fast incremental updates**: file watching + `.git/HEAD` switch detection
14
+ - **Security controls**: respects `.gitignore` via `git ls-files`, plus `.petriignore/.augmentignore`, plus redaction hooks
15
+
16
+ This package is designed so Damocles can use the same index in different problem domains:
17
+
18
+ - **Search**
19
+ - **Refactor**
20
+ - **Review**
21
+ - **Architecture understanding**
22
+ - **RCA (root cause analysis)**
23
+
24
+ …by selecting different **retrieval profiles** (k/weights/context-expansion/scope).
25
+
26
+ ---
27
+
28
+ ## Install
29
+
30
+ ```bash
31
+ npm i @neuralsea/workspace-indexer
32
+ ```
33
+
34
+ Node 18+ recommended.
35
+
36
+ ---
37
+
38
+ ## Quick start (library)
39
+
40
+ ```ts
41
+ import { WorkspaceIndexer, OllamaEmbeddingsProvider } from "@neuralsea/workspace-indexer";
42
+
43
+ const embedder = new OllamaEmbeddingsProvider({ model: "nomic-embed-text" });
44
+ const ix = new WorkspaceIndexer("/path/to/workspace", embedder);
45
+
46
+ await ix.indexAll();
47
+
48
+ // Domain: search
49
+ const search = await ix.retrieve("Where is authentication enforced?", { profile: "search" });
50
+
51
+ // Domain: refactor (more context)
52
+ const refactor = await ix.retrieve("Refactor the caching layer to support TTL per key", { profile: "refactor" });
53
+
54
+ // Domain: review (changed files only)
55
+ const review = await ix.retrieve("Explain the risk of this change", {
56
+ profile: "review",
57
+ scope: { changedOnly: true, baseRef: "origin/main" }
58
+ });
59
+
60
+ console.log(search.hits.map(h => h.chunk.path));
61
+ await ix.closeAsync();
62
+ ```
63
+
64
+ ---
65
+
66
+ ## CLI
67
+
68
+ ### Index a workspace
69
+ ```bash
70
+ npx petri-index index /path/to/workspace --provider ollama --model nomic-embed-text
71
+ ```
72
+
73
+ ### Watch (keeps index current)
74
+ ```bash
75
+ npx petri-index watch /path/to/workspace --provider ollama --model nomic-embed-text
76
+ ```
77
+
78
+ ### Query (profile: search)
79
+ ```bash
80
+ npx petri-index query "rate limiting middleware" /path/to/workspace --k 8
81
+ ```
82
+
83
+ ### Retrieve (full context bundle as JSON)
84
+ ```bash
85
+ npx petri-index retrieve "Why are requests timing out?" /path/to/workspace \
86
+ --profile rca \
87
+ --changedOnly true \
88
+ --baseRef origin/main
89
+ ```
90
+
91
+ ---
92
+
93
+ ## Retrieval profiles (how Petri adapts per domain)
94
+
95
+ The same index can be used differently depending on the task. The package provides defaults:
96
+
97
+ - `search`
98
+ Tight top-k; favours precise matches; minimal context expansion.
99
+
100
+ - `refactor`
101
+ Wider k; includes adjacent chunks and follows relative imports to pull in dependent modules.
102
+
103
+ - `review`
104
+ Biases to changed files (when scoped) and includes file synopsis for reviewer context.
105
+
106
+ - `architecture`
107
+ Larger candidate pools; prioritises file synopses and follows imports more aggressively.
108
+
109
+ - `rca`
110
+ Like review + recency bias (recently modified files rank higher).
111
+
112
+ Each profile controls:
113
+
114
+ - **k** (how many primary hits)
115
+ - **weights** (vector/lexical/recency)
116
+ - **expand** (adjacent chunks, follow imports, include file synopsis)
117
+ - **candidate pool sizes** (vectorK/lexicalK)
118
+
119
+ You can override any of these at runtime:
120
+
121
+ ```ts
122
+ const bundle = await ix.retrieve("Explain auth flow", {
123
+ profile: "architecture",
124
+ profileOverrides: {
125
+ k: 30,
126
+ weights: { vector: 0.6, lexical: 0.3, recency: 0.1 },
127
+ expand: { followImports: 5 }
128
+ }
129
+ });
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Config file
135
+
136
+ The CLI supports `--config` pointing to a JSON file.
137
+
138
+ Example: `petri-index.config.json`
139
+
140
+ ```json
141
+ {
142
+ "storage": {
143
+ "storeText": true,
144
+ "ftsMode": "full"
145
+ },
146
+ "vector": {
147
+ "provider": "hnswlib",
148
+ "metric": "cosine",
149
+ "hnswlib": {
150
+ "persist": true,
151
+ "persistDebounceMs": 2000,
152
+ "efSearch": 64
153
+ }
154
+ },
155
+ "chunk": {
156
+ "maxLines": 260,
157
+ "overlapLines": 50
158
+ },
159
+ "profiles": {
160
+ "architecture": {
161
+ "k": 30,
162
+ "expand": { "followImports": 4 }
163
+ },
164
+ "rca": {
165
+ "weights": { "recency": 0.35 }
166
+ }
167
+ }
168
+ }
169
+ ```
170
+
171
+ Run:
172
+
173
+ ```bash
174
+ npx petri-index retrieve "How does login work?" /path/to/workspace --config petri-index.config.json --profile architecture
175
+ ```
176
+
177
+ ### Lexical modes (`storage.ftsMode`)
178
+ - `"full"` (default): best retrieval; stores (redacted) chunk text in the FTS table.
179
+ - `"tokens"`: stores only extracted identifiers/tokens for lexical search (less sensitive; still useful for code search).
180
+ - `"off"`: disables lexical indexing entirely (vector-only retrieval).
181
+
182
+ ---
183
+
184
+ ## Vector backends
185
+
186
+ Configure the ANN backend via `vector.provider`:
187
+
188
+ - `"bruteforce"` (default): in-memory exact search, no extra dependencies
189
+ - `"hnswlib"`: fast local ANN using HNSW via `hnswlib-node`
190
+ - `"qdrant"`: Qdrant (local or remote) via `@qdrant/js-client-rest`
191
+ - `"faiss"`: FAISS via `faiss-node` (rebuild-on-write; good for experimentation)
192
+ - `"auto"`: picks the best available backend (prefers Qdrant if configured)
193
+ - `"custom"`: load a custom provider module that implements the `VectorIndex` interface
194
+
195
+ ### HNSW (local)
196
+
197
+ Install:
198
+
199
+ ```bash
200
+ npm i hnswlib-node
201
+ ```
202
+
203
+ Config:
204
+
205
+ ```json
206
+ {
207
+ "vector": {
208
+ "provider": "hnswlib",
209
+ "metric": "cosine",
210
+ "hnswlib": {
211
+ "persist": true,
212
+ "persistDebounceMs": 2000,
213
+ "m": 16,
214
+ "efConstruction": 200,
215
+ "efSearch": 64
216
+ }
217
+ }
218
+ }
219
+ ```
220
+
221
+ ### Qdrant (local)
222
+
223
+ Start a local Qdrant:
224
+
225
+ ```bash
226
+ docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
227
+ ```
228
+
229
+ Install client:
230
+
231
+ ```bash
232
+ npm i @qdrant/js-client-rest
233
+ ```
234
+
235
+ Config:
236
+
237
+ ```json
238
+ {
239
+ "vector": {
240
+ "provider": "qdrant",
241
+ "metric": "cosine",
242
+ "qdrant": {
243
+ "url": "http://127.0.0.1:6333",
244
+ "collectionPrefix": "petri",
245
+ "mode": "commit",
246
+ "recreateOnRebuild": true
247
+ }
248
+ }
249
+ }
250
+ ```
251
+
252
+ ### FAISS
253
+
254
+ Install:
255
+
256
+ ```bash
257
+ npm i faiss-node
258
+ ```
259
+
260
+ Config:
261
+
262
+ ```json
263
+ {
264
+ "vector": {
265
+ "provider": "faiss",
266
+ "metric": "cosine",
267
+ "faiss": {
268
+ "descriptor": "HNSW,Flat",
269
+ "persist": true,
270
+ "persistDebounceMs": 2000,
271
+ "rebuildStrategy": "lazy"
272
+ }
273
+ }
274
+ }
275
+ ```
276
+
277
+ ### Custom provider
278
+
279
+ Point `vector.custom` to an ES module that exports either:
280
+
281
+ - a class implementing `VectorIndex`, or
282
+ - a factory function returning a `VectorIndex`
283
+
284
+ ```json
285
+ {
286
+ "vector": {
287
+ "provider": "custom",
288
+ "custom": {
289
+ "module": "./my-vector-provider.mjs",
290
+ "export": "default",
291
+ "options": { "foo": "bar" }
292
+ }
293
+ }
294
+ }
295
+ ```
296
+
297
+ ## Security model
298
+
299
+ Local indexing means **your source stays on your machine**.
300
+
301
+ Controls:
302
+
303
+ 1. **Git-native ignore**: files are selected via:
304
+ - `git ls-files --cached --others --exclude-standard`
305
+ which honours `.gitignore` exactly.
306
+ 2. **Extra ignores**: `.petriignore` and `.augmentignore`
307
+ 3. **Redaction hooks** (on by default):
308
+ - skip obvious secret files by path substring
309
+ - redact patterns (e.g. private keys) before embedding + storage
310
+
311
+ > For higher assurance, set `storage.ftsMode = "tokens"` and review `redact.patterns`.
312
+
313
+ ---
314
+
315
+ ## Output format for agents
316
+
317
+ `WorkspaceIndexer.retrieve()` returns a `ContextBundle`:
318
+
319
+ - `hits[]` — ranked primary chunks with scores and previews
320
+ - `context[]` — expanded context blocks with reasons (adjacency/imports/synopsis)
321
+ - `stats` — diagnostics useful for your agent logs
322
+
323
+ This is a good structure for:
324
+ - Search answers (just `hits`)
325
+ - Multi-file refactoring (use `context` as grounded evidence)
326
+ - Review/RCA (scope to changed files, include synopsis, bias by recency)
327
+
328
+ ---
329
+
330
+ ## Performance notes
331
+
332
+ - Default vector backend is **bruteforce** (exact search in memory). For large repos, use:
333
+ - `vector.provider = "hnswlib"` for fast local ANN (HNSW)
334
+ - `vector.provider = "qdrant"` for durable, scalable vector search
335
+ - `vector.provider = "faiss"` if you already run FAISS locally
336
+ - SQLite remains the source-of-truth for file/chunk metadata, so you can rebuild the vector index at any time.
337
+
338
+ ---
339
+
340
+ ## Files ignored by default (recommended)
341
+
342
+ Create a `.petriignore` in each repo to exclude heavy or noisy artefacts:
343
+
344
+ ```txt
345
+ dist/
346
+ build/
347
+ coverage/
348
+ **/*.min.js
349
+ **/*.map
350
+ ```
351
+
352
+ ---
353
+
354
+ ## Licence
355
+
356
+ MIT (add your own licence file if desired).