xindex 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. package/.ai/research/.gitkeep +0 -0
  2. package/.ai/task/.gitkeep +0 -0
  3. package/README.md +54 -89
  4. package/apps/run.search.ts +0 -3
  5. package/componets/index/formatSearchResults.ts +2 -2
  6. package/media/MEDIUM.md +139 -0
  7. package/media/SOCIAL.md +102 -0
  8. package/package.json +1 -1
  9. package/.ai/research/2026-04-10-file-watching.md +0 -79
  10. package/.ai/research/2026-04-10-mcp-output-format.md +0 -129
  11. package/.ai/task/INDEX.md +0 -12
  12. package/.ai/task/done/INDEX.md +0 -3
  13. package/.ai/task/done/task.2026-04-09-local-ai-research-protos.log.md +0 -98
  14. package/.ai/task/done/task.2026-04-09-local-ai-research-protos.md +0 -102
  15. package/.ai/task/task.2026-04-10-cluster-config.log.md +0 -19
  16. package/.ai/task/task.2026-04-10-cluster-config.md +0 -118
  17. package/.ai/task/task.2026-04-10-dir-indexing.log.md +0 -8
  18. package/.ai/task/task.2026-04-10-dir-indexing.md +0 -92
  19. package/.ai/task/task.2026-04-10-line-clustering.log.md +0 -50
  20. package/.ai/task/task.2026-04-10-line-clustering.md +0 -176
  21. package/.ai/task/task.2026-04-10-object-store.log.md +0 -7
  22. package/.ai/task/task.2026-04-10-object-store.md +0 -81
  23. package/.ai/task/task.2026-04-10-search-config.log.md +0 -46
  24. package/.ai/task/task.2026-04-10-search-config.md +0 -274
  25. package/.ai/task/task.2026-04-10-watch-indexing.log.md +0 -32
  26. package/.ai/task/task.2026-04-10-watch-indexing.md +0 -101
  27. package/.ai/task/task.2026-04-10-xindex-mcp.log.md +0 -5
  28. package/.ai/task/task.2026-04-10-xindex-mcp.md +0 -92
  29. package/.ai/task/task.2026-04-10-xindex-mcp.report.md +0 -113
@@ -1,176 +0,0 @@
1
- # Task: Line-level clustering for block-granular search
2
-
3
- ## Context
4
-
5
- **Current state**: xindex indexes one vector per file. The `id` is the file path, keywords are extracted from the entire file content, and search returns file-level matches. This is too coarse — a 500-line file with mixed concerns returns as a single hit with no indication of *where* in the file the match is.
6
-
7
- **User's idea**: split files into semantically coherent blocks (clusters of lines), then index each block separately so search returns `file:fromLine-toLine` references.
8
-
9
- **Approach — extend existing pipeline with recursive bisection**:
10
- 1. Keep existing file-level indexing intact — `indexContent(filePath, keywords, meta)` runs first, unchanged
11
- 2. After file-level index: split file content into lines
12
- 3. Bisect into 2 halves → extract keywords for each → embed → compute cosine similarity (dot product of normalized vectors)
13
- 4. If similarity is high (≥ 0.6) → cohesive, no clustering needed. If low → 2 separate clusters.
14
- 5. Recurse: split each cluster into 2 again, test overlap, stop when clusters are cohesive or hit limits (max depth 4 → up to 16 clusters, min 5 lines per cluster)
15
- 6. If only 1 cluster (whole file is cohesive) → skip clustering, file-level entry is enough
16
- 7. If 2+ clusters → index each in persistent Vectra as `<file>:<fromLine>-<toLine>` alongside the file-level entry
17
- 8. Write a manifest at `<file>::manifest` tracking cluster IDs for cleanup on re-index
18
- 9. Both file-level and cluster-level entries coexist — search may return both
19
-
20
- **Key files (change targets)**:
21
- - `componets/index/indexFileContent.ts` — **main change site**: currently calls `indexContent(id, keywords, meta)` once per file. Will call `clusterLines` then loop over clusters.
22
- - `componets/index/handleFileEvent.ts` — calls `removeContent(path)` on file change. Must delete all clusters for a file, not just one ID.
23
- - `componets/index/indexContent.ts` — low-level: embeds + upserts one item. No change needed — called per cluster.
24
- - `componets/index/removeContent.ts` — low-level: deletes one item. No change needed — called per cluster ID.
25
- - `componets/index/searchContentIndex.ts` — returns `IIndexRecord{score, id, meta}`. No change needed — `id` becomes `file:1-27` naturally.
26
- - `componets/index/indexMeta.ts` — `IIndexMeta{keywords, id}`. Add `type` tag, add `IClusterMeta`, `IFileManifest` using `IType<>` tagged union.
27
- - `componets/index/objectStore.ts` — stores `IIndexMeta` as JSON, keyed by MD5(id). Needs a manifest entry per file to track cluster IDs.
28
- - `componets/index/contentIndexDriver.ts` — wires components together. Must construct `ClusterLines`, `IndexFileContent`, `RemoveFileContent` inside. Currently `IndexFileContent` is constructed by callers.
29
- - `componets/buildComponents.ts` — top-level builder. Must return `indexFileContent` + `removeFileContent` from driver.
30
- - `apps/indexApp.ts` — bulk indexer. Calls `indexFileContent` directly via stream (no `HandleFileEvent`). Needs `removeFileContent` for cleanup.
31
- - `apps/run.index.ts`, `apps/run.watch.ts`, `apps/run.mcp.ts` — entry points. Currently construct `IndexFileContent` manually. Will use driver-provided version.
32
- - `componets/index/vectraIndex.ts` — creates `LocalIndex(path)`. No change.
33
- - `componets/llm/embed.ts` — MiniLM-L6 embeddings, returns `number[]`. No change.
34
- - `test-vectra-memory.ts` — proved VirtualFileStorage works for in-memory cosine queries.
35
-
36
- **Raw notes**: recursive split → 2 → 4 → 8 → 16 hard stop. Overlap by keywords via embedding cosine similarity. Final clusters get indexed in persistent store with line references. MCP query returns lines.
37
-
38
- ## Goal
39
-
40
- Extend the existing indexing pipeline with a `ClusterLines` component (HOF pattern) that takes file content, splits it into semantically coherent line clusters using recursive bisection with embedding cosine similarity, and returns cluster descriptors `{fromLine, toLine, content, keywords}[]`. The existing file-level index stays intact — clustering adds block-level entries alongside it.
41
-
42
- ## Diagram
43
-
44
- ```
45
- handleFileEvent (file change/add)
46
-
47
- ├── removeFileContent(path) ◄── clean ALL old data first
48
- │ ├── removeContent(path) delete file-level vectra + meta
49
- │ └── read manifest(path::manifest) if exists:
50
- │ ├── removeContent(path:1-10) delete each cluster
51
- │ ├── removeContent(path:11-25)
52
- │ └── objectStore.remove(manifest)
53
-
54
- └── indexFileContent(path, text) ◄── create ALL new data
55
-
56
- ├── EXISTING: file-level index (unchanged)
57
- │ extractKeywords + cleanUpKeywords(text)
58
- │ indexContent(path, keywords, {keywords, id: path})
59
- │ ├── embed(keywords) → vector
60
- │ ├── vectra.upsertItem({id: path, vector})
61
- │ └── objectStore.write(path, meta)
62
-
63
- ├── NEW: cluster-level index (extension)
64
- │ clusterLines(lines, path)
65
- │ │
66
- │ ▼
67
- │ ┌─────────────────┐
68
- │ │ Split in half │
69
- │ │ lines[0..n/2] │
70
- │ │ lines[n/2..n] │
71
- │ └────────┬────────┘
72
- │ │
73
- │ ▼
74
- │ ┌──────────────────────────┐
75
- │ │ Extract keywords each │
76
- │ │ Embed keywords → vec │
77
- │ │ cosine(vecA, vecB) │
78
- │ └────────┬─────────────────┘
79
- │ │
80
- │ sim ≥ 0.6? ──yes──► 1 cluster (leaf)
81
- │ │
82
- │ no → recurse each half ◄── depth ≤ 4, min 5 lines
83
- │ │
84
- │ ▼
85
- │ clusters[] = {fromLine, toLine, content, keywords}[]
86
-
87
- │ clusters.length ≤ 1? → SKIP (file entry is enough)
88
-
89
- │ clusters.length > 1? → for each cluster:
90
- │ indexContent(id="path:12-45", cluster.keywords, clusterMeta)
91
-
92
- └── objectStore.write(path::manifest, {clusterIds})
93
-
94
- Three key types in store:
95
- path → file-level entry (vectra + objectStore)
96
- path:1-10 → cluster entry (vectra + objectStore)
97
- path::manifest → {type:"manifest", clusterIds} (objectStore only)
98
- ```
99
-
100
- ## Steps
101
-
102
- ### 1. ClusterLines component — NEW `componets/index/clusterLines.ts`
103
- 1. **Cosine helper** — `cosine(a: number[], b: number[]): number` — dot product of two normalized vectors. Pure function, no deps.
104
- 2. **HOF factory** — `ClusterLines({embed, extractKeywords, cleanUpKeywords, threshold, minLines, maxDepth})` returns `IClusterLines(lines: string[], file: string) → Promise<ILineCluster[]>`. Defaults: threshold=0.6, minLines=5, maxDepth=4.
105
- 3. **ILineCluster type** — `{fromLine: number, toLine: number, content: string, keywords: string}`. `fromLine`/`toLine` are 1-based line numbers.
106
- 4. **Recursive bisection** — split lines at midpoint → join each half → `extractKeywords` + `cleanUpKeywords` → `embed` each → `cosine(vecA, vecB)`. If sim ≥ threshold → leaf cluster. If sim < threshold → recurse on each half.
107
- 5. **Guards** — `lines.length ≤ minLines` or `depth ≥ maxDepth` → leaf. Empty lines → return `[]`. Either half has no keywords → leaf.
108
-
109
- ### 2. Extend metadata — MODIFY `componets/index/indexMeta.ts` + `objectStore.ts`
110
- 1. **Tag IIndexMeta** — add `type: "meta"` field using `IType<>` pattern: `IType<{type: "meta", keywords: string, id: string}>`. Breaking change — all constructors must add `type: "meta"`.
111
- 2. **Add IClusterMeta type** — `IType<{type: "cluster", keywords: string, id: string, fromLine: number, toLine: number}>`. Cluster-level entries with line ranges.
112
- 3. **Add IFileManifest type** — `IType<{type: "manifest", id: string, clusterIds: string[]}>`. Stored at key `filePath::manifest` in object store.
113
- 4. **IStoreEntry union** — `IIndexMeta | IClusterMeta | IFileManifest`. Discriminated by `type` field.
114
- 5. **Widen objectStore types** — `IObjectStore.write`/`read` accept/return `IStoreEntry`.
115
- 6. **Update indexContent.ts** — widen `meta` param from `IIndexMeta` to `IIndexMeta | IClusterMeta`.
116
- 7. **Update searchContentIndex.ts** — narrow `IStoreEntry` by `type` when reading results from object store.
117
-
118
- ### 3. RemoveFileContent — NEW `componets/index/removeFileContent.ts`
119
- 1. **HOF factory** — `RemoveFileContent({removeContent, objectStore})` returns `IRemoveFileContent(filePath: string) => Promise<void>`.
120
- 2. **Deletes all layers** — (a) `removeContent(filePath)` to delete file-level vectra item + meta. (b) Read manifest at `filePath::manifest` → if exists, `removeContent(clusterId)` for each → `objectStore.remove(manifestKey)`. (c) All deletes wrapped in try/catch — missing entries are fine (first-time index, no clusters).
121
-
122
- ### 4. Update indexFileContent — MODIFY `componets/index/indexFileContent.ts`
123
- 1. **Add deps** — `{extractKeywords, cleanUpKeywords, indexContent, clusterLines, objectStore}`. Existing deps stay — file-level index needs `extractKeywords`/`cleanUpKeywords`.
124
- 2. **File-level index (EXISTING, now tagged)** — `extractKeywords(content)` → `cleanUpKeywords` → `indexContent(id, keywords, {type: "meta", keywords, id})`. Runs first, always.
125
- 3. **Cluster-level index (NEW, extension)** — `content.split("\n")` → `clusterLines(lines, id)` → if `clusters.length ≤ 1` → skip (file is cohesive). If `clusters.length > 1` → for each cluster: `indexContent(\`${id}:${fromLine}-${toLine}\`, cluster.keywords, {type: "cluster", ...})`.
126
- 4. **Write manifest** — after all clusters indexed, `objectStore.write(id + "::manifest", {type: "manifest", id, clusterIds})`.
127
-
128
- ### 5. Wire through driver + builder — MODIFY `contentIndexDriver.ts` + `buildComponents.ts`
129
- 1. **contentIndexDriver.ts** — instantiate `ClusterLines({embed, extractKeywords, cleanUpKeywords})`. Construct `IndexFileContent({extractKeywords, cleanUpKeywords, indexContent, clusterLines, objectStore})` inside driver (currently constructed by callers). Construct `RemoveFileContent({removeContent, objectStore})`. Add `indexFileContent` + `removeFileContent` to `IContentIndexDriver`.
130
- 2. **buildComponents.ts** — destructure `indexFileContent` + `removeFileContent` from `ContentIndexDriver`. Return them. Callers no longer construct `IndexFileContent` themselves.
131
-
132
- ### 6. Update callers — MODIFY `run.*.ts` + `handleFileEvent.ts` + `indexApp.ts`
133
- 1. **run.index.ts, run.watch.ts, run.mcp.ts** — remove `IndexFileContent(...)` construction. Get `indexFileContent` + `removeFileContent` from `BuildComponents()`.
134
- 2. **handleFileEvent.ts** — replace `removeContent` dep with `removeFileContent`. On `FileEventType.index`: `removeFileContent(path)` first (clean old data), then `indexFileContent(path, text)` (creates file entry + cluster entries). On `FileEventType.remove`: `removeFileContent(path)`.
135
- 3. **indexApp.ts** — currently calls `indexFileContent(id, text)` directly via stream pipeline (no `HandleFileEvent`). Add `removeFileContent` dep. Call `removeFileContent(id)` before `indexFileContent(id, text)` in the `map` callback — otherwise old clusters linger when cluster boundaries change on re-index. Update `IndexApp({walkFiles, indexFileContent, removeFileContent, log})`.
136
- 4. **Import paths** — `run.index.ts` imports `IndexFileContent` from `componets/index/indexFileContent.js`. After moving construction inside driver, remove this import. Same for `run.watch.ts` and `run.mcp.ts`.
137
-
138
- ### 7. Test end-to-end
139
- 1. **Unit test clusterLines** — feed a file with 2 distinct sections (imports+types vs. implementation), verify ≥2 clusters with correct 1-based line ranges.
140
- 2. **Integration test** — index a multi-concern file, query for a specific concept, verify search returns both `file.ts` (file-level) and `file.ts:12-45` (cluster-level).
141
- 3. **Re-index test** — modify file, re-index, verify old clusters deleted + new ones created.
142
- 4. **Cohesive file test** — index a small/uniform file, verify only file-level entry exists (no clusters, no manifest).
143
-
144
- ## Decisions
145
-
146
- - **Extend, don't replace** — existing file-level indexing stays intact. Clustering is an additional step that runs after. Both levels coexist in the index.
147
- - **1 cluster = skip** — if the file is cohesive (clustering returns 1 cluster = whole file), no cluster entries are created. File-level entry is enough.
148
- - **Embedding cosine similarity** for bisection (not Jaccard). Jaccard only matches exact keyword strings — `fetchUser` and `getUser` would score 0% overlap despite being the same concern. Embeddings capture meaning. Cost is acceptable: MiniLM-L6 is local, ~30 embed calls per file at max depth, ~50-100ms total.
149
- - **Cosine computation**: Option A — direct dot product (3-line helper, vectors already normalized). Fallback to Option B (in-memory Vectra via `VirtualFileStorage`) if direct cosine proves insufficient.
150
- - **Similarity threshold**: start at 0.55–0.70, tune empirically. Try 0.6 as default.
151
- - **Min cluster size**: 3–5 lines. Use 5 as default, configurable.
152
- - **Three tagged types in store** (using `IType<>` pattern): `IIndexMeta{type:"meta"}` at `filePath`, `IClusterMeta{type:"cluster"}` at `filePath:fromLine-toLine`, `IFileManifest{type:"manifest"}` at `filePath::manifest`. All separate keys, discriminated by `type`.
153
- - **Cleanup on re-index**: `removeFileContent` deletes file-level entry, then reads manifest to delete all cluster entries, then deletes manifest itself. Graceful on missing data.
154
- - **Move IndexFileContent inside driver** — currently constructed by callers in `run.*.ts`. Moving inside `contentIndexDriver.ts` consolidates wiring since the driver already has all deps.
155
-
156
- ## Research: existing NPM packages
157
-
158
- - **semantic-chunking** (jparkerweb, v2.4.4) — splits text into sentences, embeds each with ONNX model, groups by cosine similarity. Sentence-level, not line-level. Uses its own ONNX pipeline, not BYOE.
159
- - **semantic-chunker** (johnhenry) — BYOE approach, bring your own embedding function. More flexible. Could plug in our MiniLM-L6 embed.
160
- - **LangChain RecursiveCharacterTextSplitter** — recursive splitting by character/token boundaries, not semantic. 2026 benchmarks show 512-token recursive splitting at 69% accuracy — good baseline but not meaning-aware.
161
- - **NAACL 2025 finding**: fixed 200-word chunks match or beat semantic chunking for general RAG. But for *code* with mixed concerns in one file, semantic splitting should outperform fixed-size.
162
- - **Verdict**: existing packages target prose (sentence-level). Our use case is code (line-level, preserve line boundaries for references). Custom recursive bisection with our existing embed pipeline is the right call — simpler than adapting a prose chunker to respect line boundaries.
163
-
164
- ## Edge Cases
165
-
166
- - **Small files (≤ 5 lines)** — return as single cluster, no splitting attempted.
167
- - **Empty files** — file-level entry is still indexed (existing pipeline runs first). Clustering returns `[]`, no cluster entries created.
168
- - **Files with uniform content** (e.g., all imports) — cosine similarity stays high at every split, returns 1 cluster. Expected behavior.
169
- - **Binary/non-text files** — already filtered upstream by the file walker. Not a concern here.
170
- - **Legacy index data** — files indexed before this change won't have manifests. On re-index, no old clusters to delete — just index fresh.
171
-
172
- ## Open Questions
173
-
174
- - **Keyword extraction quality**: current keywords come from compromise NLP + keyword-extractor. May need tuning for code (variable names, imports, function signatures).
175
- - **Threshold tuning**: need to test 0.55 vs 0.60 vs 0.70 on real project files to find the sweet spot.
176
- - ~~**Object store dual use**~~ — resolved: three tagged types (`IIndexMeta`, `IClusterMeta`, `IFileManifest`) discriminated by `type` field, stored at separate keys. Union `IStoreEntry = IIndexMeta | IClusterMeta | IFileManifest`.
@@ -1,7 +0,0 @@
1
- ### 2026-04-10 — Task created
2
-
3
- - Scouted: vectra currently stores vector + IIndexMeta (keywords, file) together
4
- - User wants to separate: vectra for vectors only, .xindex/objects/ for meta JSON
5
- - Hash-based path: md5(id) → xx/yy/xxyyzz.json
6
- - Need to update indexContent, searchContentIndex, resetIndex, contentIndexDriver
7
- - New components: objectStore (read/write/clear), indexStructure (manage .xindex/ dirs)
@@ -1,81 +0,0 @@
1
- # Task: Object Store — Separate Meta Storage from Vectra
2
-
3
- ## Context
4
-
5
- Currently vectra stores both vectors AND metadata (`{keywords, file}`) in the same index. Vectra is good for semantic search, not for storage. Goal: split storage into two layers:
6
-
7
- - **`.xindex/semantic/`** — vectra stores only vectors + id (for search)
8
- - **`.xindex/objects/`** — file-based JSON store for meta objects (for storage/retrieval)
9
-
10
- **Current state:**
11
- - `indexContent.ts` — embeds content, upserts `{id, vector, metadata: IIndexMeta}` into vectra
12
- - `searchContentIndex.ts` — queries vectra, reads `r.item.metadata as IIndexMeta`
13
- - `resetIndex.ts` — `deleteIndex()` + `createIndex()` on vectra only
14
- - `IIndexMeta = {keywords: string, file: string}`
15
- - Index path: `.xindex` (single vectra folder)
16
-
17
- **New structure:**
18
- ```
19
- .xindex/
20
- ├── semantic/ ← vectra (vectors + id only, minimal meta)
21
- └── objects/ ← JSON files keyed by hash of id
22
- └── xx/
23
- └── yy/
24
- └── xxyyzz.json ← {keywords, file, ...}
25
- ```
26
-
27
- ## Goal
28
-
29
- Introduce an object store layer that writes `IIndexMeta` as JSON files in `.xindex/objects/`, remove metadata from vectra (keep only vector + id), and decorate `indexContent` and `searchContentIndex` to read/write both layers.
30
-
31
- ## Diagram
32
-
33
- ```
34
- INDEX PIPELINE:
35
- file → extractKeywords → cleanUp → keywords
36
-
37
- ├── [1] embed(keywords) → vector
38
- │ └── vectra.upsert({id, vector}) → .xindex/semantic/
39
-
40
- └── [2] objectStore.write(id, meta) → .xindex/objects/xx/yy/xxyyzz.json
41
- {keywords, file}
42
-
43
- SEARCH PIPELINE:
44
- query → extractKeywords → cleanUp → embed → vector
45
-
46
- ├── [1] vectra.query(vector, limit) → [{score, id}]
47
-
48
- └── [2] objectStore.read(id) → IIndexMeta
49
-
50
- [{score, id, meta}]
51
-
52
- RESET:
53
- [1] vectra.deleteIndex + createIndex → .xindex/semantic/ wiped
54
- [2] rm -rf .xindex/objects/ → objects wiped
55
- ```
56
-
57
- ## Steps
58
-
59
- ### 1. Object Store HOF
60
- - Create `componets/index/objectStore.ts` — `ObjectStore({basePath}): IObjectStore`
61
- - `write(id, meta)` — hash id (md5 → hex), split into `xx/yy/xxyyzz`, `mkdir -p`, write JSON
62
- - `read(id)` — hash id, read JSON, parse as `IIndexMeta`
63
- - `clear()` — rm -rf basePath, recreate empty dir
64
-
65
- ### 2. Update Index Structure
66
- - Create `componets/index/indexStructure.ts` — `IndexStructure({basePath}): IIndexStructure`
67
- - Manages `.xindex/` top-level: ensures `semantic/` and `objects/` dirs exist
68
- - Returns paths: `{semanticPath, objectsPath}`
69
- - Used by `contentIndexDriver` at init
70
-
71
- ### 3. Decorate Index/Search
72
- - Update `IndexContent` — upsert vector+id to vectra (no meta), write meta to objectStore
73
- - Update `SearchContentIndex` — query vectra for `{score, id}[]`, then `objectStore.read(id)` for each result to attach meta
74
- - Update `ResetIndex` — call both `vectra.deleteIndex/createIndex` and `objectStore.clear()`
75
- - Update `ContentIndexDriver` — pass `semanticPath` to `VectraIndex`, create `ObjectStore({basePath: objectsPath})`
76
-
77
- ## Open Questions
78
-
79
- - Hash function: `crypto.createHash('md5')` from Node built-in — fast enough, no deps. Or use simpler hash?
80
- - Should objectStore support partial updates (upsert) or always overwrite?
81
- - Should search batch-read objects or read one by one per result?
@@ -1,46 +0,0 @@
1
- ### 2026-04-10
2
-
3
- - Task created from user notes
4
- - Scouted codebase via xindex search (indexed 167 files)
5
- - `.xindex.json` exists but empty `{}`, no config loading anywhere
6
- - `CleanUpKeywords` at `componets/keywords/cleanUpKeywords.ts:8` — HOF takes `{maxNgrams, minLength}`. Add `ignoreKeywords` here.
7
- - `SearchContentIndex` at `componets/index/searchContentIndex.ts:12` — search pipeline, uses `cleanUpKeywords` on query. Ignore list propagates automatically.
8
- - `IClusterMeta` at `componets/index/indexMeta.ts:11` — has `fromLine`/`toLine` for reading snippet lines
9
- - MCP tool at `apps/mcpApp.ts:34` — `xindex_search` schema has `{query, limit}` only. Add snippet params.
10
- - CLI at `apps/run.search.ts:23-31` — formats results with score + keywords, no snippets
11
- - `BuildComponents` at `componets/buildComponents.ts:6` — wires everything, no config loading. Config loads here.
12
- - `ContentIndexDriver` at `componets/index/contentIndexDriver.ts:27` — passes `cleanUpKeywords` to `ClusterLines` and `SearchContentIndex`
13
- - Entry points: `apps/run.mcp.ts:19` (MCP), `apps/run.search.ts:8` (CLI) — both call `BuildComponents()`
14
- - User wants explicit config names: `ignoreKeywords`, `snippetLines`, `snippetResults`
15
-
16
- **Clarification round — decisions:**
17
- - Defaults confirmed: `snippetResults: 3`, `snippetLines: 7`
18
- - `ignoreKeywords`: exact strings, case-insensitive. No globs/patterns.
19
- - Ignore at **index time** — re-index + MCP restart after config change is acceptable. One-time setup, review in 3mo.
20
- - File-level results (whole file, no cluster) also get snippets if file total lines ≤ `snippetLines`
21
- - Task finalized
22
-
23
- **Round 2 — user feedback during detail expansion:**
24
- - Renamed `snippetLines` → `maxSnippetLines`, `snippetResults` → `maxSnippetResults` (user preference for explicit names)
25
- - Added `ignoreFiles` feature: gitignore-style glob patterns in `.xindex.json` to exclude files from indexing. Reuses existing `ignore` package already in `walkFiles.ts:3` and `watchFiles.ts`
26
- - Expanded task from 3x3 to 4x3 to accommodate file ignore list as separate step
27
- - Traced all WalkFiles/WatchFiles consumers: `run.mcp.ts`, `run.index.ts`, `run.watch.ts` — all need `ignoreFiles` plumbed
28
- - Task ready for implementation
29
-
30
- **Round 3 — consistency check (7 findings, all fixed):**
31
- - [Missing] `.xindex.json` is optional — added to Decisions + diagram label
32
- - [Drift] Diagram only showed WalkFiles — added WatchFiles
33
- - [Mismatch] Step 2.3 duplicated validation from 1.2 — removed 2.3, kept in 1.2 only
34
- - [Mismatch] `console.warn` in LoadConfig violates project `ILogger` pattern — added `log: ILogger` dep to LoadConfig and BuildComponents
35
- - [Drift] Files Changed table had tentative "(if it creates its own)" for run.index.ts — made definitive
36
- - [Inconsistency] Step 4.2 parsed fromLine/toLine from ID string — uses `meta.fromLine`/`meta.toLine` directly now
37
- - [Missing] Step 1.3 vague "WalkFiles consumers" — listed all 5 specific construction sites (run.mcp.ts:18,30, run.index.ts:10, run.watch.ts:13,14)
38
-
39
- **Round 4 — implementation:**
40
- - Implemented all 12 files (3 new, 9 modified) + run.reset.ts (missed in plan, also calls BuildComponents)
41
- - Phase 1: config type + loadConfig HOF
42
- - Phase 2: cleanUpKeywords ignoreSet, walkFiles + watchFiles ignoreFiles
43
- - Phase 3: readSnippet HOF
44
- - Phase 4: buildComponents wiring ({log} param, config loading, return config)
45
- - Phase 5: all entry points updated (run.mcp, run.search, run.index, run.watch, run.reset)
46
- - Verified: keyword ignore filters noisy words, file ignore excludes rnd/**, snippets show for small results (top 3, ≤7 lines)
@@ -1,274 +0,0 @@
1
- # Task: Search Result Config — Keyword Ignore, File Ignore & Inline Code Snippets
2
-
3
- ## Context
4
-
5
- Three improvements to xindex, all configurable via `.xindex.json`:
6
-
7
- 1. **Keyword ignore list** — exclude noisy keywords at index time (case-insensitive exact match). Improves grouping relevance. Requires re-index after config change — acceptable as one-time setup.
8
- 2. **File ignore list** — gitignore-style glob patterns to exclude files from indexing. Same semantics as `.gitignore` but defined in `.xindex.json`. Applied in `WalkFiles` and `WatchFiles` alongside existing `.gitignore` rules.
9
- 3. **Inline code snippets** — when a search result is small (≤ N lines), include actual source code in the output. Configurable via `.xindex.json` defaults + MCP tool parameter overrides.
10
-
11
- **Current state:**
12
- - `.xindex.json` exists but is empty `{}` — file is optional, may not exist at all
13
- - Keywords: `compromise` NLP → `keyword-extractor` cleanup in `componets/keywords/cleanUpKeywords.ts`
14
- - File walking: `componets/walkFiles.ts` uses `ignore` package for `.gitignore` rules; `componets/watchFiles.ts` has its own `loadGitignore` + `ignore()` at line 22-31
15
- - Search results: 1-line summaries only (`1. path:from-to (score) — keywords`)
16
- - Cluster metadata stores `fromLine`/`toLine` in `componets/index/indexMeta.ts:11` (`IClusterMeta`)
17
- - MCP `xindex_search` accepts `query` and `limit` only
18
- - No config loading exists
19
- - Entry points that create `WalkFiles`/`WatchFiles`: `run.mcp.ts:18,30`, `run.index.ts:10`, `run.watch.ts:13-14`
20
-
21
- **Decisions:**
22
- - `.xindex.json` is **optional** — missing file → all defaults, no error
23
- - `ignoreKeywords`: exact strings, case-insensitive. No globs/patterns.
24
- - `ignoreFiles`: gitignore-style glob patterns (reuses existing `ignore` package)
25
- - `maxSnippetResults: 3`, `maxSnippetLines: 7` — confirmed defaults
26
- - Ignore list applied at **index time** — re-index + MCP restart required after config change. One-time setup, review in 3mo.
27
- - File-level results (no cluster) **also get snippets** if total file lines ≤ `maxSnippetLines`
28
- - Config field names are explicit: `ignoreKeywords`, `ignoreFiles`, `maxSnippetLines`, `maxSnippetResults`
29
-
30
- ## Diagram
31
-
32
- ```
33
- .xindex.json (optional) MCP xindex_search
34
- ┌──────────────────────────┐ ┌──────────────────────────────┐
35
- │ ignoreKeywords: [...] │ │ query, limit │
36
- │ ignoreFiles: [...] │ │ maxSnippetResults: 3 │ ← override
37
- │ maxSnippetLines: 7 │ │ maxSnippetLines: 7 │ ← override
38
- │ maxSnippetResults: 3 │ └──────┬───────────────────────┘
39
- └──────┬───────────────────┘ │
40
- │ │
41
- ├─ ignoreFiles ────┐ │
42
- │ ▼ │
43
- │ WalkFiles + WatchFiles │
44
- │ (skip matching paths) │
45
- │ │
46
- ├─ ignoreKeywords ─┐ │
47
- │ ▼ │
48
- │ CleanUpKeywords │
49
- │ (index time) │
50
- │ │
51
- └─ snippet config ────────────────────────┤
52
-
53
- Format results
54
- ├─ cluster ≤ maxSnippetLines? → readSnippet
55
- └─ file ≤ maxSnippetLines? → readSnippet
56
-
57
- Data flow (indexing):
58
- walkFiles(inputs) ← ignoreFiles applied here (NEW)
59
- → readFile
60
- → ExtractKeywords (compromise NLP)
61
- → CleanUpKeywords (keyword-extractor + ignoreKeywords filter) ← NEW
62
- → embed → vectra upsert + objectStore write
63
-
64
- Data flow (search):
65
- query
66
- → extractKeywords → cleanUpKeywords → embed → vectra query
67
- → filter by scoreThreshold → objectStore.read for each hit
68
- → format results + readSnippet for top N small results ← NEW
69
- ```
70
-
71
- ## Steps
72
-
73
- ### 1. Config schema & loading
74
-
75
- - **1.1 Define config type** — create `componets/config/xindexConfig.ts`
76
- ```ts
77
- export type IXindexConfig = {
78
- ignoreKeywords: string[];
79
- ignoreFiles: string[];
80
- maxSnippetLines: number;
81
- maxSnippetResults: number;
82
- };
83
- ```
84
- All fields optional in the JSON file; defaults applied at load time.
85
-
86
- - **1.2 Load config** — create `componets/config/loadConfig.ts` as HOF
87
- ```ts
88
- export type ILoadConfig = () => Promise<IXindexConfig>;
89
- export function LoadConfig({configPath, log}: {configPath: string, log: ILogger}): ILoadConfig
90
- ```
91
- - Read `configPath` (`.xindex.json` in cwd)
92
- - `JSON.parse`, apply defaults: `{ignoreKeywords: [], ignoreFiles: [], maxSnippetLines: 7, maxSnippetResults: 3}`
93
- - If file missing or empty → return all defaults (no error)
94
- - If JSON parse fails → throw with clear message including path
95
- - Validate: use `log` to warn if any `ignoreKeywords` entry has length ≤ 1 (no `console.*` — project uses `ILogger`)
96
-
97
- - **1.3 Wire into BuildComponents** — modify `componets/buildComponents.ts:6-22`
98
- - Current: creates `embed`, `extractKeywords`, `cleanUpKeywords({maxNgrams: 2, minLength: 2})` then `ContentIndexDriver`
99
- - `BuildComponents` currently takes no args. Add `{log}: {log: ILogger}` so `LoadConfig` can use it for warnings.
100
- - Add: `const loadConfig = LoadConfig({configPath: ".xindex.json", log})` → `const config = await loadConfig()`
101
- - Pass `config.ignoreKeywords` to `CleanUpKeywords`: `CleanUpKeywords({maxNgrams: 2, minLength: 2, ignoreKeywords: config.ignoreKeywords})`
102
- - Return `config` in the output so callers can access snippet + file ignore settings
103
- - `BuildComponents` return type gains `config: IXindexConfig`
104
- - All callers need updating to pass `log` and destructure `config`:
105
- - `apps/run.mcp.ts:19` — needs `config` for `McpApp` + `ignoreFiles` for `WalkFiles`/`WatchFiles`
106
- - `apps/run.index.ts:11` — needs `config.ignoreFiles` for `WalkFiles`
107
- - `apps/run.watch.ts:15` — needs `config.ignoreFiles` for `WalkFiles`/`WatchFiles`
108
- - `apps/run.search.ts:8` — needs `config` for snippet settings
109
-
110
- ### 2. Keyword ignore list
111
-
112
- - **2.1 Extend CleanUpKeywords** — modify `componets/keywords/cleanUpKeywords.ts:8`
113
- - Current signature: `CleanUpKeywords({maxNgrams, minLength}: {maxNgrams: number, minLength: number})`
114
- - New signature: `CleanUpKeywords({maxNgrams, minLength, ignoreKeywords = []}: {maxNgrams: number, minLength: number, ignoreKeywords?: string[]})`
115
- - Build `ignoreSet = new Set(ignoreKeywords.map(k => k.toLowerCase()))` at factory time (once, not per call)
116
- - Add `if (ignoreSet.has(lower)) return false;` into existing filter chain at line 21-27, before the `seen` dedup check
117
-
118
- Exact change at `cleanUpKeywords.ts`:
119
- ```ts
120
- export function CleanUpKeywords({maxNgrams, minLength, ignoreKeywords = []}: {
121
- maxNgrams: number, minLength: number, ignoreKeywords?: string[]
122
- }): ICleanUpKeywords {
123
- const ignoreSet = new Set(ignoreKeywords.map(k => k.toLowerCase()));
124
- return function cleanUpKeywords(keywords) {
125
- // ... existing extraction ...
126
- const seen = new Set<string>();
127
- return extracted.filter((kw: string) => {
128
- if (kw.length <= minLength || !/[a-z]/i.test(kw)) return false;
129
- const lower = kw.toLowerCase();
130
- if (ignoreSet.has(lower)) return false; // ← NEW
131
- if (seen.has(lower)) return false;
132
- seen.add(lower);
133
- return true;
134
- });
135
- }
136
- }
137
- ```
138
-
139
- - **2.2 Propagation** — single change at `buildComponents.ts:9` propagates to all consumers:
140
- - `ContentIndexDriver` (`contentIndexDriver.ts:28`) passes `cleanUpKeywords` to:
141
- - `ClusterLines` (`clusterLines.ts:20`) — uses at `:34-35` (top/bot split keywords) and `:56` (leaf keywords)
142
- - `IndexFileContent` (`indexFileContent.ts:10`) — uses at `:19` (file-level keywords)
143
- - `SearchContentIndex` (`searchContentIndex.ts:12`) — uses at `:22` (query keywords)
144
- - All paths share the same `cleanUpKeywords` instance. No additional wiring needed.
145
-
146
- ### 3. File ignore list
147
-
148
- - **3.1 Extend WalkFiles** — modify `componets/walkFiles.ts:8`
149
- - Current signature: `WalkFiles({cwd, log}: {cwd: string, log: ILogger})`
150
- - New signature: `WalkFiles({cwd, log, ignoreFiles = []}: {cwd: string, log: ILogger, ignoreFiles?: string[]})`
151
- - In `walk()` at line 18-22: the `ignore` instance `ig` is already constructed per-directory with accumulated `.gitignore` rules. Add `ignoreFiles` rules after existing rules:
152
- ```ts
153
- const ig = ignore();
154
- for (const rule of rules) ig.add(rule);
155
- for (const pattern of ignoreFiles) ig.add(pattern); // ← NEW
156
- ```
157
- - This makes `ignoreFiles` patterns behave identically to `.gitignore` entries — same glob syntax, same matching semantics (relative paths, directory trailing `/`, negation with `!`)
158
- - The `ignore` package is already a dependency (`package.json:28`)
159
-
160
- - **3.2 Extend WatchFiles** — modify `componets/watchFiles.ts:20`
161
- - Current signature: `WatchFiles({cwd, log}: {cwd: string, log: ILogger})`
162
- - New signature: `WatchFiles({cwd, log, ignoreFiles = []}: {cwd: string, log: ILogger, ignoreFiles?: string[]})`
163
- - In `loadGitignore()` at line 22-31: creates its own `ignore()` instance per watched directory. Add `ignoreFiles` rules after `.gitignore` rules:
164
- ```ts
165
- async function loadGitignore(dir: string) {
166
- const ig = ignore();
167
- ig.add(".*");
168
- try {
169
- const content = await readFile(join(dir, ".gitignore"), "utf8");
170
- ig.add(content);
171
- } catch {}
172
- for (const pattern of ignoreFiles) ig.add(pattern); // ← NEW
173
- return ig;
174
- }
175
- ```
176
-
177
- - **3.3 Wire ignoreFiles to all entry points** — pass `config.ignoreFiles` at each `WalkFiles`/`WatchFiles` construction:
178
- - `apps/run.mcp.ts:18` — `WalkFiles({cwd, log})` → `WalkFiles({cwd, log, ignoreFiles: config.ignoreFiles})`
179
- - `apps/run.mcp.ts:30` — `WatchFiles({cwd, log})` → `WatchFiles({cwd, log, ignoreFiles: config.ignoreFiles})`
180
- - `apps/run.index.ts:10` — `WalkFiles({cwd, log})` → `WalkFiles({cwd, log, ignoreFiles: config.ignoreFiles})`
181
- - `apps/run.watch.ts:13` — `WalkFiles({cwd, log})` → `WalkFiles({cwd, log, ignoreFiles: config.ignoreFiles})`
182
- - `apps/run.watch.ts:14` — `WatchFiles({cwd, log})` → `WatchFiles({cwd, log, ignoreFiles: config.ignoreFiles})`
183
-
184
- ### 4. Inline code snippets
185
-
186
- - **4.1 Add snippet params to MCP** — modify `apps/mcpApp.ts:34-58`
187
- - `McpApp` factory gains `config: IXindexConfig` dependency (add to `mcpApp.ts:23` params)
188
- - Extend `xindex_search` schema at line 37:
189
- ```ts
190
- inputSchema: z.object({
191
- query: z.string().describe("Natural language search query"),
192
- limit: z.number().int().min(1).max(100).default(10)
193
- .describe("Max results to return, 10 by default, 100 max"),
194
- maxSnippetResults: z.number().int().min(0).max(20).optional()
195
- .describe("How many top results include inline code (default from .xindex.json, 3)"),
196
- maxSnippetLines: z.number().int().min(0).max(50).optional()
197
- .describe("Max lines in a result to qualify for inline code (default from .xindex.json, 7)"),
198
- }),
199
- ```
200
- - In handler: resolve with config fallback:
201
- ```ts
202
- const sr = maxSnippetResults ?? config.maxSnippetResults;
203
- const sl = maxSnippetLines ?? config.maxSnippetLines;
204
- ```
205
- - Update `apps/run.mcp.ts:48` — pass `config` to `McpApp`
206
-
207
- - **4.2 Read source lines** — create `componets/index/readSnippet.ts`
208
- ```ts
209
- export type IReadSnippet = (record: IIndexRecord, maxLines: number) => Promise<string | null>;
210
- export function ReadSnippet(): IReadSnippet
211
- ```
212
- Logic:
213
- - **Cluster result** (`meta.type === StoreEntryType.cluster`): use `meta.fromLine`/`meta.toLine` directly from `IClusterMeta` (no need to parse from ID). Compute `lineCount = meta.toLine - meta.fromLine + 1`. If `lineCount > maxLines` → return `null`. Extract file path from `record.id` by splitting on last `:` (format `"path/to/file.ts:14-27"`). `readFile(filePath, "utf8")`, split lines, slice `[meta.fromLine-1, meta.toLine]`, return joined with `\n`.
214
- - **File result** (`meta.type === StoreEntryType.meta`): `readFile(record.id, "utf8")`, split lines, count. If `lineCount > maxLines` → return `null`. Otherwise return full content.
215
- - **Error handling**: on `readFile` failure (file deleted, moved, permission error) → return `null` silently. Search results still display; snippet is just omitted.
216
-
217
- - **4.3 Format with code** — update result formatting in both entry points:
218
-
219
- **MCP** (`apps/mcpApp.ts:47-51`): currently `results.map(...)` builds 1-line summaries. Replace with loop:
220
- ```ts
221
- const readSnippet = ReadSnippet();
222
- const lines: string[] = [];
223
- for (let i = 0; i < results.length; i++) {
224
- const r = results[i];
225
- const kw = r.meta.keywords ? ` — ${r.meta.keywords}` : "";
226
- lines.push(`${i + 1}. ${r.id} (${r.score.toFixed(2)})${kw}`);
227
- if (i < sr) {
228
- const snippet = await readSnippet(r, sl);
229
- if (snippet) lines.push("```\n" + snippet + "\n```");
230
- }
231
- }
232
- ```
233
-
234
- **CLI** (`apps/run.search.ts:8-31`): destructure config from `BuildComponents()`:
235
- ```ts
236
- const {searchContentIndex, config} = await BuildComponents({log});
237
- ```
238
- Then same snippet pattern using `config.maxSnippetResults` and `config.maxSnippetLines`:
239
- ```ts
240
- const readSnippet = ReadSnippet();
241
- // ... in the result loop after existing log lines:
242
- if (i < config.maxSnippetResults) {
243
- const snippet = await readSnippet(results[i], config.maxSnippetLines);
244
- if (snippet) log("```\n" + snippet + "\n```");
245
- }
246
- ```
247
-
248
- ## Files Changed
249
-
250
- | File | Change |
251
- |------|--------|
252
- | `componets/config/xindexConfig.ts` | **NEW** — `IXindexConfig` type with 4 fields |
253
- | `componets/config/loadConfig.ts` | **NEW** — `LoadConfig` HOF, reads `.xindex.json` (optional), applies defaults, warns via `ILogger` |
254
- | `componets/keywords/cleanUpKeywords.ts` | Add `ignoreKeywords` param + `ignoreSet` filter |
255
- | `componets/walkFiles.ts` | Add `ignoreFiles` param, feed into `ignore()` instances |
256
- | `componets/watchFiles.ts` | Add `ignoreFiles` param, feed into `loadGitignore()` `ignore()` instance |
257
- | `componets/buildComponents.ts` | Add `{log}` param, load config, pass `ignoreKeywords` to `CleanUpKeywords`, return `config` |
258
- | `componets/index/readSnippet.ts` | **NEW** — `ReadSnippet` HOF, reads file lines for a search result |
259
- | `apps/mcpApp.ts` | Add `config` dep, `maxSnippetResults`/`maxSnippetLines` schema params, snippet formatting |
260
- | `apps/run.mcp.ts` | Pass `config` to `McpApp`, `ignoreFiles` to `WalkFiles`/`WatchFiles`, `log` to `BuildComponents` |
261
- | `apps/run.search.ts` | Pass `log` to `BuildComponents`, use `config` for snippet formatting |
262
- | `apps/run.index.ts` | Pass `log` to `BuildComponents`, `ignoreFiles` to `WalkFiles` |
263
- | `apps/run.watch.ts` | Pass `log` to `BuildComponents`, `ignoreFiles` to `WalkFiles`/`WatchFiles` |
264
-
265
- ## Example `.xindex.json`
266
-
267
- ```json
268
- {
269
- "ignoreKeywords": ["import", "export", "const", "function", "return", "async", "await"],
270
- "ignoreFiles": ["*.test.ts", "*.spec.ts", "rnd/**", "dist/**"],
271
- "maxSnippetLines": 7,
272
- "maxSnippetResults": 3
273
- }
274
- ```
@@ -1,32 +0,0 @@
1
- # Log: xindex-watch — Continuous Indexing with File Watcher
2
-
3
- ### 2026-04-10
4
-
5
- - Task created from user notes: "file watcher → apply to indexer → xindex-index runs → indexes provided or cwd → watches for changes → created/updated/moved/deleted → queued to stream → index content"
6
- - Scouted codebase: IndexApp already uses streamx pipeline (`from → tap → map → run`), WalkFiles is async generator, Writer supports push-based streaming, merge combines streams
7
- - Key decision: use `node:fs/promises` `watch()` (recursive, async iterable) — no external dep needed
8
- - Key decision: tagged union `{type:"index"|"remove", path}` as common event shape for walk + watch streams
9
- - Key decision: merge initial walk stream + watch stream into single pipeline
10
- - Debounce needed: editors fire multiple events per save (write temp → rename → delete old)
11
- - RemoveContent needed: Vectra has `deleteItem()` but no HOF wrapper exists yet
12
- - **Clarification round resolved:**
13
- - Watch is always on (no optional flag) — index all, then watch. Default behavior.
14
- - Default to cwd when no args
15
- - Event→Vectra: created→add, updated→delete+add, deleted→delete, moved→delete(old)+add(new)
16
- - Binary filtering: deferred (TODO), keep simple for now
17
- - Graceful shutdown: SIGINT → stop processing → ignore queued → exit
18
- - Watching individual files: works fine with fs.watch, no issue
19
- - **Consistency check:** fixed 6 issues — removed optional watch flag, added graceful shutdown to diagram/steps, clarified update=delete+add semantics, marked binary filtering as TODO
20
- - **User clarification:** separate entry points — `xindex-watch` (new, continuous) vs `xindex-index` (existing, one-time). Both default to cwd.
21
- - **Design decision:** Vectra `upsertItem` handles both add and update — no need for delete+add on updates, just upsert
22
- - **Design decision:** WatchApp is a new HOF in `apps/watchApp.ts`; IndexApp stays unchanged; no modifications to MCP/search paths
23
- - **Design decision:** WatchFiles uses `Writer<FileEvent>` to push events into streamx-compatible stream; `stop()` closes watchers + finishes writer
24
- - **Implementation pivot:** streamx `Writer`/`merge`/`batchTimed` depend on `@handy/fun` (not installed in xindex). Rewrote to use plain async generators instead — simpler, no new deps. Two-phase approach: walk+index first, then watch+process.
25
- - **Debounce approach:** collect events in Map (keyed by path, last event wins), flush after 150ms quiet period. Replaces batchTimed.
26
- - **Watcher uses AbortController** for clean shutdown — `fs.watch` accepts `signal` option natively.
27
- - **All steps implemented and verified:**
28
- - Step 1: RemoveContent HOF + wired into ContentIndexDriver + BuildComponents ✓
29
- - Step 2: WatchFiles component with debounced async generator ✓
30
- - Step 3: WatchApp with two-phase (walk then watch) ✓
31
- - Step 4: run.watch.ts + bin/xindex-watch + package.json + run.index.ts default ✓
32
- - **Tested:** initial index, file create detection, file delete detection, SIGINT graceful shutdown — all pass