xindex 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. package/.ai/research/.gitkeep +0 -0
  2. package/.ai/task/.gitkeep +0 -0
  3. package/README.md +54 -89
  4. package/apps/run.search.ts +0 -3
  5. package/componets/index/formatSearchResults.ts +2 -2
  6. package/media/MEDIUM.md +139 -0
  7. package/media/SOCIAL.md +102 -0
  8. package/package.json +1 -1
  9. package/.ai/research/2026-04-10-file-watching.md +0 -79
  10. package/.ai/research/2026-04-10-mcp-output-format.md +0 -129
  11. package/.ai/task/INDEX.md +0 -12
  12. package/.ai/task/done/INDEX.md +0 -3
  13. package/.ai/task/done/task.2026-04-09-local-ai-research-protos.log.md +0 -98
  14. package/.ai/task/done/task.2026-04-09-local-ai-research-protos.md +0 -102
  15. package/.ai/task/task.2026-04-10-cluster-config.log.md +0 -19
  16. package/.ai/task/task.2026-04-10-cluster-config.md +0 -118
  17. package/.ai/task/task.2026-04-10-dir-indexing.log.md +0 -8
  18. package/.ai/task/task.2026-04-10-dir-indexing.md +0 -92
  19. package/.ai/task/task.2026-04-10-line-clustering.log.md +0 -50
  20. package/.ai/task/task.2026-04-10-line-clustering.md +0 -176
  21. package/.ai/task/task.2026-04-10-object-store.log.md +0 -7
  22. package/.ai/task/task.2026-04-10-object-store.md +0 -81
  23. package/.ai/task/task.2026-04-10-search-config.log.md +0 -46
  24. package/.ai/task/task.2026-04-10-search-config.md +0 -274
  25. package/.ai/task/task.2026-04-10-watch-indexing.log.md +0 -32
  26. package/.ai/task/task.2026-04-10-watch-indexing.md +0 -101
  27. package/.ai/task/task.2026-04-10-xindex-mcp.log.md +0 -5
  28. package/.ai/task/task.2026-04-10-xindex-mcp.md +0 -92
  29. package/.ai/task/task.2026-04-10-xindex-mcp.report.md +0 -113
@@ -1,129 +0,0 @@
1
- # MCP Tool Output Format for LLM Consumption
2
-
3
- **Question**: What output format should our xindex_search MCP tool use to return search results to an LLM?
4
-
5
- **Current state**: `JSON.stringify(results, null, 2)` — pretty-printed JSON with score, id, meta.keywords, meta.file (id and meta.file are redundant).
6
-
7
- ---
8
-
9
- ## Findings
10
-
11
- ### 1. Token efficiency benchmarks (ImprovingAgents, Oct 2025)
12
-
13
- **Nested data** — 1,000 questions, 3 models, 4 formats:
14
-
15
- | Format | Tokens | GPT-5 Nano | Gemini 2.5 Flash Lite |
16
- |----------|---------|------------|----------------------|
17
- | Markdown | 38,357 | 54.3% | 48.2% |
18
- | YAML | 42,477 | 62.1% | 51.9% |
19
- | JSON | 57,933 | 50.3% | 43.1% |
20
- | XML | 68,804 | 44.4% | 33.8% |
21
-
22
- Markdown uses **34% fewer tokens** than JSON. YAML has better accuracy but more tokens.
23
-
24
- **Flat/tabular data** — 11 formats, 1,000 queries, GPT-4.1-nano:
25
-
26
- | Format | Accuracy | Tokens | Efficiency |
27
- |----------------|----------|---------|------------|
28
- | Markdown-KV | 60.7% | 52,104 | Best accuracy |
29
- | Markdown Table | 51.9% | 25,140 | Best ratio |
30
- | JSON | 52.3% | 66,396 | Mediocre |
31
- | CSV | 44.3% | 19,524 | Cheapest but worst |
32
-
33
- For flat data (which our search results are), **Markdown-KV** gives best LLM comprehension. A numbered list with `key: value` pairs is effectively Markdown-KV.
34
-
35
- Sources: [Nested formats](https://www.improvingagents.com/blog/best-nested-data-format/), [Table formats](https://www.improvingagents.com/blog/best-input-data-format-for-llms/)
36
-
37
- ### 2. MCP spec guidance (June 2025)
38
-
39
- - `content` (TextContent) = what the LLM reads
40
- - `structuredContent` = machine-to-machine, optional
41
- - Spec's own example uses **plain text**: `"Current weather in New York:\nTemperature: 72°F\nConditions: Partly cloudy"`
42
- - If `outputSchema` is defined, SHOULD return both `structuredContent` AND serialized JSON in TextContent for backwards compat
43
-
44
- The spec explicitly shows plain text as the standard tool result format for LLM consumption.
45
-
46
- Source: [MCP Tools Spec](https://modelcontextprotocol.io/specification/2025-06-18/server/tools)
47
-
48
- ### 3. What popular MCP servers do
49
-
50
- | Server | Output format |
51
- |-------------|--------------|
52
- | Perplexity | AI-synthesized text + citation URLs |
53
- | Context7 | Plain text documentation snippets |
54
- | markdownify | Markdown (entire category exists for this) |
55
- | Elasticsearch | JSON (machine-oriented) |
56
-
57
- LLM-facing servers use text/markdown. Only machine-oriented servers use JSON.
58
-
59
- ### 4. JSON specifically degrades LLM reasoning
60
-
61
- - Aider benchmarks: JSON wrapping reduces code reasoning quality by 10-15% ([source](https://aider.chat/2024/08/14/code-in-json.html))
62
- - arxiv paper: frontier models top out at ~77% accuracy on JSON processing tasks ([source](https://arxiv.org/html/2510.15955v1))
63
- - OpenAI community: Markdown is 15% more token-efficient than JSON ([source](https://community.openai.com/t/markdown-is-15-more-token-efficient-than-json/841742))
64
-
65
- ### 5. TOON format (Nov 2025) — not recommended
66
-
67
- New token-optimized format. Mixed results: 73.9% on flat retrieval but **last place** (43.1%) on nested data. Immature ecosystem, no MCP support. Not applicable here.
68
-
69
- Source: [TOON benchmarks](https://www.improvingagents.com/blog/toon-benchmarks/)
70
-
71
- ### 6. Workato design guidelines
72
-
73
- - Return only necessary fields — avoid sending 200+ fields when 3 suffice
74
- - Preprocess/summarize large content before returning to LLM
75
- - Consider token efficiency — "excessive data can overwhelm the AI agent"
76
-
77
- Source: [Workato MCP Tool Design](https://docs.workato.com/en/mcp/mcp-server-tool-design.html)
78
-
79
- ---
80
-
81
- ## Analysis
82
-
83
- Our search results are **flat data** with 3 fields per result (score, file path, keywords). This is the simplest case:
84
-
85
- | Approach | Tokens/result | LLM quality | Fit |
86
- |----------|--------------|-------------|-----|
87
- | Pretty JSON (current) | ~55 | Worst — syntax overhead | Bad |
88
- | Compact JSON | ~22 | OK but cryptic keys | Meh |
89
- | Markdown numbered list | ~12 | Best — Markdown-KV pattern | Best |
90
- | TSV | ~15 | OK but less natural | OK |
91
-
92
- The markdown numbered list matches the **Markdown-KV** pattern that scored highest (60.7%) in flat data benchmarks. It's also **77% fewer tokens** than current JSON.
93
-
94
- Additional advantages:
95
- - File path is visually prominent (it's what the LLM acts on next)
96
- - Score at 2 decimals is sufficient ranking signal
97
- - Keywords give semantic context without opening the file
98
- - Zero structural noise (no braces, brackets, quotes, commas)
99
- - Matches how Perplexity/Context7 format their responses
100
-
101
- No significant trade-offs: we don't need machine-parseability (the consumer is always an LLM), and there's no nested data to worry about.
102
-
103
- ---
104
-
105
- ## Recommendation
106
-
107
- **Switch to markdown numbered list.**
108
-
109
- ```
110
- Search: "authentication flow" — 3 result(s)
111
-
112
- 1. src/components/auth.ts (0.87) — authentication, login, session, token
113
- 2. src/middleware/jwt.ts (0.81) — jwt, token, verify, middleware
114
- 3. src/routes/login.ts (0.74) — login, form, credentials, redirect
115
- ```
116
-
117
- Implementation in `mcpApp.ts`:
118
- ```ts
119
- const header = `Search: "${query}" — ${results.length} result(s)\n\n`;
120
- const lines = results.map((r, i) =>
121
- `${i + 1}. ${r.id} (${r.score.toFixed(2)}) — ${r.meta.keywords ?? ""}`
122
- );
123
- const text = header + lines.join("\n");
124
- return {content: [{type: "text" as const, text}]};
125
- ```
126
-
127
- Empty case: `No results for "${query}"` — avoids confusing the model with an empty list.
128
-
129
- **Future consideration**: Add `outputSchema` + `structuredContent` when clients start using it, but keep TextContent as the primary format for LLM consumption.
package/.ai/task/INDEX.md DELETED
@@ -1,12 +0,0 @@
1
- # Tasks
2
-
3
- - [xindex-mcp — MCP Server for Semantic Code Search](task.2026-04-10-xindex-mcp.md) — wrap xindex as MCP server so Claude Code can search codebase
4
- - [Directory-based Indexing with Async Streams](task.2026-04-10-dir-indexing.md) — accept files/dirs, recursive walk with .gitignore, index via streamx pipeline
5
- - [xindex-watch — Continuous Indexing](task.2026-04-10-watch-indexing.md) — new entry point: index all + watch for changes continuously via merged stream
6
- - [Object Store — Separate Meta from Vectra](task.2026-04-10-object-store.md) — store meta as JSON files in .xindex/objects/, vectra keeps only vectors
7
- - [Line-level Clustering](task.2026-04-10-line-clustering.md) — recursive bisection to split files into semantic blocks, index as file:fromLine-toLine
8
-
9
- - [Search Config — Keyword Ignore & Inline Snippets](task.2026-04-10-search-config.md) — `.xindex.json` config for ignoring noisy keywords + inlining small code clusters in results
10
- - [Cluster Config — Move ClusterLines defaults to .xindex.json](task.2026-04-10-cluster-config.md) — repo-level clustering params (`threshold`, `minLines`, `maxDepth`) instead of hardcoded defaults
11
-
12
- See [done/INDEX.md](done/INDEX.md) for completed tasks.
@@ -1,3 +0,0 @@
1
- # Done Tasks
2
-
3
- - [xindex — Local Semantic Code Search](task.2026-04-09-local-ai-research-protos.md) — R&D prototyping → HOF refactoring → working semantic search tool (completed 2026-04-10)
@@ -1,98 +0,0 @@
1
- ### 2026-04-09 — Session log
2
-
3
- #### 1. hf.ts — Local text generation
4
-
5
- - Installed `@huggingface/transformers` (v4.0.1)
6
- - First run failed: top-level `await` not supported in CJS → fixed by adding `"type": "module"` to package.json
7
- - Model: `HuggingFaceTB/SmolLM2-135M-Instruct` — downloads ONNX weights on first run, cached after
8
- - API: `pipeline("text-generation", model)` → pass chat messages array → `output[0].generated_text.at(-1).content`
9
- - Output quality: basic — responded "Hello, my name is [Your Name]." to "Write a one-line hello"
10
-
11
- #### 2. vectra.ts — Local vector search
12
-
13
- - Installed `vectra` (v0.14.0)
14
- - **API gotchas** — online examples use outdated API:
15
- - `VectraIndex` → actually `LocalIndex`
16
- - Constructor: `new LocalIndex(folderPath)` — no options object, no `dimension` param
17
- - Must call `createIndex()` before first use, check with `isIndexCreated()`
18
- - `queryItems(vector, query, topK, filter)` — 4 positional args, not an options object
19
- - Filter format: `{ category: { $eq: "fruit" } }` — nested operator syntax
20
- - Embeddings: `pipeline("feature-extraction", "sentence-transformers/all-MiniLM-L6-v2")` → 384-dim vectors
21
- - Embedding helper: `embedder(text, { pooling: "mean", normalize: true })` → `Array.from(result.data)`
22
- - Tested: 3 items indexed, query "red fruit" with category filter → correctly returned 2 fruit items, filtered out "Cars are vehicles"
23
- - Scores: Apples=0.7830, Bananas=0.5188
24
-
25
- #### 3. keywords.ts — Keyword extraction from code files
26
-
27
- - Installed `keyword-extractor` (v0.0.28) — CJS module, needed `createRequire(import.meta.url)` for ESM import
28
- - **Iteration 1**: `return_chained_words: true` → way too aggressive, merged entire code lines into single "keywords"
29
- - **Iteration 2**: `return_max_ngrams: 3` instead → still noisy, code syntax tokens (`{`, `}`, `const`, `await`) dominated results
30
- - **Iteration 3**: Added code-aware preprocessing before extraction:
31
- - Strip `//` from comments but keep comment text
32
- - Remove code punctuation: `{}()[];=<>|&!+*/$@` etc.
33
- - Remove JS keywords: `const`, `let`, `var`, `import`, `export`, `from`, `await`, `async`, `function`, `return`, `for`, `of`, `if`, `new`, `typeof`, `as`
34
- - Collapse whitespace
35
- - Post-filter: skip keywords < 3 chars or non-alphabetic, word-boundary regex for frequency count
36
- - Final output on vectra.ts: clean results — `metadata(8x)`, `index(7x)`, `text(7x)`, `fruit(5x)`, `pipeline(3x)`, `huggingface(1x)`
37
- - Tested without filters: confirmed `const(10x)`, `await(8x)`, `}(19x)` dominate — filters are necessary for code
38
-
39
- #### 4. keywords-compromise.ts — Compromise NLP extraction
40
-
41
- - Installed `compromise` (v14.15.0) — ESM native, no createRequire needed
42
- - Tested all extractors: `.topics()`, `.nouns()`, `.verbs()`, `.people()`, `.organizations()`
43
- - **Result on code files**: poor. Nouns are code fragments (`'const index = new LocalIndex("./vectra-index");'`), zero topics, zero people
44
- - On hf.ts: caught "Microsoft" as topic + organization from prompt string — works on embedded natural language
45
- - **Conclusion**: compromise is designed for prose (articles, emails, chat), not source code
46
-
47
- #### 5. keywords-pipeline.ts — Full extraction pipeline
48
-
49
- - Combined: read file → compromise (nouns/verbs/topics) → regex `\W+` → space → keyword-extractor → show
50
- - Added LLM step (SmolLM2-135M): asked to extract/refine keywords
51
- - **LLM result**: echoed input then looped on `await transformer.get(index)` — too small to understand the instruction
52
- - LLM fallback logic: if output has <3 unique terms or contains repetition pattern, fall back to raw keywords
53
- - Pipeline works end-to-end but LLM step is effectively a passthrough
54
-
55
- #### 6. vectra-keywords.ts — Combined indexing + synonym search
56
-
57
- - Merged keyword extraction + vectra indexing into one script
58
- - Indexed 5 files, tested synonym search:
59
- - "fruit" → hf.ts (0.18), vectra.ts (0.17)
60
- - "automobile vehicle transportation" → vectra.ts (0.18) — synonym for "cars/vehicles"
61
- - "embedding model neural network" → vectra.ts (0.27) — semantic match
62
- - Scores low because code noise (`const`, `await`) dilutes the keyword embeddings
63
-
64
- #### 7. xindex.ts — Final combined solution
65
-
66
- - Created unified CLI: `xindex.ts index <files>` and `xindex.ts search <query>`
67
- - Full index pipeline: compromise → regex → keyword-extractor → LLM refine → MiniLM embed → vectra store
68
- - Query pipeline: input → keyword-extractor → embed → vectra search
69
- - Added full payload logging at each step: [1] keywords, [2] LLM refined, [3] vector preview, [4] metadata
70
- - Test results:
71
- - "natural language processing" → keywords-pipeline.ts (0.56) — strongest match
72
- - "automobile transportation" → vectra.ts (0.21) — synonym works
73
- - "neural network deep learning" → vectra.ts (0.28)
74
-
75
- #### 8. Project setup — ~/project/xindex
76
-
77
- - Created standalone TypeScript project at `/Users/slava/project/xindex`
78
- - Moved all RnD files: xindex.ts, hf.ts, vectra.ts, keywords.ts, keywords-compromise.ts, keywords-pipeline.ts, vectra-keywords.ts
79
- - package.json, tsconfig.json, bin/xindex entry point
80
- - Git initialized
81
-
82
- #### Alternative keyword extraction libs (from user research)
83
-
84
- - `textlens` — TF-IDF, 1-line API, fastest, 10k+ weekly downloads
85
- - `node-keyword-extractor` — RAKE-like, 1-line API, very fast
86
- - `compromise` — full NLP, 3 lines, fast (tested — poor on code)
87
- - `natural` — TF-IDF tokenizer, 5 lines, fast
88
-
89
- #### Decisions & findings
90
-
91
- - `"type": "module"` in package.json is required for all prototypes (top-level await)
92
- - HuggingFace transformers JS works well for embeddings; text generation quality limited by model size
93
- - Vectra API docs/examples online are outdated — always check the actual `.d.ts` types
94
- - Code keyword extraction needs domain-specific preprocessing regardless of library choice
95
- - Compromise NLP is not suitable for code — only for natural language text
96
- - SmolLM2-135M is too small for keyword refinement — needs 360M+ or external API
97
- - MiniLM-L6-v2 embeddings understand synonyms well enough for semantic code search
98
- - The concept works: index codebase → query with natural language → get relevant files
@@ -1,102 +0,0 @@
1
- # Task: xindex — Local Semantic Code Search [COMPLETED 2026-04-10]
2
-
3
- ## Context
4
-
5
- Built a local semantic code search tool — index a codebase, query by keyword or meaning, get relevant files back. No cloud APIs, everything runs on-device.
6
-
7
- **Project**: `~/project/xindex`
8
-
9
- **Dependencies:** `@huggingface/transformers`, `vectra`, `compromise`, `keyword-extractor`, `tsx`
10
-
11
- ## Goal
12
-
13
- Index a codebase so that minimally meaningful text queries return relevant files/info about the project. Local-first, no cloud APIs.
14
-
15
- ## Final Architecture
16
-
17
- ```
18
- componets/
19
- ├── llm/
20
- │ ├── embed.ts — Embed({pooling, normalize}) → MiniLM-L6 384-dim
21
- │ └── queryLLM.ts — QueryLLM({maxTokens}) → SmolLM2-135M (kept aside, unused)
22
- ├── keywords/
23
- │ ├── extractKeywords.ts — ExtractKeywords() → compromise NLP (nouns/verbs/topics)
24
- │ ├── cleanUpKeywords.ts — CleanUpKeywords({maxNgrams, minLength}) → keyword-extractor + dedup
25
- │ └── refineKeywords.ts — RefineKeywords({queryLLM, cleanUpKeywords, prompt}) (kept aside, unused)
26
- ├── index/
27
- │ ├── vectraIndex.ts — VectraIndex(path) → LocalIndex init
28
- │ ├── indexContent.ts — IndexContent({embed, index}) → embed + upsert
29
- │ ├── getIndexStats.ts — GetIndexStats({index}) → {indexedAmount}
30
- │ ├── searchContentIndex.ts — SearchContentIndex({extractKeywords, cleanUpKeywords, embed, index})
31
- │ └── contentIndexDriver.ts — ContentIndexDriver({path, embed, extractKeywords, cleanUpKeywords})
32
- └── buildComponents.ts — wires everything, returns ready-to-use functions
33
-
34
- apps/
35
- ├── indexApp.ts — IndexApp({extractKeywords, cleanUpKeywords, indexContent})
36
- ├── searchApp.ts — SearchApp({searchContentIndex})
37
- ├── run.index.ts — CLI entry point for indexing
38
- └── run.search.ts — CLI entry point for search
39
-
40
- bin/
41
- ├── xindex-index — #!/usr/bin/env tsx → apps/run.index.ts
42
- └── xindex-search — #!/usr/bin/env tsx → apps/run.search.ts
43
- ```
44
-
45
- ## Final Pipeline
46
-
47
- ```
48
- INDEX (per file):
49
- file + filename
50
-
51
- ├─[1] extractKeywords ──→ compromise NLP (nouns, verbs, topics)
52
-
53
- ├─[2] cleanUpKeywords ──→ keyword-extractor + dedup + filter
54
-
55
- ├─[3] MiniLM-L6 ────────→ 384-dim embedding
56
-
57
- └─[4] vectra ────────────→ upsert { id, vector, metadata: { keywords, file } }
58
-
59
- SEARCH:
60
- user input
61
-
62
- ├─[1] extractKeywords ──→ compromise NLP
63
-
64
- ├─[2] cleanUpKeywords ──→ keyword-extractor + dedup + filter
65
-
66
- ├─[3] MiniLM-L6 ────────→ 384-dim embedding
67
-
68
- └─[4] vectra ────────────→ queryItems → ranked by cosine similarity
69
- ```
70
-
71
- ## Key Decisions
72
-
73
- - **LLM refine step removed** — SmolLM2-135M was too small, acted as passthrough or generated garbage. Without it: 10x faster indexing, better accuracy (11/14 → 16/20 correct #1)
74
- - **HOF component pattern** — all components are factory functions: `DoThing({deps}): IDoThing`. Export factory + type only, no default instances
75
- - **Dependencies as destructured objects** — `DoThing({embed, index}: {embed: IEmbed, index: LocalIndex})`
76
- - **Separate entry points** — `bin/xindex-index` and `bin/xindex-search` instead of one CLI with subcommands
77
- - **ContentIndexDriver** — bundles index layer (indexContent, getIndexStats, searchContentIndex) behind one factory
78
-
79
- ## Final Test Results (41 files, 20 queries)
80
-
81
- | Metric | Value |
82
- |--------|-------|
83
- | Index time (41 files) | 1.07s (26ms/file) |
84
- | Search time | 0.71s (constant) |
85
- | Correct #1 | 16/20 (80%) |
86
- | Correct in top 3 | 19/20 (95%) |
87
- | Cross-domain isolation | Perfect |
88
-
89
- ## Resolved Review Items
90
-
91
- - [x] Search loads unused LLM → removed LLM from pipeline entirely
92
- - [x] `// --- Init ---` comment artifact → removed
93
- - [ ] `BuildComponents` hardcodes index path → still hardcoded
94
- - [ ] Stale `main` in package.json → still points to deleted xindex.ts
95
- - [ ] `componets/` typo → deferred
96
-
97
- ## Open Questions (carried forward)
98
-
99
- - What's the practical index size limit for vectra LocalIndex before it slows down?
100
- - Is hybrid search (vector + BM25) in vectra good enough to skip a separate keyword index?
101
- - Directory walking instead of explicit file list
102
- - Chunking for large files — one embedding per file loses detail
@@ -1,19 +0,0 @@
1
- ### 2026-04-10
2
-
3
- - Created task from user note: move line clustering defaults (`threshold`, `minLines`, `maxDepth`) into `.xindex.json`.
4
- - Ran scout via code search and xindex MCP search:
5
- - confirmed hardcoded defaults in `componets/index/clusterLines.ts`
6
- - confirmed config infra exists in `componets/config/xindexConfig.ts` + `componets/config/loadConfig.ts`
7
- - confirmed `.xindex.json` already used for other settings
8
- - Drafted initial plan with diagram + 3x3 steps.
9
- - Updated from user clarifications:
10
- - default threshold when missing is `0.7`
11
- - threshold validation is strict `[0,1]` with fallback to default
12
- - apply through shared wiring in this repo folder
13
- - do not update `.xindex.json` file content now
14
- - config shape finalized as flat keys: `clusterThreshold`, `clusterMinLines`, `clusterMaxDepth`
15
- - Ran consistency pass and fixed task mismatches:
16
- - removed old `0.75` default references
17
- - replaced clamp/range language with strict validation wording
18
- - removed docs/example update requirement that conflicted with user instruction
19
- - Expanded task with Detailed Change Map + Acceptance Criteria for implementation readiness.
@@ -1,118 +0,0 @@
1
- # Task: Move ClusterLines defaults into .xindex.json
2
-
3
- ## Context
4
-
5
- User goal: move line-clustering configuration from hardcoded defaults to repo-level config in `.xindex.json`:
6
- - `threshold = 0.70` (new default when missing)
7
- - `minLines = 5`
8
- - `maxDepth = 5`
9
- - config key shape is flat:
10
- - `clusterThreshold`
11
- - `clusterMinLines`
12
- - `clusterMaxDepth`
13
-
14
- Why: different repositories need different clustering behavior, so these values should be configurable per repo.
15
-
16
- Scout findings (`@xi` + code scan):
17
- - `componets/index/clusterLines.ts` currently hardcodes defaults in the HOF signature:
18
- - `threshold = 0.75`
19
- - `minLines = 5`
20
- - `maxDepth = 5`
21
- - `.xindex.json` already exists and currently contains search/index config (`ignoreKeywords`, `ignoreFiles`, `maxSnippetLines`, `maxSnippetResults`).
22
- - `componets/config/xindexConfig.ts` and `componets/config/loadConfig.ts` already provide optional config loading with defaults.
23
- - `componets/buildComponents.ts` already loads `.xindex.json` and wires config into keyword cleanup, but does not yet pass clustering params to driver/components.
24
-
25
- Related active tasks:
26
- - `task.2026-04-10-line-clustering.md`
27
- - `task.2026-04-10-search-config.md`
28
-
29
- ## Goal
30
-
31
- Extend `.xindex.json` + config loading so line-clustering params are configurable per repository, and wire them into `ClusterLines` construction with strict threshold validation and the new default threshold `0.7` when config keys are absent.
32
-
33
- ## Diagram
34
-
35
- ```
36
- .xindex.json (optional)
37
- ┌──────────────────────────────────────────────┐
38
- │ existing: ignoreKeywords, ignoreFiles, ... │
39
- │ new: │
40
- │ clusterThreshold: number (default 0.7) │
41
- │ clusterMinLines: number (default 5) │
42
- │ clusterMaxDepth: number (default 5) │
43
- └───────────────────┬──────────────────────────┘
44
-
45
-
46
- LoadConfig -> IXindexConfig (defaults applied)
47
-
48
-
49
- BuildComponents
50
-
51
-
52
- ContentIndexDriver / ClusterLines factory
53
-
54
-
55
- clusterLines() uses repo-specific values
56
- ```
57
-
58
- ## Steps
59
-
60
- ### 1. Extend config schema (3x3)
61
- - **1.1 Add fields to `IXindexConfig`** — add three clustering fields with explicit names and numeric types.
62
- - **1.2 Parse + default in `LoadConfig`** — map new JSON keys to validated numbers with defaults `{clusterThreshold: 0.7, clusterMinLines: 5, clusterMaxDepth: 5}`.
63
- - **1.3 Validate bounds** — `clusterThreshold` must be in `[0,1]` (otherwise fallback to default), and line/depth values must be finite integers with safe lower bounds.
64
-
65
- ### 2. Wire config into clustering (3x3)
66
- - **2.1 Thread config through builder/driver** — ensure the clustering factory gets config values from `BuildComponents` path.
67
- - **2.2 Update `ClusterLines` construction** — pass config values from driver wiring instead of relying on hardcoded constructor defaults.
68
- - **2.3 Preserve backward compatibility** — missing `.xindex.json` or missing keys should still produce stable clustering behavior via loader defaults.
69
-
70
- ### 3. Validate behavior and docs (3x3)
71
- - **3.1 Runtime sanity checks** — run index/search flow to confirm no regressions and that loaded config values are honored.
72
- - **3.2 Surface scope in repo entry points** — ensure clustering config is available from common build path used by apps in this folder.
73
- - **3.3 Add/update tests** — cover default path (no config keys), invalid threshold path (fallback to 0.7), and override path (custom values).
74
-
75
- ## Detailed Change Map
76
-
77
- - `componets/config/xindexConfig.ts`
78
- - Add:
79
- - `clusterThreshold: number`
80
- - `clusterMinLines: number`
81
- - `clusterMaxDepth: number`
82
- - `componets/config/loadConfig.ts`
83
- - Extend `DEFAULTS` with clustering keys (`0.7`, `5`, `5`).
84
- - Parse new keys from `.xindex.json`.
85
- - Apply strict validation:
86
- - threshold valid only if finite number and `0 <= v <= 1`, otherwise default
87
- - min/depth valid only if finite number, integerized, and `>= 1`, otherwise default
88
- - `componets/index/contentIndexDriver.ts`
89
- - Accept `config` in factory deps.
90
- - Construct `ClusterLines({... , threshold: config.clusterThreshold, minLines: config.clusterMinLines, maxDepth: config.clusterMaxDepth})`.
91
- - `componets/buildComponents.ts`
92
- - Pass loaded `config` into `ContentIndexDriver`.
93
- - Keep returning `config` so consumers in this folder can use consistent runtime config.
94
- - `apps/run.*.ts` and `apps/mcpApp.ts` (as needed)
95
- - No direct clustering logic, but rely on `BuildComponents` so new config applies everywhere in this folder through shared wiring.
96
-
97
- ## Acceptance Criteria
98
-
99
- - `.xindex.json` may define clustering keys and they influence `ClusterLines` without code changes.
100
- - Missing clustering keys use defaults: `clusterThreshold=0.7`, `clusterMinLines=5`, `clusterMaxDepth=5`.
101
- - Invalid threshold values (e.g., `-0.1`, `1.2`, `"0.7"`, `null`) fallback to `0.7`.
102
- - Indexing pipeline compiles and runs with unchanged public entry points.
103
- - No update to `.xindex.json` file contents is required in this task.
104
-
105
- ## Decisions
106
-
107
- - Default threshold changed from old runtime value to `0.7`.
108
- - Threshold validation is strict `[0,1]` with fallback to default (no clamping).
109
- - Scope is this repo folder via shared component wiring, not only one direct caller.
110
- - Do not modify current `.xindex.json` as part of this task.
111
- - Config shape is flat keys in `.xindex.json`:
112
- - `clusterThreshold`
113
- - `clusterMinLines`
114
- - `clusterMaxDepth`
115
-
116
- ## Open Questions
117
-
118
- - Compatibility strategy: accept legacy names (if any) or only new canonical names?
@@ -1,8 +0,0 @@
1
- ### 2026-04-10 — Task created
2
-
3
- - Scouted: streamx has `from()`, `map`, `flat`, `pipe`, `run` — full async stream toolkit
4
- - Current IndexApp is a simple for-loop over explicit file paths
5
- - Node `fs.readdir({recursive: true})` exists but no gitignore support
6
- - `.gitignore` already has sensible rules for the project
7
- - User wants: files or dirs as input → recursive walk → stream → index
8
- - User preference: use streamx from packages/ for the stream pipeline
@@ -1,92 +0,0 @@
1
- # Task: Directory-based Indexing with Async Streams
2
-
3
- ## Context
4
-
5
- Current `IndexApp` takes an explicit file list — no directory scanning. User wants to pass files **or** dirs, recursively scan dirs, and index everything as a stream.
6
-
7
- **Current pipeline** (`apps/indexApp.ts`):
8
- ```
9
- files[] → for each → readFile → extractKeywords → cleanUp → indexContent
10
- ```
11
-
12
- **streamx available** (`packages/streamx/`):
13
- - `from(iterable)` — wraps async/sync iterable into StreamX
14
- - `of()` → `.pipe()` for chaining
15
- - Operators: `map`, `filter`, `flat`, `flatMap`, `batch`, `buffer`, `merge`, `scale`, `reduce`, `tap`
16
- - `run()` — consumes stream, returns last value
17
-
18
- **Gitignore: `ignore` npm package** (used by ESLint, Prettier):
19
- ```ts
20
- import ignore from "ignore";
21
- const ig = ignore();
22
- ig.add(await readFile(".gitignore", "utf8")); // load rules
23
- ig.ignores("node_modules/foo.js"); // true
24
- ig.filter(["src/index.ts", "dist/out.js"]); // ["src/index.ts"]
25
- ```
26
- - Paths must be **relative** to .gitignore location
27
- - `.add()` stackable — call per nested .gitignore
28
- - Handles negation (`!`), globs, `**/`, comments
29
-
30
- **Decisions:**
31
- - Paths are **relative to working directory** (children of cwd)
32
- - Sequential indexing now; `scale()` for parallelism later
33
- - Use `ignore` npm package for .gitignore parsing
34
- - Default: if no .gitignore in a folder, skip `.*` dirs (`.git`, `.idea`, etc.)
35
-
36
- ## Goal
37
-
38
- Accept files or directories as input, recursively walk directories (respecting .gitignore), and index all discovered files as an async stream using streamx.
39
-
40
- ## Diagram
41
-
42
- ```
43
- INPUT: ["file.ts", "src/", "lib/"]
44
-
45
- ├── file? ──→ yield relative path
46
-
47
- └── dir? ──→ walk recursively
48
-
49
- ├── load .gitignore (if exists)
50
- │ (else: default ignore .* dirs)
51
-
52
- ├── skip ignored paths
53
-
54
- └── yield each file (relative)
55
-
56
-
57
- from(walkFiles) ──→ streamx pipeline
58
-
59
- ├── map: readFile
60
- ├── map: extractKeywords + cleanUp
61
- └── tap: indexContent
62
-
63
-
64
- {indexed count}
65
- ```
66
-
67
- ## Steps
68
-
69
- ### 1. Directory Walker
70
- - Create `componets/walkFiles.ts` — HOF `WalkFiles()` returning async generator that yields relative file paths
71
- - Detect file vs dir via `fs.stat`, yield files directly, recurse into dirs
72
- - Use `node:fs/promises` `opendir` for streaming directory reads
73
-
74
- ### 2. Gitignore Filtering
75
- - Install `ignore` package — `npm install ignore`
76
- - Load `.gitignore` per directory during walk; stack rules with parent via `ig.add()`
77
- - Default rule when no `.gitignore`: skip `.*` dirs (`.git`, `.idea`, `.DS_Store` etc.)
78
- - Check `ig.ignores(relativePath)` before yielding or descending into subdirs
79
-
80
- ### 3. Stream Pipeline
81
- - Wire walker into streamx: `from(walkFiles(inputs))` → `pipe(map(indexFile))` → `run()`
82
- - Update `IndexApp` HOF to accept `string[]` (mix of files and dirs)
83
- - `run.index.ts` passes argv as-is — no change needed
84
-
85
- ## Dependencies
86
-
87
- - `ignore` — gitignore pattern matching (new dep)
88
- - `packages/streamx` — async stream operators (already in repo)
89
-
90
- Sources:
91
- - [ignore npm package](https://www.npmjs.com/package/ignore)
92
- - [node-ignore GitHub](https://github.com/kaelzhang/node-ignore)
@@ -1,50 +0,0 @@
1
- # Log: Line-level clustering
2
-
3
- ### 2026-04-10
4
-
5
- - Task created from user notes about recursive bisection clustering
6
- - Scouted codebase: current granularity is 1 vector per file, `id=filePath`, metadata in separate object store (MD5-keyed JSON)
7
- - Confirmed in-memory Vectra works via `VirtualFileStorage` (test-vectra-memory.ts) — no disk I/O, cosine similarity queries work with small dims
8
- - Key insight: Vectra `metadata: {}` is always empty in current code — all real metadata lives in object store. This pattern can extend to line-level clusters
9
- - Identified integration points: indexContent.ts (upsert), searchContentIndex.ts (query), indexMeta.ts (type), removeContent.ts (delete)
10
- - Open: similarity threshold calibration, keyword quality for code, cluster deletion strategy on re-index
11
- - Clarification round: user confirmed embedding cosine (not Jaccard) — meaning matters more than keyword overlap
12
- - Threshold: user says 0.55–0.70 range, will start at 0.6
13
- - Min cluster: 3–5 lines, default 5
14
- - ID format: `file.ts:12-45` works as-is for both Vectra and object store
15
- - Cleanup strategy: object store entry per file tracks all cluster IDs, delete all on re-index
16
- - Researched NPM packages: semantic-chunking (jparkerweb), semantic-chunker (johnhenry, BYOE), LangChain RecursiveCharacterTextSplitter. All target prose/sentences, not code lines. Custom bisection with our embed pipeline is better fit.
17
- - NAACL 2025: fixed-size chunks match semantic chunking for prose RAG, but code has mixed concerns per file where semantic splitting should win
18
- - Consistency check: expanded 3x3 → 6x3 steps with concrete file paths and implementation details
19
- - Fixed: diagram now shows full flow from handleFileEvent through clusterLines to persistent store + manifest
20
- - Fixed: IIndexMeta type is `{keywords, id}` not `{keywords, file}` — corrected in Context
21
- - Found: main change site is `indexFileContent.ts` (not `indexContent.ts`), and `handleFileEvent.ts` for cleanup
22
- - Found: cosine similarity doesn't need in-memory Vectra — embed returns normalized vectors, direct dot product suffices
23
- - Found: object store dual-use issue — need to store both cluster metas and file manifests with different shapes. Added to Open Questions.
24
- - Added Edge Cases section: small files, empty files, uniform content, legacy data
25
- - Consistency check #2: fixed Goal (removed "in-memory Vectra" — cosine is direct dot product), removed stale whole-file keyword step from diagram, added tagged union pattern for manifest, clarified RemoveFileContent as separate HOF, added buildComponents wiring step
26
- - Decision: cosine via direct dot product (Option A). 3-line helper, no Vectra needed for bisection — comparing exactly 2 vectors, not searching N. Fallback to in-memory Vectra (Option B) if needed.
27
- - **Key architecture correction**: existing file-level indexing must stay intact. Clustering is an EXTENSION, not a replacement. Both file-level and cluster-level entries coexist.
28
- - If file is cohesive (1 cluster = whole file) → skip clustering entirely, file-level entry is enough
29
- - Resolved object store dual-use: three separate keys — `filePath` (file meta), `filePath:1-10` (cluster meta), `filePath::manifest` (cluster ID list). Widen `IObjectStore` types to `IStoreEntry = IIndexMeta | IFileManifest`.
30
- - Traced full dependency wiring: `IndexFileContent` is constructed in run.*.ts (NOT inside contentIndexDriver). Plan: move construction inside driver since it has all deps. Simplifies callers.
31
- - `handleFileEvent` flow on re-index: `removeFileContent(path)` first (cleans file entry + clusters + manifest), then `indexFileContent(path, text)` (creates file entry + clusters + manifest)
32
- - Expanded steps from 6x3 → 7x(2-5) with concrete file paths, signatures, and implementation notes
33
- - Updated diagram to show full bidirectional flow: removal path + indexing path + all three store key types
34
- - Consistency check #3:
35
- - CRITICAL: fixed file paths — `indexFileContent.ts` and `handleFileEvent.ts` are in `componets/index/`, not `componets/`. Same for `removeFileContent.ts` (new file).
36
- - CRITICAL: found `indexApp.ts` gap — bulk indexer calls `indexFileContent` directly via stream (no `HandleFileEvent`). Old clusters would linger on re-index. Added `removeFileContent` dep to `IndexApp`, call cleanup before indexFileContent in stream callback.
37
- - Fixed edge case: empty files still get file-level entry (existing pipeline), clustering just returns `[]`.
38
- - Added `indexApp.ts`, `buildComponents.ts`, `run.*.ts` to Context key files.
39
- - Added step 6.3 for indexApp.ts and step 6.4 for import path cleanup.
40
- - Synced plan file with same fixes.
41
- - **Implementation complete** — all 7 steps done:
42
- - Step 1: created `componets/index/clusterLines.ts` — cosine helper, ILineCluster, ClusterLines HOF
43
- - Step 2: updated `indexMeta.ts` (IType tagged union: IIndexMeta, IClusterMeta, IFileManifest, IStoreEntry), `objectStore.ts` (widened types)
44
- - Step 3: created `componets/index/removeFileContent.ts` — manifest-aware cleanup
45
- - Step 4: updated `indexFileContent.ts` — kept file-level index, added clustering extension
46
- - Step 5: updated `contentIndexDriver.ts` (wires ClusterLines, IndexFileContent, RemoveFileContent inside), `buildComponents.ts` (returns new components)
47
- - Step 6: simplified `run.index.ts`, `run.watch.ts`, `run.mcp.ts` (removed manual IndexFileContent construction), updated `handleFileEvent.ts` (removeFileContent), updated `indexApp.ts` (added removeFileContent dep)
48
- - Also updated `indexContent.ts` (widened meta param) and `searchContentIndex.ts` (narrow by type)
49
- - Reset + re-index: 113 files → 211 indexed items (98 cluster entries created)
50
- - Search verified: cluster hits showing as `file.ts:fromLine-toLine` (e.g. `rnd/test-vectra-memory.ts:9-12`)