npm - xindex - Versions diffs - 1.0.0 → 1.0.2 - Mend

xindex 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/.ai/research/.gitkeep +0 -0
package/.ai/task/.gitkeep +0 -0
package/README.md +54 -89
package/apps/run.search.ts +0 -3
package/componets/index/formatSearchResults.ts +2 -2
package/media/MEDIUM.md +139 -0
package/media/SOCIAL.md +102 -0
package/package.json +1 -1
package/.ai/research/2026-04-10-file-watching.md +0 -79
package/.ai/research/2026-04-10-mcp-output-format.md +0 -129
package/.ai/task/INDEX.md +0 -12
package/.ai/task/done/INDEX.md +0 -3
package/.ai/task/done/task.2026-04-09-local-ai-research-protos.log.md +0 -98
package/.ai/task/done/task.2026-04-09-local-ai-research-protos.md +0 -102
package/.ai/task/task.2026-04-10-cluster-config.log.md +0 -19
package/.ai/task/task.2026-04-10-cluster-config.md +0 -118
package/.ai/task/task.2026-04-10-dir-indexing.log.md +0 -8
package/.ai/task/task.2026-04-10-dir-indexing.md +0 -92
package/.ai/task/task.2026-04-10-line-clustering.log.md +0 -50
package/.ai/task/task.2026-04-10-line-clustering.md +0 -176
package/.ai/task/task.2026-04-10-object-store.log.md +0 -7
package/.ai/task/task.2026-04-10-object-store.md +0 -81
package/.ai/task/task.2026-04-10-search-config.log.md +0 -46
package/.ai/task/task.2026-04-10-search-config.md +0 -274
package/.ai/task/task.2026-04-10-watch-indexing.log.md +0 -32
package/.ai/task/task.2026-04-10-watch-indexing.md +0 -101
package/.ai/task/task.2026-04-10-xindex-mcp.log.md +0 -5
package/.ai/task/task.2026-04-10-xindex-mcp.md +0 -92
package/.ai/task/task.2026-04-10-xindex-mcp.report.md +0 -113

package/.ai/research/2026-04-10-mcp-output-format.md DELETED Viewed

@@ -1,129 +0,0 @@
-# MCP Tool Output Format for LLM Consumption
-**Question**: What output format should our xindex_search MCP tool use to return search results to an LLM?
-**Current state**: `JSON.stringify(results, null, 2)` — pretty-printed JSON with score, id, meta.keywords, meta.file (id and meta.file are redundant).
----
-## Findings
-### 1. Token efficiency benchmarks (ImprovingAgents, Oct 2025)
-**Nested data** — 1,000 questions, 3 models, 4 formats:
-| Format   | Tokens  | GPT-5 Nano | Gemini 2.5 Flash Lite |
-|----------|---------|------------|----------------------|
-| Markdown | 38,357  | 54.3%      | 48.2%                |
-| YAML     | 42,477  | 62.1%      | 51.9%                |
-| JSON     | 57,933  | 50.3%      | 43.1%                |
-| XML      | 68,804  | 44.4%      | 33.8%                |
-Markdown uses **34% fewer tokens** than JSON. YAML has better accuracy but more tokens.
-**Flat/tabular data** — 11 formats, 1,000 queries, GPT-4.1-nano:
-| Format         | Accuracy | Tokens  | Efficiency |
-|----------------|----------|---------|------------|
-| Markdown-KV    | 60.7%    | 52,104  | Best accuracy |
-| Markdown Table | 51.9%    | 25,140  | Best ratio |
-| JSON           | 52.3%    | 66,396  | Mediocre |
-| CSV            | 44.3%    | 19,524  | Cheapest but worst |
-For flat data (which our search results are), **Markdown-KV** gives best LLM comprehension. A numbered list with `key: value` pairs is effectively Markdown-KV.
-Sources: [Nested formats](https://www.improvingagents.com/blog/best-nested-data-format/), [Table formats](https://www.improvingagents.com/blog/best-input-data-format-for-llms/)
-### 2. MCP spec guidance (June 2025)
-- `content` (TextContent) = what the LLM reads
-- `structuredContent` = machine-to-machine, optional
-- Spec's own example uses **plain text**: `"Current weather in New York:\nTemperature: 72°F\nConditions: Partly cloudy"`
-- If `outputSchema` is defined, SHOULD return both `structuredContent` AND serialized JSON in TextContent for backwards compat
-The spec explicitly shows plain text as the standard tool result format for LLM consumption.
-Source: [MCP Tools Spec](https://modelcontextprotocol.io/specification/2025-06-18/server/tools)
-### 3. What popular MCP servers do
-| Server       | Output format |
-|-------------|--------------|
-| Perplexity  | AI-synthesized text + citation URLs |
-| Context7    | Plain text documentation snippets |
-| markdownify | Markdown (entire category exists for this) |
-| Elasticsearch | JSON (machine-oriented) |
-LLM-facing servers use text/markdown. Only machine-oriented servers use JSON.
-### 4. JSON specifically degrades LLM reasoning
-- Aider benchmarks: JSON wrapping reduces code reasoning quality by 10-15% ([source](https://aider.chat/2024/08/14/code-in-json.html))
-- arxiv paper: frontier models top out at ~77% accuracy on JSON processing tasks ([source](https://arxiv.org/html/2510.15955v1))
-- OpenAI community: Markdown is 15% more token-efficient than JSON ([source](https://community.openai.com/t/markdown-is-15-more-token-efficient-than-json/841742))
-### 5. TOON format (Nov 2025) — not recommended
-New token-optimized format. Mixed results: 73.9% on flat retrieval but **last place** (43.1%) on nested data. Immature ecosystem, no MCP support. Not applicable here.
-Source: [TOON benchmarks](https://www.improvingagents.com/blog/toon-benchmarks/)
-### 6. Workato design guidelines
-- Return only necessary fields — avoid sending 200+ fields when 3 suffice
-- Preprocess/summarize large content before returning to LLM
-- Consider token efficiency — "excessive data can overwhelm the AI agent"
-Source: [Workato MCP Tool Design](https://docs.workato.com/en/mcp/mcp-server-tool-design.html)
----
-## Analysis
-Our search results are **flat data** with 3 fields per result (score, file path, keywords). This is the simplest case:
-| Approach | Tokens/result | LLM quality | Fit |
-|----------|--------------|-------------|-----|
-| Pretty JSON (current) | ~55 | Worst — syntax overhead | Bad |
-| Compact JSON | ~22 | OK but cryptic keys | Meh |
-| Markdown numbered list | ~12 | Best — Markdown-KV pattern | Best |
-| TSV | ~15 | OK but less natural | OK |
-The markdown numbered list matches the **Markdown-KV** pattern that scored highest (60.7%) in flat data benchmarks. It's also **77% fewer tokens** than current JSON.
-Additional advantages:
-- File path is visually prominent (it's what the LLM acts on next)
-- Score at 2 decimals is sufficient ranking signal
-- Keywords give semantic context without opening the file
-- Zero structural noise (no braces, brackets, quotes, commas)
-- Matches how Perplexity/Context7 format their responses
-No significant trade-offs: we don't need machine-parseability (the consumer is always an LLM), and there's no nested data to worry about.
----
-## Recommendation
-**Switch to markdown numbered list.**
-```
-Search: "authentication flow" — 3 result(s)
-1. src/components/auth.ts (0.87) — authentication, login, session, token
-2. src/middleware/jwt.ts (0.81) — jwt, token, verify, middleware
-3. src/routes/login.ts (0.74) — login, form, credentials, redirect
-```
-Implementation in `mcpApp.ts`:
-```ts
-const header = `Search: "${query}" — ${results.length} result(s)\n\n`;
-const lines = results.map((r, i) =>
-    `${i + 1}. ${r.id} (${r.score.toFixed(2)}) — ${r.meta.keywords ?? ""}`
-);
-const text = header + lines.join("\n");
-return {content: [{type: "text" as const, text}]};
-```
-Empty case: `No results for "${query}"` — avoids confusing the model with an empty list.
-**Future consideration**: Add `outputSchema` + `structuredContent` when clients start using it, but keep TextContent as the primary format for LLM consumption.

package/.ai/task/INDEX.md DELETED Viewed

@@ -1,12 +0,0 @@
-# Tasks
-- [xindex-mcp — MCP Server for Semantic Code Search](task.2026-04-10-xindex-mcp.md) — wrap xindex as MCP server so Claude Code can search codebase
-- [Directory-based Indexing with Async Streams](task.2026-04-10-dir-indexing.md) — accept files/dirs, recursive walk with .gitignore, index via streamx pipeline
-- [xindex-watch — Continuous Indexing](task.2026-04-10-watch-indexing.md) — new entry point: index all + watch for changes continuously via merged stream
-- [Object Store — Separate Meta from Vectra](task.2026-04-10-object-store.md) — store meta as JSON files in .xindex/objects/, vectra keeps only vectors
-- [Line-level Clustering](task.2026-04-10-line-clustering.md) — recursive bisection to split files into semantic blocks, index as file:fromLine-toLine
-- [Search Config — Keyword Ignore & Inline Snippets](task.2026-04-10-search-config.md) — `.xindex.json` config for ignoring noisy keywords + inlining small code clusters in results
-- [Cluster Config — Move ClusterLines defaults to .xindex.json](task.2026-04-10-cluster-config.md) — repo-level clustering params (`threshold`, `minLines`, `maxDepth`) instead of hardcoded defaults
-See [done/INDEX.md](done/INDEX.md) for completed tasks.

package/.ai/task/done/INDEX.md DELETED Viewed

@@ -1,3 +0,0 @@
-# Done Tasks
-- [xindex — Local Semantic Code Search](task.2026-04-09-local-ai-research-protos.md) — R&D prototyping → HOF refactoring → working semantic search tool (completed 2026-04-10)

package/.ai/task/done/task.2026-04-09-local-ai-research-protos.log.md DELETED Viewed

@@ -1,98 +0,0 @@
-### 2026-04-09 — Session log
-#### 1. hf.ts — Local text generation
-- Installed `@huggingface/transformers` (v4.0.1)
-- First run failed: top-level `await` not supported in CJS → fixed by adding `"type": "module"` to package.json
-- Model: `HuggingFaceTB/SmolLM2-135M-Instruct` — downloads ONNX weights on first run, cached after
-- API: `pipeline("text-generation", model)` → pass chat messages array → `output[0].generated_text.at(-1).content`
-- Output quality: basic — responded "Hello, my name is [Your Name]." to "Write a one-line hello"
-#### 2. vectra.ts — Local vector search
-- Installed `vectra` (v0.14.0)
-- **API gotchas** — online examples use outdated API:
-  - `VectraIndex` → actually `LocalIndex`
-  - Constructor: `new LocalIndex(folderPath)` — no options object, no `dimension` param
-  - Must call `createIndex()` before first use, check with `isIndexCreated()`
-  - `queryItems(vector, query, topK, filter)` — 4 positional args, not an options object
-  - Filter format: `{ category: { $eq: "fruit" } }` — nested operator syntax
-- Embeddings: `pipeline("feature-extraction", "sentence-transformers/all-MiniLM-L6-v2")` → 384-dim vectors
-- Embedding helper: `embedder(text, { pooling: "mean", normalize: true })` → `Array.from(result.data)`
-- Tested: 3 items indexed, query "red fruit" with category filter → correctly returned 2 fruit items, filtered out "Cars are vehicles"
-- Scores: Apples=0.7830, Bananas=0.5188
-#### 3. keywords.ts — Keyword extraction from code files
-- Installed `keyword-extractor` (v0.0.28) — CJS module, needed `createRequire(import.meta.url)` for ESM import
-- **Iteration 1**: `return_chained_words: true` → way too aggressive, merged entire code lines into single "keywords"
-- **Iteration 2**: `return_max_ngrams: 3` instead → still noisy, code syntax tokens (`{`, `}`, `const`, `await`) dominated results
-- **Iteration 3**: Added code-aware preprocessing before extraction:
-  - Strip `//` from comments but keep comment text
-  - Remove code punctuation: `{}()[];=<>|&!+*/$@` etc.
-  - Remove JS keywords: `const`, `let`, `var`, `import`, `export`, `from`, `await`, `async`, `function`, `return`, `for`, `of`, `if`, `new`, `typeof`, `as`
-  - Collapse whitespace
-  - Post-filter: skip keywords < 3 chars or non-alphabetic, word-boundary regex for frequency count
-- Final output on vectra.ts: clean results — `metadata(8x)`, `index(7x)`, `text(7x)`, `fruit(5x)`, `pipeline(3x)`, `huggingface(1x)`
-- Tested without filters: confirmed `const(10x)`, `await(8x)`, `}(19x)` dominate — filters are necessary for code
-#### 4. keywords-compromise.ts — Compromise NLP extraction
-- Installed `compromise` (v14.15.0) — ESM native, no createRequire needed
-- Tested all extractors: `.topics()`, `.nouns()`, `.verbs()`, `.people()`, `.organizations()`
-- **Result on code files**: poor. Nouns are code fragments (`'const index = new LocalIndex("./vectra-index");'`), zero topics, zero people
-- On hf.ts: caught "Microsoft" as topic + organization from prompt string — works on embedded natural language
-- **Conclusion**: compromise is designed for prose (articles, emails, chat), not source code
-#### 5. keywords-pipeline.ts — Full extraction pipeline
-- Combined: read file → compromise (nouns/verbs/topics) → regex `\W+` → space → keyword-extractor → show
-- Added LLM step (SmolLM2-135M): asked to extract/refine keywords
-- **LLM result**: echoed input then looped on `await transformer.get(index)` — too small to understand the instruction
-- LLM fallback logic: if output has <3 unique terms or contains repetition pattern, fall back to raw keywords
-- Pipeline works end-to-end but LLM step is effectively a passthrough
-#### 6. vectra-keywords.ts — Combined indexing + synonym search
-- Merged keyword extraction + vectra indexing into one script
-- Indexed 5 files, tested synonym search:
-  - "fruit" → hf.ts (0.18), vectra.ts (0.17)
-  - "automobile vehicle transportation" → vectra.ts (0.18) — synonym for "cars/vehicles"
-  - "embedding model neural network" → vectra.ts (0.27) — semantic match
-- Scores low because code noise (`const`, `await`) dilutes the keyword embeddings
-#### 7. xindex.ts — Final combined solution
-- Created unified CLI: `xindex.ts index <files>` and `xindex.ts search <query>`
-- Full index pipeline: compromise → regex → keyword-extractor → LLM refine → MiniLM embed → vectra store
-- Query pipeline: input → keyword-extractor → embed → vectra search
-- Added full payload logging at each step: [1] keywords, [2] LLM refined, [3] vector preview, [4] metadata
-- Test results:
-  - "natural language processing" → keywords-pipeline.ts (0.56) — strongest match
-  - "automobile transportation" → vectra.ts (0.21) — synonym works
-  - "neural network deep learning" → vectra.ts (0.28)
-#### 8. Project setup — ~/project/xindex
-- Created standalone TypeScript project at `/Users/slava/project/xindex`
-- Moved all RnD files: xindex.ts, hf.ts, vectra.ts, keywords.ts, keywords-compromise.ts, keywords-pipeline.ts, vectra-keywords.ts
-- package.json, tsconfig.json, bin/xindex entry point
-- Git initialized
-#### Alternative keyword extraction libs (from user research)
-- `textlens` — TF-IDF, 1-line API, fastest, 10k+ weekly downloads
-- `node-keyword-extractor` — RAKE-like, 1-line API, very fast
-- `compromise` — full NLP, 3 lines, fast (tested — poor on code)
-- `natural` — TF-IDF tokenizer, 5 lines, fast
-#### Decisions & findings
-- `"type": "module"` in package.json is required for all prototypes (top-level await)
-- HuggingFace transformers JS works well for embeddings; text generation quality limited by model size
-- Vectra API docs/examples online are outdated — always check the actual `.d.ts` types
-- Code keyword extraction needs domain-specific preprocessing regardless of library choice
-- Compromise NLP is not suitable for code — only for natural language text
-- SmolLM2-135M is too small for keyword refinement — needs 360M+ or external API
-- MiniLM-L6-v2 embeddings understand synonyms well enough for semantic code search
-- The concept works: index codebase → query with natural language → get relevant files

package/.ai/task/done/task.2026-04-09-local-ai-research-protos.md DELETED Viewed

@@ -1,102 +0,0 @@
-# Task: xindex — Local Semantic Code Search [COMPLETED 2026-04-10]
-## Context
-Built a local semantic code search tool — index a codebase, query by keyword or meaning, get relevant files back. No cloud APIs, everything runs on-device.
-**Project**: `~/project/xindex`
-**Dependencies:** `@huggingface/transformers`, `vectra`, `compromise`, `keyword-extractor`, `tsx`
-## Goal
-Index a codebase so that minimally meaningful text queries return relevant files/info about the project. Local-first, no cloud APIs.
-## Final Architecture
-```
-componets/
-├── llm/
-│   ├── embed.ts            — Embed({pooling, normalize}) → MiniLM-L6 384-dim
-│   └── queryLLM.ts         — QueryLLM({maxTokens}) → SmolLM2-135M (kept aside, unused)
-├── keywords/
-│   ├── extractKeywords.ts  — ExtractKeywords() → compromise NLP (nouns/verbs/topics)
-│   ├── cleanUpKeywords.ts  — CleanUpKeywords({maxNgrams, minLength}) → keyword-extractor + dedup
-│   └── refineKeywords.ts   — RefineKeywords({queryLLM, cleanUpKeywords, prompt}) (kept aside, unused)
-├── index/
-│   ├── vectraIndex.ts      — VectraIndex(path) → LocalIndex init
-│   ├── indexContent.ts     — IndexContent({embed, index}) → embed + upsert
-│   ├── getIndexStats.ts    — GetIndexStats({index}) → {indexedAmount}
-│   ├── searchContentIndex.ts — SearchContentIndex({extractKeywords, cleanUpKeywords, embed, index})
-│   └── contentIndexDriver.ts — ContentIndexDriver({path, embed, extractKeywords, cleanUpKeywords})
-└── buildComponents.ts      — wires everything, returns ready-to-use functions
-apps/
-├── indexApp.ts             — IndexApp({extractKeywords, cleanUpKeywords, indexContent})
-├── searchApp.ts            — SearchApp({searchContentIndex})
-├── run.index.ts            — CLI entry point for indexing
-└── run.search.ts           — CLI entry point for search
-bin/
-├── xindex-index            — #!/usr/bin/env tsx → apps/run.index.ts
-└── xindex-search           — #!/usr/bin/env tsx → apps/run.search.ts
-```
-## Final Pipeline
-```
-INDEX (per file):
-  file + filename
-   │
-   ├─[1] extractKeywords ──→ compromise NLP (nouns, verbs, topics)
-   │
-   ├─[2] cleanUpKeywords ──→ keyword-extractor + dedup + filter
-   │
-   ├─[3] MiniLM-L6 ────────→ 384-dim embedding
-   │
-   └─[4] vectra ────────────→ upsert { id, vector, metadata: { keywords, file } }
-SEARCH:
-  user input
-   │
-   ├─[1] extractKeywords ──→ compromise NLP
-   │
-   ├─[2] cleanUpKeywords ──→ keyword-extractor + dedup + filter
-   │
-   ├─[3] MiniLM-L6 ────────→ 384-dim embedding
-   │
-   └─[4] vectra ────────────→ queryItems → ranked by cosine similarity
-```
-## Key Decisions
-- **LLM refine step removed** — SmolLM2-135M was too small, acted as passthrough or generated garbage. Without it: 10x faster indexing, better accuracy (11/14 → 16/20 correct #1)
-- **HOF component pattern** — all components are factory functions: `DoThing({deps}): IDoThing`. Export factory + type only, no default instances
-- **Dependencies as destructured objects** — `DoThing({embed, index}: {embed: IEmbed, index: LocalIndex})`
-- **Separate entry points** — `bin/xindex-index` and `bin/xindex-search` instead of one CLI with subcommands
-- **ContentIndexDriver** — bundles index layer (indexContent, getIndexStats, searchContentIndex) behind one factory
-## Final Test Results (41 files, 20 queries)
-| Metric | Value |
-|--------|-------|
-| Index time (41 files) | 1.07s (26ms/file) |
-| Search time | 0.71s (constant) |
-| Correct #1 | 16/20 (80%) |
-| Correct in top 3 | 19/20 (95%) |
-| Cross-domain isolation | Perfect |
-## Resolved Review Items
-- [x] Search loads unused LLM → removed LLM from pipeline entirely
-- [x] `// --- Init ---` comment artifact → removed
-- [ ] `BuildComponents` hardcodes index path → still hardcoded
-- [ ] Stale `main` in package.json → still points to deleted xindex.ts
-- [ ] `componets/` typo → deferred
-## Open Questions (carried forward)
-- What's the practical index size limit for vectra LocalIndex before it slows down?
-- Is hybrid search (vector + BM25) in vectra good enough to skip a separate keyword index?
-- Directory walking instead of explicit file list
-- Chunking for large files — one embedding per file loses detail

package/.ai/task/task.2026-04-10-cluster-config.log.md DELETED Viewed

@@ -1,19 +0,0 @@
-### 2026-04-10
-- Created task from user note: move line clustering defaults (`threshold`, `minLines`, `maxDepth`) into `.xindex.json`.
-- Ran scout via code search and xindex MCP search:
-  - confirmed hardcoded defaults in `componets/index/clusterLines.ts`
-  - confirmed config infra exists in `componets/config/xindexConfig.ts` + `componets/config/loadConfig.ts`
-  - confirmed `.xindex.json` already used for other settings
-- Drafted initial plan with diagram + 3x3 steps.
-- Updated from user clarifications:
-  - default threshold when missing is `0.7`
-  - threshold validation is strict `[0,1]` with fallback to default
-  - apply through shared wiring in this repo folder
-  - do not update `.xindex.json` file content now
-  - config shape finalized as flat keys: `clusterThreshold`, `clusterMinLines`, `clusterMaxDepth`
-- Ran consistency pass and fixed task mismatches:
-  - removed old `0.75` default references
-  - replaced clamp/range language with strict validation wording
-  - removed docs/example update requirement that conflicted with user instruction
-- Expanded task with Detailed Change Map + Acceptance Criteria for implementation readiness.

package/.ai/task/task.2026-04-10-cluster-config.md DELETED Viewed

@@ -1,118 +0,0 @@
-# Task: Move ClusterLines defaults into .xindex.json
-## Context
-User goal: move line-clustering configuration from hardcoded defaults to repo-level config in `.xindex.json`:
-- `threshold = 0.70` (new default when missing)
-- `minLines = 5`
-- `maxDepth = 5`
-- config key shape is flat:
-  - `clusterThreshold`
-  - `clusterMinLines`
-  - `clusterMaxDepth`
-Why: different repositories need different clustering behavior, so these values should be configurable per repo.
-Scout findings (`@xi` + code scan):
-- `componets/index/clusterLines.ts` currently hardcodes defaults in the HOF signature:
-  - `threshold = 0.75`
-  - `minLines = 5`
-  - `maxDepth = 5`
-- `.xindex.json` already exists and currently contains search/index config (`ignoreKeywords`, `ignoreFiles`, `maxSnippetLines`, `maxSnippetResults`).
-- `componets/config/xindexConfig.ts` and `componets/config/loadConfig.ts` already provide optional config loading with defaults.
-- `componets/buildComponents.ts` already loads `.xindex.json` and wires config into keyword cleanup, but does not yet pass clustering params to driver/components.
-Related active tasks:
-- `task.2026-04-10-line-clustering.md`
-- `task.2026-04-10-search-config.md`
-## Goal
-Extend `.xindex.json` + config loading so line-clustering params are configurable per repository, and wire them into `ClusterLines` construction with strict threshold validation and the new default threshold `0.7` when config keys are absent.
-## Diagram
-```
-.xindex.json (optional)
-┌──────────────────────────────────────────────┐
-│ existing: ignoreKeywords, ignoreFiles, ...  │
-│ new:                                         │
-│   clusterThreshold: number (default 0.7)     │
-│   clusterMinLines: number (default 5)        │
-│   clusterMaxDepth: number (default 5)        │
-└───────────────────┬──────────────────────────┘
-                    │
-                    ▼
-LoadConfig -> IXindexConfig (defaults applied)
-                    │
-                    ▼
-BuildComponents
-                    │
-                    ▼
-ContentIndexDriver / ClusterLines factory
-                    │
-                    ▼
-clusterLines() uses repo-specific values
-```
-## Steps
-### 1. Extend config schema (3x3)
-- **1.1 Add fields to `IXindexConfig`** — add three clustering fields with explicit names and numeric types.
-- **1.2 Parse + default in `LoadConfig`** — map new JSON keys to validated numbers with defaults `{clusterThreshold: 0.7, clusterMinLines: 5, clusterMaxDepth: 5}`.
-- **1.3 Validate bounds** — `clusterThreshold` must be in `[0,1]` (otherwise fallback to default), and line/depth values must be finite integers with safe lower bounds.
-### 2. Wire config into clustering (3x3)
-- **2.1 Thread config through builder/driver** — ensure the clustering factory gets config values from `BuildComponents` path.
-- **2.2 Update `ClusterLines` construction** — pass config values from driver wiring instead of relying on hardcoded constructor defaults.
-- **2.3 Preserve backward compatibility** — missing `.xindex.json` or missing keys should still produce stable clustering behavior via loader defaults.
-### 3. Validate behavior and docs (3x3)
-- **3.1 Runtime sanity checks** — run index/search flow to confirm no regressions and that loaded config values are honored.
-- **3.2 Surface scope in repo entry points** — ensure clustering config is available from common build path used by apps in this folder.
-- **3.3 Add/update tests** — cover default path (no config keys), invalid threshold path (fallback to 0.7), and override path (custom values).
-## Detailed Change Map
-- `componets/config/xindexConfig.ts`
-  - Add:
-    - `clusterThreshold: number`
-    - `clusterMinLines: number`
-    - `clusterMaxDepth: number`
-- `componets/config/loadConfig.ts`
-  - Extend `DEFAULTS` with clustering keys (`0.7`, `5`, `5`).
-  - Parse new keys from `.xindex.json`.
-  - Apply strict validation:
-    - threshold valid only if finite number and `0 <= v <= 1`, otherwise default
-    - min/depth valid only if finite number, integerized, and `>= 1`, otherwise default
-- `componets/index/contentIndexDriver.ts`
-  - Accept `config` in factory deps.
-  - Construct `ClusterLines({... , threshold: config.clusterThreshold, minLines: config.clusterMinLines, maxDepth: config.clusterMaxDepth})`.
-- `componets/buildComponents.ts`
-  - Pass loaded `config` into `ContentIndexDriver`.
-  - Keep returning `config` so consumers in this folder can use consistent runtime config.
-- `apps/run.*.ts` and `apps/mcpApp.ts` (as needed)
-  - No direct clustering logic, but rely on `BuildComponents` so new config applies everywhere in this folder through shared wiring.
-## Acceptance Criteria
-- `.xindex.json` may define clustering keys and they influence `ClusterLines` without code changes.
-- Missing clustering keys use defaults: `clusterThreshold=0.7`, `clusterMinLines=5`, `clusterMaxDepth=5`.
-- Invalid threshold values (e.g., `-0.1`, `1.2`, `"0.7"`, `null`) fallback to `0.7`.
-- Indexing pipeline compiles and runs with unchanged public entry points.
-- No update to `.xindex.json` file contents is required in this task.
-## Decisions
-- Default threshold changed from old runtime value to `0.7`.
-- Threshold validation is strict `[0,1]` with fallback to default (no clamping).
-- Scope is this repo folder via shared component wiring, not only one direct caller.
-- Do not modify current `.xindex.json` as part of this task.
-- Config shape is flat keys in `.xindex.json`:
-  - `clusterThreshold`
-  - `clusterMinLines`
-  - `clusterMaxDepth`
-## Open Questions
-- Compatibility strategy: accept legacy names (if any) or only new canonical names?

package/.ai/task/task.2026-04-10-dir-indexing.log.md DELETED Viewed

@@ -1,8 +0,0 @@
-### 2026-04-10 — Task created
-- Scouted: streamx has `from()`, `map`, `flat`, `pipe`, `run` — full async stream toolkit
-- Current IndexApp is a simple for-loop over explicit file paths
-- Node `fs.readdir({recursive: true})` exists but no gitignore support
-- `.gitignore` already has sensible rules for the project
-- User wants: files or dirs as input → recursive walk → stream → index
-- User preference: use streamx from packages/ for the stream pipeline

package/.ai/task/task.2026-04-10-dir-indexing.md DELETED Viewed

@@ -1,92 +0,0 @@
-# Task: Directory-based Indexing with Async Streams
-## Context
-Current `IndexApp` takes an explicit file list — no directory scanning. User wants to pass files **or** dirs, recursively scan dirs, and index everything as a stream.
-**Current pipeline** (`apps/indexApp.ts`):
-```
-files[] → for each → readFile → extractKeywords → cleanUp → indexContent
-```
-**streamx available** (`packages/streamx/`):
-- `from(iterable)` — wraps async/sync iterable into StreamX
-- `of()` → `.pipe()` for chaining
-- Operators: `map`, `filter`, `flat`, `flatMap`, `batch`, `buffer`, `merge`, `scale`, `reduce`, `tap`
-- `run()` — consumes stream, returns last value
-**Gitignore: `ignore` npm package** (used by ESLint, Prettier):
-```ts
-import ignore from "ignore";
-const ig = ignore();
-ig.add(await readFile(".gitignore", "utf8"));  // load rules
-ig.ignores("node_modules/foo.js");              // true
-ig.filter(["src/index.ts", "dist/out.js"]);     // ["src/index.ts"]
-```
-- Paths must be **relative** to .gitignore location
-- `.add()` stackable — call per nested .gitignore
-- Handles negation (`!`), globs, `**/`, comments
-**Decisions:**
-- Paths are **relative to working directory** (children of cwd)
-- Sequential indexing now; `scale()` for parallelism later
-- Use `ignore` npm package for .gitignore parsing
-- Default: if no .gitignore in a folder, skip `.*` dirs (`.git`, `.idea`, etc.)
-## Goal
-Accept files or directories as input, recursively walk directories (respecting .gitignore), and index all discovered files as an async stream using streamx.
-## Diagram
-```
-INPUT:  ["file.ts", "src/", "lib/"]
-         │
-         ├── file? ──→ yield relative path
-         │
-         └── dir? ──→ walk recursively
-                       │
-                       ├── load .gitignore (if exists)
-                       │   (else: default ignore .* dirs)
-                       │
-                       ├── skip ignored paths
-                       │
-                       └── yield each file (relative)
-                            │
-                            ▼
-              from(walkFiles) ──→ streamx pipeline
-                            │
-                  ├── map: readFile
-                  ├── map: extractKeywords + cleanUp
-                  └── tap: indexContent
-                            │
-                            ▼
-                      {indexed count}
-```
-## Steps
-### 1. Directory Walker
-- Create `componets/walkFiles.ts` — HOF `WalkFiles()` returning async generator that yields relative file paths
-- Detect file vs dir via `fs.stat`, yield files directly, recurse into dirs
-- Use `node:fs/promises` `opendir` for streaming directory reads
-### 2. Gitignore Filtering
-- Install `ignore` package — `npm install ignore`
-- Load `.gitignore` per directory during walk; stack rules with parent via `ig.add()`
-- Default rule when no `.gitignore`: skip `.*` dirs (`.git`, `.idea`, `.DS_Store` etc.)
-- Check `ig.ignores(relativePath)` before yielding or descending into subdirs
-### 3. Stream Pipeline
-- Wire walker into streamx: `from(walkFiles(inputs))` → `pipe(map(indexFile))` → `run()`
-- Update `IndexApp` HOF to accept `string[]` (mix of files and dirs)
-- `run.index.ts` passes argv as-is — no change needed
-## Dependencies
-- `ignore` — gitignore pattern matching (new dep)
-- `packages/streamx` — async stream operators (already in repo)
-Sources:
-- [ignore npm package](https://www.npmjs.com/package/ignore)
-- [node-ignore GitHub](https://github.com/kaelzhang/node-ignore)

package/.ai/task/task.2026-04-10-line-clustering.log.md DELETED Viewed

@@ -1,50 +0,0 @@
-# Log: Line-level clustering
-### 2026-04-10
-- Task created from user notes about recursive bisection clustering
-- Scouted codebase: current granularity is 1 vector per file, `id=filePath`, metadata in separate object store (MD5-keyed JSON)
-- Confirmed in-memory Vectra works via `VirtualFileStorage` (test-vectra-memory.ts) — no disk I/O, cosine similarity queries work with small dims
-- Key insight: Vectra `metadata: {}` is always empty in current code — all real metadata lives in object store. This pattern can extend to line-level clusters
-- Identified integration points: indexContent.ts (upsert), searchContentIndex.ts (query), indexMeta.ts (type), removeContent.ts (delete)
-- Open: similarity threshold calibration, keyword quality for code, cluster deletion strategy on re-index
-- Clarification round: user confirmed embedding cosine (not Jaccard) — meaning matters more than keyword overlap
-- Threshold: user says 0.55–0.70 range, will start at 0.6
-- Min cluster: 3–5 lines, default 5
-- ID format: `file.ts:12-45` works as-is for both Vectra and object store
-- Cleanup strategy: object store entry per file tracks all cluster IDs, delete all on re-index
-- Researched NPM packages: semantic-chunking (jparkerweb), semantic-chunker (johnhenry, BYOE), LangChain RecursiveCharacterTextSplitter. All target prose/sentences, not code lines. Custom bisection with our embed pipeline is better fit.
-- NAACL 2025: fixed-size chunks match semantic chunking for prose RAG, but code has mixed concerns per file where semantic splitting should win
-- Consistency check: expanded 3x3 → 6x3 steps with concrete file paths and implementation details
-- Fixed: diagram now shows full flow from handleFileEvent through clusterLines to persistent store + manifest
-- Fixed: IIndexMeta type is `{keywords, id}` not `{keywords, file}` — corrected in Context
-- Found: main change site is `indexFileContent.ts` (not `indexContent.ts`), and `handleFileEvent.ts` for cleanup
-- Found: cosine similarity doesn't need in-memory Vectra — embed returns normalized vectors, direct dot product suffices
-- Found: object store dual-use issue — need to store both cluster metas and file manifests with different shapes. Added to Open Questions.
-- Added Edge Cases section: small files, empty files, uniform content, legacy data
-- Consistency check #2: fixed Goal (removed "in-memory Vectra" — cosine is direct dot product), removed stale whole-file keyword step from diagram, added tagged union pattern for manifest, clarified RemoveFileContent as separate HOF, added buildComponents wiring step
-- Decision: cosine via direct dot product (Option A). 3-line helper, no Vectra needed for bisection — comparing exactly 2 vectors, not searching N. Fallback to in-memory Vectra (Option B) if needed.
-- **Key architecture correction**: existing file-level indexing must stay intact. Clustering is an EXTENSION, not a replacement. Both file-level and cluster-level entries coexist.
-- If file is cohesive (1 cluster = whole file) → skip clustering entirely, file-level entry is enough
-- Resolved object store dual-use: three separate keys — `filePath` (file meta), `filePath:1-10` (cluster meta), `filePath::manifest` (cluster ID list). Widen `IObjectStore` types to `IStoreEntry = IIndexMeta | IFileManifest`.
-- Traced full dependency wiring: `IndexFileContent` is constructed in run.*.ts (NOT inside contentIndexDriver). Plan: move construction inside driver since it has all deps. Simplifies callers.
-- `handleFileEvent` flow on re-index: `removeFileContent(path)` first (cleans file entry + clusters + manifest), then `indexFileContent(path, text)` (creates file entry + clusters + manifest)
-- Expanded steps from 6x3 → 7x(2-5) with concrete file paths, signatures, and implementation notes
-- Updated diagram to show full bidirectional flow: removal path + indexing path + all three store key types
-- Consistency check #3:
-  - CRITICAL: fixed file paths — `indexFileContent.ts` and `handleFileEvent.ts` are in `componets/index/`, not `componets/`. Same for `removeFileContent.ts` (new file).
-  - CRITICAL: found `indexApp.ts` gap — bulk indexer calls `indexFileContent` directly via stream (no `HandleFileEvent`). Old clusters would linger on re-index. Added `removeFileContent` dep to `IndexApp`, call cleanup before indexFileContent in stream callback.
-  - Fixed edge case: empty files still get file-level entry (existing pipeline), clustering just returns `[]`.
-  - Added `indexApp.ts`, `buildComponents.ts`, `run.*.ts` to Context key files.
-  - Added step 6.3 for indexApp.ts and step 6.4 for import path cleanup.
-  - Synced plan file with same fixes.
-- **Implementation complete** — all 7 steps done:
-  - Step 1: created `componets/index/clusterLines.ts` — cosine helper, ILineCluster, ClusterLines HOF
-  - Step 2: updated `indexMeta.ts` (IType tagged union: IIndexMeta, IClusterMeta, IFileManifest, IStoreEntry), `objectStore.ts` (widened types)
-  - Step 3: created `componets/index/removeFileContent.ts` — manifest-aware cleanup
-  - Step 4: updated `indexFileContent.ts` — kept file-level index, added clustering extension
-  - Step 5: updated `contentIndexDriver.ts` (wires ClusterLines, IndexFileContent, RemoveFileContent inside), `buildComponents.ts` (returns new components)
-  - Step 6: simplified `run.index.ts`, `run.watch.ts`, `run.mcp.ts` (removed manual IndexFileContent construction), updated `handleFileEvent.ts` (removeFileContent), updated `indexApp.ts` (added removeFileContent dep)
-  - Also updated `indexContent.ts` (widened meta param) and `searchContentIndex.ts` (narrow by type)
-- Reset + re-index: 113 files → 211 indexed items (98 cluster entries created)
-- Search verified: cluster hits showing as `file.ts:fromLine-toLine` (e.g. `rnd/test-vectra-memory.ts:9-12`)