PyPI - corpus-analyzer - Versions diffs - 0.1.0__tar.gz - Mend

corpus-analyzer 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (628) hide show

corpus_analyzer-0.1.0/.env.example ADDED Viewed

@@ -0,0 +1,12 @@
+# Application
+CORPUS_DATABASE_PATH=corpus.sqlite
+CORPUS_REPORTS_DIR=reports
+CORPUS_TEMPLATES_DIR=templates
+# Ollama settings
+CORPUS_OLLAMA_HOST=http://localhost:11434
+CORPUS_OLLAMA_MODEL=llama3.2
+# Analysis
+CORPUS_CHUNK_SIZE=1000
+CORPUS_CHUNK_OVERLAP=100

corpus_analyzer-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,45 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# Virtual environments
+.venv/
+venv/
+ENV/
+# Distribution / packaging
+build/
+dist/
+*.egg-info/
+*.egg
+# Testing
+.pytest_cache/
+htmlcov/
+.coverage
+coverage.xml
+# Linting / Type checking
+.ruff_cache/
+.mypy_cache/
+# IDEs
+.idea/
+.vscode/
+*.swp
+*.swo
+# OS files
+.DS_Store
+Thumbs.db
+# Project specific
+*.sqlite
+corpus.sqlite
+reports/
+templates/
+output/
+# Environment
+.env

corpus_analyzer-0.1.0/.planning/MILESTONES.md ADDED Viewed

@@ -0,0 +1,170 @@
+# Milestones
+## v1.0 MVP (Shipped: 2026-02-23)
+**Phases completed:** 4 phases, 15 plans
+**Timeline:** 2026-01-13 → 2026-02-23 (41 days)
+**Python LOC:** ~6,100
+**Delivered:** A semantic search engine for AI agent libraries — index local skills, workflows, and code; query via CLI, MCP, or Python API with sub-second hybrid BM25+vector retrieval.
+**Key accomplishments:**
+- LanceDB embedding pipeline: ChunkRecord schema, deterministic sha256 chunk IDs, OllamaEmbedder integration
+- Full ingestion CLI (`corpus add`, `corpus index`) with incremental re-indexing and XDG Base Directory compliance
+- Hybrid BM25+vector search engine with RRF fusion, filterable by source, file type, and construct type
+- Agent construct classifier (rule-based + LLM fallback) and AI summarizer per indexed file
+- FastMCP server with pre-warmed embeddings and `from corpus import search` Python API
+- Safety hardening: CLI KeyError fix, indexer warning logs, MCP `content_error` signaling
+**Archive:** `.planning/milestones/v1.0-ROADMAP.md`
+---
+## v1.1 Search Quality (Shipped: 2026-02-23)
+**Phases completed:** 2 phases, 4 plans
+**Timeline:** 2026-02-23 (1 day)
+**Python LOC:** ~6,248
+**Delivered:** Eliminated index noise via per-source extension allowlists and lifted `--construct` filtering accuracy from heuristic to 0.95-confidence via YAML frontmatter signals.
+**Key accomplishments:**
+- Per-source extension allowlist (`SourceConfig.extensions`) with sensible defaults — excludes `.sh`, `.html`, `.json`, `.lock`, binaries automatically
+- Extension filtering wired end-to-end: corpus.toml → walk_source → indexer → CLI warning on file removal
+- `ClassificationResult` dataclass with confidence + source tracking (frontmatter/rule_based/llm)
+- Frontmatter-first classifier: `type:` / `component_type:` → 0.95 confidence; `tags:` → 0.70 confidence
+- LanceDB schema v3 with `classification_source` + `classification_confidence` fields; idempotent in-place migration
+- `--construct` filter now leverages high-confidence frontmatter classifications persisted in LanceDB
+**Archive:** `.planning/milestones/v1.1-ROADMAP.md`
+---
+## v1.2 Graph Linker (Shipped: 2026-02-24)
+**Phases completed:** 2 phases (7–8), 1 GSD plan
+**Timeline:** 2026-02-24 (1 day)
+**Python LOC:** ~7,300
+**Delivered:** Added a directed relationship graph to the indexing pipeline — `corpus index` now auto-extracts `## Related Skills` / `## Related Files` links from Markdown and persists them as queryable graph edges, while dead API surface was removed.
+**Key accomplishments:**
+- `corpus index` auto-extracts `## Related Skills` / `## Related Files` edges into `graph.sqlite` with no extra commands
+- `corpus graph <slug>` exposes upstream (←) and downstream (→) neighbours for any indexed file
+- Closest-prefix ambiguity resolution picks the nearest-path candidate when multiple files share a slug
+- `corpus index` warns on duplicate slug collisions in yellow at index time
+- `corpus graph --show-duplicates` lists all slug collisions and the paths involved
+- Removed dead `use_llm_classification` parameter from `index_source()` / `SourceConfig`, shrinking the API surface
+**Archive:** `.planning/milestones/v1.2-ROADMAP.md`
+---
+## v1.3 Code Quality (Shipped: 2026-02-24)
+**Phases completed:** 4 phases, 11 plans
+**Timeline:** 2026-02-24 (1 day)
+**Delivered:** Achieved a zero-violation linting baseline across all 53 source files — `ruff check .` and `mypy src/` both exit 0 — via surgical config, auto-fix sweeps, and targeted manual fixes.
+**Key accomplishments:**
+- `pyproject.toml` surgical ruff/mypy config: `extend-exclude` for `.planning/`, per-file-ignores for E501 and B006
+- `ruff --fix` sweep eliminated ~370 auto-fixable violations across 37 files
+- Manual E741/E402/B017/B023/B904/E501 fixes across leaf modules, database hub, and cli.py
+- `cast(Table, ...)` pattern at all 8 sqlite-utils call sites — grep-able and refactor-safe
+- `DEFAULT_SYSTEM_PROMPT` trailing comma bug fixed (was `tuple[str]` at runtime, now `str`)
+- `Atom` dataclass promoted to module level; nested closures fully annotated in chunked_processor.py
+**Archive:** `.planning/milestones/v1.3-ROADMAP.md`
+---
+## v1.4 Search Precision (Shipped: 2026-02-24)
+**Phases completed:** 2 phases, 3 plans
+**Timeline:** 2026-02-24 (1 day)
+**Files modified:** 23 (+2,566 / -48 lines)
+**Delivered:** Gave users full control over search output quality — minimum-score filtering and sort-order control are now available across CLI, Python API, and MCP with RRF score guidance and a contextual hint when filtering eliminates all results.
+**Key accomplishments:**
+- `min_score: float = 0.0` parameter added to `hybrid_search()` with post-retrieval RRF score filtering (FILT-01)
+- `--min-score` CLI option with RRF range help text (0.009–0.033) so users can calibrate thresholds (FILT-02)
+- FILT-03 contextual hint: "No results above X.xxx. Run without --min-score to see available scores." on filtered-all
+- `--sort-by score|date|title` CLI option with `_CLI_SORT_BY_MAP` translation to engine vocabulary
+- Python `search()` API gains `sort_by` + `min_score` with `ValueError` on invalid sort values (PARITY-01/02)
+- MCP `corpus_search()` gains `min_score: Optional[float]` with `None`→`0.0` + `filtered_by_min_score` signal (PARITY-03)
+**Archive:** `.planning/milestones/v1.4-ROADMAP.md`
+---
+## v1.5 TypeScript AST Chunking (Shipped: 2026-02-24)
+**Phases completed:** 2 phases (15–16), 5 plans
+**Timeline:** 2026-02-24 (1 day)
+**Python LOC:** ~7,811
+**Delivered:** Replaced line-based chunking for TypeScript and JavaScript with tree-sitter AST-aware chunking — `.ts`, `.tsx`, `.js`, `.jsx` files now index at construct-level precision matching the existing Python AST chunker, with production safeguards for minified files and optional-dependency environments.
+**Key accomplishments:**
+- `chunk_typescript()` extracts 8 top-level node types (function, generator, class, abstract class, interface, type alias, lexical declaration, enum) with correct 1-indexed line boundaries and export-unwrapping
+- Grammar dispatched by extension: TypeScript for `.ts`, TSX for `.tsx`/`.jsx`, JavaScript for `.js`; `@lru_cache` grammar loader prevents re-parsing on 500+ file corpora
+- TDD RED→GREEN: 21-method `TestChunkTypeScript` class at full parity with `TestChunkPython` (all node types, non-ASCII identifiers, JSX parse, partial-error tree, catastrophic failure fallback)
+- IDX-08 size guard: files >50,000 chars bypass AST parse entirely — safe for minified and generated files
+- IDX-09 `ImportError` fallback: `chunk_file()` catches missing tree-sitter at call site; import of `ts_chunker` no longer raises in optional-dependency environments
+- Zero-violation quality gate: ruff 0 violations, mypy 0 errors, 320 tests passing across 54 source files
+**Archive:** `.planning/milestones/v1.5-ROADMAP.md`
+---
+## v2.0 Chunk Foundation (Shipped: 2026-02-24)
+**Phases completed:** 2 phases (17–18), 4 plans
+**Timeline:** 2026-02-24 (1 day)
+**Python LOC:** ~7,926
+**Files modified:** 26 (+3,540 / -986 lines)
+**Delivered:** Established the chunk data foundation — every indexed chunk now carries exact line boundaries and full text in LanceDB; `corpus search` output switched to grep/IDE-clickable format with chunk text preview.
+**Key accomplishments:**
+- LanceDB schema v4: `ChunkRecord` gains `chunk_name`, `chunk_text`, `start_line`, `end_line` with idempotent `ensure_schema_v4()` migration (CHUNK-01)
+- All three chunkers (Markdown, Python, TypeScript) emit v4 fields; `_enforce_char_limit` carries fields through all sub-chunk split paths
+- Zero-hallucination line-range contract verified via parametrised round-trip test across `.md`/`.py`/`.ts` fixtures
+- `format_result(result, cwd)` implemented with grep-style `path:start-end [type] score:X.XXX` output and 200-char indented chunk text preview (CHUNK-02)
+- `search_command` render loop replaced — CLI output is now IDE-clickable in VSCode/IntelliJ
+- Rich markup escaping on path, construct_type, and preview — no MarkupError on special chars; 340 tests passing
+**Archive:** `.planning/milestones/v2.0-ROADMAP.md`
+---
+## v2.1 Result Quality (Shipped: 2026-02-24)
+**Phases completed:** 7 phases (19–25), 11 plans
+**Timeline:** 2026-02-24 (1 day)
+**Python LOC:** ~8,000
+**Delivered:** Completed the chunk-level search experience across CLI, MCP, and Python API surfaces. Search results are now self-contained, finely chunked, reliably scored, and strictly formatted.
+**Key accomplishments:**
+- MCP `corpus_search` response now includes `start_line`, `end_line`, and `text` per result, making it a self-contained unit of knowledge (CHUNK-03)
+- Python and TypeScript AST chunkers sub-chunk classes at method level with `ClassName.method_name` naming (SUB-01–03)
+- `--name` filtering enabled across CLI and MCP via substring matching on chunk name (NAME-01–03)
+- Search scores normalised to 0–1 range internally; MCP `sort_by` achieves parity with CLI (SORT-01, NORM-01)
+- `corpus search --output json` emits a clean JSON array to stdout for shell automation (JSON-01)
+- `corpus_graph` MCP tool added for LLM-driven graph traversal using the existing `GraphStore` (GRAPH-01)
+- Quality gates verified: 85%+ branch coverage on chunking modules and zero-hallucination line-range contract enforced by parametrised integration tests (QUAL-01–02)
+**Archive:** `.planning/ROADMAP.md`
+---

corpus_analyzer-0.1.0/.planning/PROJECT.md ADDED Viewed

@@ -0,0 +1,215 @@
+# Corpus
+## What This Is
+Corpus is a local semantic search engine for AI agent libraries. It indexes agent skills, workflows, prompts, and code across configured directories and local repos, then makes them instantly queryable via CLI, MCP server, and Python API. Built for developers who maintain collections of agent skills and need to surface relevant files without grepping.
+v1.0 ships as a CLI tool (`corpus add`, `corpus index`, `corpus search`, `corpus status`) backed by LanceDB with hybrid BM25+vector retrieval. An MCP server (`corpus mcp serve`) exposes search to Claude Code and other agents. A Python API (`from corpus import search`) enables programmatic access.
+v1.1 adds configurable extension allowlists per source (no more config files polluting results) and frontmatter-aware construct classification — YAML `type:` and `component_type:` fields are classified at 0.95 confidence, making `--construct` filtering reliably accurate.
+v1.2 adds a relationship graph layer: `corpus index` now extracts `## Related Skills` / `## Related Files` links from indexed Markdown and persists them as a directed graph, queryable via `corpus graph <slug>`.
+v1.3 achieves a clean linting baseline across the entire codebase: zero mypy errors (`uv run mypy src/` exits 0 across 53 files) and zero ruff violations (`uv run ruff check .` exits 0). Per-file line-length override (120 chars) suppresses E501 in `llm/*.py`; B006 suppressed for Typer list defaults in `cli.py`.
+v1.4 gives users full control over search output: `--min-score <float>` filters results below the RRF threshold (0.009–0.033 range documented in help text); `--sort-by score|date|title` orders results; a contextual FILT-03 hint fires when filtering eliminates all results. Python API and MCP are at parity — `search()` accepts `sort_by` and `min_score`; `corpus_search()` MCP tool accepts `min_score: Optional[float]`.
+v1.5 replaces line-based chunking for TypeScript and JavaScript with tree-sitter AST-aware chunking. `.ts`, `.tsx`, `.js`, `.jsx` files now index at construct-level precision — functions, classes, interfaces, type aliases, enums, and generators extracted as separate chunks with correct line boundaries. Production-safe: 50K+ char files fall back to line chunking; missing tree-sitter is caught at call site with graceful fallback.
+v2 exposes chunk-level search results across all surfaces. CLI output switches to grep/compiler-error format (`path/to/file.md:42-67 [skill] score:0.021`) — IDE-clickable with exact line ranges. MCP `corpus_search` response becomes self-contained: each result carries `start_line`, `end_line`, and full chunk `text`. Python and TypeScript class bodies are sub-chunked at method level (`ClassName.method_name`). A new `--name` CLI flag and MCP `name` parameter filter by chunk name. MCP gains `sort_by` support and 0–1 normalised scores. A `corpus search --output json` flag enables shell piping. A new `corpus_graph` MCP tool allows LLM clients to walk codebase dependency graphs.
+## Core Value
+Surface relevant agent files instantly — query an entire local agent library and get ranked, relevant results in under a second.
+## Current Milestone: v2.1 Result Quality (Shipped)
+**Goal:** Complete the chunk-level search experience — MCP self-contained chunks, method sub-chunking, name filtering, normalised scores, JSON output, and graph MCP.
+**Status:** ✅ Complete (2026-02-24)
+**Delivered features:**
+- ✅ MCP `corpus_search` response includes `start_line`, `end_line`, `text` per result (CHUNK-03)
+- ✅ Python + TypeScript method sub-chunking: `ClassName.method_name` naming (SUB-01–03)
+- ✅ `corpus search --name <fragment>` and MCP `name` parameter (NAME-01–03)
+- ✅ MCP `sort_by` + 0–1 normalised scores (SORT-01, NORM-01)
+- ✅ `corpus search --output json` for shell piping (JSON-01)
+- ✅ `corpus_graph` MCP tool for LLM graph traversal (GRAPH-01)
+- ✅ 85%+ branch coverage on chunking modules; zero-hallucination line-range contract (QUAL-01–02)
+## Requirements
+### Validated
+- ✓ LanceDB embedding pipeline with deterministic chunk IDs — v1.0
+- ✓ Source config via `corpus.toml` (CONF-01–CONF-04) — v1.0
+- ✓ `corpus add <dir>` CLI command (CONF-05) — v1.0
+- ✓ `corpus index` with incremental re-indexing and ghost document removal (INGEST-01–INGEST-07) — v1.0
+- ✓ Structure-aware chunking: heading-based (`.md`), AST-based (`.py`), line-based fallback — v1.0
+- ✓ `corpus search` with hybrid BM25+vector+RRF ranking (SEARCH-01, SEARCH-02) — v1.0
+- ✓ Search filters: `--source`, `--type`, `--construct`, `--limit` (SEARCH-03, SEARCH-04, SEARCH-05) — v1.0
+- ✓ Agent construct classifier: skill/prompt/workflow/agent_config/code/documentation (CLASS-01–CLASS-03) — v1.0
+- ✓ AI summarizer per indexed file with Ollama, config-gated (SUMM-01–SUMM-03) — v1.0
+- ✓ `corpus status` with file count, chunk count, last indexed, model (CLI-04) — v1.0
+- ✓ FastMCP server with pre-warmed embeddings, `corpus_search` tool (MCP-01–MCP-06) — v1.0
+- ✓ Python API: `search()`, `index()`, `SearchResult` dataclass (API-01–API-03) — v1.0
+- ✓ Safety: no CLI KeyErrors, no silent exception swallowing, MCP `content_error` signaling — v1.0
+- ✓ Per-source extension allowlist in corpus.toml; default covers doc/code types, excludes junk (CONF-06, CONF-07, CONF-08) — v1.1
+- ✓ Frontmatter-aware construct classifier: `type:`, `component_type:`, `tags` at 0.95/0.95/0.70 confidence (CLASS-04, CLASS-05) — v1.1
+- ✓ `classification_source` + `classification_confidence` persisted in LanceDB; schema v3 in-place migration — v1.1
+- ✓ Relationship graph: `corpus index` extracts + persists `## Related Skills` / `## Related Files` edges; `corpus graph <slug>` queries upstream/downstream neighbours; closest-prefix ambiguity resolution; `--show-duplicates` (GRAPH-01–GRAPH-05) — v1.2
+- ✓ Dead `use_llm_classification` parameter removed from `index_source()` and `SourceConfig` (CLEAN-01) — v1.2
+- ✓ `pyproject.toml` surgical ruff/mypy config: `extend-exclude`, `per-file-ignores`, `[[tool.mypy.overrides]]` (CONF-01–CONF-04, RUFF-01, RUFF-02) — v1.3
+- ✓ `ruff --fix` sweep eliminated ~370 auto-fixable violations across 37 files (RUFF-01, RUFF-02) — v1.3
+- ✓ Manual E741/E402/B017/B023/B904/E501 fixes across leaf modules, database hub, and cli.py (RUFF-03–RUFF-07) — v1.3
+- ✓ `cast(Table, ...)` pattern at all sqlite-utils call sites; parameterised generics; float() guards (MYPY-01) — v1.3
+- ✓ `Atom` promoted to module level; nested functions fully annotated in `chunked_processor.py` (MYPY-02) — v1.3
+- ✓ `DEFAULT_SYSTEM_PROMPT` trailing comma bug fixed (was `tuple[str]`, now `str`) (MYPY-05) — v1.3
+- ✓ `uv run ruff check .` → 0 violations; `uv run mypy src/` → 0 errors; 281 tests green (VALID-01–VALID-03) — v1.3
+- ✓ `--min-score <float>` CLI flag filters results below RRF threshold; help text documents 0.009–0.033 range (FILT-01, FILT-02) — v1.4
+- ✓ FILT-03 contextual hint when `--min-score` filters all results — v1.4
+- ✓ `--sort-by score|date|title` CLI flag with engine vocabulary translation — v1.4
+- ✓ Python `search()` API accepts `sort_by` and `min_score` with `ValueError` validation (PARITY-01, PARITY-02) — v1.4
+- ✓ MCP `corpus_search()` accepts `min_score: Optional[float]` with `None`→`0.0` + `filtered_by_min_score` signal (PARITY-03) — v1.4
+- ✓ `pyproject.toml` updated with `tree-sitter>=0.25.0` and `tree-sitter-language-pack>=0.13.0`; `uv sync` succeeds with pre-compiled wheels (DEP-01) — v1.5
+- ✓ `chunk_file()` dispatch routes `.ts`, `.tsx`, `.js`, `.jsx` to `chunk_typescript()` (IDX-01) — v1.5
+- ✓ `chunk_typescript()` extracts 8 top-level node types with correct 1-indexed line boundaries (IDX-02) — v1.5
+- ✓ `export_statement` unwrapping: `export function foo()` and `export class Bar` produce correctly-named chunks (IDX-03) — v1.5
+- ✓ Grammar dispatched by extension: TypeScript/TSX/JavaScript; `@lru_cache` grammar loader (IDX-04, IDX-07) — v1.5
+- ✓ Silent fallback to `chunk_lines()` on exception or zero constructs; no fallback on `has_error` alone (IDX-05) — v1.5
+- ✓ `chunk_name` field in returned chunk dict (IDX-06) — v1.5
+- ✓ Size guard: files >50,000 chars fall back to `chunk_lines()` before AST parse (IDX-08) — v1.5
+- ✓ `ImportError` guard in `chunk_file()` — missing tree-sitter falls back to line chunking (IDX-09) — v1.5
+- ✓ `TestChunkTypeScript` at full parity with `TestChunkPython` — 21 test methods (TEST-01) — v1.5
+- ✓ `TestChunkFile` dispatch assertions for all four TS/JS extensions (TEST-02) — v1.5
+- ✓ `ruff check .` exits 0, `mypy src/` exits 0, 320 tests passing (QUAL-01) — v1.5
+- ✓ LanceDB schema v4: `start_line`, `end_line`, `chunk_name`, `chunk_text` persisted per chunk; `ensure_schema_v4()` idempotent migration; all chunkers emit v4 fields (CHUNK-01) — v2.0
+- ✓ CLI `corpus search` output: grep/IDE-clickable format `path:start-end [construct] score:X.XXX` with 200-char indented chunk text preview; Rich markup escaped (CHUNK-02) — v2.0
+### Completed (v2.1)
+- [x] MCP `corpus_search` response includes `start_line`, `end_line`, `text` per result — self-contained unit of knowledge (CHUNK-03)
+- [x] Python AST chunker: class header chunk (`ClassName`) + per-method chunks (`ClassName.method_name`) (SUB-01, SUB-02)
+- [x] TypeScript AST chunker: per-method chunks for class bodies using `ClassName.method_name` naming (SUB-03)
+- [x] `corpus search --name <fragment>` CLI flag; MCP `name` parameter — case-insensitive substring filter on `chunk_name` (NAME-01, NAME-02, NAME-03)
+- [x] MCP `corpus_search` accepts `sort_by` parameter (same vocabulary as CLI); scores normalised to 0–1 per query (SORT-01, NORM-01)
+- [x] `corpus search --output json` — JSON array to stdout for shell piping (JSON-01)
+- [x] MCP `corpus_graph` tool: accepts `slug`, returns upstream/downstream neighbour lists (GRAPH-01)
+- [x] 85%+ branch coverage on chunking modules; zero-hallucination parametrised integration test for line ranges (QUAL-01, QUAL-02)
+### Planned (v3)
+- [ ] MMR-style result diversity: near-duplicate/same-file chunks penalized; architectural spread ensured (RANK-01)
+- [ ] Contiguous sub-chunk merging: adjacent method chunks from same file merged into single display entry (RANK-02)
+- [ ] Optional `--rerank` flag: two-stage cross-encoder/LLM re-ranking of top-20 results (RANK-03)
+- [ ] `corpus search --expand-graph`: results include immediate graph neighbors labeled `[Depends On]` / `[Imported By]` (GEXP-01)
+- [ ] `corpus graph --depth N`: recursive N-hop walk (default 2); hub nodes (high indegree) labeled with count (GWALK-01, GWALK-02)
+- [ ] Centrality scoring: `corpus index` computes indegree centrality; high-centrality files receive score multiplier in search (GCENT-01)
+- [ ] `corpus search --within-graph <slug>`: +20% score boost for results in same graph component (soft filter) (GSCOPE-01)
+- [ ] Multiple `--query` flags: AND-intersection of semantic spaces (MULTI-01)
+- [ ] `--exclude-path <glob>`: structural negative filter on file path (MULTI-02)
+- [ ] Source labeling: CLI results display source name prefix; MCP `corpus_search` includes `source` field (XSRC-01, XSRC-02)
+### Out of Scope
+- Hosted/cloud index — local-only (privacy, simplicity)
+- Real-time file watching — manual index refresh only; daemon complexity not justified yet
+- Web UI — CLI + MCP sufficient for target users
+- LLM rewriting (corpus-analyzer original feature) — retained but not the focus
+- Chunk-level results (CHUNK-01–CHUNK-03) — addressed in v2
+## Context
+**v2.1 shipped 2026-02-24. Chunk-level result quality is complete across CLI, MCP, and Python API. Next work moves to deferred v3 ranking, graph expansion, and multi-query capabilities.**
+- ~8k lines Python source across 54 files
+- Tech stack: LanceDB, FastMCP, Pydantic, Typer, Rich, OllamaEmbedder (nomic-embed-text), tree-sitter + tree-sitter-language-pack; graph layer uses SQLite (`graph.sqlite`)
+- 433 tests passing (pytest)
+- XDG Base Directory compliant: config in `~/.config/corpus/`, data in `~/.local/share/corpus/`
+- Single-user local tool; no daemon, no cloud dependency
+- Zero mypy errors, zero ruff violations — clean linting baseline maintained through v2.1
+Shipped in v2.0 (Phases 17–18):
+- CLI output is now grep/IDE-clickable: `path:start-end [construct] score:X.XXX` with chunk text preview
+- LanceDB stores exact line boundaries and full chunk text per chunk (schema v4)
+v2.1-delivered improvements:
+- MCP `corpus_search` now returns chunk-level self-contained results (`start_line`, `end_line`, `text`)
+- Python and TypeScript class method sub-chunking is complete (`ClassName.method_name`)
+- Name filter is available on CLI and MCP (`--name` / `name`)
+- Scores are normalised to 0–1; MCP `sort_by` is parity-complete
+- JSON output mode is available on CLI search (`--output json`)
+- Graph traversal is exposed via MCP (`corpus_graph`)
+Remaining known limitations (deferred):
+- Cold-start on first index after KEEP_ALIVE expiry still possible (pre-warm only covers MCP startup)
+- `.d.ts` ambient declarations excluded at scanner level — niche use case, deferred indefinitely
+## Constraints
+- **Runtime**: Python 3.12, managed with `uv`
+- **Package manager**: `uv` only (no pip/poetry)
+- **Embeddings**: Must support offline (Ollama) — cloud providers optional
+- **Storage**: LanceDB for vector+metadata; FTS via tantivy (built into LanceDB)
+- **Existing tests**: Test suite must stay green
+## Key Decisions
+| Decision | Rationale | Outcome |
+|----------|-----------|---------|
+| Pivot corpus-analyzer rather than start fresh | Extraction, models, DB, CLI already built | ✓ Good — saved ~2 phases of setup work |
+| LanceDB over sqlite-vec | Embedded, ships hybrid search + BM25 + RRF natively | ✓ Good — no external service, fast queries |
+| Hybrid search (vector + BM25 via RRF) | Better recall than pure vector; handles exact name matches | ✓ Good — validated in UAT |
+| Protocol-based embedding abstraction | Offline-first (Ollama) with optional cloud providers | ✓ Good — clean swap if needed |
+| Store embedding model name in schema | Cannot be retrofitted after first embed | ✓ Good — enforced at query time |
+| AST-aware chunking for `.py`, heading-based for `.md` | Better chunk coherence vs line-based | ✓ Good — search quality improved |
+| Rule-based classifier first, LLM fallback | Cost-conscious; LLM gate via `use_llm_classification` | ✓ Good — zero LLM cost by default |
+| Default `use_llm_classification = False` | Avoid unexpected Ollama API costs | ✓ Good — explicit opt-in |
+| `needs_reindex()` removed entirely | Hash-based change detection makes it redundant | ✓ Good — dead code eliminated |
+| FTS rebuild only in `index_source()` | Redundant rebuild in `CorpusSearch.__init__` was wasteful | ✓ Good — ~20% faster search init |
+| MCP `content_error` field vs exception | Avoids breaking existing clients; explicit error signal | ✓ Good — graceful degradation |
+| Path constants in `config/schema.py` | Avoid circular imports from dual definitions | ✓ Good — single source of truth |
+| `ClassificationResult` dataclass (not raw string) | Structured data enables confidence + source tracking | ✓ Good — clean extension point for LLM/rule/frontmatter sources |
+| Frontmatter priority > rule-based | Explicit declaration is higher signal than heuristics | ✓ Good — 0.95 vs 0.60 confidence gap makes priority clear |
+| `type:` confidence 0.95, `tags:` confidence 0.70 | Explicit type is direct match; tags are softer signal | ✓ Good — appropriate weighting |
+| `ensure_schema_v3()` in-place migration | Existing users don't rebuild index on upgrade | ✓ Good — idempotent, zero data loss |
+| Extension allowlist default includes doc/code types | Sensible out-of-box behavior without requiring config | ✓ Good — excludes `.sh`, `.html`, `.json`, `.lock`, binaries automatically |
+| SQLite for graph store (not LanceDB) | Graph edges are relational; SQLite is simpler and avoids LanceDB schema overhead | ✓ Good — separate `graph.sqlite` keeps concerns cleanly separated |
+| Closest-prefix ambiguity resolution for slugs | Context-aware: picks the candidate closest in filesystem path to the referencing file | ✓ Good — handles multi-repo agent libraries without false matches |
+| Hardcode `use_llm=False` at call site (not remove arg) | `classify_file` defaults to `use_llm=True`; removing kwarg would silently switch to LLM classification | ✓ Good — preserved rule-based classification behaviour |
+| `extend-exclude` over `exclude` in ruff config | Preserves ruff defaults (`.venv`, `.git`) while adding `.windsurf/`, `.planning/` | ✓ Good — no accidental exclusion of user dirs |
+| `cast(Table, ...)` per call site (not `# type: ignore`) | Explicit and refactor-safe; grep-able pattern | ✓ Good — all 8 sqlite-utils call sites annotated cleanly |
+| `DEFAULT_SYSTEM_PROMPT` trailing comma removed (not cast away) | Was a real `tuple[str]` runtime bug — `+=` on a tuple would raise `TypeError` at runtime | ✓ Good — genuine bug fixed, not suppressed |
+| `Atom` promoted to module level (not type: ignore on nested) | Nested dataclasses prevent typed annotations on helper closures | ✓ Good — cleaner module structure, zero mypy errors |
+| `OllamaClient.db: CorpusDatabase \| None = None` added | `advanced_rewriter.py` uses `client.db`; TYPE_CHECKING guard avoids circular import | ✓ Good — optional field, no runtime impact if unused |
+| Post-retrieval Python filter for `min_score` (not LanceDB `.where()`) | RRF scores aren't stored as a native LanceDB column; filtering in-process is simpler and correct | ✓ Good — 4 tests confirm behaviour; no schema change needed |
+| `_CLI_SORT_BY_MAP` + `_API_SORT_MAP` translation dicts | CLI/API vocabulary (`score/date/title`) decoupled from engine vocabulary (`relevance/date/path`) | ✓ Good — user-facing names are stable even if engine internals change |
+| MCP `min_score: Optional[float]` with `None`→`0.0` guard | MCP schema requires Optional for backward-compat; explicit guard makes the no-op intent clear | ✓ Good — tested with both `None` and `0.02` cases |
+| FILT-03 branch checks `min_score > 0.0` before generic no-results | Contextual hint only fires when the user deliberately filtered; doesn't appear on regular empty results | ✓ Good — tested by asserting absence of generic message alongside presence of hint |
+| `get_parser(dialect)` from `tree_sitter_language_pack` with `@lru_cache` | Simpler than module-level globals; one Parser per dialect per process | ✓ Good — prevents 30s overhead on 500+ TS file corpora |
+| `.tsx` and `.jsx` both use TSX grammar | TypeScript grammar produces ERROR nodes on JSX elements | ✓ Good — all JSX cases handled cleanly |
+| Fall back only on exception or zero constructs, NOT on `has_error` | tree-sitter is error-tolerant; partial trees still yield good constructs | ✓ Good — no unnecessary fallback on syntax errors |
+| `export default function(){}` → detect 'default' child, emit full export_statement | `declaration` field is `None` for anonymous default exports; special-case required | ✓ Good — `test_export_default_function` passes |
+| Lazy import of `ts_chunker` inside `chunk_file` elif branch | Avoids circular import (`ts_chunker` imports from `chunker`) | ✓ Good — clean module boundary, no import side-effects |
+| Separate `except ImportError:` before `except Exception:` (not merged) | Explicit over broad; ImportError is a distinct, expected failure mode | ✓ Good — IDX-09 contract clear and testable |
+| Size guard before parser call (not after) | Minified files exit early with zero AST overhead | ✓ Good — 50K char threshold validated by test |
+| Smoke test via Python API (`chunk_file`) not CLI (`corpus index`) | `corpus index` requires configured LanceDB sources; API call confirms dispatch wiring without external config | ✓ Good — clean, hermetic validation |
+| CLI format: `file:start-end [construct] score:X.XXX` (grep/compiler-error pattern) | IDE-clickable natively in VSCode/IntelliJ; parseable by shell scripts; matches developer muscle memory | ✓ Good — shipped v2.0 |
+| MCP returns full `text` (untruncated); CLI truncates to 200 chars | MCP callers are LLMs — self-contained chunk avoids follow-up file-read; CLI is human-readable terminal output | ✓ Good — shipped in v2.1 |
+| `start_line`/`end_line` are 1-indexed, inclusive | Matches existing chunker convention (tree-sitter 0-indexed → +1); matches editor conventions | ✓ Good — shipped v2.0 |
+| `ensure_schema_v4()` adds columns with defaults, does not rebuild | Same pattern as v3 migration (v1.1); users re-index to backfill; idempotent | ✓ Good — shipped v2.0 |
+| `format_result` returns `(primary, preview \| None)` tuple — pure function | Console.print() owns rendering; testable without output side effects | ✓ Good — shipped v2.0 |
+| `ClassName.method_name` dot notation for method sub-chunks | Unambiguous; matches Python attribute access syntax; makes `--name` filtering readable | ✓ Good — shipped in v2.1 |
+| Class header chunk = docstring + `__init__` up to first non-assignment | Makes class purpose searchable independently from methods; bounded and deterministic | ✓ Good — shipped in v2.1 |
+| `--name foo` flag (not `--construct name:foo`) | `--construct` already has a fixed meaning (agent construct type); separate flag is cleaner UX | ✓ Good — shipped in v2.1 |
+| Case-insensitive substring match for `--name` | Most useful for exploratory search; exact match would require knowing the precise chunk name | ✓ Good — shipped in v2.1 |
+| Per-query min-max normalisation (not global baseline) | Cross-query normalisation is misleading for hybrid scores; within-query is correct and simple | ✓ Good — shipped in v2.1 |
+| Normalisation inside search engine (not display layer) | API, CLI, and MCP all receive same normalised value; existing score-range tests must be updated | ✓ Good — shipped in v2.1 |
+| `--output json` flag (not separate subcommand) | Composable with all existing filter flags; keeps CLI surface small | ✓ Good — shipped in v2.1 |
+| `corpus_graph` MCP returns upstream/downstream lists (not graph object) | LLMs iterate with follow-up calls; flat lists are simpler to consume than nested structures | ✓ Good — shipped in v2.1 |
+| `corpus_graph` MCP reuses `GraphStore` (no new storage) | Graph layer complete since v1.2; MCP tool is a thin wrapper; zero schema changes required | ✓ Good — shipped in v2.1 |
+---
+*Last updated: 2026-02-24 after v2.1 milestone completion sync*

corpus_analyzer-0.1.0/.planning/REQUIREMENTS.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Requirements: Corpus v2.1 Result Quality
+**Defined:** 2026-02-24
+**Core Value:** Surface relevant agent files instantly — query your entire local agent library and get ranked results in under a second
+## v2.1 Requirements
+### MCP Completeness
+- [x] **CHUNK-03**: MCP `corpus_search` response includes `start_line`, `end_line`, and full untruncated `text` per result — self-contained unit of knowledge for LLM callers
+- [x] **SORT-01**: MCP `corpus_search` accepts `sort_by` parameter (same vocabulary as CLI: `relevance | construct | confidence | date | path`); translated to engine vocabulary via existing `_API_SORT_MAP`
+- [x] **NORM-01**: Scores normalised to 0–1 per query via min-max normalisation inside the search engine; Python API `SearchResult.score`, CLI display, and MCP response all see the same normalised value
+### Method Sub-Chunking
+- [x] **SUB-01**: Python AST chunker produces a class header chunk (`ClassName`) containing docstring + `__init__` body up to first non-assignment statement; replaces monolithic class chunk
+- [x] **SUB-02**: Python AST chunker produces per-method chunks (`ClassName.method_name`) for each method in a class; `__init__`, `__str__`, etc. follow the same dot-notation
+- [x] **SUB-03**: TypeScript AST chunker produces per-method chunks (`ClassName.method_name`) for `method_definition` nodes in class bodies; constructor becomes `ClassName.constructor`
+### Name Filtering
+- [x] **NAME-01**: `corpus search --name <fragment>` CLI flag; case-insensitive substring match on `chunk_name`; composable with all existing filter flags
+- [x] **NAME-02**: MCP `corpus_search` accepts `name: Optional[str]` parameter; same case-insensitive substring match semantics
+- [x] **NAME-03**: `corpus search --name foo` (no positional query) is valid; lists all chunks matching the name filter ordered by `--sort-by` (default: relevance)
+### JSON Output
+- [x] **JSON-01**: `corpus search --output json` emits a valid JSON array to stdout; suppresses Rich formatting entirely; composable with all filter flags including `--name`
+### Graph MCP
+- [x] **GRAPH-01**: MCP `corpus_graph` tool accepts `slug: str`; returns `upstream: list[str]` and `downstream: list[str]` neighbour paths; reuses `GraphStore` without new storage
+### Quality
+- [x] **QUAL-01**: `pytest --cov --cov-branch` reports 85%+ branch coverage on ingest chunking modules (`ingest/chunker.py`, `ingest/ts_chunker.py`)
+- [x] **QUAL-02**: Parametrised zero-hallucination integration test: every stored `start_line`/`end_line` matches actual file content across `.md`, `.py`, and `.ts` fixtures
+## v3 Requirements (Deferred)
+### Ranking
+- **RANK-01**: MMR-style result diversity — near-duplicate/same-file chunks penalised; architectural spread ensured
+- **RANK-02**: Contiguous sub-chunk merging — adjacent method chunks from same file merged into single display entry
+- **RANK-03**: Optional `--rerank` flag — two-stage cross-encoder/LLM re-ranking of top-20 results
+### Graph Traversal
+- **GEXP-01**: `corpus search --expand-graph` — results include immediate graph neighbours labelled `[Depends On]` / `[Imported By]`
+- **GWALK-01**: `corpus graph --depth N` — recursive N-hop walk (default 2); hub nodes labelled with indegree count
+- **GWALK-02**: Cycle detection (`visited: set[str]`) before any recursive graph walk
+- **GCENT-01**: Centrality scoring — `corpus index` computes indegree centrality; high-centrality files receive score multiplier in hybrid search
+- **GSCOPE-01**: `corpus search --within-graph <slug>` — +20% score boost for results in same graph component
+### Multi-Query
+- **MULTI-01**: Multiple `--query` flags — AND-intersection of semantic spaces via overfetch+score-sum
+- **MULTI-02**: `--exclude-path <glob>` — structural negative filter on file path
+### Cross-Source
+- **XSRC-01**: CLI results display source name prefix
+- **XSRC-02**: MCP `corpus_search` includes `source` field per result
+## Out of Scope
+| Feature | Reason |
+|---------|--------|
+| Chunk overlap / sliding window | AST-based chunking is non-overlapping by design; overlap requires different strategy |
+| Streaming MCP responses | Chunk text bounded by 50K size guard; streaming complexity not justified yet |
+| Persistent cross-query normalisation | Cross-query normalisation is misleading for hybrid scores; per-query is correct |
+| Cloud/hosted index | Local-only (privacy, simplicity constraint) |
+| Web UI | CLI + MCP sufficient for target users |
+## Traceability
+| Requirement | Phase | Status |
+|-------------|-------|--------|
+| CHUNK-03 | Phase 19 | Complete |
+| SUB-01 | Phase 20 | Complete |
+| SUB-02 | Phase 20 | Complete |
+| SUB-03 | Phase 21 | Complete |
+| NAME-01 | Phase 22 | Complete |
+| NAME-02 | Phase 22 | Complete |
+| NAME-03 | Phase 22 | Complete |
+| NORM-01 | Phase 23 | Complete |
+| SORT-01 | Phase 23 | Complete |
+| JSON-01 | Phase 24 | Complete |
+| QUAL-01 | Phase 25 | Complete |
+| QUAL-02 | Phase 25 | Complete |
+| GRAPH-01 | Phase 25 | Complete |
+**Coverage:**
+- v2.1 requirements: 13 total
+- Mapped to phases: 13
+- Unmapped: 0 ✓
+---
+*Requirements defined: 2026-02-24*
+*Last updated: 2026-02-24 after roadmap creation (Phases 19–25)*