npm - nano-brain - Versions diffs - 2026.1.0 - Mend

nano-brain 2026.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (79) hide show

package/AGENTS_SNIPPET.md +36 -0
package/CHANGELOG.md +68 -0
package/README.md +281 -0
package/SKILL.md +153 -0
package/bin/cli.js +18 -0
package/index.html +929 -0
package/nano-brain +4 -0
package/opencode-mcp.json +9 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/.openspec.yaml +2 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/design.md +68 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/proposal.md +27 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/specs/mcp-integration-testing/spec.md +50 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/specs/mcp-server/spec.md +40 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/specs/search-pipeline/spec.md +29 -0
package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/tasks.md +37 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/.openspec.yaml +2 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/design.md +111 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/proposal.md +30 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/specs/mcp-server/spec.md +33 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/specs/storage-limits/spec.md +90 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/specs/workspace-scoping/spec.md +66 -0
package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/tasks.md +199 -0
package/openspec/changes/codebase-indexing/.openspec.yaml +2 -0
package/openspec/changes/codebase-indexing/design.md +169 -0
package/openspec/changes/codebase-indexing/proposal.md +30 -0
package/openspec/changes/codebase-indexing/specs/codebase-collection/spec.md +187 -0
package/openspec/changes/codebase-indexing/specs/mcp-server/spec.md +36 -0
package/openspec/changes/codebase-indexing/tasks.md +56 -0
package/openspec/specs/mcp-integration-testing/spec.md +50 -0
package/openspec/specs/mcp-server/spec.md +75 -0
package/openspec/specs/search-pipeline/spec.md +29 -0
package/openspec/specs/storage-limits/spec.md +94 -0
package/openspec/specs/workspace-scoping/spec.md +70 -0
package/package.json +34 -0
package/site/build.js +66 -0
package/site/partials/_api.html +83 -0
package/site/partials/_compare.html +100 -0
package/site/partials/_config.html +23 -0
package/site/partials/_features.html +43 -0
package/site/partials/_footer.html +6 -0
package/site/partials/_hero.html +9 -0
package/site/partials/_how-it-works.html +26 -0
package/site/partials/_models.html +18 -0
package/site/partials/_quick-start.html +15 -0
package/site/partials/_stats.html +1 -0
package/site/partials/_tech-stack.html +13 -0
package/site/script.js +12 -0
package/site/shell.html +44 -0
package/site/styles.css +548 -0
package/src/chunker.ts +427 -0
package/src/codebase.ts +331 -0
package/src/collections.ts +192 -0
package/src/embeddings.ts +293 -0
package/src/expansion.ts +79 -0
package/src/harvester.ts +306 -0
package/src/index.ts +503 -0
package/src/reranker.ts +103 -0
package/src/search.ts +294 -0
package/src/server.ts +664 -0
package/src/storage.ts +221 -0
package/src/store.ts +623 -0
package/src/types.ts +202 -0
package/src/watcher.ts +384 -0
package/test/chunker.test.ts +479 -0
package/test/cli.test.ts +309 -0
package/test/codebase-chunker.test.ts +446 -0
package/test/codebase.test.ts +678 -0
package/test/collections.test.ts +571 -0
package/test/harvester.test.ts +636 -0
package/test/integration.test.ts +150 -0
package/test/llm.test.ts +322 -0
package/test/search.test.ts +572 -0
package/test/server.test.ts +541 -0
package/test/storage.test.ts +302 -0
package/test/store.test.ts +465 -0
package/test/watcher.test.ts +656 -0
package/test/workspace.test.ts +239 -0
package/tsconfig.json +19 -0
package/vitest.config.ts +16 -0

package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/tasks.md ADDED Viewed

@@ -0,0 +1,199 @@
+## Tasks
+- [x] **Task 1**: Add project_hash column and migration to store.ts
+**Spec**: workspace-scoping/spec.md — Database migration for existing documents, Document-level project tagging
+**Files**: `src/store.ts`, `src/types.ts`
+**Steps**:
+1. Add `projectHash?: string` field to the `Document` interface in `types.ts`
+2. In `createStore()`, after the initial `db.exec()` schema creation, add migration logic:
+   - Check if `project_hash` column exists: `PRAGMA table_info(documents)`
+   - If missing: `ALTER TABLE documents ADD COLUMN project_hash TEXT DEFAULT 'global'`
+   - Backfill existing rows: `UPDATE documents SET project_hash = ... WHERE path LIKE 'sessions/%'` — extract hash from path using `substr(path, instr(path, 'sessions/') + 9, 12)` pattern
+3. Add `CREATE INDEX IF NOT EXISTS idx_documents_project_hash ON documents(project_hash, active)` for efficient filtering
+4. Update `insertDocumentStmt` to include `project_hash` column
+5. Update `insertDocument()` method to accept and store `projectHash`
+6. Add `projectHash` to `findDocument()` return mapping
+**Acceptance**: Migration runs on first startup, backfills correctly, subsequent startups skip migration. New documents get project_hash set.
+---
+- [x] **Task 2**: Add workspace-filtered search to store.ts
+**Spec**: workspace-scoping/spec.md — Default search scoping, Cross-workspace search opt-in
+**Files**: `src/store.ts`, `src/types.ts`
+**Steps**:
+1. Add `projectHash?: string` parameter to `searchFTS()` and `searchVec()` in the `Store` interface
+2. Create new prepared statements for workspace-filtered FTS search:
+   - `searchFTSWithWorkspaceStmt`: filters `d.project_hash IN (?, 'global')` in addition to existing conditions
+   - `searchFTSWithWorkspaceAndCollectionStmt`: combines both workspace and collection filters
+3. Update `searchFTS()` implementation: when `projectHash` is provided and not `'all'`, use workspace-filtered statement
+4. Update `searchVec()` implementation: add `AND d.project_hash IN (?, 'global')` to the dynamic SQL when `projectHash` is provided and not `'all'`
+5. Update `Store` interface signatures in `types.ts`
+**Acceptance**: `searchFTS('query', 10, undefined, 'abc123')` returns only docs with `project_hash = 'abc123'` or `'global'`. `searchFTS('query', 10, undefined, 'all')` returns all docs. `searchFTS('query', 10)` (no projectHash) returns all docs (backward compatible).
+---
+- [x] **Task 3**: Compute currentProjectHash in server.ts and wire to search tools
+**Spec**: workspace-scoping/spec.md — Workspace detection from PWD; mcp-server/spec.md (delta) — Search tools support workspace filtering
+**Files**: `src/server.ts`
+**Steps**:
+1. In `startServer()`, compute `currentProjectHash = crypto.createHash('sha256').update(process.cwd()).digest('hex').substring(0, 12)`
+2. Add `currentProjectHash` to `ServerDeps` interface
+3. In `createMcpServer()`, add `workspace` parameter (optional string) to `memory_search`, `memory_vsearch`, `memory_query` tool schemas
+4. In each search tool handler:
+   - Resolve effective workspace: `workspace === 'all' ? 'all' : (workspace || deps.currentProjectHash)`
+   - Pass resolved workspace to `store.searchFTS()` / `store.searchVec()` / `hybridSearch()`
+5. Update `hybridSearch()` in `search.ts` to accept and pass through `projectHash` parameter
+**Acceptance**: Search tools default to current workspace. `workspace: "all"` searches everything. Tool schemas show `workspace` parameter.
+---
+- [x] **Task 4**: Update memory_status to report workspace and storage info
+**Spec**: mcp-server/spec.md (delta) — memory_status reports storage usage
+**Files**: `src/server.ts`, `src/store.ts`, `src/types.ts`
+**Steps**:
+1. Add `getWorkspaceStats()` method to store: `SELECT project_hash, COUNT(*) as count FROM documents WHERE active = 1 GROUP BY project_hash`
+2. Add workspace stats and storage config to `IndexHealth` interface
+3. Update `formatStatus()` to include per-workspace document counts and storage limit info
+4. In `memory_status` handler, pass storage config to format function
+**Acceptance**: `memory_status` output shows per-workspace breakdown and storage limits.
+---
+- [x] **Task 5**: Add storage config parsing to types.ts and collections.ts
+**Spec**: storage-limits/spec.md — Storage configuration with safe defaults, Human-readable size and duration parsing
+**Files**: `src/types.ts`, `src/collections.ts`
+**Steps**:
+1. Add `StorageConfig` interface to `types.ts`: `{ maxSize: number; retention: number; minFreeDisk: number }`
+2. Add `storage?` field to `CollectionConfig` interface: `{ maxSize?: string; retention?: string; minFreeDisk?: string }`
+3. Create `parseSize(value: string): number` function — parses `500MB`, `2GB`, `1TB` to bytes
+4. Create `parseDuration(value: string): number` function — parses `30d`, `90d`, `1y` to milliseconds
+5. Create `parseStorageConfig(raw?: { maxSize?: string; retention?: string; minFreeDisk?: string }): StorageConfig` — applies defaults and validates
+6. Export from `collections.ts` (or create new `src/storage.ts` if cleaner)
+**Acceptance**: `parseSize('2GB')` returns `2147483648`. `parseDuration('90d')` returns `7776000000`. Invalid values log warning and return defaults.
+---
+- [x] **Task 6**: Implement disk safety guard
+**Spec**: storage-limits/spec.md — Disk safety guard
+**Files**: `src/storage.ts` (new) or `src/watcher.ts`
+**Steps**:
+1. Create `checkDiskSpace(minFreeDisk: number): { ok: boolean; freeBytes: number }` function
+2. Use `fs.statfsSync()` (Node 18.15+) on the output directory to get available space
+3. Wrap in try/catch — if `statfsSync` unavailable, log warning and return `{ ok: true, freeBytes: -1 }`
+4. Integrate into watcher's `triggerReindex()`: check disk before harvest/reindex/embed operations
+5. If disk check fails, skip all write operations and log warning
+**Acceptance**: When disk is below `minFreeDisk`, writes are skipped with warning. When `statfsSync` unavailable, operations proceed with warning.
+---
+- [x] **Task 7**: Implement retention and size-based eviction
+**Spec**: storage-limits/spec.md — Retention-based eviction, Size-based eviction, Original session JSON is never deleted
+**Files**: `src/storage.ts` (new), `src/watcher.ts`, `src/store.ts`
+**Steps**:
+1. Create `evictExpiredSessions(sessionsDir: string, retention: number, store: Store): number` function:
+   - Scan all `sessions/{hash}/*.md` files
+   - Check mtime against `Date.now() - retention`
+   - Delete expired files and remove corresponding documents from store
+   - Return count of evicted files
+2. Create `evictBySize(sessionsDir: string, dbPath: string, maxSize: number, store: Store): number` function:
+   - Calculate total size: `statSync(dbPath).size` + recursive dir size of `sessionsDir`
+   - If over `maxSize`, collect all session files sorted by mtime (oldest first)
+   - Delete oldest files one by one until under limit
+   - Remove corresponding documents from store
+   - Return count of evicted files
+3. Add `deleteDocumentsByPath(pathPattern: string): number` method to store for removing evicted documents
+4. Integrate eviction into watcher's harvest cycle: run after harvest, before reindex
+5. Never touch files outside `sessionsDir` (original JSON in `~/.local/share/opencode/` is safe)
+**Acceptance**: Sessions older than retention are evicted. If still over maxSize, oldest are evicted. Original JSON untouched. Eviction count logged.
+---
+- [x] **Task 8**: Implement orphan embedding cleanup
+**Spec**: storage-limits/spec.md — Orphan embedding cleanup
+**Files**: `src/store.ts`, `src/watcher.ts`
+**Steps**:
+1. Add `cleanOrphanedEmbeddings(): number` method to store:
+   - `DELETE FROM content_vectors WHERE hash NOT IN (SELECT hash FROM documents WHERE active = 1)`
+   - If vec table exists: `DELETE FROM vectors_vec WHERE substr(hash_seq, 1, instr(hash_seq, ':') - 1) NOT IN (SELECT hash FROM documents WHERE active = 1)`
+   - Return count of deleted rows
+2. Add cycle counter to watcher — every 10 harvest cycles, call `cleanOrphanedEmbeddings()`
+3. Log cleanup results
+**Acceptance**: Orphaned embeddings are cleaned every 10 cycles. No active document embeddings are deleted.
+---
+- [x] **Task 9**: Wire storage config through server startup
+**Spec**: storage-limits/spec.md — Storage configuration with safe defaults
+**Files**: `src/server.ts`, `src/watcher.ts`
+**Steps**:
+1. In `startServer()`, parse storage config from loaded collection config
+2. Pass `StorageConfig` to watcher via `WatcherOptions`
+3. Watcher uses storage config for disk check, eviction thresholds
+4. Pass storage config to `memory_status` for display
+**Acceptance**: Storage config from `config.yml` is loaded and used. Missing config uses defaults. Status tool shows config values.
+---
+- [x] **Task 10**: Add tests for workspace scoping
+**Spec**: workspace-scoping/spec.md — all requirements
+**Files**: `test/store.test.ts` or new `test/workspace.test.ts`
+**Steps**:
+1. Test migration: create store, verify `project_hash` column exists
+2. Test document tagging: index documents with session paths, verify `project_hash` extracted correctly
+3. Test non-session documents get `project_hash = 'global'`
+4. Test `searchFTS` with workspace filter returns only matching + global docs
+5. Test `searchFTS` with `'all'` returns everything
+6. Test `searchVec` with workspace filter (if vec available)
+7. Test `currentProjectHash` computation matches harvester convention
+**Acceptance**: All new tests pass. Existing 265 tests still pass.
+---
+- [x] **Task 11**: Add tests for storage limits
+**Spec**: storage-limits/spec.md — all requirements
+**Files**: `test/storage.test.ts` (new)
+**Steps**:
+1. Test `parseSize()`: valid sizes, invalid input, edge cases
+2. Test `parseDuration()`: valid durations, invalid input
+3. Test `parseStorageConfig()`: full config, partial config, empty config
+4. Test retention eviction: create temp files with old mtimes, verify eviction
+5. Test size eviction: create files exceeding maxSize, verify oldest evicted first
+6. Test disk safety guard: mock `statfsSync` to test both paths
+7. Test orphan cleanup: create orphaned embeddings, verify cleanup
+**Acceptance**: All new tests pass. Existing tests still pass.
+---
+- [x] **Task 12**: Integration test for workspace-scoped search via MCP tools
+**Spec**: mcp-server/spec.md (delta) — Search tool parameter schema includes workspace
+**Files**: `test/server.test.ts`
+**Steps**:
+1. Add test: index documents with different `project_hash` values
+2. Test `memory_search` without workspace param returns only current workspace + global
+3. Test `memory_search` with `workspace: "all"` returns all
+4. Test `memory_status` includes workspace breakdown
+**Acceptance**: Integration tests verify end-to-end workspace scoping through MCP tool handlers.

package/openspec/changes/codebase-indexing/.openspec.yaml ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ schema: spec-driven
2	+ created: 2026-02-23

package/openspec/changes/codebase-indexing/design.md ADDED Viewed

@@ -0,0 +1,169 @@
+## Context
+nano-brain indexes markdown documents (sessions, MEMORY.md, daily logs) into SQLite with FTS5 + sqlite-vec for hybrid search. The existing pipeline is:
+1. **Collections** define directories + glob patterns to scan (`config.yml`)
+2. **Watcher** monitors collections via chokidar, triggers reindex on changes
+3. **Chunker** splits markdown by headings/paragraphs (target 900 tokens, 15% overlap)
+4. **Store** indexes chunks into FTS5 + content-addressed embeddings
+5. **Search** queries FTS5 (BM25) and/or sqlite-vec (cosine), fuses with RRF
+Current limitations:
+- The chunker only understands markdown structure (headings, code fences, lists)
+- Collections use `pattern: "**/*.md"` — no exclude support
+- The watcher watches collection output dirs, not arbitrary workspace directories
+- No concept of "codebase" as a distinct collection type
+The MCP server runs per-workspace with `PWD` set to the workspace root. `currentProjectHash = sha256(cwd).substring(0, 12)` is already computed at startup.
+## Goals / Non-Goals
+**Goals:**
+- Index source code files from the current workspace into the existing search pipeline
+- Support configurable exclude patterns to skip `node_modules`, `.git`, `dist`, etc.
+- Chunk source code files intelligently — respecting function/class boundaries where feasible, falling back to line-based splitting
+- Reuse the existing watcher infrastructure for incremental updates
+- Tag all codebase documents with `currentProjectHash` for workspace-scoped search
+- Provide sensible defaults so it works with zero config for common project types
+- Keep it opt-in — codebase indexing only happens when configured
+**Non-Goals:**
+- AST-level parsing of every language (too complex, too many dependencies)
+- Indexing binary files, images, or compiled output
+- Real-time indexing on every keystroke (debounced file watcher is sufficient)
+- Replacing LSP or grep — this complements them with semantic search
+- Cross-workspace codebase search (each workspace indexes its own code)
+- Indexing external dependencies (node_modules, vendor, etc.)
+## Decisions
+### D1: Codebase as a special auto-configured collection
+**Decision**: When `codebase: { enabled: true }` is set in config.yml (or auto-detected), create a virtual collection named `codebase` pointing at `process.cwd()` with source code glob patterns and exclude rules. This collection is NOT stored in the `collections:` section — it's a separate top-level config.
+**Why**: Codebase indexing is fundamentally different from document collections — it targets the workspace root (dynamic per-server instance), needs exclude patterns, and uses different chunking. Keeping it separate avoids polluting the general collection config.
+**Config format**:
+```yaml
+codebase:
+  enabled: true
+  exclude:
+    - node_modules
+    - .git
+    - dist
+    - build
+    - .next
+    - __pycache__
+    - "*.min.js"
+    - "*.map"
+  extensions:
+    - .ts
+    - .tsx
+    - .js
+    - .jsx
+    - .py
+    - .go
+    - .rs
+    - .java
+    - .rb
+    - .md
+  maxFileSize: 5MB     # Skip files larger than this
+```
+**Alternative considered**: Add codebase as a regular collection entry. Rejected because the path is dynamic (PWD), exclude patterns don't exist on regular collections, and it would confuse the existing collection management CLI.
+### D2: Exclude patterns via .gitignore + config
+**Decision**: Merge exclude patterns from three sources (in priority order):
+1. **Config `codebase.exclude`** — explicit user overrides
+2. **`.gitignore`** — project-specific ignores (already maintained by developers)
+3. **Built-in defaults** — `node_modules`, `.git`, `dist`, `build`, `__pycache__`, `vendor`, `.next`, `.nuxt`, `target`, `*.min.js`, `*.map`, `*.lock`, `*.sum`
+**Why**: `.gitignore` already captures 90% of what should be excluded. Adding it as a source means zero-config works for most projects. The built-in defaults catch common cases even without a `.gitignore`.
+**Implementation**: Use fast-glob's `ignore` option which already supports gitignore-style patterns. Parse `.gitignore` at startup and merge with config excludes.
+### D3: Source code chunking — line-based with structural hints
+**Decision**: Create a `chunkSourceCode()` function that splits by structural boundaries:
+1. **Primary split**: Blank line sequences (2+ consecutive blank lines = strong break)
+2. **Secondary split**: Single blank lines between top-level constructs
+3. **Structural hints**: Recognize common patterns across languages:
+   - `function`, `def`, `fn`, `func` — function definitions
+   - `class`, `struct`, `interface`, `enum`, `type` — type definitions
+   - `import`, `from`, `require`, `use` — import blocks
+   - `export` — export statements
+4. **Fallback**: If no structural breaks found within target chunk size, split at line boundaries
+5. **Same target size**: 900 tokens (~3600 chars), 15% overlap — matching markdown chunker
+**Why**: Full AST parsing requires language-specific parsers (tree-sitter, etc.) which adds heavy dependencies. Line-based splitting with structural hints gives 80% of the benefit at 10% of the complexity. The patterns above work across TypeScript, Python, Go, Rust, Java, Ruby, and most C-family languages.
+**Alternative considered**: tree-sitter for AST-aware chunking. Rejected for v1 — adds ~50MB of native dependencies, complex build, and only marginally better than structural hints for search purposes. Can be added later as an enhancement.
+### D4: File-level metadata in document title
+**Decision**: Store the relative file path as the document `title` and include a metadata header in the indexed content:
+```
+File: src/auth/login.ts
+Language: typescript
+Lines: 1-45
+[actual file content]
+```
+**Why**: The metadata header helps the embedding model understand what it's looking at. The relative path as title makes search results immediately actionable (agent knows which file to open).
+### D5: Incremental indexing via content hash
+**Decision**: Reuse the existing content-addressed storage. On each scan:
+1. Compute `sha256(fileContent)` for each source file
+2. Compare with stored hash in `documents` table
+3. Skip unchanged files (hash match)
+4. Re-index changed files (hash mismatch)
+5. Deactivate deleted files (in DB but not on disk)
+**Why**: The store already does content-addressed dedup via `computeHash()`. This is the same pattern used for markdown documents. No new infrastructure needed.
+### D6: Watcher integration — reuse existing chokidar setup
+**Decision**: Add the workspace directory as an additional watch target in the existing watcher, with the exclude patterns applied as chokidar `ignored` options. Source file changes trigger the same dirty-flag → debounce → reindex cycle.
+**Why**: The watcher already handles debouncing, dirty flags, and reindex scheduling. Adding another watch target is simpler than creating a separate watcher.
+### D7: Auto-detection of project type for default extensions
+**Decision**: At startup, detect project type by checking for marker files:
+- `package.json` → Node.js: `.ts`, `.tsx`, `.js`, `.jsx`, `.json`, `.css`, `.html`
+- `pyproject.toml` / `setup.py` / `requirements.txt` → Python: `.py`, `.pyi`
+- `go.mod` → Go: `.go`
+- `Cargo.toml` → Rust: `.rs`
+- `pom.xml` / `build.gradle` → Java/Kotlin: `.java`, `.kt`
+- `Gemfile` → Ruby: `.rb`
+- Fallback: all common extensions
+Always include `.md` files from the workspace (README, docs).
+**Why**: Avoids indexing irrelevant file types. A Python project doesn't need `.ts` files indexed. Auto-detection means zero-config for common setups.
+### D8: Max file size guard
+**Decision**: Skip files larger than `maxFileSize` (default 5MB). Log a warning for skipped files.
+**Why**: Large generated files (bundles, minified code, data files) waste storage and produce poor search results. 5MB covers virtually all hand-written source files while still guarding against huge generated artifacts.
+## Risks / Trade-offs
+**[Risk] Large codebases produce many chunks** → Storage limits from v0.2.0 (maxSize, retention) apply. Codebase chunks count toward the total. For very large monorepos, users may need to increase maxSize or narrow extensions.
+**[Risk] Stale index after branch switch** → `git checkout` changes many files at once. The watcher will detect changes and reindex, but there's a window where the index is stale. Acceptable — the reindex debounce (2s) is fast enough.
+**[Risk] Embedding all source code is expensive** → Embedding 1000 files × 5 chunks each = 5000 embeddings. At ~50ms per embedding, that's ~4 minutes for initial index. Subsequent updates are incremental (only changed files). This is acceptable for a background process.
+**[Risk] Structural hints miss language-specific constructs** → The line-based chunker won't perfectly split every language. Some chunks may cut mid-function. This is acceptable for search — the overlap ensures context is preserved, and semantic search is tolerant of imperfect boundaries.
+**[Risk] .gitignore parsing edge cases** → Complex .gitignore patterns (negation, nested) may not parse perfectly with fast-glob. Mitigation: the built-in defaults catch the most important exclusions regardless.
+## Open Questions
+- Should codebase indexing be enabled by default (auto-detect) or require explicit `codebase: { enabled: true }`? Leaning toward explicit opt-in for v1 to avoid surprising users with increased storage usage.
+- Should there be a `memory_index_codebase` MCP tool for on-demand full reindex, or is the watcher sufficient? Leaning toward adding the tool for the initial index trigger.

package/openspec/changes/codebase-indexing/proposal.md ADDED Viewed

@@ -0,0 +1,30 @@
+## Why
+nano-brain currently indexes only session transcripts, MEMORY.md, and daily logs — it has zero knowledge of the actual source code. When an agent queries `memory_query query="how does authentication work"`, it finds past *conversations* about auth but not the actual implementation. This forces agents to rely on grep (exact keywords only) and LSP (requires knowing symbol names upfront) for code discovery, which fails for semantic/conceptual queries across large codebases.
+Indexing source code enables semantic search over the codebase — finding related code by meaning, not just keywords. This is the single highest-impact improvement for agent productivity in unfamiliar or large projects.
+## What Changes
+- Add support for a `codebase` collection type that indexes source code files from the current workspace
+- Add `exclude` patterns to collection config (e.g., `node_modules`, `.git`, `dist`, `build`) to skip irrelevant directories
+- Add language-aware chunking for source code files (not just markdown) — respecting function/class boundaries where possible
+- Integrate with the existing file watcher for incremental re-indexing on file changes
+- Tag codebase documents with the current workspace's `projectHash` for workspace-scoped search
+- Auto-detect common exclude patterns based on project type (Node.js, Python, Go, etc.)
+## Capabilities
+### New Capabilities
+- `codebase-collection`: Collection type for indexing source code with exclude patterns, language-aware chunking, and workspace tagging
+### Modified Capabilities
+- `mcp-server`: Add `memory_index_codebase` tool for on-demand codebase indexing, update `memory_status` to show codebase collection stats
+## Impact
+- **Config format**: New optional `exclude` field on collection config, new optional `codebase` section in config.yml
+- **Chunker**: Needs to handle non-markdown files (`.ts`, `.py`, `.go`, `.rs`, `.java`, etc.) with language-aware splitting
+- **Storage**: Codebase indexing will significantly increase document/chunk count — storage limits from v0.2.0 apply
+- **Watcher**: Must watch workspace directory with exclude patterns, not just the output directory
+- **Dependencies**: No new dependencies expected — fast-glob already supports ignore patterns

package/openspec/changes/codebase-indexing/specs/codebase-collection/spec.md ADDED Viewed

@@ -0,0 +1,187 @@
+## ADDED Requirements
+### Requirement: Codebase configuration format
+The system SHALL support a top-level `codebase` section in `config.yml` with the following fields: `enabled` (boolean), `exclude` (string array), `extensions` (string array), `maxFileSize` (string, human-readable size), and `maxSize` (string, human-readable size for storage budget). All fields except `enabled` SHALL be optional with sensible defaults. When `codebase.enabled` is `false` or the section is absent, no codebase indexing SHALL occur.
+#### Scenario: Full codebase config provided
+- **WHEN** config.yml contains `codebase: { enabled: true, exclude: ["node_modules", ".git"], extensions: [".ts", ".py"], maxFileSize: "50KB" }`
+- **THEN** the system indexes only `.ts` and `.py` files, skipping `node_modules` and `.git` directories, and skipping files larger than 50KB
+#### Scenario: Minimal codebase config (enabled only)
+- **WHEN** config.yml contains `codebase: { enabled: true }` with no other fields
+ **THEN** the system uses auto-detected extensions (from project type), built-in exclude defaults, and 5MB maxFileSize
+#### Scenario: Codebase section absent
+- **WHEN** config.yml has no `codebase` section
+- **THEN** no codebase indexing occurs
+- **THEN** no codebase-related file watching occurs
+#### Scenario: Codebase explicitly disabled
+- **WHEN** config.yml contains `codebase: { enabled: false }`
+- **THEN** no codebase indexing occurs even if other codebase fields are present
+### Requirement: Exclude pattern merging from three sources
+The system SHALL merge exclude patterns from three sources in priority order: (1) `codebase.exclude` from config, (2) `.gitignore` file in the workspace root, and (3) built-in defaults. The built-in defaults SHALL include at minimum: `node_modules`, `.git`, `dist`, `build`, `__pycache__`, `vendor`, `.next`, `.nuxt`, `target`, `*.min.js`, `*.map`, `*.lock`, `*.sum`. All three sources SHALL be combined (union) into a single exclude list passed to the file scanner.
+#### Scenario: All three sources present
+- **WHEN** config has `exclude: ["custom-dir"]`, `.gitignore` contains `coverage/`, and built-in defaults include `node_modules`
+- **THEN** the effective exclude list includes `custom-dir`, `coverage/`, `node_modules`, and all other built-in defaults
+#### Scenario: No .gitignore file
+- **WHEN** the workspace root has no `.gitignore` file
+- **THEN** only config excludes and built-in defaults are used
+- **THEN** no error is thrown
+#### Scenario: No config excludes
+- **WHEN** `codebase.exclude` is not set in config
+- **THEN** `.gitignore` patterns and built-in defaults are still applied
+### Requirement: Source code chunking with structural hints
+The system SHALL chunk source code files using line-based splitting with structural boundary detection. Primary split points SHALL be blank line sequences (2+ consecutive blank lines). Secondary split points SHALL be single blank lines between top-level constructs. Structural hints SHALL recognize common cross-language patterns: function definitions (`function`, `def`, `fn`, `func`), type definitions (`class`, `struct`, `interface`, `enum`, `type`), import blocks (`import`, `from`, `require`, `use`), and export statements (`export`). The target chunk size SHALL be 900 tokens (~3600 characters) with 15% overlap, matching the existing markdown chunker.
+#### Scenario: TypeScript file with functions
+- **WHEN** a TypeScript file contains three functions separated by blank lines, each ~300 tokens
+- **THEN** the chunker produces one chunk containing all three functions (under 900 token target)
+#### Scenario: Large Python file exceeding chunk target
+- **WHEN** a Python file contains a class with 2000 tokens
+- **THEN** the chunker splits at structural boundaries (method definitions) within the class
+- **THEN** each resulting chunk is approximately 900 tokens with 15% overlap
+#### Scenario: File with no structural boundaries
+- **WHEN** a file contains continuous text with no blank lines or structural keywords
+- **THEN** the chunker falls back to splitting at line boundaries near the 900 token target
+#### Scenario: Overlap between chunks
+- **WHEN** a source file is split into multiple chunks
+- **THEN** adjacent chunks share approximately 15% overlapping content at their boundaries
+### Requirement: File metadata header in indexed content
+Each indexed chunk SHALL include a metadata header prepended to the content: `File: <relative-path>`, `Language: <language>`, `Lines: <start>-<end>`. The relative path SHALL be computed from the workspace root. The language SHALL be inferred from the file extension.
+#### Scenario: TypeScript file chunk
+- **WHEN** a chunk is created from lines 10-45 of `src/auth/login.ts`
+- **THEN** the indexed content starts with `File: src/auth/login.ts\nLanguage: typescript\nLines: 10-45\n\n`
+#### Scenario: Python file chunk
+- **WHEN** a chunk is created from lines 1-30 of `utils/helpers.py`
+- **THEN** the indexed content starts with `File: utils/helpers.py\nLanguage: python\nLines: 1-30\n\n`
+### Requirement: Incremental indexing via content hash
+The system SHALL use content-addressed hashing to avoid re-indexing unchanged files. On each scan, the system SHALL compute `sha256(fileContent)` for each source file, compare with the stored hash in the `documents` table, skip files with matching hashes, re-index files with mismatched hashes, and deactivate documents for files that no longer exist on disk.
+#### Scenario: Unchanged file
+- **WHEN** a source file has the same content as the last index run
+- **THEN** the file is skipped (no re-chunking, no re-embedding)
+#### Scenario: Modified file
+- **WHEN** a source file has different content than the last index run
+- **THEN** the old document is replaced with newly chunked and embedded content
+#### Scenario: Deleted file
+- **WHEN** a previously indexed source file no longer exists on disk
+- **THEN** the corresponding document and its chunks/embeddings are deactivated or removed
+### Requirement: Project type auto-detection for default extensions
+When `codebase.extensions` is not configured, the system SHALL auto-detect the project type by checking for marker files in the workspace root and select appropriate default extensions. Detection rules SHALL include: `package.json` maps to `.ts`, `.tsx`, `.js`, `.jsx`, `.json`, `.css`, `.html`; `pyproject.toml`, `setup.py`, or `requirements.txt` maps to `.py`, `.pyi`; `go.mod` maps to `.go`; `Cargo.toml` maps to `.rs`; `pom.xml` or `build.gradle` maps to `.java`, `.kt`; `Gemfile` maps to `.rb`. If no marker files are found, all common extensions SHALL be used as fallback. `.md` files SHALL always be included regardless of project type.
+#### Scenario: Node.js project detected
+- **WHEN** the workspace root contains `package.json` and `codebase.extensions` is not configured
+- **THEN** the system indexes files with extensions: `.ts`, `.tsx`, `.js`, `.jsx`, `.json`, `.css`, `.html`, `.md`
+#### Scenario: Python project detected
+- **WHEN** the workspace root contains `pyproject.toml` and `codebase.extensions` is not configured
+- **THEN** the system indexes files with extensions: `.py`, `.pyi`, `.md`
+#### Scenario: Multiple marker files present
+- **WHEN** the workspace root contains both `package.json` and `pyproject.toml`
+- **THEN** the system merges extensions from both project types
+#### Scenario: No marker files found
+- **WHEN** the workspace root contains no recognized marker files
+- **THEN** the system uses all common extensions as fallback (`.ts`, `.tsx`, `.js`, `.jsx`, `.py`, `.go`, `.rs`, `.java`, `.kt`, `.rb`, `.md`)
+#### Scenario: Explicit extensions override auto-detection
+- **WHEN** `codebase.extensions` is set to `[".go", ".proto"]`
+- **THEN** only `.go` and `.proto` files are indexed, regardless of detected project type
+### Requirement: Codebase storage budget enforcement
+The system SHALL support a `codebase.maxSize` field (default 2GB) that limits the total storage used by codebase-indexed content. During indexing, the system SHALL track cumulative storage and skip remaining files when the budget would be exceeded. The `memory_status` tool SHALL report current codebase storage usage versus the configured limit. The codebase storage budget SHALL be independent from the session storage budget (`storage.maxSize`).
+#### Scenario: Indexing within budget
+ **WHEN** codebase storage is at 500MB and `codebase.maxSize` is 2GB
+ **THEN** files continue to be indexed normally
+#### Scenario: Indexing exceeds budget
+ **WHEN** codebase storage is at 1.9GB and the next file would push it over 2GB
+ **THEN** the file is skipped
+ **THEN** remaining files are skipped
+ **THEN** the index result reports the number of files skipped due to budget
+#### Scenario: Default budget
+ **WHEN** `codebase.maxSize` is not configured
+ **THEN** the default budget of 2GB is used
+#### Scenario: Custom budget
+ **WHEN** `codebase.maxSize` is set to `"500MB"`
+ **THEN** indexing stops when codebase storage reaches 500MB
+#### Scenario: Budget independent from session storage
+ **WHEN** session `storage.maxSize` is 2GB and `codebase.maxSize` is 2GB
+ **THEN** the system can use up to 4GB total (2GB sessions + 2GB codebase)
+ **THEN** codebase budget enforcement does not trigger session eviction
+### Requirement: Max file size guard
+The system SHALL skip source files larger than `codebase.maxFileSize` (default 5MB). A debug-level log message SHALL be emitted for each skipped file indicating the file path and its size.
+#### Scenario: File under size limit
+ **WHEN** a source file is 3MB and `maxFileSize` is 5MB
+- **THEN** the file is indexed normally
+#### Scenario: File exceeding size limit
+ **WHEN** a source file is 8MB and `maxFileSize` is 5MB
+- **THEN** the file is skipped
+- **THEN** a debug log is emitted: file path and size
+#### Scenario: Custom maxFileSize
+- **WHEN** `codebase.maxFileSize` is set to `"50KB"`
+- **THEN** files larger than 50KB are skipped
+### Requirement: Watcher integration for codebase files
+When codebase indexing is enabled, the file watcher SHALL add the workspace directory as an additional watch target with the configured exclude patterns applied as ignored paths. Source file changes SHALL trigger the same dirty-flag and debounced reindex cycle used for collection files. The watcher SHALL only watch files matching the configured or auto-detected extensions.
+#### Scenario: Source file modified
+- **WHEN** a watched source file is saved with new content
+- **THEN** the watcher detects the change
+- **THEN** the file is re-indexed after the debounce period
+#### Scenario: New source file created
+- **WHEN** a new `.ts` file is created in the workspace (with codebase enabled for a Node.js project)
+- **THEN** the watcher detects the new file
+- **THEN** the file is indexed after the debounce period
+#### Scenario: Excluded directory not watched
+- **WHEN** a file changes inside `node_modules/`
+- **THEN** the watcher does not detect the change
+- **THEN** no reindex is triggered
+### Requirement: Codebase documents tagged with workspace project hash
+All documents indexed from codebase files SHALL be tagged with the current workspace's `projectHash` in the `project_hash` column. This ensures codebase search results are scoped to the current workspace by default.
+#### Scenario: Codebase document indexed
+- **WHEN** a source file is indexed with `currentProjectHash = "abc123def456"`
+- **THEN** the resulting document has `project_hash = "abc123def456"`
+#### Scenario: Codebase search scoped to workspace
+- **WHEN** `memory_search` is called with default workspace scoping
+- **THEN** codebase documents from the current workspace are included in results
+- **THEN** codebase documents from other workspaces are excluded
+### Requirement: Codebase collection identified in search results
+Documents from the codebase collection SHALL be identifiable in search results via their collection name `"codebase"`. This allows agents to distinguish between session-based memory and source code results.
+#### Scenario: Search returns codebase and session results
+- **WHEN** a search query matches both a session document and a codebase document
+- **THEN** the codebase result has `collection: "codebase"` in its metadata
+- **THEN** the session result has a different collection identifier

package/openspec/changes/codebase-indexing/specs/mcp-server/spec.md ADDED Viewed

@@ -0,0 +1,36 @@
+## ADDED Requirements
+### Requirement: memory_index_codebase tool for on-demand indexing
+The MCP server SHALL register a `memory_index_codebase` tool that triggers a full codebase scan and index. The tool SHALL accept no required parameters. It SHALL return a summary including: number of files scanned, number of files indexed (new or updated), number of files skipped (unchanged), number of files skipped (too large), and total chunks created. If codebase indexing is not enabled in config, the tool SHALL return an error message indicating that codebase indexing is disabled.
+#### Scenario: Successful codebase index
+- **WHEN** `memory_index_codebase` is called with codebase indexing enabled and source files exist
+- **THEN** the response includes counts for files scanned, indexed, skipped (unchanged), skipped (too large), and chunks created
+- **THEN** all matching source files are indexed into the store with the current workspace's projectHash
+#### Scenario: Codebase indexing disabled
+- **WHEN** `memory_index_codebase` is called but `codebase.enabled` is not set or is `false`
+- **THEN** the response contains an error message: "Codebase indexing is not enabled. Set codebase.enabled: true in config.yml"
+#### Scenario: No matching files found
+- **WHEN** `memory_index_codebase` is called but no files match the configured extensions and exclude patterns
+- **THEN** the response indicates 0 files scanned and 0 files indexed
+### Requirement: memory_status includes codebase statistics
+The `memory_status` tool response SHALL include a `codebase` section when codebase indexing is enabled. This section SHALL report: whether codebase indexing is enabled, number of indexed codebase documents, number of codebase chunks, configured extensions (resolved after auto-detection), and configured exclude pattern count.
+#### Scenario: memory_status with codebase enabled
+- **WHEN** `memory_status` is called and codebase indexing is enabled with indexed files
+- **THEN** the response includes a `codebase` section with `enabled: true`, document count, chunk count, resolved extensions list, and exclude pattern count
+#### Scenario: memory_status with codebase disabled
+- **WHEN** `memory_status` is called and codebase indexing is not enabled
+- **THEN** the response includes `codebase: { enabled: false }` or omits the codebase section entirely
+### Requirement: memory_index_codebase tool schema registration
+The MCP tool registration for `memory_index_codebase` SHALL include a description explaining its purpose: triggering a full scan and index of source code files in the current workspace. The input schema SHALL have no required parameters.
+#### Scenario: Tool schema advertised to MCP client
+- **WHEN** an MCP client lists available tools
+- **THEN** `memory_index_codebase` appears with a description mentioning codebase/source code indexing
+- **THEN** the input schema has no required parameters