nano-brain 2026.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (79) hide show
  1. package/AGENTS_SNIPPET.md +36 -0
  2. package/CHANGELOG.md +68 -0
  3. package/README.md +281 -0
  4. package/SKILL.md +153 -0
  5. package/bin/cli.js +18 -0
  6. package/index.html +929 -0
  7. package/nano-brain +4 -0
  8. package/opencode-mcp.json +9 -0
  9. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/.openspec.yaml +2 -0
  10. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/design.md +68 -0
  11. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/proposal.md +27 -0
  12. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/specs/mcp-integration-testing/spec.md +50 -0
  13. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/specs/mcp-server/spec.md +40 -0
  14. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/specs/search-pipeline/spec.md +29 -0
  15. package/openspec/changes/archive/2026-02-16-fix-mcp-server-bugs/tasks.md +37 -0
  16. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/.openspec.yaml +2 -0
  17. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/design.md +111 -0
  18. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/proposal.md +30 -0
  19. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/specs/mcp-server/spec.md +33 -0
  20. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/specs/storage-limits/spec.md +90 -0
  21. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/specs/workspace-scoping/spec.md +66 -0
  22. package/openspec/changes/archive/2026-02-23-workspace-scoped-memory-and-storage-limits/tasks.md +199 -0
  23. package/openspec/changes/codebase-indexing/.openspec.yaml +2 -0
  24. package/openspec/changes/codebase-indexing/design.md +169 -0
  25. package/openspec/changes/codebase-indexing/proposal.md +30 -0
  26. package/openspec/changes/codebase-indexing/specs/codebase-collection/spec.md +187 -0
  27. package/openspec/changes/codebase-indexing/specs/mcp-server/spec.md +36 -0
  28. package/openspec/changes/codebase-indexing/tasks.md +56 -0
  29. package/openspec/specs/mcp-integration-testing/spec.md +50 -0
  30. package/openspec/specs/mcp-server/spec.md +75 -0
  31. package/openspec/specs/search-pipeline/spec.md +29 -0
  32. package/openspec/specs/storage-limits/spec.md +94 -0
  33. package/openspec/specs/workspace-scoping/spec.md +70 -0
  34. package/package.json +34 -0
  35. package/site/build.js +66 -0
  36. package/site/partials/_api.html +83 -0
  37. package/site/partials/_compare.html +100 -0
  38. package/site/partials/_config.html +23 -0
  39. package/site/partials/_features.html +43 -0
  40. package/site/partials/_footer.html +6 -0
  41. package/site/partials/_hero.html +9 -0
  42. package/site/partials/_how-it-works.html +26 -0
  43. package/site/partials/_models.html +18 -0
  44. package/site/partials/_quick-start.html +15 -0
  45. package/site/partials/_stats.html +1 -0
  46. package/site/partials/_tech-stack.html +13 -0
  47. package/site/script.js +12 -0
  48. package/site/shell.html +44 -0
  49. package/site/styles.css +548 -0
  50. package/src/chunker.ts +427 -0
  51. package/src/codebase.ts +331 -0
  52. package/src/collections.ts +192 -0
  53. package/src/embeddings.ts +293 -0
  54. package/src/expansion.ts +79 -0
  55. package/src/harvester.ts +306 -0
  56. package/src/index.ts +503 -0
  57. package/src/reranker.ts +103 -0
  58. package/src/search.ts +294 -0
  59. package/src/server.ts +664 -0
  60. package/src/storage.ts +221 -0
  61. package/src/store.ts +623 -0
  62. package/src/types.ts +202 -0
  63. package/src/watcher.ts +384 -0
  64. package/test/chunker.test.ts +479 -0
  65. package/test/cli.test.ts +309 -0
  66. package/test/codebase-chunker.test.ts +446 -0
  67. package/test/codebase.test.ts +678 -0
  68. package/test/collections.test.ts +571 -0
  69. package/test/harvester.test.ts +636 -0
  70. package/test/integration.test.ts +150 -0
  71. package/test/llm.test.ts +322 -0
  72. package/test/search.test.ts +572 -0
  73. package/test/server.test.ts +541 -0
  74. package/test/storage.test.ts +302 -0
  75. package/test/store.test.ts +465 -0
  76. package/test/watcher.test.ts +656 -0
  77. package/test/workspace.test.ts +239 -0
  78. package/tsconfig.json +19 -0
  79. package/vitest.config.ts +16 -0
@@ -0,0 +1,199 @@
1
+ ## Tasks
2
+
3
+ - [x] **Task 1**: Add project_hash column and migration to store.ts
4
+ **Spec**: workspace-scoping/spec.md — Database migration for existing documents, Document-level project tagging
5
+ **Files**: `src/store.ts`, `src/types.ts`
6
+
7
+ **Steps**:
8
+ 1. Add `projectHash?: string` field to the `Document` interface in `types.ts`
9
+ 2. In `createStore()`, after the initial `db.exec()` schema creation, add migration logic:
10
+ - Check if `project_hash` column exists: `PRAGMA table_info(documents)`
11
+ - If missing: `ALTER TABLE documents ADD COLUMN project_hash TEXT DEFAULT 'global'`
12
+ - Backfill existing rows: `UPDATE documents SET project_hash = ... WHERE path LIKE 'sessions/%'` — extract hash from path using `substr(path, instr(path, 'sessions/') + 9, 12)` pattern
13
+ 3. Add `CREATE INDEX IF NOT EXISTS idx_documents_project_hash ON documents(project_hash, active)` for efficient filtering
14
+ 4. Update `insertDocumentStmt` to include `project_hash` column
15
+ 5. Update `insertDocument()` method to accept and store `projectHash`
16
+ 6. Add `projectHash` to `findDocument()` return mapping
17
+
18
+ **Acceptance**: Migration runs on first startup, backfills correctly, subsequent startups skip migration. New documents get project_hash set.
19
+
20
+ ---
21
+
22
+ - [x] **Task 2**: Add workspace-filtered search to store.ts
23
+ **Spec**: workspace-scoping/spec.md — Default search scoping, Cross-workspace search opt-in
24
+ **Files**: `src/store.ts`, `src/types.ts`
25
+
26
+ **Steps**:
27
+ 1. Add `projectHash?: string` parameter to `searchFTS()` and `searchVec()` in the `Store` interface
28
+ 2. Create new prepared statements for workspace-filtered FTS search:
29
+ - `searchFTSWithWorkspaceStmt`: filters `d.project_hash IN (?, 'global')` in addition to existing conditions
30
+ - `searchFTSWithWorkspaceAndCollectionStmt`: combines both workspace and collection filters
31
+ 3. Update `searchFTS()` implementation: when `projectHash` is provided and not `'all'`, use workspace-filtered statement
32
+ 4. Update `searchVec()` implementation: add `AND d.project_hash IN (?, 'global')` to the dynamic SQL when `projectHash` is provided and not `'all'`
33
+ 5. Update `Store` interface signatures in `types.ts`
34
+
35
+ **Acceptance**: `searchFTS('query', 10, undefined, 'abc123')` returns only docs with `project_hash = 'abc123'` or `'global'`. `searchFTS('query', 10, undefined, 'all')` returns all docs. `searchFTS('query', 10)` (no projectHash) returns all docs (backward compatible).
36
+
37
+ ---
38
+
39
+ - [x] **Task 3**: Compute currentProjectHash in server.ts and wire to search tools
40
+ **Spec**: workspace-scoping/spec.md — Workspace detection from PWD; mcp-server/spec.md (delta) — Search tools support workspace filtering
41
+ **Files**: `src/server.ts`
42
+
43
+ **Steps**:
44
+ 1. In `startServer()`, compute `currentProjectHash = crypto.createHash('sha256').update(process.cwd()).digest('hex').substring(0, 12)`
45
+ 2. Add `currentProjectHash` to `ServerDeps` interface
46
+ 3. In `createMcpServer()`, add `workspace` parameter (optional string) to `memory_search`, `memory_vsearch`, `memory_query` tool schemas
47
+ 4. In each search tool handler:
48
+ - Resolve effective workspace: `workspace === 'all' ? 'all' : (workspace || deps.currentProjectHash)`
49
+ - Pass resolved workspace to `store.searchFTS()` / `store.searchVec()` / `hybridSearch()`
50
+ 5. Update `hybridSearch()` in `search.ts` to accept and pass through `projectHash` parameter
51
+
52
+ **Acceptance**: Search tools default to current workspace. `workspace: "all"` searches everything. Tool schemas show `workspace` parameter.
53
+
54
+ ---
55
+
56
+ - [x] **Task 4**: Update memory_status to report workspace and storage info
57
+ **Spec**: mcp-server/spec.md (delta) — memory_status reports storage usage
58
+ **Files**: `src/server.ts`, `src/store.ts`, `src/types.ts`
59
+
60
+ **Steps**:
61
+ 1. Add `getWorkspaceStats()` method to store: `SELECT project_hash, COUNT(*) as count FROM documents WHERE active = 1 GROUP BY project_hash`
62
+ 2. Add workspace stats and storage config to `IndexHealth` interface
63
+ 3. Update `formatStatus()` to include per-workspace document counts and storage limit info
64
+ 4. In `memory_status` handler, pass storage config to format function
65
+
66
+ **Acceptance**: `memory_status` output shows per-workspace breakdown and storage limits.
67
+
68
+ ---
69
+
70
+ - [x] **Task 5**: Add storage config parsing to types.ts and collections.ts
71
+ **Spec**: storage-limits/spec.md — Storage configuration with safe defaults, Human-readable size and duration parsing
72
+ **Files**: `src/types.ts`, `src/collections.ts`
73
+
74
+ **Steps**:
75
+ 1. Add `StorageConfig` interface to `types.ts`: `{ maxSize: number; retention: number; minFreeDisk: number }`
76
+ 2. Add `storage?` field to `CollectionConfig` interface: `{ maxSize?: string; retention?: string; minFreeDisk?: string }`
77
+ 3. Create `parseSize(value: string): number` function — parses `500MB`, `2GB`, `1TB` to bytes
78
+ 4. Create `parseDuration(value: string): number` function — parses `30d`, `90d`, `1y` to milliseconds
79
+ 5. Create `parseStorageConfig(raw?: { maxSize?: string; retention?: string; minFreeDisk?: string }): StorageConfig` — applies defaults and validates
80
+ 6. Export from `collections.ts` (or create new `src/storage.ts` if cleaner)
81
+
82
+ **Acceptance**: `parseSize('2GB')` returns `2147483648`. `parseDuration('90d')` returns `7776000000`. Invalid values log warning and return defaults.
83
+
84
+ ---
85
+
86
+ - [x] **Task 6**: Implement disk safety guard
87
+ **Spec**: storage-limits/spec.md — Disk safety guard
88
+ **Files**: `src/storage.ts` (new) or `src/watcher.ts`
89
+
90
+ **Steps**:
91
+ 1. Create `checkDiskSpace(minFreeDisk: number): { ok: boolean; freeBytes: number }` function
92
+ 2. Use `fs.statfsSync()` (Node 18.15+) on the output directory to get available space
93
+ 3. Wrap in try/catch — if `statfsSync` unavailable, log warning and return `{ ok: true, freeBytes: -1 }`
94
+ 4. Integrate into watcher's `triggerReindex()`: check disk before harvest/reindex/embed operations
95
+ 5. If disk check fails, skip all write operations and log warning
96
+
97
+ **Acceptance**: When disk is below `minFreeDisk`, writes are skipped with warning. When `statfsSync` unavailable, operations proceed with warning.
98
+
99
+ ---
100
+
101
+ - [x] **Task 7**: Implement retention and size-based eviction
102
+ **Spec**: storage-limits/spec.md — Retention-based eviction, Size-based eviction, Original session JSON is never deleted
103
+ **Files**: `src/storage.ts` (new), `src/watcher.ts`, `src/store.ts`
104
+
105
+ **Steps**:
106
+ 1. Create `evictExpiredSessions(sessionsDir: string, retention: number, store: Store): number` function:
107
+ - Scan all `sessions/{hash}/*.md` files
108
+ - Check mtime against `Date.now() - retention`
109
+ - Delete expired files and remove corresponding documents from store
110
+ - Return count of evicted files
111
+ 2. Create `evictBySize(sessionsDir: string, dbPath: string, maxSize: number, store: Store): number` function:
112
+ - Calculate total size: `statSync(dbPath).size` + recursive dir size of `sessionsDir`
113
+ - If over `maxSize`, collect all session files sorted by mtime (oldest first)
114
+ - Delete oldest files one by one until under limit
115
+ - Remove corresponding documents from store
116
+ - Return count of evicted files
117
+ 3. Add `deleteDocumentsByPath(pathPattern: string): number` method to store for removing evicted documents
118
+ 4. Integrate eviction into watcher's harvest cycle: run after harvest, before reindex
119
+ 5. Never touch files outside `sessionsDir` (original JSON in `~/.local/share/opencode/` is safe)
120
+
121
+ **Acceptance**: Sessions older than retention are evicted. If still over maxSize, oldest are evicted. Original JSON untouched. Eviction count logged.
122
+
123
+ ---
124
+
125
+ - [x] **Task 8**: Implement orphan embedding cleanup
126
+ **Spec**: storage-limits/spec.md — Orphan embedding cleanup
127
+ **Files**: `src/store.ts`, `src/watcher.ts`
128
+
129
+ **Steps**:
130
+ 1. Add `cleanOrphanedEmbeddings(): number` method to store:
131
+ - `DELETE FROM content_vectors WHERE hash NOT IN (SELECT hash FROM documents WHERE active = 1)`
132
+ - If vec table exists: `DELETE FROM vectors_vec WHERE substr(hash_seq, 1, instr(hash_seq, ':') - 1) NOT IN (SELECT hash FROM documents WHERE active = 1)`
133
+ - Return count of deleted rows
134
+ 2. Add cycle counter to watcher — every 10 harvest cycles, call `cleanOrphanedEmbeddings()`
135
+ 3. Log cleanup results
136
+
137
+ **Acceptance**: Orphaned embeddings are cleaned every 10 cycles. No active document embeddings are deleted.
138
+
139
+ ---
140
+
141
+ - [x] **Task 9**: Wire storage config through server startup
142
+ **Spec**: storage-limits/spec.md — Storage configuration with safe defaults
143
+ **Files**: `src/server.ts`, `src/watcher.ts`
144
+
145
+ **Steps**:
146
+ 1. In `startServer()`, parse storage config from loaded collection config
147
+ 2. Pass `StorageConfig` to watcher via `WatcherOptions`
148
+ 3. Watcher uses storage config for disk check, eviction thresholds
149
+ 4. Pass storage config to `memory_status` for display
150
+
151
+ **Acceptance**: Storage config from `config.yml` is loaded and used. Missing config uses defaults. Status tool shows config values.
152
+
153
+ ---
154
+
155
+ - [x] **Task 10**: Add tests for workspace scoping
156
+ **Spec**: workspace-scoping/spec.md — all requirements
157
+ **Files**: `test/store.test.ts` or new `test/workspace.test.ts`
158
+
159
+ **Steps**:
160
+ 1. Test migration: create store, verify `project_hash` column exists
161
+ 2. Test document tagging: index documents with session paths, verify `project_hash` extracted correctly
162
+ 3. Test non-session documents get `project_hash = 'global'`
163
+ 4. Test `searchFTS` with workspace filter returns only matching + global docs
164
+ 5. Test `searchFTS` with `'all'` returns everything
165
+ 6. Test `searchVec` with workspace filter (if vec available)
166
+ 7. Test `currentProjectHash` computation matches harvester convention
167
+
168
+ **Acceptance**: All new tests pass. Existing 265 tests still pass.
169
+
170
+ ---
171
+
172
+ - [x] **Task 11**: Add tests for storage limits
173
+ **Spec**: storage-limits/spec.md — all requirements
174
+ **Files**: `test/storage.test.ts` (new)
175
+
176
+ **Steps**:
177
+ 1. Test `parseSize()`: valid sizes, invalid input, edge cases
178
+ 2. Test `parseDuration()`: valid durations, invalid input
179
+ 3. Test `parseStorageConfig()`: full config, partial config, empty config
180
+ 4. Test retention eviction: create temp files with old mtimes, verify eviction
181
+ 5. Test size eviction: create files exceeding maxSize, verify oldest evicted first
182
+ 6. Test disk safety guard: mock `statfsSync` to test both paths
183
+ 7. Test orphan cleanup: create orphaned embeddings, verify cleanup
184
+
185
+ **Acceptance**: All new tests pass. Existing tests still pass.
186
+
187
+ ---
188
+
189
+ - [x] **Task 12**: Integration test for workspace-scoped search via MCP tools
190
+ **Spec**: mcp-server/spec.md (delta) — Search tool parameter schema includes workspace
191
+ **Files**: `test/server.test.ts`
192
+
193
+ **Steps**:
194
+ 1. Add test: index documents with different `project_hash` values
195
+ 2. Test `memory_search` without workspace param returns only current workspace + global
196
+ 3. Test `memory_search` with `workspace: "all"` returns all
197
+ 4. Test `memory_status` includes workspace breakdown
198
+
199
+ **Acceptance**: Integration tests verify end-to-end workspace scoping through MCP tool handlers.
@@ -0,0 +1,2 @@
1
+ schema: spec-driven
2
+ created: 2026-02-23
@@ -0,0 +1,169 @@
1
+ ## Context
2
+
3
+ nano-brain indexes markdown documents (sessions, MEMORY.md, daily logs) into SQLite with FTS5 + sqlite-vec for hybrid search. The existing pipeline is:
4
+
5
+ 1. **Collections** define directories + glob patterns to scan (`config.yml`)
6
+ 2. **Watcher** monitors collections via chokidar, triggers reindex on changes
7
+ 3. **Chunker** splits markdown by headings/paragraphs (target 900 tokens, 15% overlap)
8
+ 4. **Store** indexes chunks into FTS5 + content-addressed embeddings
9
+ 5. **Search** queries FTS5 (BM25) and/or sqlite-vec (cosine), fuses with RRF
10
+
11
+ Current limitations:
12
+ - The chunker only understands markdown structure (headings, code fences, lists)
13
+ - Collections use `pattern: "**/*.md"` — no exclude support
14
+ - The watcher watches collection output dirs, not arbitrary workspace directories
15
+ - No concept of "codebase" as a distinct collection type
16
+
17
+ The MCP server runs per-workspace with `PWD` set to the workspace root. `currentProjectHash = sha256(cwd).substring(0, 12)` is already computed at startup.
18
+
19
+ ## Goals / Non-Goals
20
+
21
+ **Goals:**
22
+ - Index source code files from the current workspace into the existing search pipeline
23
+ - Support configurable exclude patterns to skip `node_modules`, `.git`, `dist`, etc.
24
+ - Chunk source code files intelligently — respecting function/class boundaries where feasible, falling back to line-based splitting
25
+ - Reuse the existing watcher infrastructure for incremental updates
26
+ - Tag all codebase documents with `currentProjectHash` for workspace-scoped search
27
+ - Provide sensible defaults so it works with zero config for common project types
28
+ - Keep it opt-in — codebase indexing only happens when configured
29
+
30
+ **Non-Goals:**
31
+ - AST-level parsing of every language (too complex, too many dependencies)
32
+ - Indexing binary files, images, or compiled output
33
+ - Real-time indexing on every keystroke (debounced file watcher is sufficient)
34
+ - Replacing LSP or grep — this complements them with semantic search
35
+ - Cross-workspace codebase search (each workspace indexes its own code)
36
+ - Indexing external dependencies (node_modules, vendor, etc.)
37
+
38
+ ## Decisions
39
+
40
+ ### D1: Codebase as a special auto-configured collection
41
+
42
+ **Decision**: When `codebase: { enabled: true }` is set in config.yml (or auto-detected), create a virtual collection named `codebase` pointing at `process.cwd()` with source code glob patterns and exclude rules. This collection is NOT stored in the `collections:` section — it's a separate top-level config.
43
+
44
+ **Why**: Codebase indexing is fundamentally different from document collections — it targets the workspace root (dynamic per-server instance), needs exclude patterns, and uses different chunking. Keeping it separate avoids polluting the general collection config.
45
+
46
+ **Config format**:
47
+ ```yaml
48
+ codebase:
49
+ enabled: true
50
+ exclude:
51
+ - node_modules
52
+ - .git
53
+ - dist
54
+ - build
55
+ - .next
56
+ - __pycache__
57
+ - "*.min.js"
58
+ - "*.map"
59
+ extensions:
60
+ - .ts
61
+ - .tsx
62
+ - .js
63
+ - .jsx
64
+ - .py
65
+ - .go
66
+ - .rs
67
+ - .java
68
+ - .rb
69
+ - .md
70
+ maxFileSize: 5MB # Skip files larger than this
71
+ ```
72
+
73
+ **Alternative considered**: Add codebase as a regular collection entry. Rejected because the path is dynamic (PWD), exclude patterns don't exist on regular collections, and it would confuse the existing collection management CLI.
74
+
75
+ ### D2: Exclude patterns via .gitignore + config
76
+
77
+ **Decision**: Merge exclude patterns from three sources (in priority order):
78
+ 1. **Config `codebase.exclude`** — explicit user overrides
79
+ 2. **`.gitignore`** — project-specific ignores (already maintained by developers)
80
+ 3. **Built-in defaults** — `node_modules`, `.git`, `dist`, `build`, `__pycache__`, `vendor`, `.next`, `.nuxt`, `target`, `*.min.js`, `*.map`, `*.lock`, `*.sum`
81
+
82
+ **Why**: `.gitignore` already captures 90% of what should be excluded. Adding it as a source means zero-config works for most projects. The built-in defaults catch common cases even without a `.gitignore`.
83
+
84
+ **Implementation**: Use fast-glob's `ignore` option which already supports gitignore-style patterns. Parse `.gitignore` at startup and merge with config excludes.
85
+
86
+ ### D3: Source code chunking — line-based with structural hints
87
+
88
+ **Decision**: Create a `chunkSourceCode()` function that splits by structural boundaries:
89
+ 1. **Primary split**: Blank line sequences (2+ consecutive blank lines = strong break)
90
+ 2. **Secondary split**: Single blank lines between top-level constructs
91
+ 3. **Structural hints**: Recognize common patterns across languages:
92
+ - `function`, `def`, `fn`, `func` — function definitions
93
+ - `class`, `struct`, `interface`, `enum`, `type` — type definitions
94
+ - `import`, `from`, `require`, `use` — import blocks
95
+ - `export` — export statements
96
+ 4. **Fallback**: If no structural breaks found within target chunk size, split at line boundaries
97
+ 5. **Same target size**: 900 tokens (~3600 chars), 15% overlap — matching markdown chunker
98
+
99
+ **Why**: Full AST parsing requires language-specific parsers (tree-sitter, etc.) which adds heavy dependencies. Line-based splitting with structural hints gives 80% of the benefit at 10% of the complexity. The patterns above work across TypeScript, Python, Go, Rust, Java, Ruby, and most C-family languages.
100
+
101
+ **Alternative considered**: tree-sitter for AST-aware chunking. Rejected for v1 — adds ~50MB of native dependencies, complex build, and only marginally better than structural hints for search purposes. Can be added later as an enhancement.
102
+
103
+ ### D4: File-level metadata in document title
104
+
105
+ **Decision**: Store the relative file path as the document `title` and include a metadata header in the indexed content:
106
+ ```
107
+ File: src/auth/login.ts
108
+ Language: typescript
109
+ Lines: 1-45
110
+
111
+ [actual file content]
112
+ ```
113
+
114
+ **Why**: The metadata header helps the embedding model understand what it's looking at. The relative path as title makes search results immediately actionable (agent knows which file to open).
115
+
116
+ ### D5: Incremental indexing via content hash
117
+
118
+ **Decision**: Reuse the existing content-addressed storage. On each scan:
119
+ 1. Compute `sha256(fileContent)` for each source file
120
+ 2. Compare with stored hash in `documents` table
121
+ 3. Skip unchanged files (hash match)
122
+ 4. Re-index changed files (hash mismatch)
123
+ 5. Deactivate deleted files (in DB but not on disk)
124
+
125
+ **Why**: The store already does content-addressed dedup via `computeHash()`. This is the same pattern used for markdown documents. No new infrastructure needed.
126
+
127
+ ### D6: Watcher integration — reuse existing chokidar setup
128
+
129
+ **Decision**: Add the workspace directory as an additional watch target in the existing watcher, with the exclude patterns applied as chokidar `ignored` options. Source file changes trigger the same dirty-flag → debounce → reindex cycle.
130
+
131
+ **Why**: The watcher already handles debouncing, dirty flags, and reindex scheduling. Adding another watch target is simpler than creating a separate watcher.
132
+
133
+ ### D7: Auto-detection of project type for default extensions
134
+
135
+ **Decision**: At startup, detect project type by checking for marker files:
136
+ - `package.json` → Node.js: `.ts`, `.tsx`, `.js`, `.jsx`, `.json`, `.css`, `.html`
137
+ - `pyproject.toml` / `setup.py` / `requirements.txt` → Python: `.py`, `.pyi`
138
+ - `go.mod` → Go: `.go`
139
+ - `Cargo.toml` → Rust: `.rs`
140
+ - `pom.xml` / `build.gradle` → Java/Kotlin: `.java`, `.kt`
141
+ - `Gemfile` → Ruby: `.rb`
142
+ - Fallback: all common extensions
143
+
144
+ Always include `.md` files from the workspace (README, docs).
145
+
146
+ **Why**: Avoids indexing irrelevant file types. A Python project doesn't need `.ts` files indexed. Auto-detection means zero-config for common setups.
147
+
148
+ ### D8: Max file size guard
149
+
150
+ **Decision**: Skip files larger than `maxFileSize` (default 5MB). Log a warning for skipped files.
151
+
152
+ **Why**: Large generated files (bundles, minified code, data files) waste storage and produce poor search results. 5MB covers virtually all hand-written source files while still guarding against huge generated artifacts.
153
+
154
+ ## Risks / Trade-offs
155
+
156
+ **[Risk] Large codebases produce many chunks** → Storage limits from v0.2.0 (maxSize, retention) apply. Codebase chunks count toward the total. For very large monorepos, users may need to increase maxSize or narrow extensions.
157
+
158
+ **[Risk] Stale index after branch switch** → `git checkout` changes many files at once. The watcher will detect changes and reindex, but there's a window where the index is stale. Acceptable — the reindex debounce (2s) is fast enough.
159
+
160
+ **[Risk] Embedding all source code is expensive** → Embedding 1000 files × 5 chunks each = 5000 embeddings. At ~50ms per embedding, that's ~4 minutes for initial index. Subsequent updates are incremental (only changed files). This is acceptable for a background process.
161
+
162
+ **[Risk] Structural hints miss language-specific constructs** → The line-based chunker won't perfectly split every language. Some chunks may cut mid-function. This is acceptable for search — the overlap ensures context is preserved, and semantic search is tolerant of imperfect boundaries.
163
+
164
+ **[Risk] .gitignore parsing edge cases** → Complex .gitignore patterns (negation, nested) may not parse perfectly with fast-glob. Mitigation: the built-in defaults catch the most important exclusions regardless.
165
+
166
+ ## Open Questions
167
+
168
+ - Should codebase indexing be enabled by default (auto-detect) or require explicit `codebase: { enabled: true }`? Leaning toward explicit opt-in for v1 to avoid surprising users with increased storage usage.
169
+ - Should there be a `memory_index_codebase` MCP tool for on-demand full reindex, or is the watcher sufficient? Leaning toward adding the tool for the initial index trigger.
@@ -0,0 +1,30 @@
1
+ ## Why
2
+
3
+ nano-brain currently indexes only session transcripts, MEMORY.md, and daily logs — it has zero knowledge of the actual source code. When an agent queries `memory_query query="how does authentication work"`, it finds past *conversations* about auth but not the actual implementation. This forces agents to rely on grep (exact keywords only) and LSP (requires knowing symbol names upfront) for code discovery, which fails for semantic/conceptual queries across large codebases.
4
+
5
+ Indexing source code enables semantic search over the codebase — finding related code by meaning, not just keywords. This is the single highest-impact improvement for agent productivity in unfamiliar or large projects.
6
+
7
+ ## What Changes
8
+
9
+ - Add support for a `codebase` collection type that indexes source code files from the current workspace
10
+ - Add `exclude` patterns to collection config (e.g., `node_modules`, `.git`, `dist`, `build`) to skip irrelevant directories
11
+ - Add language-aware chunking for source code files (not just markdown) — respecting function/class boundaries where possible
12
+ - Integrate with the existing file watcher for incremental re-indexing on file changes
13
+ - Tag codebase documents with the current workspace's `projectHash` for workspace-scoped search
14
+ - Auto-detect common exclude patterns based on project type (Node.js, Python, Go, etc.)
15
+
16
+ ## Capabilities
17
+
18
+ ### New Capabilities
19
+ - `codebase-collection`: Collection type for indexing source code with exclude patterns, language-aware chunking, and workspace tagging
20
+
21
+ ### Modified Capabilities
22
+ - `mcp-server`: Add `memory_index_codebase` tool for on-demand codebase indexing, update `memory_status` to show codebase collection stats
23
+
24
+ ## Impact
25
+
26
+ - **Config format**: New optional `exclude` field on collection config, new optional `codebase` section in config.yml
27
+ - **Chunker**: Needs to handle non-markdown files (`.ts`, `.py`, `.go`, `.rs`, `.java`, etc.) with language-aware splitting
28
+ - **Storage**: Codebase indexing will significantly increase document/chunk count — storage limits from v0.2.0 apply
29
+ - **Watcher**: Must watch workspace directory with exclude patterns, not just the output directory
30
+ - **Dependencies**: No new dependencies expected — fast-glob already supports ignore patterns
@@ -0,0 +1,187 @@
1
+ ## ADDED Requirements
2
+
3
+ ### Requirement: Codebase configuration format
4
+ The system SHALL support a top-level `codebase` section in `config.yml` with the following fields: `enabled` (boolean), `exclude` (string array), `extensions` (string array), `maxFileSize` (string, human-readable size), and `maxSize` (string, human-readable size for storage budget). All fields except `enabled` SHALL be optional with sensible defaults. When `codebase.enabled` is `false` or the section is absent, no codebase indexing SHALL occur.
5
+
6
+ #### Scenario: Full codebase config provided
7
+ - **WHEN** config.yml contains `codebase: { enabled: true, exclude: ["node_modules", ".git"], extensions: [".ts", ".py"], maxFileSize: "50KB" }`
8
+ - **THEN** the system indexes only `.ts` and `.py` files, skipping `node_modules` and `.git` directories, and skipping files larger than 50KB
9
+
10
+ #### Scenario: Minimal codebase config (enabled only)
11
+ - **WHEN** config.yml contains `codebase: { enabled: true }` with no other fields
12
+ **THEN** the system uses auto-detected extensions (from project type), built-in exclude defaults, and 5MB maxFileSize
13
+
14
+ #### Scenario: Codebase section absent
15
+ - **WHEN** config.yml has no `codebase` section
16
+ - **THEN** no codebase indexing occurs
17
+ - **THEN** no codebase-related file watching occurs
18
+
19
+ #### Scenario: Codebase explicitly disabled
20
+ - **WHEN** config.yml contains `codebase: { enabled: false }`
21
+ - **THEN** no codebase indexing occurs even if other codebase fields are present
22
+
23
+ ### Requirement: Exclude pattern merging from three sources
24
+ The system SHALL merge exclude patterns from three sources in priority order: (1) `codebase.exclude` from config, (2) `.gitignore` file in the workspace root, and (3) built-in defaults. The built-in defaults SHALL include at minimum: `node_modules`, `.git`, `dist`, `build`, `__pycache__`, `vendor`, `.next`, `.nuxt`, `target`, `*.min.js`, `*.map`, `*.lock`, `*.sum`. All three sources SHALL be combined (union) into a single exclude list passed to the file scanner.
25
+
26
+ #### Scenario: All three sources present
27
+ - **WHEN** config has `exclude: ["custom-dir"]`, `.gitignore` contains `coverage/`, and built-in defaults include `node_modules`
28
+ - **THEN** the effective exclude list includes `custom-dir`, `coverage/`, `node_modules`, and all other built-in defaults
29
+
30
+ #### Scenario: No .gitignore file
31
+ - **WHEN** the workspace root has no `.gitignore` file
32
+ - **THEN** only config excludes and built-in defaults are used
33
+ - **THEN** no error is thrown
34
+
35
+ #### Scenario: No config excludes
36
+ - **WHEN** `codebase.exclude` is not set in config
37
+ - **THEN** `.gitignore` patterns and built-in defaults are still applied
38
+
39
+ ### Requirement: Source code chunking with structural hints
40
+ The system SHALL chunk source code files using line-based splitting with structural boundary detection. Primary split points SHALL be blank line sequences (2+ consecutive blank lines). Secondary split points SHALL be single blank lines between top-level constructs. Structural hints SHALL recognize common cross-language patterns: function definitions (`function`, `def`, `fn`, `func`), type definitions (`class`, `struct`, `interface`, `enum`, `type`), import blocks (`import`, `from`, `require`, `use`), and export statements (`export`). The target chunk size SHALL be 900 tokens (~3600 characters) with 15% overlap, matching the existing markdown chunker.
41
+
42
+ #### Scenario: TypeScript file with functions
43
+ - **WHEN** a TypeScript file contains three functions separated by blank lines, each ~300 tokens
44
+ - **THEN** the chunker produces one chunk containing all three functions (under 900 token target)
45
+
46
+ #### Scenario: Large Python file exceeding chunk target
47
+ - **WHEN** a Python file contains a class with 2000 tokens
48
+ - **THEN** the chunker splits at structural boundaries (method definitions) within the class
49
+ - **THEN** each resulting chunk is approximately 900 tokens with 15% overlap
50
+
51
+ #### Scenario: File with no structural boundaries
52
+ - **WHEN** a file contains continuous text with no blank lines or structural keywords
53
+ - **THEN** the chunker falls back to splitting at line boundaries near the 900 token target
54
+
55
+ #### Scenario: Overlap between chunks
56
+ - **WHEN** a source file is split into multiple chunks
57
+ - **THEN** adjacent chunks share approximately 15% overlapping content at their boundaries
58
+
59
+ ### Requirement: File metadata header in indexed content
60
+ Each indexed chunk SHALL include a metadata header prepended to the content: `File: <relative-path>`, `Language: <language>`, `Lines: <start>-<end>`. The relative path SHALL be computed from the workspace root. The language SHALL be inferred from the file extension.
61
+
62
+ #### Scenario: TypeScript file chunk
63
+ - **WHEN** a chunk is created from lines 10-45 of `src/auth/login.ts`
64
+ - **THEN** the indexed content starts with `File: src/auth/login.ts\nLanguage: typescript\nLines: 10-45\n\n`
65
+
66
+ #### Scenario: Python file chunk
67
+ - **WHEN** a chunk is created from lines 1-30 of `utils/helpers.py`
68
+ - **THEN** the indexed content starts with `File: utils/helpers.py\nLanguage: python\nLines: 1-30\n\n`
69
+
70
+ ### Requirement: Incremental indexing via content hash
71
+ The system SHALL use content-addressed hashing to avoid re-indexing unchanged files. On each scan, the system SHALL compute `sha256(fileContent)` for each source file, compare with the stored hash in the `documents` table, skip files with matching hashes, re-index files with mismatched hashes, and deactivate documents for files that no longer exist on disk.
72
+
73
+ #### Scenario: Unchanged file
74
+ - **WHEN** a source file has the same content as the last index run
75
+ - **THEN** the file is skipped (no re-chunking, no re-embedding)
76
+
77
+ #### Scenario: Modified file
78
+ - **WHEN** a source file has different content than the last index run
79
+ - **THEN** the old document is replaced with newly chunked and embedded content
80
+
81
+ #### Scenario: Deleted file
82
+ - **WHEN** a previously indexed source file no longer exists on disk
83
+ - **THEN** the corresponding document and its chunks/embeddings are deactivated or removed
84
+
85
+ ### Requirement: Project type auto-detection for default extensions
86
+ When `codebase.extensions` is not configured, the system SHALL auto-detect the project type by checking for marker files in the workspace root and select appropriate default extensions. Detection rules SHALL include: `package.json` maps to `.ts`, `.tsx`, `.js`, `.jsx`, `.json`, `.css`, `.html`; `pyproject.toml`, `setup.py`, or `requirements.txt` maps to `.py`, `.pyi`; `go.mod` maps to `.go`; `Cargo.toml` maps to `.rs`; `pom.xml` or `build.gradle` maps to `.java`, `.kt`; `Gemfile` maps to `.rb`. If no marker files are found, all common extensions SHALL be used as fallback. `.md` files SHALL always be included regardless of project type.
87
+
88
+ #### Scenario: Node.js project detected
89
+ - **WHEN** the workspace root contains `package.json` and `codebase.extensions` is not configured
90
+ - **THEN** the system indexes files with extensions: `.ts`, `.tsx`, `.js`, `.jsx`, `.json`, `.css`, `.html`, `.md`
91
+
92
+ #### Scenario: Python project detected
93
+ - **WHEN** the workspace root contains `pyproject.toml` and `codebase.extensions` is not configured
94
+ - **THEN** the system indexes files with extensions: `.py`, `.pyi`, `.md`
95
+
96
+ #### Scenario: Multiple marker files present
97
+ - **WHEN** the workspace root contains both `package.json` and `pyproject.toml`
98
+ - **THEN** the system merges extensions from both project types
99
+
100
+ #### Scenario: No marker files found
101
+ - **WHEN** the workspace root contains no recognized marker files
102
+ - **THEN** the system uses all common extensions as fallback (`.ts`, `.tsx`, `.js`, `.jsx`, `.py`, `.go`, `.rs`, `.java`, `.kt`, `.rb`, `.md`)
103
+
104
+ #### Scenario: Explicit extensions override auto-detection
105
+ - **WHEN** `codebase.extensions` is set to `[".go", ".proto"]`
106
+ - **THEN** only `.go` and `.proto` files are indexed, regardless of detected project type
107
+
108
+ ### Requirement: Codebase storage budget enforcement
109
+ The system SHALL support a `codebase.maxSize` field (default 2GB) that limits the total storage used by codebase-indexed content. During indexing, the system SHALL track cumulative storage and skip remaining files when the budget would be exceeded. The `memory_status` tool SHALL report current codebase storage usage versus the configured limit. The codebase storage budget SHALL be independent from the session storage budget (`storage.maxSize`).
110
+
111
+ #### Scenario: Indexing within budget
112
+ **WHEN** codebase storage is at 500MB and `codebase.maxSize` is 2GB
113
+ **THEN** files continue to be indexed normally
114
+
115
+ #### Scenario: Indexing exceeds budget
116
+ **WHEN** codebase storage is at 1.9GB and the next file would push it over 2GB
117
+ **THEN** the file is skipped
118
+ **THEN** remaining files are skipped
119
+ **THEN** the index result reports the number of files skipped due to budget
120
+
121
+ #### Scenario: Default budget
122
+ **WHEN** `codebase.maxSize` is not configured
123
+ **THEN** the default budget of 2GB is used
124
+
125
+ #### Scenario: Custom budget
126
+ **WHEN** `codebase.maxSize` is set to `"500MB"`
127
+ **THEN** indexing stops when codebase storage reaches 500MB
128
+
129
+ #### Scenario: Budget independent from session storage
130
+ **WHEN** session `storage.maxSize` is 2GB and `codebase.maxSize` is 2GB
131
+ **THEN** the system can use up to 4GB total (2GB sessions + 2GB codebase)
132
+ **THEN** codebase budget enforcement does not trigger session eviction
133
+
134
+
135
+ ### Requirement: Max file size guard
136
+ The system SHALL skip source files larger than `codebase.maxFileSize` (default 5MB). A debug-level log message SHALL be emitted for each skipped file indicating the file path and its size.
137
+
138
+ #### Scenario: File under size limit
139
+ **WHEN** a source file is 3MB and `maxFileSize` is 5MB
140
+ - **THEN** the file is indexed normally
141
+
142
+ #### Scenario: File exceeding size limit
143
+ **WHEN** a source file is 8MB and `maxFileSize` is 5MB
144
+ - **THEN** the file is skipped
145
+ - **THEN** a debug log is emitted: file path and size
146
+
147
+ #### Scenario: Custom maxFileSize
148
+ - **WHEN** `codebase.maxFileSize` is set to `"50KB"`
149
+ - **THEN** files larger than 50KB are skipped
150
+
151
+ ### Requirement: Watcher integration for codebase files
152
+ When codebase indexing is enabled, the file watcher SHALL add the workspace directory as an additional watch target with the configured exclude patterns applied as ignored paths. Source file changes SHALL trigger the same dirty-flag and debounced reindex cycle used for collection files. The watcher SHALL only watch files matching the configured or auto-detected extensions.
153
+
154
+ #### Scenario: Source file modified
155
+ - **WHEN** a watched source file is saved with new content
156
+ - **THEN** the watcher detects the change
157
+ - **THEN** the file is re-indexed after the debounce period
158
+
159
+ #### Scenario: New source file created
160
+ - **WHEN** a new `.ts` file is created in the workspace (with codebase enabled for a Node.js project)
161
+ - **THEN** the watcher detects the new file
162
+ - **THEN** the file is indexed after the debounce period
163
+
164
+ #### Scenario: Excluded directory not watched
165
+ - **WHEN** a file changes inside `node_modules/`
166
+ - **THEN** the watcher does not detect the change
167
+ - **THEN** no reindex is triggered
168
+
169
+ ### Requirement: Codebase documents tagged with workspace project hash
170
+ All documents indexed from codebase files SHALL be tagged with the current workspace's `projectHash` in the `project_hash` column. This ensures codebase search results are scoped to the current workspace by default.
171
+
172
+ #### Scenario: Codebase document indexed
173
+ - **WHEN** a source file is indexed with `currentProjectHash = "abc123def456"`
174
+ - **THEN** the resulting document has `project_hash = "abc123def456"`
175
+
176
+ #### Scenario: Codebase search scoped to workspace
177
+ - **WHEN** `memory_search` is called with default workspace scoping
178
+ - **THEN** codebase documents from the current workspace are included in results
179
+ - **THEN** codebase documents from other workspaces are excluded
180
+
181
+ ### Requirement: Codebase collection identified in search results
182
+ Documents from the codebase collection SHALL be identifiable in search results via their collection name `"codebase"`. This allows agents to distinguish between session-based memory and source code results.
183
+
184
+ #### Scenario: Search returns codebase and session results
185
+ - **WHEN** a search query matches both a session document and a codebase document
186
+ - **THEN** the codebase result has `collection: "codebase"` in its metadata
187
+ - **THEN** the session result has a different collection identifier
@@ -0,0 +1,36 @@
1
+ ## ADDED Requirements
2
+
3
+ ### Requirement: memory_index_codebase tool for on-demand indexing
4
+ The MCP server SHALL register a `memory_index_codebase` tool that triggers a full codebase scan and index. The tool SHALL accept no required parameters. It SHALL return a summary including: number of files scanned, number of files indexed (new or updated), number of files skipped (unchanged), number of files skipped (too large), and total chunks created. If codebase indexing is not enabled in config, the tool SHALL return an error message indicating that codebase indexing is disabled.
5
+
6
+ #### Scenario: Successful codebase index
7
+ - **WHEN** `memory_index_codebase` is called with codebase indexing enabled and source files exist
8
+ - **THEN** the response includes counts for files scanned, indexed, skipped (unchanged), skipped (too large), and chunks created
9
+ - **THEN** all matching source files are indexed into the store with the current workspace's projectHash
10
+
11
+ #### Scenario: Codebase indexing disabled
12
+ - **WHEN** `memory_index_codebase` is called but `codebase.enabled` is not set or is `false`
13
+ - **THEN** the response contains an error message: "Codebase indexing is not enabled. Set codebase.enabled: true in config.yml"
14
+
15
+ #### Scenario: No matching files found
16
+ - **WHEN** `memory_index_codebase` is called but no files match the configured extensions and exclude patterns
17
+ - **THEN** the response indicates 0 files scanned and 0 files indexed
18
+
19
+ ### Requirement: memory_status includes codebase statistics
20
+ The `memory_status` tool response SHALL include a `codebase` section when codebase indexing is enabled. This section SHALL report: whether codebase indexing is enabled, number of indexed codebase documents, number of codebase chunks, configured extensions (resolved after auto-detection), and configured exclude pattern count.
21
+
22
+ #### Scenario: memory_status with codebase enabled
23
+ - **WHEN** `memory_status` is called and codebase indexing is enabled with indexed files
24
+ - **THEN** the response includes a `codebase` section with `enabled: true`, document count, chunk count, resolved extensions list, and exclude pattern count
25
+
26
+ #### Scenario: memory_status with codebase disabled
27
+ - **WHEN** `memory_status` is called and codebase indexing is not enabled
28
+ - **THEN** the response includes `codebase: { enabled: false }` or omits the codebase section entirely
29
+
30
+ ### Requirement: memory_index_codebase tool schema registration
31
+ The MCP tool registration for `memory_index_codebase` SHALL include a description explaining its purpose: triggering a full scan and index of source code files in the current workspace. The input schema SHALL have no required parameters.
32
+
33
+ #### Scenario: Tool schema advertised to MCP client
34
+ - **WHEN** an MCP client lists available tools
35
+ - **THEN** `memory_index_codebase` appears with a description mentioning codebase/source code indexing
36
+ - **THEN** the input schema has no required parameters