@khivi/opencode-codebase-index 0.5.2 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,658 +1,197 @@
1
- # opencode-codebase-index
1
+ # @khivi/opencode-codebase-index
2
2
 
3
- [![npm version](https://img.shields.io/npm/v/opencode-codebase-index.svg)](https://www.npmjs.com/package/opencode-codebase-index)
3
+ [![npm version](https://img.shields.io/npm/v/@khivi/opencode-codebase-index.svg)](https://www.npmjs.com/package/@khivi/opencode-codebase-index)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
- [![Downloads](https://img.shields.io/npm/dm/opencode-codebase-index.svg)](https://www.npmjs.com/package/opencode-codebase-index)
6
- [![Build Status](https://img.shields.io/github/actions/workflow/status/Helweg/opencode-codebase-index/ci.yml?branch=main)](https://github.com/Helweg/opencode-codebase-index/actions)
7
5
  [![Node.js](https://img.shields.io/badge/node-%3E%3D18-brightgreen.svg)](https://nodejs.org/)
8
6
 
9
- > **Stop grepping for concepts. Start searching for meaning.**
10
-
11
- **opencode-codebase-index** brings semantic understanding to your [OpenCode](https://opencode.ai) workflow — and now to any MCP-compatible client like Cursor, Claude Code, and Windsurf. Instead of guessing function names or grepping for keywords, ask your codebase questions in plain English.
12
-
13
- ## 🚀 Why Use This?
14
-
15
- - 🧠 **Semantic Search**: Finds "user authentication" logic even if the function is named `check_creds`.
16
- - ⚡ **Blazing Fast Indexing**: Powered by a Rust native module using `tree-sitter` and `usearch`. Incremental updates take milliseconds.
17
- - 🌿 **Branch-Aware**: Seamlessly handles git branch switches — reuses embeddings, filters stale results.
18
- - 🔒 **Privacy Focused**: Your vector index is stored locally in your project.
19
- - 🔌 **Model Agnostic**: Works out-of-the-box with GitHub Copilot, OpenAI, Gemini, or local Ollama models.
20
- - 🌐 **MCP Server**: Use with Cursor, Claude Code, Windsurf, or any MCP-compatible client — index once, search from anywhere.
21
-
22
- ## ⚡ Quick Start
23
-
24
- 1. **Install the plugin**
25
- ```bash
26
- npm install opencode-codebase-index
27
- ```
28
-
29
- 2. **Add to `opencode.json`**
30
- ```json
31
- {
32
- "plugin": ["opencode-codebase-index"]
33
- }
34
- ```
35
-
36
- 3. **Index your codebase**
37
- Run `/index` or ask the agent to index your codebase. This only needs to be done once — subsequent updates are incremental.
38
-
39
- 4. **Start Searching**
40
- Ask:
41
- > "Find the function that handles credit card validation errors"
42
-
43
- ## 🌐 MCP Server (Cursor, Claude Code, Windsurf, etc.)
44
-
45
- Use the same semantic search from any MCP-compatible client. Index once, search from anywhere.
46
-
47
- 1. **Install dependencies**
48
- ```bash
49
- npm install opencode-codebase-index @modelcontextprotocol/sdk zod
50
- ```
51
-
52
- 2. **Configure your MCP client**
53
-
54
- **Cursor** (`.cursor/mcp.json`):
55
- ```json
56
- {
57
- "mcpServers": {
58
- "codebase-index": {
59
- "command": "npx",
60
- "args": ["opencode-codebase-index-mcp", "--project", "/path/to/your/project"]
61
- }
62
- }
63
- }
64
- ```
65
-
66
- **Claude Code** (`claude_desktop_config.json`):
67
- ```json
68
- {
69
- "mcpServers": {
70
- "codebase-index": {
71
- "command": "npx",
72
- "args": ["opencode-codebase-index-mcp", "--project", "/path/to/your/project"]
73
- }
74
- }
75
- }
76
- ```
77
-
78
- 3. **CLI options**
79
- ```bash
80
- npx opencode-codebase-index-mcp --project /path/to/repo # specify project root
81
- npx opencode-codebase-index-mcp --config /path/to/config # custom config file
82
- npx opencode-codebase-index-mcp # uses current directory
83
- ```
84
-
85
- The MCP server exposes all 9 tools (`codebase_search`, `codebase_peek`, `find_similar`, `call_graph`, `index_codebase`, `index_status`, `index_health_check`, `index_metrics`, `index_logs`) and 4 prompts (`search`, `find`, `index`, `status`).
86
-
87
- The MCP dependencies (`@modelcontextprotocol/sdk`, `zod`) are optional peer dependencies — they're only needed if you use the MCP server.
88
-
89
- ## 🔍 See It In Action
90
-
91
- **Scenario**: You're new to a codebase and need to fix a bug in the payment flow.
92
-
93
- **Without Plugin (grep)**:
94
- - `grep "payment" .` → 500 results (too many)
95
- - `grep "card" .` → 200 results (mostly UI)
96
- - `grep "stripe" .` → 50 results (maybe?)
97
-
98
- **With `opencode-codebase-index`**:
99
- You ask: *"Where is the payment validation logic?"*
100
-
101
- Plugin returns:
102
- ```text
103
- src/services/billing.ts:45 (Class PaymentValidator)
104
- src/utils/stripe.ts:12 (Function validateCardToken)
105
- src/api/checkout.ts:89 (Route handler for /pay)
106
- ```
107
-
108
- ## 🎯 When to Use What
109
-
110
- | Scenario | Tool | Why |
111
- |----------|------|-----|
112
- | Don't know the function name | `codebase_search` | Semantic search finds by meaning |
113
- | Exploring unfamiliar codebase | `codebase_search` | Discovers related code across files |
114
- | Just need to find locations | `codebase_peek` | Returns metadata only, saves ~90% tokens |
115
- | Understand code flow | `call_graph` | Find callers/callees of any function |
116
- | Know exact identifier | `grep` | Faster, finds all occurrences |
117
- | Need ALL matches | `grep` | Semantic returns top N only |
118
- | Mixed discovery + precision | `/find` (hybrid) | Best of both worlds |
119
-
120
- **Rule of thumb**: `codebase_peek` to find locations → `Read` to examine → `grep` for precision.
121
-
122
- ## 📊 Token Usage
123
-
124
- In our testing across open-source codebases (axios, express), we observed **up to 90% reduction in token usage** for conceptual queries like *"find the error handling middleware"*.
125
-
126
- ### Why It Saves Tokens
127
-
128
- - **Without plugin**: Agent explores files, reads code, backtracks, explores more
129
- - **With plugin**: Semantic search returns relevant code immediately → less exploration
130
-
131
- ### Key Takeaways
132
-
133
- 1. **Significant savings possible**: Up to 90% reduction in the best cases
134
- 2. **Results vary**: Savings depend on query type, codebase structure, and agent behavior
135
- 3. **Best for discovery**: Conceptual queries benefit most; exact identifier lookups should use grep
136
- 4. **Complements existing tools**: Provides a faster initial signal, doesn't replace grep/explore
137
-
138
- ### When the Plugin Helps Most
139
-
140
- - **Conceptual queries**: "Where is the authentication logic?" (no keywords to grep for)
141
- - **Unfamiliar codebases**: You don't know what to search for yet
142
- - **Large codebases**: Semantic search scales better than exhaustive exploration
143
-
144
- ## 🛠️ How It Works
145
-
146
- ```mermaid
147
- graph TD
148
- subgraph Indexing
149
- A[Source Code] -->|Tree-sitter| B[Semantic Chunks]
150
- B -->|Embedding Model| C[Vectors]
151
- C -->|uSearch| D[(Vector Store)]
152
- C -->|SQLite| G[(Embeddings DB)]
153
- B -->|BM25| E[(Inverted Index)]
154
- B -->|Branch Catalog| G
155
- end
156
-
157
- subgraph Searching
158
- Q[User Query] -->|Embedding Model| V[Query Vector]
159
- V -->|Cosine Similarity| D
160
- Q -->|BM25| E
161
- D --> F[Hybrid Fusion RRF/Weighted]
162
- E --> F
163
- F --> X[Deterministic Rerank]
164
- G -->|Branch + Metadata Filters| X
165
- X --> R[Ranked Results]
166
- end
167
- ```
168
-
169
- 1. **Parsing**: We use `tree-sitter` to intelligently parse your code into meaningful blocks (functions, classes, interfaces). JSDoc comments and docstrings are automatically included with their associated code.
170
-
171
- **Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML
172
- 2. **Chunking**: Large blocks are split with overlapping windows to preserve context across chunk boundaries.
173
- 3. **Embedding**: These blocks are converted into vector representations using your configured AI provider.
174
- 4. **Storage**: Embeddings are stored in SQLite (deduplicated by content hash) and vectors in `usearch` with F16 quantization for 50% memory savings. A branch catalog tracks which chunks exist on each branch.
175
- 5. **Hybrid Search**: Combines semantic similarity (vectors) with BM25 keyword matching, fuses (`rrf` default, `weighted` fallback), applies deterministic rerank, then filters by current branch/metadata.
176
-
177
- **Performance characteristics:**
178
- - **Incremental indexing**: ~50ms check time — only re-embeds changed files
179
- - **Smart chunking**: Understands code structure to keep functions whole, with overlap for context
180
- - **Native speed**: Core logic written in Rust for maximum performance
181
- - **Memory efficient**: F16 vector quantization reduces index size by 50%
182
- - **Branch-aware**: Automatically tracks which chunks exist on each git branch
183
- - **Provider validation**: Detects embedding provider/model changes and requires rebuild to prevent garbage results
184
-
185
- ## 🌿 Branch-Aware Indexing
7
+ > Fork of [opencode-codebase-index](https://github.com/Helweg/opencode-codebase-index) by Kenneth Helweg
186
8
 
187
- The plugin automatically detects git branches and optimizes indexing across branch switches.
9
+ Semantic codebase indexing and search — with a standalone **git-native CLI** for incremental indexing via git hooks and worktree-aware querying.
188
10
 
189
- ### How It Works
11
+ ## What this fork adds
190
12
 
191
- When you switch branches, code changes but embeddings for unchanged content remain the same. The plugin:
13
+ - **Standalone CLI** (`codebase-index`) runs without an MCP host
14
+ - **Git hook integration** — automatic background reindexing on commit, merge, checkout, and rewrite
15
+ - **Blob SHA incremental indexing** — diffs via `git hash-object` instead of reading files, skips unchanged files instantly
16
+ - **Worktree-aware scoping** — `git ls-files` scopes queries to the current worktree checkout
192
17
 
193
- 1. **Stores embeddings by content hash**: Embeddings are deduplicated across branches
194
- 2. **Tracks branch membership**: A lightweight catalog tracks which chunks exist on each branch
195
- 3. **Filters search results**: Queries only return results relevant to the current branch
18
+ The upstream MCP server, OpenCode plugin, and all original features are preserved.
196
19
 
197
- ### Benefits
20
+ ## Quick Start
198
21
 
199
- | Scenario | Without Branch Awareness | With Branch Awareness |
200
- |----------|-------------------------|----------------------|
201
- | Switch to feature branch | Re-index everything | Instant — reuse existing embeddings |
202
- | Return to main | Re-index everything | Instant — catalog already exists |
203
- | Search on branch | May return stale results | Only returns current branch's code |
22
+ ### Install
204
23
 
205
- ### Automatic Behavior
206
-
207
- - **Branch detection**: Automatically reads from `.git/HEAD`
208
- - **Re-indexing on switch**: Triggers when you switch branches (via file watcher)
209
- - **Legacy migration**: Automatically migrates old indexes on first run
210
- - **Garbage collection**: Health check removes orphaned embeddings and chunks
211
-
212
- ### Storage Structure
213
-
214
- ```
215
- .opencode/index/
216
- ├── codebase.db # SQLite: embeddings, chunks, branch catalog, symbols, call edges
217
- ├── vectors.usearch # Vector index (uSearch)
218
- ├── inverted-index.json # BM25 keyword index
219
- └── file-hashes.json # File change detection
24
+ ```bash
25
+ npm install @khivi/opencode-codebase-index
220
26
  ```
221
27
 
222
- ## 🧰 Tools Available
223
-
224
- The plugin exposes these tools to the OpenCode agent:
225
-
226
- ### `codebase_search`
227
- **The primary tool.** Searches code by describing behavior.
228
- - **Use for**: Discovery, understanding flows, finding logic when you don't know the names.
229
- - **Example**: `"find the middleware that sanitizes input"`
230
- - **Ranking path**: hybrid retrieval → fusion (`search.fusionStrategy`) → deterministic rerank (`search.rerankTopN`) → filters
231
-
232
- **Writing good queries:**
233
-
234
- | ✅ Good queries (describe behavior) | ❌ Bad queries (too vague) |
235
- |-------------------------------------|---------------------------|
236
- | "function that validates email format" | "email" |
237
- | "error handling for failed API calls" | "error" |
238
- | "middleware that checks authentication" | "auth middleware" |
239
- | "code that calculates shipping costs" | "shipping" |
240
- | "where user permissions are checked" | "permissions" |
241
-
242
- ### `codebase_peek`
243
- **Token-efficient discovery.** Returns only metadata (file, line, name, type) without code content.
244
- - **Use for**: Finding WHERE code is before deciding what to read. Saves ~90% tokens vs `codebase_search`.
245
- - **Ranking path**: same hybrid ranking path as `codebase_search` (metadata-only output)
246
- - **Example output**:
247
- ```
248
- [1] function "validatePayment" at src/billing.ts:45-67 (score: 0.92)
249
- [2] class "PaymentProcessor" at src/processor.ts:12-89 (score: 0.87)
250
-
251
- Use Read tool to examine specific files.
252
- ```
253
- - **Workflow**: `codebase_peek` → find locations → `Read` specific files
254
-
255
- ### `find_similar`
256
- Find code similar to a provided snippet.
257
- - **Use for**: Duplicate detection, refactor prep, pattern mining.
258
- - **Ranking path**: semantic retrieval only + deterministic rerank (no BM25, no RRF).
259
-
260
- ### `index_codebase`
261
- Manually trigger indexing.
262
- - **Use for**: Forcing a re-index or checking stats.
263
- - **Parameters**: `force` (rebuild all), `estimateOnly` (check costs), `verbose` (show skipped files and parse failures).
264
-
265
- ### `index_status`
266
- Checks if the index is ready and healthy.
267
-
268
- ### `index_health_check`
269
- Maintenance tool to remove stale entries from deleted files and orphaned embeddings/chunks from the database.
270
-
271
- ### `index_metrics`
272
- Returns collected metrics about indexing and search performance. Requires `debug.enabled` and `debug.metrics` to be `true`.
273
- - **Metrics include**: Files indexed, chunks created, cache hit rate, search timing breakdown, GC stats, embedding API call stats.
274
-
275
- ### `index_logs`
276
- Returns recent debug logs with optional filtering.
277
- - **Parameters**: `category` (optional: `search`, `embedding`, `cache`, `gc`, `branch`), `level` (optional: `error`, `warn`, `info`, `debug`), `limit` (default: 50).
278
-
279
- ### `call_graph`
280
- Query the call graph to find callers or callees of a function/method. Automatically built during indexing for TypeScript, JavaScript, Python, Go, and Rust.
281
- - **Use for**: Understanding code flow, tracing dependencies, impact analysis.
282
- - **Parameters**: `name` (function name), `direction` (`callers` or `callees`), `symbolId` (required for `callees`, returned by previous queries).
283
- - **Example**: Find who calls `validateToken` → `call_graph(name="validateToken", direction="callers")`
284
-
285
- ## 🎮 Slash Commands
286
-
287
- The plugin automatically registers these slash commands:
28
+ ### Set up an embedding provider
288
29
 
289
- | Command | Description |
290
- | ------- | ----------- |
291
- | `/search <query>` | **Pure Semantic Search**. Best for "How does X work?" |
292
- | `/find <query>` | **Hybrid Search**. Combines semantic search + grep. Best for "Find usage of X". |
293
- | `/index` | **Update Index**. Forces a refresh of the codebase index. |
294
- | `/status` | **Check Status**. Shows if indexed, chunk count, and provider info. |
30
+ You need one embedding provider. The easiest local option:
295
31
 
296
- ## ⚙️ Configuration
297
-
298
- Zero-config by default (uses `auto` mode). Customize in `.opencode/codebase-index.json`:
299
-
300
- ```json
301
- {
302
- "embeddingProvider": "auto",
303
- "scope": "project",
304
- "indexing": {
305
- "autoIndex": false,
306
- "watchFiles": true,
307
- "maxFileSize": 1048576,
308
- "maxChunksPerFile": 100,
309
- "semanticOnly": false,
310
- "autoGc": true,
311
- "gcIntervalDays": 7,
312
- "gcOrphanThreshold": 100,
313
- "requireProjectMarker": true
314
- },
315
- "search": {
316
- "maxResults": 20,
317
- "minScore": 0.1,
318
- "hybridWeight": 0.5,
319
- "fusionStrategy": "rrf",
320
- "rrfK": 60,
321
- "rerankTopN": 20,
322
- "contextLines": 0
323
- },
324
- "debug": {
325
- "enabled": false,
326
- "logLevel": "info",
327
- "metrics": false
328
- }
329
- }
32
+ ```bash
33
+ brew install ollama # install Ollama
34
+ ollama pull nomic-embed-text # pull the embedding model (~274MB)
330
35
  ```
331
36
 
332
- ### Options Reference
333
-
334
- | Option | Default | Description |
335
- |--------|---------|-------------|
336
- | `embeddingProvider` | `"auto"` | Which AI to use: `auto`, `github-copilot`, `openai`, `google`, `ollama`, `custom` |
337
- | `scope` | `"project"` | `project` = index per repo, `global` = shared index across repos |
338
- | **indexing** | | |
339
- | `autoIndex` | `false` | Automatically index on plugin load |
340
- | `watchFiles` | `true` | Re-index when files change |
341
- | `maxFileSize` | `1048576` | Skip files larger than this (bytes). Default: 1MB |
342
- | `maxChunksPerFile` | `100` | Maximum chunks to index per file (controls token costs for large files) |
343
- | `semanticOnly` | `false` | When `true`, only index semantic nodes (functions, classes) and skip generic blocks |
344
- | `retries` | `3` | Number of retry attempts for failed embedding API calls |
345
- | `retryDelayMs` | `1000` | Delay between retries in milliseconds |
346
- | `autoGc` | `true` | Automatically run garbage collection to remove orphaned embeddings/chunks |
347
- | `gcIntervalDays` | `7` | Run GC on initialization if last GC was more than N days ago |
348
- | `gcOrphanThreshold` | `100` | Run GC after indexing if orphan count exceeds this threshold |
349
- | `requireProjectMarker` | `true` | Require a project marker (`.git`, `package.json`, etc.) to enable file watching and auto-indexing. Prevents accidentally indexing large directories like home. Set to `false` to index any directory. |
350
- | **search** | | |
351
- | `maxResults` | `20` | Maximum results to return |
352
- | `minScore` | `0.1` | Minimum similarity score (0-1). Lower = more results |
353
- | `hybridWeight` | `0.5` | Balance between keyword (1.0) and semantic (0.0) search |
354
- | `fusionStrategy` | `"rrf"` | Hybrid fusion mode: `"rrf"` (rank-based reciprocal rank fusion) or `"weighted"` (legacy score blending fallback) |
355
- | `rrfK` | `60` | RRF smoothing constant. Higher values flatten rank impact, lower values prioritize top-ranked candidates more strongly |
356
- | `rerankTopN` | `20` | Deterministic rerank depth cap. Applies lightweight name/path/chunk-type rerank to top-N only |
357
- | `contextLines` | `0` | Extra lines to include before/after each match |
358
- | **debug** | | |
359
- | `enabled` | `false` | Enable debug logging and metrics collection |
360
- | `logLevel` | `"info"` | Log level: `error`, `warn`, `info`, `debug` |
361
- | `logSearch` | `true` | Log search operations with timing breakdown |
362
- | `logEmbedding` | `true` | Log embedding API calls (success, error, rate-limit) |
363
- | `logCache` | `true` | Log cache hits and misses |
364
- | `logGc` | `true` | Log garbage collection operations |
365
- | `logBranch` | `true` | Log branch detection and switches |
366
- | `metrics` | `false` | Enable metrics collection (indexing stats, search timing, cache performance) |
367
-
368
- ### Retrieval ranking behavior (Phase 1)
369
-
370
- - `codebase_search` and `codebase_peek` use the hybrid path: semantic + keyword retrieval → fusion (`fusionStrategy`) → deterministic rerank (`rerankTopN`) → filtering.
371
- - `find_similar` stays semantic-only: semantic retrieval + deterministic rerank only (no keyword retrieval, no RRF).
372
- - For compatibility rollbacks, set `search.fusionStrategy` to `"weighted"` to use the legacy weighted fusion path.
373
- - Retrieval benchmark artifacts are separated by role:
374
- - baseline (versioned): `benchmarks/baselines/retrieval-baseline.json`
375
- - latest candidate run (generated): `benchmark-results/retrieval-candidate.json`
376
-
377
- ### Embedding Providers
378
- The plugin automatically detects available credentials in this order:
379
- 1. **GitHub Copilot** (Free if you have it)
380
- 2. **OpenAI** (Standard Embeddings)
381
- 3. **Google** (Gemini Embeddings)
382
- 4. **Ollama** (Local/Private - requires `nomic-embed-text`)
383
-
384
- You can also use **Custom** to connect any OpenAI-compatible embedding endpoint (llama.cpp, vLLM, text-embeddings-inference, LiteLLM, etc.).
385
-
386
- ### Rate Limits by Provider
387
-
388
- Each provider has different rate limits. The plugin automatically adjusts concurrency and delays:
389
-
390
- | Provider | Concurrency | Delay | Best For |
391
- |----------|-------------|-------|----------|
392
- | **GitHub Copilot** | 1 | 4s | Small codebases (<1k files) |
393
- | **OpenAI** | 3 | 500ms | Medium codebases |
394
- | **Google** | 5 | 200ms | Medium-large codebases |
395
- | **Ollama** | 5 | None | Large codebases (10k+ files) |
396
- | **Custom** | 3 | 1s | Any OpenAI-compatible endpoint |
397
-
398
- **For large codebases**, use Ollama locally to avoid rate limits:
37
+ Or use a cloud provider:
399
38
 
400
39
  ```bash
401
- # Install the embedding model
402
- ollama pull nomic-embed-text
40
+ export OPENAI_API_KEY=sk-... # OpenAI
41
+ # or
42
+ export GOOGLE_API_KEY=... # Google
43
+ # or have an active GitHub Copilot subscription
403
44
  ```
404
45
 
405
- ```json
406
- // .opencode/codebase-index.json
407
- {
408
- "embeddingProvider": "ollama"
409
- }
410
- ```
411
-
412
- ## 📈 Performance
46
+ The package auto-detects whichever is available.
413
47
 
414
- The plugin is built for speed with a Rust native module. Here are typical performance numbers (Apple M1):
48
+ ### Set up git hooks (one-time, per repo)
415
49
 
416
- ### Parsing (tree-sitter)
417
-
418
- | Files | Chunks | Time |
419
- |-------|--------|------|
420
- | 100 | 1,200 | ~7ms |
421
- | 500 | 6,000 | ~32ms |
50
+ ```bash
51
+ codebase-index install
52
+ ```
422
53
 
423
- ### Vector Search (usearch)
54
+ This installs hooks in `.git/hooks/` (worktree-aware via `--git-common-dir`):
55
+ - `post-commit` — reindex after commits
56
+ - `post-merge` — reindex after merges (symlink to post-commit)
57
+ - `post-rewrite` — reindex after rebases (symlink to post-commit)
58
+ - `post-checkout` — reindex on branch switch (with `[ "$3" = "1" ]` guard)
424
59
 
425
- | Index Size | Search Time | Throughput |
426
- |------------|-------------|------------|
427
- | 1,000 vectors | 0.7ms | 1,400 ops/sec |
428
- | 5,000 vectors | 1.2ms | 850 ops/sec |
429
- | 10,000 vectors | 1.3ms | 780 ops/sec |
60
+ All hooks run `codebase-index incremental` in the background (`&`).
430
61
 
431
- ### Database Operations (SQLite with batch)
62
+ ### Full index
432
63
 
433
- | Operation | 1,000 items | 10,000 items |
434
- |-----------|-------------|--------------|
435
- | Insert chunks | 4ms | 44ms |
436
- | Add to branch | 2ms | 22ms |
437
- | Check embedding exists | <0.01ms | <0.01ms |
64
+ ```bash
65
+ codebase-index index
66
+ ```
438
67
 
439
- ### Batch vs Sequential Performance
68
+ ### Incremental update (what the hooks run)
440
69
 
441
- Batch operations provide significant speedups:
70
+ ```bash
71
+ codebase-index incremental
72
+ ```
442
73
 
443
- | Operation | Sequential | Batch | Speedup |
444
- |-----------|------------|-------|---------|
445
- | Insert 1,000 chunks | 38ms | 4ms | **~10x** |
446
- | Add 1,000 to branch | 29ms | 2ms | **~14x** |
447
- | Insert 1,000 embeddings | 59ms | 40ms | **~1.5x** |
74
+ Compares blob SHAs of tracked files against stored hashes. Only triggers the indexer when files actually changed.
448
75
 
449
- Run benchmarks yourself: `npx tsx benchmarks/run.ts`
76
+ ### Query
450
77
 
451
- ## 🎯 Choosing a Provider
78
+ ```bash
79
+ codebase-index query "authentication middleware" --limit 5
80
+ ```
452
81
 
453
- Use this decision tree to pick the right embedding provider:
82
+ Results are scoped to the current worktree's `git ls-files`:
454
83
 
455
84
  ```
456
- ┌─────────────────────────┐
457
- │ Do you have Copilot? │
458
- └───────────┬─────────────┘
459
- ┌─────┴─────┐
460
- YES NO
461
- │ │
462
- ┌───────────▼───────┐ │
463
- │ Codebase < 1k │ │
464
- │ files? │ │
465
- └─────────┬─────────┘ │
466
- ┌─────┴─────┐ │
467
- YES NO │
468
- │ │ │
469
- ▼ │ │
470
- ┌──────────┐ │ │
471
- │ Copilot │ │ │
472
- │ (free) │ │ │
473
- └──────────┘ │ │
474
- ▼ ▼
475
- ┌─────────────────────────┐
476
- │ Need fastest indexing? │
477
- └───────────┬─────────────┘
478
- ┌─────┴─────┐
479
- YES NO
480
- │ │
481
- ▼ ▼
482
- ┌──────────┐ ┌──────────────┐
483
- │ Ollama │ │ OpenAI or │
484
- │ (local) │ │ Google │
485
- └──────────┘ └──────────────┘
85
+ src/auth/validator.ts:45 validateToken 0.923
86
+ src/middleware/auth.ts:12 authMiddleware 0.891
87
+ src/api/login.ts:89 handleLogin 0.847
486
88
  ```
487
89
 
488
- ### Provider Comparison
90
+ ### Status
489
91
 
490
- | Provider | Speed | Cost | Privacy | Best For |
491
- |----------|-------|------|---------|----------|
492
- | **Ollama** | Fastest | Free | Full | Large codebases, privacy-sensitive |
493
- | **GitHub Copilot** | Slow (rate limited) | Free* | Cloud | Small codebases, existing subscribers |
494
- | **OpenAI** | Medium | ~$0.0001/1K tokens | Cloud | General use |
495
- | **Google** | Fast | Free tier available | Cloud | Medium-large codebases |
496
- | **Custom** | Varies | Varies | Varies | Self-hosted or third-party endpoints |
92
+ ```bash
93
+ codebase-index status
94
+ ```
497
95
 
498
- *Requires active Copilot subscription
96
+ ## CLI Commands
499
97
 
500
- ### Setup by Provider
98
+ | Command | Description |
99
+ |---------|-------------|
100
+ | `codebase-index install` | Install git hooks for automatic incremental indexing |
101
+ | `codebase-index index` | Full reindex of the codebase |
102
+ | `codebase-index incremental` | Incremental update — only changed files (blob SHA diff) |
103
+ | `codebase-index query <text>` | Worktree-scoped semantic search |
104
+ | `codebase-index status` | Show index status |
501
105
 
502
- **Ollama (Recommended for large codebases)**
503
- ```bash
504
- ollama pull nomic-embed-text
505
- ```
506
- ```json
507
- { "embeddingProvider": "ollama" }
508
- ```
106
+ ## MCP Server (Cursor, Claude Code, Windsurf)
509
107
 
510
- **OpenAI**
511
- ```bash
512
- export OPENAI_API_KEY=sk-...
513
- ```
514
- ```json
515
- { "embeddingProvider": "openai" }
516
- ```
108
+ The MCP server from upstream is fully preserved:
517
109
 
518
- **Google**
519
110
  ```bash
520
- export GOOGLE_API_KEY=...
521
- ```
522
- ```json
523
- { "embeddingProvider": "google" }
111
+ npm install @khivi/opencode-codebase-index @modelcontextprotocol/sdk zod
524
112
  ```
525
113
 
526
- **GitHub Copilot**
527
- No setup needed if you have an active Copilot subscription.
528
- ```json
529
- { "embeddingProvider": "github-copilot" }
530
- ```
531
-
532
- **Custom (OpenAI-compatible)**
533
- Works with any server that implements the OpenAI `/v1/embeddings` API format (llama.cpp, vLLM, text-embeddings-inference, LiteLLM, etc.).
114
+ **Cursor** (`.cursor/mcp.json`):
534
115
  ```json
535
116
  {
536
- "embeddingProvider": "custom",
537
- "customProvider": {
538
- "baseUrl": "http://localhost:11434/v1",
539
- "model": "nomic-embed-text",
540
- "dimensions": 768,
541
- "apiKey": "optional-api-key",
542
- "maxTokens": 8192,
543
- "timeoutMs": 30000
117
+ "mcpServers": {
118
+ "codebase-index": {
119
+ "command": "npx",
120
+ "args": ["opencode-codebase-index-mcp", "--project", "/path/to/your/project"]
121
+ }
544
122
  }
545
123
  }
546
124
  ```
547
- Required fields: `baseUrl`, `model`, `dimensions` (positive integer). Optional: `apiKey`, `maxTokens`, `timeoutMs` (default: 30000).
548
-
549
- ## ⚠️ Tradeoffs
550
-
551
- Be aware of these characteristics:
552
-
553
- | Aspect | Reality |
554
- |--------|---------|
555
- | **Search latency** | ~800-1000ms per query (embedding API call) |
556
- | **First index** | Takes time depending on codebase size (e.g., ~30s for 500 chunks) |
557
- | **Requires API** | Needs an embedding provider (Copilot, OpenAI, Google, or local Ollama) |
558
- | **Token costs** | Uses embedding tokens (free with Copilot, minimal with others) |
559
- | **Best for** | Discovery and exploration, not exhaustive matching |
560
-
561
- ## 💻 Local Development
562
125
 
563
- 1. **Build**:
564
- ```bash
565
- npm run build
566
- ```
126
+ Exposes 9 tools: `codebase_search`, `codebase_peek`, `find_similar`, `call_graph`, `index_codebase`, `index_status`, `index_health_check`, `index_metrics`, `index_logs`.
567
127
 
568
- 2. **Register in Test Project** (use `file://` URL in `opencode.json`):
569
- ```json
570
- {
571
- "plugin": [
572
- "file:///path/to/opencode-codebase-index"
573
- ]
574
- }
575
- ```
576
-
577
- This loads directly from your source directory, so changes take effect after rebuilding.
128
+ ## Configuration
578
129
 
579
- ## 🤝 Contributing
130
+ Zero-config by default. Customize in `.opencode/codebase-index.json`:
580
131
 
581
- 1. Fork the repository
582
- 2. Create a feature branch: `git checkout -b feature/my-feature`
583
- 3. Make your changes and add tests
584
- 4. Run checks: `npm run build && npm run test:run && npm run lint`
585
- 5. Commit: `git commit -m "feat: add my feature"`
586
- 6. Push and open a pull request
132
+ ```json
133
+ {
134
+ "embeddingProvider": "auto",
135
+ "scope": "project",
136
+ "indexing": {
137
+ "autoIndex": false,
138
+ "watchFiles": true,
139
+ "maxFileSize": 1048576,
140
+ "maxChunksPerFile": 100,
141
+ "semanticOnly": false
142
+ },
143
+ "search": {
144
+ "maxResults": 20,
145
+ "minScore": 0.1,
146
+ "hybridWeight": 0.5,
147
+ "fusionStrategy": "rrf"
148
+ }
149
+ }
150
+ ```
587
151
 
588
- CI will automatically run tests and type checking on your PR.
152
+ ### Embedding Providers
589
153
 
590
- ### Release process (structured + complete notes)
154
+ Auto-detected in order: GitHub Copilot, OpenAI, Google, Ollama. Or set explicitly:
591
155
 
592
- To ensure release notes reflect all merged work, this repo uses a draft-release workflow.
156
+ | Provider | Setup | Best For |
157
+ |----------|-------|----------|
158
+ | **GitHub Copilot** | Have active subscription | Small codebases (<1k files) |
159
+ | **OpenAI** | `export OPENAI_API_KEY=sk-...` | General use |
160
+ | **Google** | `export GOOGLE_API_KEY=...` | Medium-large codebases |
161
+ | **Ollama** | `ollama pull nomic-embed-text` | Large codebases, privacy |
162
+ | **Custom** | OpenAI-compatible endpoint | Self-hosted |
593
163
 
594
- 1. **Label every PR** with at least one semantic label:
595
- - `feature`, `bug`, `performance`, `documentation`, `dependencies`, `refactor`, `test`, `chore`
596
- - and (when relevant) `semver:major`, `semver:minor`, or `semver:patch`
597
- - PRs are validated by CI (`Release Label Check`) and fail if no release category label is present
598
- 2. **Let Release Drafter build the draft notes** automatically from merged PRs on `main`.
599
- 3. **Before publishing**:
600
- - copy/finalize relevant highlights into `CHANGELOG.md`
601
- - bump `package.json` version
602
- - run: `npm run build && npm run typecheck && npm run lint && npm run test:run`
603
- 4. **Publish release** from the draft (or via `gh release create` after reviewing draft content).
164
+ ## How It Works
604
165
 
605
- PRs labeled `skip-changelog` are intentionally excluded from release notes.
166
+ 1. **Parsing** — tree-sitter (Rust native) splits code into semantic chunks (functions, classes, interfaces)
167
+ 2. **Embedding** — chunks are vectorized via your configured AI provider
168
+ 3. **Storage** — vectors in usearch (F16 quantization), metadata in SQLite, keywords in BM25 inverted index
169
+ 4. **Search** — hybrid semantic + keyword retrieval, RRF fusion, deterministic reranking
170
+ 5. **Incremental** — blob SHAs track changes, only re-embeds what changed
606
171
 
607
- ### Project Structure
172
+ ### Storage Structure
608
173
 
609
174
  ```
610
- ├── src/
611
- ├── index.ts # Plugin entry point
612
- ├── mcp-server.ts # MCP server (Cursor, Claude Code, Windsurf)
613
- ├── cli.ts # CLI entry for MCP stdio transport
614
- │ ├── config/ # Configuration schema
615
- │ ├── embeddings/ # Provider detection and API calls
616
- │ ├── indexer/ # Core indexing logic + inverted index
617
- │ ├── git/ # Git utilities (branch detection)
618
- │ ├── tools/ # OpenCode tool definitions
619
- │ ├── utils/ # File collection, cost estimation
620
- │ ├── native/ # Rust native module wrapper
621
- │ └── watcher/ # File/git change watcher
622
- ├── native/
623
- │ └── src/ # Rust: tree-sitter, usearch, xxhash, SQLite
624
- ├── tests/ # Unit tests (vitest)
625
- ├── commands/ # Slash command definitions
626
- ├── skill/ # Agent skill guidance
627
- └── .github/workflows/ # CI/CD (test, build, publish)
175
+ .opencode/index/
176
+ ├── codebase.db # SQLite: embeddings, chunks, branch catalog, symbols, call edges
177
+ ├── vectors.usearch # Vector index (uSearch)
178
+ ├── inverted-index.json # BM25 keyword index
179
+ └── file-hashes.json # Blob SHA change detection
628
180
  ```
629
181
 
630
- ### Native Module
631
-
632
- The Rust native module handles performance-critical operations:
633
- - **tree-sitter**: Language-aware code parsing with JSDoc/docstring extraction
634
- - **usearch**: High-performance vector similarity search with F16 quantization
635
- - **SQLite**: Persistent storage for embeddings, chunks, branch catalog, symbols, and call edges
636
- - **BM25 inverted index**: Fast keyword search for hybrid retrieval
637
- - **Call graph extraction**: Tree-sitter query-based extraction of function calls, method calls, constructors, and imports (TypeScript/JavaScript, Python, Go, Rust)
638
- - **xxhash**: Fast content hashing for change detection
639
-
640
- Rebuild with: `npm run build:native` (requires Rust toolchain)
641
-
642
- ### Platform Support
182
+ ## Upstream Features
643
183
 
644
- Pre-built native binaries are published for:
184
+ All upstream capabilities are preserved:
645
185
 
646
- | Platform | Architecture | SIMD Acceleration |
647
- |----------|-------------|--------------------|
648
- | macOS | x86_64 | simsimd |
649
- | macOS | ARM64 (Apple Silicon) | simsimd |
650
- | Linux | x86_64 (GNU) | simsimd |
651
- | Linux | ARM64 (GNU) | simsimd |
652
- | Windows | x86_64 (MSVC) | ❌ scalar fallback |
186
+ - Branch-aware indexing with embedding reuse across branches
187
+ - Call graph extraction (TypeScript, JavaScript, Python, Go, Rust)
188
+ - Hybrid search (semantic + BM25 keyword)
189
+ - OpenCode plugin with slash commands (`/search`, `/find`, `/index`, `/status`)
190
+ - Native Rust module (tree-sitter, usearch, SQLite, xxhash)
191
+ - Platform support: macOS (x86_64, ARM64), Linux (x86_64, ARM64), Windows (x86_64)
653
192
 
654
- Windows builds use scalar distance functions instead of SIMD — functionally identical, marginally slower for very large indexes. This is due to MSVC lacking support for certain AVX-512 intrinsics used by simsimd.
193
+ For full upstream documentation, see the [original repository](https://github.com/Helweg/opencode-codebase-index).
655
194
 
656
195
  ## License
657
196
 
658
- MIT
197
+ MIT — original work by [Kenneth Helweg](https://github.com/Helweg)