@khivi/opencode-codebase-index 0.5.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +658 -0
- package/commands/call-graph.md +24 -0
- package/commands/find.md +25 -0
- package/commands/index.md +21 -0
- package/commands/search.md +24 -0
- package/commands/status.md +15 -0
- package/dist/cli.cjs +6208 -0
- package/dist/cli.cjs.map +1 -0
- package/dist/cli.js +6213 -0
- package/dist/cli.js.map +1 -0
- package/dist/git-cli.cjs +6006 -0
- package/dist/git-cli.cjs.map +1 -0
- package/dist/git-cli.js +6011 -0
- package/dist/git-cli.js.map +1 -0
- package/dist/index.cjs +8224 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.js +8221 -0
- package/dist/index.js.map +1 -0
- package/native/codebase-index-native.darwin-arm64.node +0 -0
- package/package.json +103 -0
- package/skill/SKILL.md +78 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Kenneth Helweg
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,658 @@
|
|
|
1
|
+
# opencode-codebase-index
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/opencode-codebase-index)
|
|
4
|
+
[](https://opensource.org/licenses/MIT)
|
|
5
|
+
[](https://www.npmjs.com/package/opencode-codebase-index)
|
|
6
|
+
[](https://github.com/Helweg/opencode-codebase-index/actions)
|
|
7
|
+
[](https://nodejs.org/)
|
|
8
|
+
|
|
9
|
+
> **Stop grepping for concepts. Start searching for meaning.**
|
|
10
|
+
|
|
11
|
+
**opencode-codebase-index** brings semantic understanding to your [OpenCode](https://opencode.ai) workflow — and now to any MCP-compatible client like Cursor, Claude Code, and Windsurf. Instead of guessing function names or grepping for keywords, ask your codebase questions in plain English.
|
|
12
|
+
|
|
13
|
+
## 🚀 Why Use This?
|
|
14
|
+
|
|
15
|
+
- 🧠 **Semantic Search**: Finds "user authentication" logic even if the function is named `check_creds`.
|
|
16
|
+
- ⚡ **Blazing Fast Indexing**: Powered by a Rust native module using `tree-sitter` and `usearch`. Incremental updates take milliseconds.
|
|
17
|
+
- 🌿 **Branch-Aware**: Seamlessly handles git branch switches — reuses embeddings, filters stale results.
|
|
18
|
+
- 🔒 **Privacy Focused**: Your vector index is stored locally in your project.
|
|
19
|
+
- 🔌 **Model Agnostic**: Works out-of-the-box with GitHub Copilot, OpenAI, Gemini, or local Ollama models.
|
|
20
|
+
- 🌐 **MCP Server**: Use with Cursor, Claude Code, Windsurf, or any MCP-compatible client — index once, search from anywhere.
|
|
21
|
+
|
|
22
|
+
## ⚡ Quick Start
|
|
23
|
+
|
|
24
|
+
1. **Install the plugin**
|
|
25
|
+
```bash
|
|
26
|
+
npm install opencode-codebase-index
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
2. **Add to `opencode.json`**
|
|
30
|
+
```json
|
|
31
|
+
{
|
|
32
|
+
"plugin": ["opencode-codebase-index"]
|
|
33
|
+
}
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
3. **Index your codebase**
|
|
37
|
+
Run `/index` or ask the agent to index your codebase. This only needs to be done once — subsequent updates are incremental.
|
|
38
|
+
|
|
39
|
+
4. **Start Searching**
|
|
40
|
+
Ask:
|
|
41
|
+
> "Find the function that handles credit card validation errors"
|
|
42
|
+
|
|
43
|
+
## 🌐 MCP Server (Cursor, Claude Code, Windsurf, etc.)
|
|
44
|
+
|
|
45
|
+
Use the same semantic search from any MCP-compatible client. Index once, search from anywhere.
|
|
46
|
+
|
|
47
|
+
1. **Install dependencies**
|
|
48
|
+
```bash
|
|
49
|
+
npm install opencode-codebase-index @modelcontextprotocol/sdk zod
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
2. **Configure your MCP client**
|
|
53
|
+
|
|
54
|
+
**Cursor** (`.cursor/mcp.json`):
|
|
55
|
+
```json
|
|
56
|
+
{
|
|
57
|
+
"mcpServers": {
|
|
58
|
+
"codebase-index": {
|
|
59
|
+
"command": "npx",
|
|
60
|
+
"args": ["opencode-codebase-index-mcp", "--project", "/path/to/your/project"]
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
}
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Claude Code** (`claude_desktop_config.json`):
|
|
67
|
+
```json
|
|
68
|
+
{
|
|
69
|
+
"mcpServers": {
|
|
70
|
+
"codebase-index": {
|
|
71
|
+
"command": "npx",
|
|
72
|
+
"args": ["opencode-codebase-index-mcp", "--project", "/path/to/your/project"]
|
|
73
|
+
}
|
|
74
|
+
}
|
|
75
|
+
}
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
3. **CLI options**
|
|
79
|
+
```bash
|
|
80
|
+
npx opencode-codebase-index-mcp --project /path/to/repo # specify project root
|
|
81
|
+
npx opencode-codebase-index-mcp --config /path/to/config # custom config file
|
|
82
|
+
npx opencode-codebase-index-mcp # uses current directory
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
The MCP server exposes all 9 tools (`codebase_search`, `codebase_peek`, `find_similar`, `call_graph`, `index_codebase`, `index_status`, `index_health_check`, `index_metrics`, `index_logs`) and 4 prompts (`search`, `find`, `index`, `status`).
|
|
86
|
+
|
|
87
|
+
The MCP dependencies (`@modelcontextprotocol/sdk`, `zod`) are optional peer dependencies — they're only needed if you use the MCP server.
|
|
88
|
+
|
|
89
|
+
## 🔍 See It In Action
|
|
90
|
+
|
|
91
|
+
**Scenario**: You're new to a codebase and need to fix a bug in the payment flow.
|
|
92
|
+
|
|
93
|
+
**Without Plugin (grep)**:
|
|
94
|
+
- `grep "payment" .` → 500 results (too many)
|
|
95
|
+
- `grep "card" .` → 200 results (mostly UI)
|
|
96
|
+
- `grep "stripe" .` → 50 results (maybe?)
|
|
97
|
+
|
|
98
|
+
**With `opencode-codebase-index`**:
|
|
99
|
+
You ask: *"Where is the payment validation logic?"*
|
|
100
|
+
|
|
101
|
+
Plugin returns:
|
|
102
|
+
```text
|
|
103
|
+
src/services/billing.ts:45 (Class PaymentValidator)
|
|
104
|
+
src/utils/stripe.ts:12 (Function validateCardToken)
|
|
105
|
+
src/api/checkout.ts:89 (Route handler for /pay)
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## 🎯 When to Use What
|
|
109
|
+
|
|
110
|
+
| Scenario | Tool | Why |
|
|
111
|
+
|----------|------|-----|
|
|
112
|
+
| Don't know the function name | `codebase_search` | Semantic search finds by meaning |
|
|
113
|
+
| Exploring unfamiliar codebase | `codebase_search` | Discovers related code across files |
|
|
114
|
+
| Just need to find locations | `codebase_peek` | Returns metadata only, saves ~90% tokens |
|
|
115
|
+
| Understand code flow | `call_graph` | Find callers/callees of any function |
|
|
116
|
+
| Know exact identifier | `grep` | Faster, finds all occurrences |
|
|
117
|
+
| Need ALL matches | `grep` | Semantic returns top N only |
|
|
118
|
+
| Mixed discovery + precision | `/find` (hybrid) | Best of both worlds |
|
|
119
|
+
|
|
120
|
+
**Rule of thumb**: `codebase_peek` to find locations → `Read` to examine → `grep` for precision.
|
|
121
|
+
|
|
122
|
+
## 📊 Token Usage
|
|
123
|
+
|
|
124
|
+
In our testing across open-source codebases (axios, express), we observed **up to 90% reduction in token usage** for conceptual queries like *"find the error handling middleware"*.
|
|
125
|
+
|
|
126
|
+
### Why It Saves Tokens
|
|
127
|
+
|
|
128
|
+
- **Without plugin**: Agent explores files, reads code, backtracks, explores more
|
|
129
|
+
- **With plugin**: Semantic search returns relevant code immediately → less exploration
|
|
130
|
+
|
|
131
|
+
### Key Takeaways
|
|
132
|
+
|
|
133
|
+
1. **Significant savings possible**: Up to 90% reduction in the best cases
|
|
134
|
+
2. **Results vary**: Savings depend on query type, codebase structure, and agent behavior
|
|
135
|
+
3. **Best for discovery**: Conceptual queries benefit most; exact identifier lookups should use grep
|
|
136
|
+
4. **Complements existing tools**: Provides a faster initial signal, doesn't replace grep/explore
|
|
137
|
+
|
|
138
|
+
### When the Plugin Helps Most
|
|
139
|
+
|
|
140
|
+
- **Conceptual queries**: "Where is the authentication logic?" (no keywords to grep for)
|
|
141
|
+
- **Unfamiliar codebases**: You don't know what to search for yet
|
|
142
|
+
- **Large codebases**: Semantic search scales better than exhaustive exploration
|
|
143
|
+
|
|
144
|
+
## 🛠️ How It Works
|
|
145
|
+
|
|
146
|
+
```mermaid
|
|
147
|
+
graph TD
|
|
148
|
+
subgraph Indexing
|
|
149
|
+
A[Source Code] -->|Tree-sitter| B[Semantic Chunks]
|
|
150
|
+
B -->|Embedding Model| C[Vectors]
|
|
151
|
+
C -->|uSearch| D[(Vector Store)]
|
|
152
|
+
C -->|SQLite| G[(Embeddings DB)]
|
|
153
|
+
B -->|BM25| E[(Inverted Index)]
|
|
154
|
+
B -->|Branch Catalog| G
|
|
155
|
+
end
|
|
156
|
+
|
|
157
|
+
subgraph Searching
|
|
158
|
+
Q[User Query] -->|Embedding Model| V[Query Vector]
|
|
159
|
+
V -->|Cosine Similarity| D
|
|
160
|
+
Q -->|BM25| E
|
|
161
|
+
D --> F[Hybrid Fusion RRF/Weighted]
|
|
162
|
+
E --> F
|
|
163
|
+
F --> X[Deterministic Rerank]
|
|
164
|
+
G -->|Branch + Metadata Filters| X
|
|
165
|
+
X --> R[Ranked Results]
|
|
166
|
+
end
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
1. **Parsing**: We use `tree-sitter` to intelligently parse your code into meaningful blocks (functions, classes, interfaces). JSDoc comments and docstrings are automatically included with their associated code.
|
|
170
|
+
|
|
171
|
+
**Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML
|
|
172
|
+
2. **Chunking**: Large blocks are split with overlapping windows to preserve context across chunk boundaries.
|
|
173
|
+
3. **Embedding**: These blocks are converted into vector representations using your configured AI provider.
|
|
174
|
+
4. **Storage**: Embeddings are stored in SQLite (deduplicated by content hash) and vectors in `usearch` with F16 quantization for 50% memory savings. A branch catalog tracks which chunks exist on each branch.
|
|
175
|
+
5. **Hybrid Search**: Combines semantic similarity (vectors) with BM25 keyword matching, fuses (`rrf` default, `weighted` fallback), applies deterministic rerank, then filters by current branch/metadata.
|
|
176
|
+
|
|
177
|
+
**Performance characteristics:**
|
|
178
|
+
- **Incremental indexing**: ~50ms check time — only re-embeds changed files
|
|
179
|
+
- **Smart chunking**: Understands code structure to keep functions whole, with overlap for context
|
|
180
|
+
- **Native speed**: Core logic written in Rust for maximum performance
|
|
181
|
+
- **Memory efficient**: F16 vector quantization reduces index size by 50%
|
|
182
|
+
- **Branch-aware**: Automatically tracks which chunks exist on each git branch
|
|
183
|
+
- **Provider validation**: Detects embedding provider/model changes and requires rebuild to prevent garbage results
|
|
184
|
+
|
|
185
|
+
## 🌿 Branch-Aware Indexing
|
|
186
|
+
|
|
187
|
+
The plugin automatically detects git branches and optimizes indexing across branch switches.
|
|
188
|
+
|
|
189
|
+
### How It Works
|
|
190
|
+
|
|
191
|
+
When you switch branches, code changes but embeddings for unchanged content remain the same. The plugin:
|
|
192
|
+
|
|
193
|
+
1. **Stores embeddings by content hash**: Embeddings are deduplicated across branches
|
|
194
|
+
2. **Tracks branch membership**: A lightweight catalog tracks which chunks exist on each branch
|
|
195
|
+
3. **Filters search results**: Queries only return results relevant to the current branch
|
|
196
|
+
|
|
197
|
+
### Benefits
|
|
198
|
+
|
|
199
|
+
| Scenario | Without Branch Awareness | With Branch Awareness |
|
|
200
|
+
|----------|-------------------------|----------------------|
|
|
201
|
+
| Switch to feature branch | Re-index everything | Instant — reuse existing embeddings |
|
|
202
|
+
| Return to main | Re-index everything | Instant — catalog already exists |
|
|
203
|
+
| Search on branch | May return stale results | Only returns current branch's code |
|
|
204
|
+
|
|
205
|
+
### Automatic Behavior
|
|
206
|
+
|
|
207
|
+
- **Branch detection**: Automatically reads from `.git/HEAD`
|
|
208
|
+
- **Re-indexing on switch**: Triggers when you switch branches (via file watcher)
|
|
209
|
+
- **Legacy migration**: Automatically migrates old indexes on first run
|
|
210
|
+
- **Garbage collection**: Health check removes orphaned embeddings and chunks
|
|
211
|
+
|
|
212
|
+
### Storage Structure
|
|
213
|
+
|
|
214
|
+
```
|
|
215
|
+
.opencode/index/
|
|
216
|
+
├── codebase.db # SQLite: embeddings, chunks, branch catalog, symbols, call edges
|
|
217
|
+
├── vectors.usearch # Vector index (uSearch)
|
|
218
|
+
├── inverted-index.json # BM25 keyword index
|
|
219
|
+
└── file-hashes.json # File change detection
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
## 🧰 Tools Available
|
|
223
|
+
|
|
224
|
+
The plugin exposes these tools to the OpenCode agent:
|
|
225
|
+
|
|
226
|
+
### `codebase_search`
|
|
227
|
+
**The primary tool.** Searches code by describing behavior.
|
|
228
|
+
- **Use for**: Discovery, understanding flows, finding logic when you don't know the names.
|
|
229
|
+
- **Example**: `"find the middleware that sanitizes input"`
|
|
230
|
+
- **Ranking path**: hybrid retrieval → fusion (`search.fusionStrategy`) → deterministic rerank (`search.rerankTopN`) → filters
|
|
231
|
+
|
|
232
|
+
**Writing good queries:**
|
|
233
|
+
|
|
234
|
+
| ✅ Good queries (describe behavior) | ❌ Bad queries (too vague) |
|
|
235
|
+
|-------------------------------------|---------------------------|
|
|
236
|
+
| "function that validates email format" | "email" |
|
|
237
|
+
| "error handling for failed API calls" | "error" |
|
|
238
|
+
| "middleware that checks authentication" | "auth middleware" |
|
|
239
|
+
| "code that calculates shipping costs" | "shipping" |
|
|
240
|
+
| "where user permissions are checked" | "permissions" |
|
|
241
|
+
|
|
242
|
+
### `codebase_peek`
|
|
243
|
+
**Token-efficient discovery.** Returns only metadata (file, line, name, type) without code content.
|
|
244
|
+
- **Use for**: Finding WHERE code is before deciding what to read. Saves ~90% tokens vs `codebase_search`.
|
|
245
|
+
- **Ranking path**: same hybrid ranking path as `codebase_search` (metadata-only output)
|
|
246
|
+
- **Example output**:
|
|
247
|
+
```
|
|
248
|
+
[1] function "validatePayment" at src/billing.ts:45-67 (score: 0.92)
|
|
249
|
+
[2] class "PaymentProcessor" at src/processor.ts:12-89 (score: 0.87)
|
|
250
|
+
|
|
251
|
+
Use Read tool to examine specific files.
|
|
252
|
+
```
|
|
253
|
+
- **Workflow**: `codebase_peek` → find locations → `Read` specific files
|
|
254
|
+
|
|
255
|
+
### `find_similar`
|
|
256
|
+
Find code similar to a provided snippet.
|
|
257
|
+
- **Use for**: Duplicate detection, refactor prep, pattern mining.
|
|
258
|
+
- **Ranking path**: semantic retrieval only + deterministic rerank (no BM25, no RRF).
|
|
259
|
+
|
|
260
|
+
### `index_codebase`
|
|
261
|
+
Manually trigger indexing.
|
|
262
|
+
- **Use for**: Forcing a re-index or checking stats.
|
|
263
|
+
- **Parameters**: `force` (rebuild all), `estimateOnly` (check costs), `verbose` (show skipped files and parse failures).
|
|
264
|
+
|
|
265
|
+
### `index_status`
|
|
266
|
+
Checks if the index is ready and healthy.
|
|
267
|
+
|
|
268
|
+
### `index_health_check`
|
|
269
|
+
Maintenance tool to remove stale entries from deleted files and orphaned embeddings/chunks from the database.
|
|
270
|
+
|
|
271
|
+
### `index_metrics`
|
|
272
|
+
Returns collected metrics about indexing and search performance. Requires `debug.enabled` and `debug.metrics` to be `true`.
|
|
273
|
+
- **Metrics include**: Files indexed, chunks created, cache hit rate, search timing breakdown, GC stats, embedding API call stats.
|
|
274
|
+
|
|
275
|
+
### `index_logs`
|
|
276
|
+
Returns recent debug logs with optional filtering.
|
|
277
|
+
- **Parameters**: `category` (optional: `search`, `embedding`, `cache`, `gc`, `branch`), `level` (optional: `error`, `warn`, `info`, `debug`), `limit` (default: 50).
|
|
278
|
+
|
|
279
|
+
### `call_graph`
|
|
280
|
+
Query the call graph to find callers or callees of a function/method. Automatically built during indexing for TypeScript, JavaScript, Python, Go, and Rust.
|
|
281
|
+
- **Use for**: Understanding code flow, tracing dependencies, impact analysis.
|
|
282
|
+
- **Parameters**: `name` (function name), `direction` (`callers` or `callees`), `symbolId` (required for `callees`, returned by previous queries).
|
|
283
|
+
- **Example**: Find who calls `validateToken` → `call_graph(name="validateToken", direction="callers")`
|
|
284
|
+
|
|
285
|
+
## 🎮 Slash Commands
|
|
286
|
+
|
|
287
|
+
The plugin automatically registers these slash commands:
|
|
288
|
+
|
|
289
|
+
| Command | Description |
|
|
290
|
+
| ------- | ----------- |
|
|
291
|
+
| `/search <query>` | **Pure Semantic Search**. Best for "How does X work?" |
|
|
292
|
+
| `/find <query>` | **Hybrid Search**. Combines semantic search + grep. Best for "Find usage of X". |
|
|
293
|
+
| `/index` | **Update Index**. Forces a refresh of the codebase index. |
|
|
294
|
+
| `/status` | **Check Status**. Shows if indexed, chunk count, and provider info. |
|
|
295
|
+
|
|
296
|
+
## ⚙️ Configuration
|
|
297
|
+
|
|
298
|
+
Zero-config by default (uses `auto` mode). Customize in `.opencode/codebase-index.json`:
|
|
299
|
+
|
|
300
|
+
```json
|
|
301
|
+
{
|
|
302
|
+
"embeddingProvider": "auto",
|
|
303
|
+
"scope": "project",
|
|
304
|
+
"indexing": {
|
|
305
|
+
"autoIndex": false,
|
|
306
|
+
"watchFiles": true,
|
|
307
|
+
"maxFileSize": 1048576,
|
|
308
|
+
"maxChunksPerFile": 100,
|
|
309
|
+
"semanticOnly": false,
|
|
310
|
+
"autoGc": true,
|
|
311
|
+
"gcIntervalDays": 7,
|
|
312
|
+
"gcOrphanThreshold": 100,
|
|
313
|
+
"requireProjectMarker": true
|
|
314
|
+
},
|
|
315
|
+
"search": {
|
|
316
|
+
"maxResults": 20,
|
|
317
|
+
"minScore": 0.1,
|
|
318
|
+
"hybridWeight": 0.5,
|
|
319
|
+
"fusionStrategy": "rrf",
|
|
320
|
+
"rrfK": 60,
|
|
321
|
+
"rerankTopN": 20,
|
|
322
|
+
"contextLines": 0
|
|
323
|
+
},
|
|
324
|
+
"debug": {
|
|
325
|
+
"enabled": false,
|
|
326
|
+
"logLevel": "info",
|
|
327
|
+
"metrics": false
|
|
328
|
+
}
|
|
329
|
+
}
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
### Options Reference
|
|
333
|
+
|
|
334
|
+
| Option | Default | Description |
|
|
335
|
+
|--------|---------|-------------|
|
|
336
|
+
| `embeddingProvider` | `"auto"` | Which AI to use: `auto`, `github-copilot`, `openai`, `google`, `ollama`, `custom` |
|
|
337
|
+
| `scope` | `"project"` | `project` = index per repo, `global` = shared index across repos |
|
|
338
|
+
| **indexing** | | |
|
|
339
|
+
| `autoIndex` | `false` | Automatically index on plugin load |
|
|
340
|
+
| `watchFiles` | `true` | Re-index when files change |
|
|
341
|
+
| `maxFileSize` | `1048576` | Skip files larger than this (bytes). Default: 1MB |
|
|
342
|
+
| `maxChunksPerFile` | `100` | Maximum chunks to index per file (controls token costs for large files) |
|
|
343
|
+
| `semanticOnly` | `false` | When `true`, only index semantic nodes (functions, classes) and skip generic blocks |
|
|
344
|
+
| `retries` | `3` | Number of retry attempts for failed embedding API calls |
|
|
345
|
+
| `retryDelayMs` | `1000` | Delay between retries in milliseconds |
|
|
346
|
+
| `autoGc` | `true` | Automatically run garbage collection to remove orphaned embeddings/chunks |
|
|
347
|
+
| `gcIntervalDays` | `7` | Run GC on initialization if last GC was more than N days ago |
|
|
348
|
+
| `gcOrphanThreshold` | `100` | Run GC after indexing if orphan count exceeds this threshold |
|
|
349
|
+
| `requireProjectMarker` | `true` | Require a project marker (`.git`, `package.json`, etc.) to enable file watching and auto-indexing. Prevents accidentally indexing large directories like home. Set to `false` to index any directory. |
|
|
350
|
+
| **search** | | |
|
|
351
|
+
| `maxResults` | `20` | Maximum results to return |
|
|
352
|
+
| `minScore` | `0.1` | Minimum similarity score (0-1). Lower = more results |
|
|
353
|
+
| `hybridWeight` | `0.5` | Balance between keyword (1.0) and semantic (0.0) search |
|
|
354
|
+
| `fusionStrategy` | `"rrf"` | Hybrid fusion mode: `"rrf"` (rank-based reciprocal rank fusion) or `"weighted"` (legacy score blending fallback) |
|
|
355
|
+
| `rrfK` | `60` | RRF smoothing constant. Higher values flatten rank impact, lower values prioritize top-ranked candidates more strongly |
|
|
356
|
+
| `rerankTopN` | `20` | Deterministic rerank depth cap. Applies lightweight name/path/chunk-type rerank to top-N only |
|
|
357
|
+
| `contextLines` | `0` | Extra lines to include before/after each match |
|
|
358
|
+
| **debug** | | |
|
|
359
|
+
| `enabled` | `false` | Enable debug logging and metrics collection |
|
|
360
|
+
| `logLevel` | `"info"` | Log level: `error`, `warn`, `info`, `debug` |
|
|
361
|
+
| `logSearch` | `true` | Log search operations with timing breakdown |
|
|
362
|
+
| `logEmbedding` | `true` | Log embedding API calls (success, error, rate-limit) |
|
|
363
|
+
| `logCache` | `true` | Log cache hits and misses |
|
|
364
|
+
| `logGc` | `true` | Log garbage collection operations |
|
|
365
|
+
| `logBranch` | `true` | Log branch detection and switches |
|
|
366
|
+
| `metrics` | `false` | Enable metrics collection (indexing stats, search timing, cache performance) |
|
|
367
|
+
|
|
368
|
+
### Retrieval ranking behavior (Phase 1)
|
|
369
|
+
|
|
370
|
+
- `codebase_search` and `codebase_peek` use the hybrid path: semantic + keyword retrieval → fusion (`fusionStrategy`) → deterministic rerank (`rerankTopN`) → filtering.
|
|
371
|
+
- `find_similar` stays semantic-only: semantic retrieval + deterministic rerank only (no keyword retrieval, no RRF).
|
|
372
|
+
- For compatibility rollbacks, set `search.fusionStrategy` to `"weighted"` to use the legacy weighted fusion path.
|
|
373
|
+
- Retrieval benchmark artifacts are separated by role:
|
|
374
|
+
- baseline (versioned): `benchmarks/baselines/retrieval-baseline.json`
|
|
375
|
+
- latest candidate run (generated): `benchmark-results/retrieval-candidate.json`
|
|
376
|
+
|
|
377
|
+
### Embedding Providers
|
|
378
|
+
The plugin automatically detects available credentials in this order:
|
|
379
|
+
1. **GitHub Copilot** (Free if you have it)
|
|
380
|
+
2. **OpenAI** (Standard Embeddings)
|
|
381
|
+
3. **Google** (Gemini Embeddings)
|
|
382
|
+
4. **Ollama** (Local/Private - requires `nomic-embed-text`)
|
|
383
|
+
|
|
384
|
+
You can also use **Custom** to connect any OpenAI-compatible embedding endpoint (llama.cpp, vLLM, text-embeddings-inference, LiteLLM, etc.).
|
|
385
|
+
|
|
386
|
+
### Rate Limits by Provider
|
|
387
|
+
|
|
388
|
+
Each provider has different rate limits. The plugin automatically adjusts concurrency and delays:
|
|
389
|
+
|
|
390
|
+
| Provider | Concurrency | Delay | Best For |
|
|
391
|
+
|----------|-------------|-------|----------|
|
|
392
|
+
| **GitHub Copilot** | 1 | 4s | Small codebases (<1k files) |
|
|
393
|
+
| **OpenAI** | 3 | 500ms | Medium codebases |
|
|
394
|
+
| **Google** | 5 | 200ms | Medium-large codebases |
|
|
395
|
+
| **Ollama** | 5 | None | Large codebases (10k+ files) |
|
|
396
|
+
| **Custom** | 3 | 1s | Any OpenAI-compatible endpoint |
|
|
397
|
+
|
|
398
|
+
**For large codebases**, use Ollama locally to avoid rate limits:
|
|
399
|
+
|
|
400
|
+
```bash
|
|
401
|
+
# Install the embedding model
|
|
402
|
+
ollama pull nomic-embed-text
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
```json
|
|
406
|
+
// .opencode/codebase-index.json
|
|
407
|
+
{
|
|
408
|
+
"embeddingProvider": "ollama"
|
|
409
|
+
}
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
## 📈 Performance
|
|
413
|
+
|
|
414
|
+
The plugin is built for speed with a Rust native module. Here are typical performance numbers (Apple M1):
|
|
415
|
+
|
|
416
|
+
### Parsing (tree-sitter)
|
|
417
|
+
|
|
418
|
+
| Files | Chunks | Time |
|
|
419
|
+
|-------|--------|------|
|
|
420
|
+
| 100 | 1,200 | ~7ms |
|
|
421
|
+
| 500 | 6,000 | ~32ms |
|
|
422
|
+
|
|
423
|
+
### Vector Search (usearch)
|
|
424
|
+
|
|
425
|
+
| Index Size | Search Time | Throughput |
|
|
426
|
+
|------------|-------------|------------|
|
|
427
|
+
| 1,000 vectors | 0.7ms | 1,400 ops/sec |
|
|
428
|
+
| 5,000 vectors | 1.2ms | 850 ops/sec |
|
|
429
|
+
| 10,000 vectors | 1.3ms | 780 ops/sec |
|
|
430
|
+
|
|
431
|
+
### Database Operations (SQLite with batch)
|
|
432
|
+
|
|
433
|
+
| Operation | 1,000 items | 10,000 items |
|
|
434
|
+
|-----------|-------------|--------------|
|
|
435
|
+
| Insert chunks | 4ms | 44ms |
|
|
436
|
+
| Add to branch | 2ms | 22ms |
|
|
437
|
+
| Check embedding exists | <0.01ms | <0.01ms |
|
|
438
|
+
|
|
439
|
+
### Batch vs Sequential Performance
|
|
440
|
+
|
|
441
|
+
Batch operations provide significant speedups:
|
|
442
|
+
|
|
443
|
+
| Operation | Sequential | Batch | Speedup |
|
|
444
|
+
|-----------|------------|-------|---------|
|
|
445
|
+
| Insert 1,000 chunks | 38ms | 4ms | **~10x** |
|
|
446
|
+
| Add 1,000 to branch | 29ms | 2ms | **~14x** |
|
|
447
|
+
| Insert 1,000 embeddings | 59ms | 40ms | **~1.5x** |
|
|
448
|
+
|
|
449
|
+
Run benchmarks yourself: `npx tsx benchmarks/run.ts`
|
|
450
|
+
|
|
451
|
+
## 🎯 Choosing a Provider
|
|
452
|
+
|
|
453
|
+
Use this decision tree to pick the right embedding provider:
|
|
454
|
+
|
|
455
|
+
```
|
|
456
|
+
┌─────────────────────────┐
|
|
457
|
+
│ Do you have Copilot? │
|
|
458
|
+
└───────────┬─────────────┘
|
|
459
|
+
┌─────┴─────┐
|
|
460
|
+
YES NO
|
|
461
|
+
│ │
|
|
462
|
+
┌───────────▼───────┐ │
|
|
463
|
+
│ Codebase < 1k │ │
|
|
464
|
+
│ files? │ │
|
|
465
|
+
└─────────┬─────────┘ │
|
|
466
|
+
┌─────┴─────┐ │
|
|
467
|
+
YES NO │
|
|
468
|
+
│ │ │
|
|
469
|
+
▼ │ │
|
|
470
|
+
┌──────────┐ │ │
|
|
471
|
+
│ Copilot │ │ │
|
|
472
|
+
│ (free) │ │ │
|
|
473
|
+
└──────────┘ │ │
|
|
474
|
+
▼ ▼
|
|
475
|
+
┌─────────────────────────┐
|
|
476
|
+
│ Need fastest indexing? │
|
|
477
|
+
└───────────┬─────────────┘
|
|
478
|
+
┌─────┴─────┐
|
|
479
|
+
YES NO
|
|
480
|
+
│ │
|
|
481
|
+
▼ ▼
|
|
482
|
+
┌──────────┐ ┌──────────────┐
|
|
483
|
+
│ Ollama │ │ OpenAI or │
|
|
484
|
+
│ (local) │ │ Google │
|
|
485
|
+
└──────────┘ └──────────────┘
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
### Provider Comparison
|
|
489
|
+
|
|
490
|
+
| Provider | Speed | Cost | Privacy | Best For |
|
|
491
|
+
|----------|-------|------|---------|----------|
|
|
492
|
+
| **Ollama** | Fastest | Free | Full | Large codebases, privacy-sensitive |
|
|
493
|
+
| **GitHub Copilot** | Slow (rate limited) | Free* | Cloud | Small codebases, existing subscribers |
|
|
494
|
+
| **OpenAI** | Medium | ~$0.0001/1K tokens | Cloud | General use |
|
|
495
|
+
| **Google** | Fast | Free tier available | Cloud | Medium-large codebases |
|
|
496
|
+
| **Custom** | Varies | Varies | Varies | Self-hosted or third-party endpoints |
|
|
497
|
+
|
|
498
|
+
*Requires active Copilot subscription
|
|
499
|
+
|
|
500
|
+
### Setup by Provider
|
|
501
|
+
|
|
502
|
+
**Ollama (Recommended for large codebases)**
|
|
503
|
+
```bash
|
|
504
|
+
ollama pull nomic-embed-text
|
|
505
|
+
```
|
|
506
|
+
```json
|
|
507
|
+
{ "embeddingProvider": "ollama" }
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
**OpenAI**
|
|
511
|
+
```bash
|
|
512
|
+
export OPENAI_API_KEY=sk-...
|
|
513
|
+
```
|
|
514
|
+
```json
|
|
515
|
+
{ "embeddingProvider": "openai" }
|
|
516
|
+
```
|
|
517
|
+
|
|
518
|
+
**Google**
|
|
519
|
+
```bash
|
|
520
|
+
export GOOGLE_API_KEY=...
|
|
521
|
+
```
|
|
522
|
+
```json
|
|
523
|
+
{ "embeddingProvider": "google" }
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
**GitHub Copilot**
|
|
527
|
+
No setup needed if you have an active Copilot subscription.
|
|
528
|
+
```json
|
|
529
|
+
{ "embeddingProvider": "github-copilot" }
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
**Custom (OpenAI-compatible)**
|
|
533
|
+
Works with any server that implements the OpenAI `/v1/embeddings` API format (llama.cpp, vLLM, text-embeddings-inference, LiteLLM, etc.).
|
|
534
|
+
```json
|
|
535
|
+
{
|
|
536
|
+
"embeddingProvider": "custom",
|
|
537
|
+
"customProvider": {
|
|
538
|
+
"baseUrl": "http://localhost:11434/v1",
|
|
539
|
+
"model": "nomic-embed-text",
|
|
540
|
+
"dimensions": 768,
|
|
541
|
+
"apiKey": "optional-api-key",
|
|
542
|
+
"maxTokens": 8192,
|
|
543
|
+
"timeoutMs": 30000
|
|
544
|
+
}
|
|
545
|
+
}
|
|
546
|
+
```
|
|
547
|
+
Required fields: `baseUrl`, `model`, `dimensions` (positive integer). Optional: `apiKey`, `maxTokens`, `timeoutMs` (default: 30000).
|
|
548
|
+
|
|
549
|
+
## ⚠️ Tradeoffs
|
|
550
|
+
|
|
551
|
+
Be aware of these characteristics:
|
|
552
|
+
|
|
553
|
+
| Aspect | Reality |
|
|
554
|
+
|--------|---------|
|
|
555
|
+
| **Search latency** | ~800-1000ms per query (embedding API call) |
|
|
556
|
+
| **First index** | Takes time depending on codebase size (e.g., ~30s for 500 chunks) |
|
|
557
|
+
| **Requires API** | Needs an embedding provider (Copilot, OpenAI, Google, or local Ollama) |
|
|
558
|
+
| **Token costs** | Uses embedding tokens (free with Copilot, minimal with others) |
|
|
559
|
+
| **Best for** | Discovery and exploration, not exhaustive matching |
|
|
560
|
+
|
|
561
|
+
## 💻 Local Development
|
|
562
|
+
|
|
563
|
+
1. **Build**:
|
|
564
|
+
```bash
|
|
565
|
+
npm run build
|
|
566
|
+
```
|
|
567
|
+
|
|
568
|
+
2. **Register in Test Project** (use `file://` URL in `opencode.json`):
|
|
569
|
+
```json
|
|
570
|
+
{
|
|
571
|
+
"plugin": [
|
|
572
|
+
"file:///path/to/opencode-codebase-index"
|
|
573
|
+
]
|
|
574
|
+
}
|
|
575
|
+
```
|
|
576
|
+
|
|
577
|
+
This loads directly from your source directory, so changes take effect after rebuilding.
|
|
578
|
+
|
|
579
|
+
## 🤝 Contributing
|
|
580
|
+
|
|
581
|
+
1. Fork the repository
|
|
582
|
+
2. Create a feature branch: `git checkout -b feature/my-feature`
|
|
583
|
+
3. Make your changes and add tests
|
|
584
|
+
4. Run checks: `npm run build && npm run test:run && npm run lint`
|
|
585
|
+
5. Commit: `git commit -m "feat: add my feature"`
|
|
586
|
+
6. Push and open a pull request
|
|
587
|
+
|
|
588
|
+
CI will automatically run tests and type checking on your PR.
|
|
589
|
+
|
|
590
|
+
### Release process (structured + complete notes)
|
|
591
|
+
|
|
592
|
+
To ensure release notes reflect all merged work, this repo uses a draft-release workflow.
|
|
593
|
+
|
|
594
|
+
1. **Label every PR** with at least one semantic label:
|
|
595
|
+
- `feature`, `bug`, `performance`, `documentation`, `dependencies`, `refactor`, `test`, `chore`
|
|
596
|
+
- and (when relevant) `semver:major`, `semver:minor`, or `semver:patch`
|
|
597
|
+
- PRs are validated by CI (`Release Label Check`) and fail if no release category label is present
|
|
598
|
+
2. **Let Release Drafter build the draft notes** automatically from merged PRs on `main`.
|
|
599
|
+
3. **Before publishing**:
|
|
600
|
+
- copy/finalize relevant highlights into `CHANGELOG.md`
|
|
601
|
+
- bump `package.json` version
|
|
602
|
+
- run: `npm run build && npm run typecheck && npm run lint && npm run test:run`
|
|
603
|
+
4. **Publish release** from the draft (or via `gh release create` after reviewing draft content).
|
|
604
|
+
|
|
605
|
+
PRs labeled `skip-changelog` are intentionally excluded from release notes.
|
|
606
|
+
|
|
607
|
+
### Project Structure
|
|
608
|
+
|
|
609
|
+
```
|
|
610
|
+
├── src/
|
|
611
|
+
│ ├── index.ts # Plugin entry point
|
|
612
|
+
│ ├── mcp-server.ts # MCP server (Cursor, Claude Code, Windsurf)
|
|
613
|
+
│ ├── cli.ts # CLI entry for MCP stdio transport
|
|
614
|
+
│ ├── config/ # Configuration schema
|
|
615
|
+
│ ├── embeddings/ # Provider detection and API calls
|
|
616
|
+
│ ├── indexer/ # Core indexing logic + inverted index
|
|
617
|
+
│ ├── git/ # Git utilities (branch detection)
|
|
618
|
+
│ ├── tools/ # OpenCode tool definitions
|
|
619
|
+
│ ├── utils/ # File collection, cost estimation
|
|
620
|
+
│ ├── native/ # Rust native module wrapper
|
|
621
|
+
│ └── watcher/ # File/git change watcher
|
|
622
|
+
├── native/
|
|
623
|
+
│ └── src/ # Rust: tree-sitter, usearch, xxhash, SQLite
|
|
624
|
+
├── tests/ # Unit tests (vitest)
|
|
625
|
+
├── commands/ # Slash command definitions
|
|
626
|
+
├── skill/ # Agent skill guidance
|
|
627
|
+
└── .github/workflows/ # CI/CD (test, build, publish)
|
|
628
|
+
```
|
|
629
|
+
|
|
630
|
+
### Native Module
|
|
631
|
+
|
|
632
|
+
The Rust native module handles performance-critical operations:
|
|
633
|
+
- **tree-sitter**: Language-aware code parsing with JSDoc/docstring extraction
|
|
634
|
+
- **usearch**: High-performance vector similarity search with F16 quantization
|
|
635
|
+
- **SQLite**: Persistent storage for embeddings, chunks, branch catalog, symbols, and call edges
|
|
636
|
+
- **BM25 inverted index**: Fast keyword search for hybrid retrieval
|
|
637
|
+
- **Call graph extraction**: Tree-sitter query-based extraction of function calls, method calls, constructors, and imports (TypeScript/JavaScript, Python, Go, Rust)
|
|
638
|
+
- **xxhash**: Fast content hashing for change detection
|
|
639
|
+
|
|
640
|
+
Rebuild with: `npm run build:native` (requires Rust toolchain)
|
|
641
|
+
|
|
642
|
+
### Platform Support
|
|
643
|
+
|
|
644
|
+
Pre-built native binaries are published for:
|
|
645
|
+
|
|
646
|
+
| Platform | Architecture | SIMD Acceleration |
|
|
647
|
+
|----------|-------------|--------------------|
|
|
648
|
+
| macOS | x86_64 | ✅ simsimd |
|
|
649
|
+
| macOS | ARM64 (Apple Silicon) | ✅ simsimd |
|
|
650
|
+
| Linux | x86_64 (GNU) | ✅ simsimd |
|
|
651
|
+
| Linux | ARM64 (GNU) | ✅ simsimd |
|
|
652
|
+
| Windows | x86_64 (MSVC) | ❌ scalar fallback |
|
|
653
|
+
|
|
654
|
+
Windows builds use scalar distance functions instead of SIMD — functionally identical, marginally slower for very large indexes. This is due to MSVC lacking support for certain AVX-512 intrinsics used by simsimd.
|
|
655
|
+
|
|
656
|
+
## License
|
|
657
|
+
|
|
658
|
+
MIT
|