opencode-codebase-index 0.2.5 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -117,6 +117,8 @@ graph TD
117
117
  ```
118
118
 
119
119
  1. **Parsing**: We use `tree-sitter` to intelligently parse your code into meaningful blocks (functions, classes, interfaces). JSDoc comments and docstrings are automatically included with their associated code.
120
+
121
+ **Supported Languages**: TypeScript, JavaScript, Python, Rust, Go, Java, C#, Ruby, Bash, C, C++, JSON, TOML, YAML
120
122
  2. **Chunking**: Large blocks are split with overlapping windows to preserve context across chunk boundaries.
121
123
  3. **Embedding**: These blocks are converted into vector representations using your configured AI provider.
122
124
  4. **Storage**: Embeddings are stored in SQLite (deduplicated by content hash) and vectors in `usearch` with F16 quantization for 50% memory savings. A branch catalog tracks which chunks exist on each branch.
@@ -196,6 +198,14 @@ Checks if the index is ready and healthy.
196
198
  ### `index_health_check`
197
199
  Maintenance tool to remove stale entries from deleted files and orphaned embeddings/chunks from the database.
198
200
 
201
+ ### `index_metrics`
202
+ Returns collected metrics about indexing and search performance. Requires `debug.enabled` and `debug.metrics` to be `true`.
203
+ - **Metrics include**: Files indexed, chunks created, cache hit rate, search timing breakdown, GC stats, embedding API call stats.
204
+
205
+ ### `index_logs`
206
+ Returns recent debug logs with optional filtering.
207
+ - **Parameters**: `category` (optional: `search`, `embedding`, `cache`, `gc`, `branch`), `level` (optional: `error`, `warn`, `info`, `debug`), `limit` (default: 50).
208
+
199
209
  ## 🎮 Slash Commands
200
210
 
201
211
  The plugin automatically registers these slash commands:
@@ -205,6 +215,7 @@ The plugin automatically registers these slash commands:
205
215
  | `/search <query>` | **Pure Semantic Search**. Best for "How does X work?" |
206
216
  | `/find <query>` | **Hybrid Search**. Combines semantic search + grep. Best for "Find usage of X". |
207
217
  | `/index` | **Update Index**. Forces a refresh of the codebase index. |
218
+ | `/status` | **Check Status**. Shows if indexed, chunk count, and provider info. |
208
219
 
209
220
  ## ⚙️ Configuration
210
221
 
@@ -219,13 +230,21 @@ Zero-config by default (uses `auto` mode). Customize in `.opencode/codebase-inde
219
230
  "watchFiles": true,
220
231
  "maxFileSize": 1048576,
221
232
  "maxChunksPerFile": 100,
222
- "semanticOnly": false
233
+ "semanticOnly": false,
234
+ "autoGc": true,
235
+ "gcIntervalDays": 7,
236
+ "gcOrphanThreshold": 100
223
237
  },
224
238
  "search": {
225
239
  "maxResults": 20,
226
240
  "minScore": 0.1,
227
241
  "hybridWeight": 0.5,
228
242
  "contextLines": 0
243
+ },
244
+ "debug": {
245
+ "enabled": false,
246
+ "logLevel": "info",
247
+ "metrics": false
229
248
  }
230
249
  }
231
250
  ```
@@ -244,11 +263,23 @@ Zero-config by default (uses `auto` mode). Customize in `.opencode/codebase-inde
244
263
  | `semanticOnly` | `false` | When `true`, only index semantic nodes (functions, classes) and skip generic blocks |
245
264
  | `retries` | `3` | Number of retry attempts for failed embedding API calls |
246
265
  | `retryDelayMs` | `1000` | Delay between retries in milliseconds |
266
+ | `autoGc` | `true` | Automatically run garbage collection to remove orphaned embeddings/chunks |
267
+ | `gcIntervalDays` | `7` | Run GC on initialization if last GC was more than N days ago |
268
+ | `gcOrphanThreshold` | `100` | Run GC after indexing if orphan count exceeds this threshold |
247
269
  | **search** | | |
248
270
  | `maxResults` | `20` | Maximum results to return |
249
271
  | `minScore` | `0.1` | Minimum similarity score (0-1). Lower = more results |
250
272
  | `hybridWeight` | `0.5` | Balance between keyword (1.0) and semantic (0.0) search |
251
273
  | `contextLines` | `0` | Extra lines to include before/after each match |
274
+ | **debug** | | |
275
+ | `enabled` | `false` | Enable debug logging and metrics collection |
276
+ | `logLevel` | `"info"` | Log level: `error`, `warn`, `info`, `debug` |
277
+ | `logSearch` | `true` | Log search operations with timing breakdown |
278
+ | `logEmbedding` | `true` | Log embedding API calls (success, error, rate-limit) |
279
+ | `logCache` | `true` | Log cache hits and misses |
280
+ | `logGc` | `true` | Log garbage collection operations |
281
+ | `logBranch` | `true` | Log branch detection and switches |
282
+ | `metrics` | `false` | Enable metrics collection (indexing stats, search timing, cache performance) |
252
283
 
253
284
  ### Embedding Providers
254
285
  The plugin automatically detects available credentials in this order:
@@ -257,6 +288,150 @@ The plugin automatically detects available credentials in this order:
257
288
  3. **Google** (Gemini Embeddings)
258
289
  4. **Ollama** (Local/Private - requires `nomic-embed-text`)
259
290
 
291
+ ### Rate Limits by Provider
292
+
293
+ Each provider has different rate limits. The plugin automatically adjusts concurrency and delays:
294
+
295
+ | Provider | Concurrency | Delay | Best For |
296
+ |----------|-------------|-------|----------|
297
+ | **GitHub Copilot** | 1 | 4s | Small codebases (<1k files) |
298
+ | **OpenAI** | 3 | 500ms | Medium codebases |
299
+ | **Google** | 5 | 200ms | Medium-large codebases |
300
+ | **Ollama** | 5 | None | Large codebases (10k+ files) |
301
+
302
+ **For large codebases**, use Ollama locally to avoid rate limits:
303
+
304
+ ```bash
305
+ # Install the embedding model
306
+ ollama pull nomic-embed-text
307
+ ```
308
+
309
+ ```json
310
+ // .opencode/codebase-index.json
311
+ {
312
+ "embeddingProvider": "ollama"
313
+ }
314
+ ```
315
+
316
+ ## 📈 Performance
317
+
318
+ The plugin is built for speed with a Rust native module. Here are typical performance numbers (Apple M1):
319
+
320
+ ### Parsing (tree-sitter)
321
+
322
+ | Files | Chunks | Time |
323
+ |-------|--------|------|
324
+ | 100 | 1,200 | ~7ms |
325
+ | 500 | 6,000 | ~32ms |
326
+
327
+ ### Vector Search (usearch)
328
+
329
+ | Index Size | Search Time | Throughput |
330
+ |------------|-------------|------------|
331
+ | 1,000 vectors | 0.7ms | 1,400 ops/sec |
332
+ | 5,000 vectors | 1.2ms | 850 ops/sec |
333
+ | 10,000 vectors | 1.3ms | 780 ops/sec |
334
+
335
+ ### Database Operations (SQLite with batch)
336
+
337
+ | Operation | 1,000 items | 10,000 items |
338
+ |-----------|-------------|--------------|
339
+ | Insert chunks | 4ms | 44ms |
340
+ | Add to branch | 2ms | 22ms |
341
+ | Check embedding exists | <0.01ms | <0.01ms |
342
+
343
+ ### Batch vs Sequential Performance
344
+
345
+ Batch operations provide significant speedups:
346
+
347
+ | Operation | Sequential | Batch | Speedup |
348
+ |-----------|------------|-------|---------|
349
+ | Insert 1,000 chunks | 38ms | 4ms | **~10x** |
350
+ | Add 1,000 to branch | 29ms | 2ms | **~14x** |
351
+ | Insert 1,000 embeddings | 59ms | 40ms | **~1.5x** |
352
+
353
+ Run benchmarks yourself: `npx tsx benchmarks/run.ts`
354
+
355
+ ## 🎯 Choosing a Provider
356
+
357
+ Use this decision tree to pick the right embedding provider:
358
+
359
+ ```
360
+ ┌─────────────────────────┐
361
+ │ Do you have Copilot? │
362
+ └───────────┬─────────────┘
363
+ ┌─────┴─────┐
364
+ YES NO
365
+ │ │
366
+ ┌───────────▼───────┐ │
367
+ │ Codebase < 1k │ │
368
+ │ files? │ │
369
+ └─────────┬─────────┘ │
370
+ ┌─────┴─────┐ │
371
+ YES NO │
372
+ │ │ │
373
+ ▼ │ │
374
+ ┌──────────┐ │ │
375
+ │ Copilot │ │ │
376
+ │ (free) │ │ │
377
+ └──────────┘ │ │
378
+ ▼ ▼
379
+ ┌─────────────────────────┐
380
+ │ Need fastest indexing? │
381
+ └───────────┬─────────────┘
382
+ ┌─────┴─────┐
383
+ YES NO
384
+ │ │
385
+ ▼ ▼
386
+ ┌──────────┐ ┌──────────────┐
387
+ │ Ollama │ │ OpenAI or │
388
+ │ (local) │ │ Google │
389
+ └──────────┘ └──────────────┘
390
+ ```
391
+
392
+ ### Provider Comparison
393
+
394
+ | Provider | Speed | Cost | Privacy | Best For |
395
+ |----------|-------|------|---------|----------|
396
+ | **Ollama** | Fastest | Free | Full | Large codebases, privacy-sensitive |
397
+ | **GitHub Copilot** | Slow (rate limited) | Free* | Cloud | Small codebases, existing subscribers |
398
+ | **OpenAI** | Medium | ~$0.0001/1K tokens | Cloud | General use |
399
+ | **Google** | Fast | Free tier available | Cloud | Medium-large codebases |
400
+
401
+ *Requires active Copilot subscription
402
+
403
+ ### Setup by Provider
404
+
405
+ **Ollama (Recommended for large codebases)**
406
+ ```bash
407
+ ollama pull nomic-embed-text
408
+ ```
409
+ ```json
410
+ { "embeddingProvider": "ollama" }
411
+ ```
412
+
413
+ **OpenAI**
414
+ ```bash
415
+ export OPENAI_API_KEY=sk-...
416
+ ```
417
+ ```json
418
+ { "embeddingProvider": "openai" }
419
+ ```
420
+
421
+ **Google**
422
+ ```bash
423
+ export GOOGLE_API_KEY=...
424
+ ```
425
+ ```json
426
+ { "embeddingProvider": "google" }
427
+ ```
428
+
429
+ **GitHub Copilot**
430
+ No setup needed if you have an active Copilot subscription.
431
+ ```json
432
+ { "embeddingProvider": "github-copilot" }
433
+ ```
434
+
260
435
  ## ⚠️ Tradeoffs
261
436
 
262
437
  Be aware of these characteristics:
package/commands/find.md CHANGED
@@ -2,12 +2,24 @@
2
2
  description: Find code using hybrid approach (semantic + grep)
3
3
  ---
4
4
 
5
- Find code related to: $ARGUMENTS
5
+ Find code using both semantic search and grep.
6
+
7
+ User input: $ARGUMENTS
6
8
 
7
9
  Strategy:
8
- 1. First use `codebase_search` to find semantically related code
9
- 2. From the results, identify specific function/class names
10
+ 1. Use `codebase_search` to find semantically related code
11
+ 2. Identify specific function/class/variable names from results
10
12
  3. Use grep to find all occurrences of those identifiers
11
- 4. Combine findings into a comprehensive answer
13
+ 4. Combine into a comprehensive answer
14
+
15
+ Parse optional parameters from input:
16
+ - `limit=N` → limit semantic results
17
+ - `type=X` or "functions"/"classes" → filter chunk type
18
+ - `dir=X` → filter directory
19
+
20
+ Examples:
21
+ - `/find error handling middleware`
22
+ - `/find payment validation type=function`
23
+ - `/find user auth in src/services`
12
24
 
13
- If the semantic index doesn't exist, run `index_codebase` first.
25
+ If no index exists, run `index_codebase` first.
package/commands/index.md CHANGED
@@ -2,10 +2,20 @@
2
2
  description: Index the codebase for semantic search
3
3
  ---
4
4
 
5
- Run the `index_codebase` tool to create or update the semantic search index.
5
+ Run the `index_codebase` tool with these settings:
6
6
 
7
- Show progress and final statistics including:
8
- - Number of files processed
9
- - Number of chunks indexed
10
- - Tokens used
11
- - Duration
7
+ User input: $ARGUMENTS
8
+
9
+ Parse the input and set tool arguments:
10
+ - force=true if input contains "force"
11
+ - estimateOnly=true if input contains "estimate"
12
+ - verbose=true (always, for detailed output)
13
+
14
+ Examples:
15
+ - `/index` → force=false, estimateOnly=false, verbose=true
16
+ - `/index force` → force=true, estimateOnly=false, verbose=true
17
+ - `/index estimate` → force=false, estimateOnly=true, verbose=true
18
+
19
+ IMPORTANT: You MUST pass the parsed arguments to `index_codebase`. Do not ignore them.
20
+
21
+ Show final statistics including files processed, chunks indexed, tokens used, and duration.
@@ -2,8 +2,23 @@
2
2
  description: Search codebase by meaning using semantic search
3
3
  ---
4
4
 
5
- Use the `codebase_search` tool to find code related to: $ARGUMENTS
5
+ Search the codebase using semantic search.
6
6
 
7
- If the index doesn't exist yet, run `index_codebase` first.
7
+ User input: $ARGUMENTS
8
8
 
9
- Return the most relevant results with file paths and line numbers.
9
+ The first part is the search query. Look for optional parameters:
10
+ - `limit=N` or "top N" or "first N" → set limit
11
+ - `type=X` or mentions "functions"/"classes"/"methods" → set chunkType
12
+ - `dir=X` or "in folder X" → set directory filter
13
+ - File extensions like ".ts", "typescript", ".py" → set fileType
14
+
15
+ Call `codebase_search` with the parsed arguments.
16
+
17
+ Examples:
18
+ - `/search authentication logic` → query="authentication logic"
19
+ - `/search error handling limit=5` → query="error handling", limit=5
20
+ - `/search validation functions` → query="validation", chunkType="function"
21
+
22
+ If the index doesn't exist, run `index_codebase` first.
23
+
24
+ Return results with file paths and line numbers.
@@ -0,0 +1,15 @@
1
+ ---
2
+ description: Check if the codebase is indexed and ready for semantic search
3
+ ---
4
+
5
+ Run the `index_status` tool to check if the codebase index is ready.
6
+
7
+ This shows:
8
+ - Whether the codebase is indexed
9
+ - Number of indexed chunks
10
+ - Embedding provider and model being used
11
+ - Current git branch
12
+
13
+ No arguments needed - just run `index_status`.
14
+
15
+ If not indexed, suggest running `/index` to create the index.