@comfanion/usethis_search 3.0.0-dev.9 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,28 +1,27 @@
1
- # 🔍 @comfanion/usethis_search
1
+ # @comfanion/usethis_search
2
2
 
3
- **Semantic code search with automatic indexing**
3
+ **Semantic code search with graph-based context for OpenCode**
4
4
 
5
- Forget about `grep` and `find` — search code by meaning, not by text!
5
+ Search code by meaning, not by text. Get related context automatically via code graph.
6
6
 
7
7
  ---
8
8
 
9
- ## What is this?
9
+ ## What is this?
10
10
 
11
11
  An OpenCode plugin that adds **smart search** to your project:
12
12
 
13
- - 🧠 **Semantic search** — finds code by meaning, even when words don't match
14
- - 🔀 **Hybrid search (v2)** — combines vector similarity + BM25 keyword matching
15
- - 🧩 **Semantic chunking (v2)** — structure-aware splitting for Markdown (headings) and code (functions/classes)
16
- - 🏷️ **Rich metadata (v2)** — filter by file type, language, date, tags
17
- - **Automatic indexing** — files are indexed on change (zero effort)
18
- - 📦 **Local vectorization** — works offline, no API keys needed
19
- - 🎯 **Three indexes** — separate for code, docs, and configs
20
- - 📊 **Quality metrics (v2)** — track search relevance and usage
21
- - 🌍 **Multilingual** — supports Ukrainian, Russian, and English
13
+ - **Semantic search** — finds code by meaning, even when words don't match
14
+ - **Hybrid search** — combines vector similarity + BM25 keyword matching
15
+ - **Graph-based context** — automatically attaches related code (imports, calls, type references) to search results
16
+ - **Two-phase indexing** — BM25 + graph search available immediately (Phase 1), vector search after embedding (Phase 2)
17
+ - **Simplified API** — 5 parameters, smart filter parsing, config-driven defaults
18
+ - **Automatic indexing** — files are indexed on change, zero effort
19
+ - **Local vectorization** — works offline, no API keys needed
20
+ - **Three indexes** — separate for code, docs, and configs
22
21
 
23
22
  ---
24
23
 
25
- ## 🚀 Quick Start
24
+ ## Quick Start
26
25
 
27
26
  ### Installation
28
27
 
@@ -44,76 +43,104 @@ Add to `opencode.json`:
44
43
 
45
44
  On OpenCode startup, the plugin automatically:
46
45
  1. Creates indexes for code and documentation
47
- 2. Indexes all project files
48
- 3. Shows progress via toast notifications
46
+ 2. Phase 1: chunks files, builds code graph (fast, parallel) — **BM25 search available immediately**
47
+ 3. Phase 2: embeds chunks into vectors — **hybrid search available after completion**
49
48
 
50
- **First indexing may take time:**
51
- - < 20 files — Quick coffee? ☕
52
- - < 100 files — ~1min. Stretch break? 🧘
53
- - < 500 files — ~3min. Make coffee ☕ and relax 🛋️
54
- - 500+ files — ~10min. Go touch grass 🌿 or take a nap 😴
49
+ **Indexing time estimates:**
50
+ - < 100 files — ~1 min
51
+ - < 500 files — ~3 min
52
+ - 500+ files — ~10 min
55
53
 
56
54
  ---
57
55
 
58
- ## 🎯 How to Use
56
+ ## Search API
59
57
 
60
- ### Search
58
+ The search tool has 5 parameters:
59
+
60
+ | Parameter | Type | Default | Description |
61
+ |-----------|------|---------|-------------|
62
+ | `query` | string | required | What you're looking for (semantic) |
63
+ | `index` | string | `"code"` | Which index: `code`, `docs`, `config` |
64
+ | `limit` | number | 10 | Number of results |
65
+ | `searchAll` | boolean | false | Search across all indexes |
66
+ | `filter` | string | — | Filter by path or language |
67
+
68
+ ### Search examples
61
69
 
62
70
  ```javascript
63
- // Search for authentication logic
64
- search({
65
- query: "authentication logic",
66
- index: "code"
67
- })
68
-
69
- // Search for deployment instructions
70
- search({
71
- query: "how to deploy",
72
- index: "docs"
73
- })
74
-
75
- // Search for API keys in configs
76
- search({
77
- query: "API keys",
78
- index: "config"
79
- })
80
-
81
- // Search across all indexes
82
- search({
83
- query: "database connection",
84
- searchAll: true
85
- })
86
-
87
- // v2: Hybrid search (vector + keyword matching)
88
- search({
89
- query: "getUserById",
90
- hybrid: true
91
- })
92
-
93
- // v2: Filter by file type and language
94
- search({
95
- query: "authentication logic",
96
- fileType: "code",
97
- language: "typescript"
98
- })
99
-
100
- // v2: Filter by date
101
- search({
102
- query: "recent changes",
103
- modifiedAfter: "2024-06-01"
104
- })
105
-
106
- // v2: Filter by frontmatter tags
107
- search({
108
- query: "security",
109
- tags: "auth,security"
110
- })
71
+ // Basic semantic search
72
+ search({ query: "authentication logic" })
73
+
74
+ // Search documentation
75
+ search({ query: "how to deploy", index: "docs" })
76
+
77
+ // Search all indexes
78
+ search({ query: "database connection", searchAll: true })
79
+
80
+ // Filter by directory
81
+ search({ query: "tenant management", filter: "internal/domain/" })
82
+
83
+ // Filter by language
84
+ search({ query: "event handling", filter: "*.go" })
85
+ search({ query: "middleware", filter: "go" })
86
+
87
+ // Combined: directory + language
88
+ search({ query: "API routes", filter: "internal/**/*.go" })
89
+
90
+ // Substring match on file path
91
+ search({ query: "metrics", filter: "service" })
92
+
93
+ // More results
94
+ search({ query: "error handling", limit: 20 })
95
+ ```
96
+
97
+ ### Filter syntax
98
+
99
+ The `filter` parameter is smart — it auto-detects what you mean:
100
+
101
+ | Input | Parsed as |
102
+ |-------|-----------|
103
+ | `"internal/domain/"` | Path prefix |
104
+ | `"*.go"` or `".go"` | Language filter (go) |
105
+ | `"go"` or `"python"` | Language filter |
106
+ | `"internal/**/*.go"` | Path prefix + language |
107
+ | `"service"` | Substring match on file path |
108
+
109
+ ### Search output
110
+
111
+ Each result includes:
112
+ - **Score breakdown**: `Score: 0.619 (vec: 0.47, bm25: +0.04, kw: +0.11 | matched: "event", "correlation")`
113
+ - **Rich metadata**: language, function name, class name, heading context
114
+ - **File grouping**: best chunk per file + "N matching sections" count
115
+ - **Related context**: graph-expanded neighbors (imports, calls, type references)
116
+ - **Confidence signal**: warning when top score < 0.45
117
+
118
+ When vectors are not yet available (Phase 2 in progress), search automatically falls back to **BM25-only mode** with a banner notification.
119
+
120
+ ---
121
+
122
+ ## Index Management
123
+
124
+ ### CLI
125
+
126
+ ```bash
127
+ # Reindex everything
128
+ bunx usethis_search reindex
129
+
130
+ # Check status
131
+ bunx usethis_search status
132
+
133
+ # List indexes
134
+ bunx usethis_search list
135
+
136
+ # Clear index
137
+ bunx usethis_search clear
111
138
  ```
112
139
 
113
- ### Index Management
140
+ ### Tool API
114
141
 
115
142
  ```javascript
116
- // List all indexes
143
+ // List all indexes with stats
117
144
  codeindex({ action: "list" })
118
145
 
119
146
  // Check specific index status
@@ -121,68 +148,120 @@ codeindex({ action: "status", index: "code" })
121
148
 
122
149
  // Reindex
123
150
  codeindex({ action: "reindex", index: "code" })
151
+ ```
124
152
 
125
- // Index specific directory
126
- codeindex({
127
- action: "reindex",
128
- index: "docs",
129
- dir: "docs/"
130
- })
153
+ ---
154
+
155
+ ## Architecture
156
+
157
+ ### Two-Phase Indexing Pipeline
131
158
 
132
- // v2: Run quality tests against gold dataset
133
- codeindex({ action: "test", index: "code" })
134
159
  ```
160
+ Phase 1 (fast, parallel, 5 workers):
161
+ file -> read -> chunk -> regex analyze -> graph edges -> ChunkStore (SQLite)
162
+ Result: BM25 + graph search available immediately
135
163
 
136
- ---
164
+ Phase 2 (batch, sequential):
165
+ ChunkStore chunks -> batch embed (32/batch) -> LanceDB
166
+ Result: vector/hybrid search becomes available
167
+ ```
137
168
 
138
- ## 🧠 How It Works
169
+ ### Search Strategy (auto-detect)
139
170
 
140
- ### Semantic Search
171
+ ```
172
+ Has vectors? -> hybrid search (vector + BM25 + graph + keyword rerank)
173
+ No vectors? -> BM25-only search (from ChunkStore + graph + keyword rerank)
174
+ ```
141
175
 
142
- Instead of searching for exact text matches, the plugin:
143
- 1. **Cleans** content (removes TOC, noise, auto-generated markers)
144
- 2. **Chunks** intelligently (Markdown by headings, code by functions/classes)
145
- 3. Converts chunks into **vectors** (numerical representations of meaning)
146
- 4. Compares vectors of your query with vectors of code
147
- 5. Optionally combines with **BM25 keyword search** (hybrid mode)
148
- 6. Returns the most **semantically similar** fragments with rich metadata
176
+ ### Storage Layout
149
177
 
150
- **Example:**
151
- ```javascript
152
- // You search for: "user authentication"
153
- // It will find code with:
154
- // - "login handler"
155
- // - "verify credentials"
156
- // - "session management"
157
- // Even if words "user" and "authentication" are absent!
178
+ ```
179
+ .opencode/
180
+ vectors/
181
+ code/
182
+ lancedb/ # Vector embeddings (LanceDB)
183
+ chunks.db # Chunk content + metadata (SQLite, ChunkStore)
184
+ hashes.json # File hashes for change detection
185
+ docs/
186
+ lancedb/
187
+ chunks.db
188
+ hashes.json
189
+ graph/
190
+ code_graph.db # Code relationships (SQLite, GraphDB)
191
+ doc_graph.db # Doc relationships (SQLite, GraphDB)
192
+ vectorizer.yaml # Configuration
193
+ indexer.log # Indexing log
158
194
  ```
159
195
 
160
- ### Automatic Indexing
196
+ ### Module Overview
161
197
 
162
- The plugin tracks file changes and automatically updates indexes:
198
+ | Module | Purpose |
199
+ |--------|---------|
200
+ | **Core** | |
201
+ | `vectorizer/index.ts` | CodebaseIndexer, two-phase pipeline, search, singleton pool |
202
+ | `vectorizer/chunk-store.ts` | SQLite chunk storage (BM25 without vectors) |
203
+ | `vectorizer/graph-db.ts` | SQLite triple store for code relationships |
204
+ | `vectorizer/graph-builder.ts` | Builds graph edges from code analysis |
205
+ | `vectorizer/bm25-index.ts` | Inverted index for keyword search |
206
+ | **Chunking** | |
207
+ | `vectorizer/chunkers/code-chunker.ts` | Function/class-aware splitting |
208
+ | `vectorizer/chunkers/markdown-chunker.ts` | Heading-aware splitting with hierarchy |
209
+ | `vectorizer/chunkers/chunker-factory.ts` | Routes to correct chunker by file type |
210
+ | **Analysis** | |
211
+ | `vectorizer/analyzers/regex-analyzer.ts` | Regex-based code analysis (imports, calls, types) |
212
+ | `vectorizer/analyzers/lsp-analyzer.ts` | LSP-based code analysis (definitions, references) |
213
+ | `vectorizer/analyzers/lsp-client.ts` | Language Server Protocol client |
214
+ | **Search** | |
215
+ | `vectorizer/hybrid-search.ts` | Merge vector + BM25 scores |
216
+ | `vectorizer/query-cache.ts` | LRU cache for query embeddings |
217
+ | `vectorizer/content-cleaner.ts` | Remove noise (TOC, breadcrumbs, markers) |
218
+ | `vectorizer/metadata-extractor.ts` | Extract file_type, language, tags, dates |
219
+ | **Tracking** | |
220
+ | `vectorizer/search-metrics.ts` | Search quality metrics |
221
+ | `vectorizer/usage-tracker.ts` | Usage provenance tracking |
222
+ | **Tools** | |
223
+ | `tools/search.ts` | Search tool (5 params, smart filter, score breakdown) |
224
+ | `tools/codeindex.ts` | Index management tool |
225
+
226
+ ### Graph-Based Context
227
+
228
+ The code graph tracks relationships between chunks:
229
+
230
+ - **imports** — file A imports module B
231
+ - **calls** — function A calls function B
232
+ - **references** — code references a type/interface
233
+ - **implements** — class implements an interface
234
+ - **extends** — class extends another class
235
+ - **belongs_to** — chunk belongs to file (structural)
236
+
237
+ When you search, results are automatically expanded with 1-hop graph neighbors. Related context is scored by `edge_weight * cosine_similarity` (or `edge_weight * 0.7` in BM25-only mode) and filtered by `min_relevance`.
238
+
239
+ ### Singleton Indexer Pool
240
+
241
+ Multiple parallel searches share one `CodebaseIndexer` instance per (project, index) pair. No SQLite lock conflicts. Managed via `getIndexer()` / `releaseIndexer()` / `destroyIndexer()`.
242
+
243
+ ---
163
244
 
164
- 1. **On OpenCode startup** — checks all indexes, updates stale ones
165
- 2. **On file edit** — queues file for reindexing
166
- 3. **After 1 second** (debounce) — indexes changed files
245
+ ## Configuration
167
246
 
168
- **Configuration in `.opencode/vectorizer.yaml`:**
247
+ ### Full config example
169
248
 
170
249
  ```yaml
250
+ # .opencode/vectorizer.yaml
171
251
  vectorizer:
172
- enabled: true # Enable plugin
173
- auto_index: true # Automatic indexing
174
- debounce_ms: 1000 # Delay before indexing (ms)
175
-
176
- # v2: Content cleaning
252
+ enabled: true
253
+ auto_index: true
254
+ model: "Xenova/all-MiniLM-L6-v2"
255
+ debounce_ms: 1000
256
+
177
257
  cleaning:
178
258
  remove_toc: true
179
259
  remove_frontmatter_metadata: false
180
260
  remove_imports: false
181
261
  remove_comments: false
182
-
183
- # v2: Semantic chunking
262
+
184
263
  chunking:
185
- strategy: "semantic" # fixed | semantic
264
+ strategy: "semantic" # fixed | semantic
186
265
  markdown:
187
266
  split_by_headings: true
188
267
  min_chunk_size: 200
@@ -193,133 +272,73 @@ vectorizer:
193
272
  include_function_signature: true
194
273
  min_chunk_size: 300
195
274
  max_chunk_size: 1500
196
-
197
- # v2: Hybrid search
275
+ fixed:
276
+ max_chars: 1500
277
+
198
278
  search:
199
- hybrid: false # vector + BM25
279
+ hybrid: true
200
280
  bm25_weight: 0.3
201
-
202
- # v2: Quality monitoring
281
+ freshen: false # Don't re-index on every search
282
+ min_score: 0.35 # Minimum relevance cutoff
283
+ include_archived: false
284
+ default_limit: 10
285
+
286
+ graph:
287
+ enabled: true
288
+ max_related: 4 # Max related chunks per result
289
+ min_relevance: 0.5 # Min score for related context
290
+ semantic_edges: false # O(n^2) — enable only for small repos
291
+ semantic_edges_max_chunks: 500
292
+ lsp:
293
+ enabled: true
294
+ timeout_ms: 5000
295
+ read_intercept: true
296
+
203
297
  quality:
204
298
  enable_metrics: false
205
299
  enable_cache: true
206
-
300
+
207
301
  indexes:
208
302
  code:
209
303
  enabled: true
304
+ pattern: "**/*.{js,ts,jsx,tsx,mjs,cjs,py,go,rs,java,kt,swift,c,cpp,h,hpp,cs,rb,php,scala,clj}"
305
+ ignore:
306
+ - "**/node_modules/**"
307
+ - "**/.git/**"
308
+ - "**/dist/**"
309
+ - "**/build/**"
310
+ - "**/.opencode/**"
311
+ - "**/vendor/**"
312
+ hybrid: true
313
+ bm25_weight: 0.3
210
314
  docs:
211
315
  enabled: true
316
+ pattern: "docs/**/*.{md,mdx,txt,rst,adoc}"
317
+ hybrid: false
318
+ bm25_weight: 0.2
212
319
  config:
213
320
  enabled: false
214
-
321
+ pattern: "**/*.{yaml,yml,json,toml,ini,env,xml}"
322
+ hybrid: false
323
+ bm25_weight: 0.3
324
+
215
325
  exclude:
216
326
  - node_modules
217
327
  - vendor
218
328
  - dist
219
329
  - build
330
+ - out
220
331
  - __pycache__
221
332
  ```
222
333
 
223
- ---
224
-
225
- ## 📦 Data Structure
226
-
227
- Indexes are stored locally in your project:
228
-
229
- ```
230
- .opencode/
231
- vectors/
232
- code/ # Code index
233
- data/ # LanceDB tables
234
- hashes.json # File hashes (for change detection)
235
- docs/ # Documentation index
236
- data/
237
- hashes.json
238
- vectorizer.yaml # Configuration
239
- indexer.log # Indexing log (if DEBUG=*)
240
- ```
241
-
242
- ---
243
-
244
- ## 🎨 Usage Examples
245
-
246
- ### 1. Find all API endpoints
247
-
248
- ```javascript
249
- search({
250
- query: "REST API endpoints routes",
251
- index: "code"
252
- })
253
- ```
254
-
255
- ### 2. Find testing documentation
256
-
257
- ```javascript
258
- search({
259
- query: "how to write tests",
260
- index: "docs"
261
- })
262
- ```
263
-
264
- ### 3. Find database configuration
265
-
266
- ```javascript
267
- search({
268
- query: "database connection settings",
269
- index: "config"
270
- })
271
- ```
272
-
273
- ### 4. Find error handling
274
-
275
- ```javascript
276
- search({
277
- query: "error handling try catch",
278
- index: "code",
279
- limit: 20 // More results
280
- })
281
- ```
282
-
283
- ### 5. Search across entire project
284
-
285
- ```javascript
286
- search({
287
- query: "authentication",
288
- searchAll: true // Searches in code, docs, config
289
- })
290
- ```
291
-
292
- ---
293
-
294
- ## 🛠️ Configuration
295
-
296
334
  ### Disable automatic indexing
297
335
 
298
- ```yaml
299
- # .opencode/vectorizer.yaml
300
- vectorizer:
301
- enabled: true
302
- auto_index: false # Manual indexing only
303
- ```
304
-
305
- ### Add custom index
306
-
307
- ```yaml
308
- vectorizer:
309
- indexes:
310
- tests:
311
- enabled: true
312
- extensions: [.test.js, .spec.ts]
313
- ```
314
-
315
- ### Change indexing delay
316
-
317
336
  ```yaml
318
337
  vectorizer:
319
- debounce_ms: 3000 # 3 seconds instead of 1
338
+ auto_index: false
320
339
  ```
321
340
 
322
- ### Temporarily disable plugin
341
+ ### Skip auto-index via env
323
342
 
324
343
  ```bash
325
344
  export OPENCODE_SKIP_AUTO_INDEX=1
@@ -327,112 +346,37 @@ export OPENCODE_SKIP_AUTO_INDEX=1
327
346
 
328
347
  ---
329
348
 
330
- ## 🐛 Debugging
349
+ ## Debugging
331
350
 
332
351
  ### Enable logs
333
352
 
334
353
  ```bash
335
- export DEBUG=file-indexer
336
- # or
354
+ export DEBUG=vectorizer
355
+ # or all logs
337
356
  export DEBUG=*
338
357
  ```
339
358
 
340
- Logs will be in `.opencode/indexer.log`
341
-
342
- ### Reindex everything
343
-
344
- ```javascript
345
- codeindex({ action: "reindex", index: "code" })
346
- codeindex({ action: "reindex", index: "docs" })
347
- ```
348
-
349
- ### Check index status
350
-
351
- ```javascript
352
- codeindex({ action: "list" })
353
- ```
354
-
355
- ---
356
-
357
- ## 🌟 Advantages
358
-
359
- ### Compared to `grep`/`find`
360
-
361
- | Feature | grep/find | usethis_search |
362
- |---------|-----------|----------------|
363
- | Text search | ✅ | ✅ |
364
- | Semantic search | ❌ | ✅ |
365
- | Finds synonyms | ❌ | ✅ |
366
- | Understands context | ❌ | ✅ |
367
- | Works offline | ✅ | ✅ |
368
- | Auto-updates | ❌ | ✅ |
369
-
370
- ### Compared to online search (GitHub Copilot, ChatGPT)
371
-
372
- | Feature | Online | usethis_search |
373
- |---------|--------|----------------|
374
- | Works offline | ❌ | ✅ |
375
- | Privacy | ❌ | ✅ |
376
- | Free | ❌ | ✅ |
377
- | Speed | 🐌 | ⚡ |
378
- | Knows your code | ❌ | ✅ |
359
+ Indexing activity is logged to `.opencode/indexer.log`.
379
360
 
380
361
  ---
381
362
 
382
- ## 📊 Technical Details
363
+ ## Technical Details
383
364
 
384
365
  - **Vectorization:** [@xenova/transformers](https://github.com/xenova/transformers.js) (ONNX Runtime)
385
366
  - **Vector DB:** [LanceDB](https://lancedb.com/) (local, serverless)
386
- - **Model:** `Xenova/all-MiniLM-L6-v2` (multilingual, 384 dimensions)
387
- - **Model size:** ~23 MB (downloaded once)
388
- - **Speed:** ~0.5 sec/file (after model loading)
389
-
390
- ### v2 Architecture
391
-
392
- ```
393
- File → Content Cleaner → Chunker Factory → Embedder → LanceDB
394
- ├── Markdown Chunker (heading-aware)
395
- ├── Code Chunker (function/class-aware)
396
- └── Fixed Chunker (fallback)
397
-
398
- Query → Query Cache → Embedder → Vector Search ─┐
399
- └──────────→ BM25 Search ────┤→ Hybrid Merge → Filter → Results
400
-
401
- Metadata Filter (type, lang, date, tags)
402
- ```
403
-
404
- ### New Modules (v2)
405
-
406
- | Module | Purpose |
407
- |--------|---------|
408
- | `content-cleaner.ts` | Remove noise (TOC, breadcrumbs, markers) |
409
- | `metadata-extractor.ts` | Extract file_type, language, tags, dates |
410
- | `markdown-chunker.ts` | Heading-aware splitting with hierarchy |
411
- | `code-chunker.ts` | Function/class-aware splitting |
412
- | `chunker-factory.ts` | Route to correct chunker by file type |
413
- | `bm25-index.ts` | Inverted index for keyword search |
414
- | `hybrid-search.ts` | Merge vector + BM25 scores |
415
- | `query-cache.ts` | LRU cache for query embeddings |
416
- | `search-metrics.ts` | Track search quality metrics |
367
+ - **Chunk Store:** bun:sqlite (WAL mode, concurrent reads)
368
+ - **Graph DB:** bun:sqlite (WAL mode, triple store)
369
+ - **Model:** `Xenova/all-MiniLM-L6-v2` (multilingual, 384 dimensions, ~23 MB)
370
+ - **Embedding speed:** ~0.5 sec/file
371
+ - **Phase 1 speed:** ~0.05 sec/file (no embedding)
372
+ - **Supported languages:** JavaScript, TypeScript, Python, Go, Rust, Java, Kotlin, Swift, C/C++, C#, Ruby, PHP, Scala, Clojure
417
373
 
418
374
  ---
419
375
 
420
- ## 🤝 Contributing
421
-
422
- Found a bug? Have an idea? Open an issue or PR!
423
-
424
- ---
425
-
426
- ## 📄 License
376
+ ## License
427
377
 
428
378
  MIT
429
379
 
430
380
  ---
431
381
 
432
- ## 🎉 Authors
433
-
434
- Made with ❤️ by the **Comfanion** team
435
-
436
- ---
437
-
438
- **Search smart, not hard!** 🚀
382
+ Made by the **Comfanion** team