npm - @comfanion/usethis_search - Versions diffs - 3.0.0-dev.9 → 3.0.1 - Mend

@comfanion/usethis_search 3.0.0-dev.9 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/README.md +246 -302
package/cli.ts +263 -0
package/file-indexer.ts +1 -1
package/index.ts +0 -2
package/package.json +12 -5
package/tools/codeindex.ts +2 -2
package/tools/search.ts +254 -66
package/vectorizer/analyzers/lsp-analyzer.ts +7 -7
package/vectorizer/analyzers/regex-analyzer.ts +358 -61
package/vectorizer/chunk-store.ts +207 -0
package/vectorizer/chunkers/code-chunker.ts +74 -24
package/vectorizer/chunkers/markdown-chunker.ts +69 -7
package/vectorizer/graph-builder.ts +207 -15
package/vectorizer/graph-db.ts +161 -164
package/vectorizer/hybrid-search.ts +1 -1
package/vectorizer/{index.js → index.ts} +796 -160
package/vectorizer.yaml +20 -2

package/README.md CHANGED Viewed

@@ -1,28 +1,27 @@
-# 🔍 @comfanion/usethis_search
+# @comfanion/usethis_search
-**Semantic code search with automatic indexing**
+**Semantic code search with graph-based context for OpenCode**
-Forget about `grep` and `find` — search code by meaning, not by text!
+Search code by meaning, not by text. Get related context automatically via code graph.
 ---
-## ✨ What is this?
+## What is this?
 An OpenCode plugin that adds **smart search** to your project:
-- 🧠 **Semantic search** — finds code by meaning, even when words don't match
-- 🔀 **Hybrid search (v2)** — combines vector similarity + BM25 keyword matching
-- 🧩 **Semantic chunking (v2)** — structure-aware splitting for Markdown (headings) and code (functions/classes)
-- 🏷️ **Rich metadata (v2)** — filter by file type, language, date, tags
-- ⚡ **Automatic indexing** — files are indexed on change (zero effort)
-- 📦 **Local vectorization** — works offline, no API keys needed
-- 🎯 **Three indexes** — separate for code, docs, and configs
-- 📊 **Quality metrics (v2)** — track search relevance and usage
-- 🌍 **Multilingual** — supports Ukrainian, Russian, and English
+- **Semantic search** — finds code by meaning, even when words don't match
+- **Hybrid search** — combines vector similarity + BM25 keyword matching
+- **Graph-based context** — automatically attaches related code (imports, calls, type references) to search results
+- **Two-phase indexing** — BM25 + graph search available immediately (Phase 1), vector search after embedding (Phase 2)
+- **Simplified API** — 5 parameters, smart filter parsing, config-driven defaults
+- **Automatic indexing** — files are indexed on change, zero effort
+- **Local vectorization** — works offline, no API keys needed
+- **Three indexes** — separate for code, docs, and configs
 ---
-## 🚀 Quick Start
+## Quick Start
 ### Installation
@@ -44,76 +43,104 @@ Add to `opencode.json`:
 On OpenCode startup, the plugin automatically:
 1. Creates indexes for code and documentation
-2. Indexes all project files
-3. Shows progress via toast notifications
+2. Phase 1: chunks files, builds code graph (fast, parallel) — **BM25 search available immediately**
+3. Phase 2: embeds chunks into vectors — **hybrid search available after completion**
-**First indexing may take time:**
-- < 20 files — Quick coffee? ☕
-- < 100 files — ~1min. Stretch break? 🧘
-- < 500 files — ~3min. Make coffee ☕ and relax 🛋️
-- 500+ files — ~10min. Go touch grass 🌿 or take a nap 😴
+**Indexing time estimates:**
+- < 100 files — ~1 min
+- < 500 files — ~3 min
+- 500+ files — ~10 min
 ---
-## 🎯 How to Use
+## Search API
-### Search
+The search tool has 5 parameters:
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `query` | string | required | What you're looking for (semantic) |
+| `index` | string | `"code"` | Which index: `code`, `docs`, `config` |
+| `limit` | number | 10 | Number of results |
+| `searchAll` | boolean | false | Search across all indexes |
+| `filter` | string | — | Filter by path or language |
+### Search examples
 ```javascript
-// Search for authentication logic
-search({
-  query: "authentication logic",
-  index: "code"
-})
-// Search for deployment instructions
-search({
-  query: "how to deploy",
-  index: "docs"
-})
-// Search for API keys in configs
-search({
-  query: "API keys",
-  index: "config"
-})
-// Search across all indexes
-search({
-  query: "database connection",
-  searchAll: true
-})
-// v2: Hybrid search (vector + keyword matching)
-search({
-  query: "getUserById",
-  hybrid: true
-})
-// v2: Filter by file type and language
-search({
-  query: "authentication logic",
-  fileType: "code",
-  language: "typescript"
-})
-// v2: Filter by date
-search({
-  query: "recent changes",
-  modifiedAfter: "2024-06-01"
-})
-// v2: Filter by frontmatter tags
-search({
-  query: "security",
-  tags: "auth,security"
-})
+// Basic semantic search
+search({ query: "authentication logic" })
+// Search documentation
+search({ query: "how to deploy", index: "docs" })
+// Search all indexes
+search({ query: "database connection", searchAll: true })
+// Filter by directory
+search({ query: "tenant management", filter: "internal/domain/" })
+// Filter by language
+search({ query: "event handling", filter: "*.go" })
+search({ query: "middleware", filter: "go" })
+// Combined: directory + language
+search({ query: "API routes", filter: "internal/**/*.go" })
+// Substring match on file path
+search({ query: "metrics", filter: "service" })
+// More results
+search({ query: "error handling", limit: 20 })
+```
+### Filter syntax
+The `filter` parameter is smart — it auto-detects what you mean:
+| Input | Parsed as |
+|-------|-----------|
+| `"internal/domain/"` | Path prefix |
+| `"*.go"` or `".go"` | Language filter (go) |
+| `"go"` or `"python"` | Language filter |
+| `"internal/**/*.go"` | Path prefix + language |
+| `"service"` | Substring match on file path |
+### Search output
+Each result includes:
+- **Score breakdown**: `Score: 0.619 (vec: 0.47, bm25: +0.04, kw: +0.11 | matched: "event", "correlation")`
+- **Rich metadata**: language, function name, class name, heading context
+- **File grouping**: best chunk per file + "N matching sections" count
+- **Related context**: graph-expanded neighbors (imports, calls, type references)
+- **Confidence signal**: warning when top score < 0.45
+When vectors are not yet available (Phase 2 in progress), search automatically falls back to **BM25-only mode** with a banner notification.
+---
+## Index Management
+### CLI
+```bash
+# Reindex everything
+bunx usethis_search reindex
+# Check status
+bunx usethis_search status
+# List indexes
+bunx usethis_search list
+# Clear index
+bunx usethis_search clear
 ```
-### Index Management
+### Tool API
 ```javascript
-// List all indexes
+// List all indexes with stats
 codeindex({ action: "list" })
 // Check specific index status
@@ -121,68 +148,120 @@ codeindex({ action: "status", index: "code" })
 // Reindex
 codeindex({ action: "reindex", index: "code" })
+```
-// Index specific directory
-codeindex({
-  action: "reindex",
-  index: "docs",
-  dir: "docs/"
-})
+---
+## Architecture
+### Two-Phase Indexing Pipeline
-// v2: Run quality tests against gold dataset
-codeindex({ action: "test", index: "code" })
 ```
+Phase 1 (fast, parallel, 5 workers):
+  file -> read -> chunk -> regex analyze -> graph edges -> ChunkStore (SQLite)
+  Result: BM25 + graph search available immediately
----
+Phase 2 (batch, sequential):
+  ChunkStore chunks -> batch embed (32/batch) -> LanceDB
+  Result: vector/hybrid search becomes available
+```
-## 🧠 How It Works
+### Search Strategy (auto-detect)
-### Semantic Search
+```
+Has vectors? -> hybrid search (vector + BM25 + graph + keyword rerank)
+No vectors?  -> BM25-only search (from ChunkStore + graph + keyword rerank)
+```
-Instead of searching for exact text matches, the plugin:
-1. **Cleans** content (removes TOC, noise, auto-generated markers)
-2. **Chunks** intelligently (Markdown by headings, code by functions/classes)
-3. Converts chunks into **vectors** (numerical representations of meaning)
-4. Compares vectors of your query with vectors of code
-5. Optionally combines with **BM25 keyword search** (hybrid mode)
-6. Returns the most **semantically similar** fragments with rich metadata
+### Storage Layout
-**Example:**
-```javascript
-// You search for: "user authentication"
-// It will find code with:
-// - "login handler"
-// - "verify credentials"
-// - "session management"
-// Even if words "user" and "authentication" are absent!
+```
+.opencode/
+  vectors/
+    code/
+      lancedb/          # Vector embeddings (LanceDB)
+      chunks.db         # Chunk content + metadata (SQLite, ChunkStore)
+      hashes.json       # File hashes for change detection
+    docs/
+      lancedb/
+      chunks.db
+      hashes.json
+  graph/
+    code_graph.db       # Code relationships (SQLite, GraphDB)
+    doc_graph.db        # Doc relationships (SQLite, GraphDB)
+  vectorizer.yaml       # Configuration
+  indexer.log           # Indexing log
 ```
-### Automatic Indexing
+### Module Overview
-The plugin tracks file changes and automatically updates indexes:
+| Module | Purpose |
+|--------|---------|
+| **Core** | |
+| `vectorizer/index.ts` | CodebaseIndexer, two-phase pipeline, search, singleton pool |
+| `vectorizer/chunk-store.ts` | SQLite chunk storage (BM25 without vectors) |
+| `vectorizer/graph-db.ts` | SQLite triple store for code relationships |
+| `vectorizer/graph-builder.ts` | Builds graph edges from code analysis |
+| `vectorizer/bm25-index.ts` | Inverted index for keyword search |
+| **Chunking** | |
+| `vectorizer/chunkers/code-chunker.ts` | Function/class-aware splitting |
+| `vectorizer/chunkers/markdown-chunker.ts` | Heading-aware splitting with hierarchy |
+| `vectorizer/chunkers/chunker-factory.ts` | Routes to correct chunker by file type |
+| **Analysis** | |
+| `vectorizer/analyzers/regex-analyzer.ts` | Regex-based code analysis (imports, calls, types) |
+| `vectorizer/analyzers/lsp-analyzer.ts` | LSP-based code analysis (definitions, references) |
+| `vectorizer/analyzers/lsp-client.ts` | Language Server Protocol client |
+| **Search** | |
+| `vectorizer/hybrid-search.ts` | Merge vector + BM25 scores |
+| `vectorizer/query-cache.ts` | LRU cache for query embeddings |
+| `vectorizer/content-cleaner.ts` | Remove noise (TOC, breadcrumbs, markers) |
+| `vectorizer/metadata-extractor.ts` | Extract file_type, language, tags, dates |
+| **Tracking** | |
+| `vectorizer/search-metrics.ts` | Search quality metrics |
+| `vectorizer/usage-tracker.ts` | Usage provenance tracking |
+| **Tools** | |
+| `tools/search.ts` | Search tool (5 params, smart filter, score breakdown) |
+| `tools/codeindex.ts` | Index management tool |
+### Graph-Based Context
+The code graph tracks relationships between chunks:
+- **imports** — file A imports module B
+- **calls** — function A calls function B
+- **references** — code references a type/interface
+- **implements** — class implements an interface
+- **extends** — class extends another class
+- **belongs_to** — chunk belongs to file (structural)
+When you search, results are automatically expanded with 1-hop graph neighbors. Related context is scored by `edge_weight * cosine_similarity` (or `edge_weight * 0.7` in BM25-only mode) and filtered by `min_relevance`.
+### Singleton Indexer Pool
+Multiple parallel searches share one `CodebaseIndexer` instance per (project, index) pair. No SQLite lock conflicts. Managed via `getIndexer()` / `releaseIndexer()` / `destroyIndexer()`.
+---
-1. **On OpenCode startup** — checks all indexes, updates stale ones
-2. **On file edit** — queues file for reindexing
-3. **After 1 second** (debounce) — indexes changed files
+## Configuration
-**Configuration in `.opencode/vectorizer.yaml`:**
+### Full config example
 ```yaml
+# .opencode/vectorizer.yaml
 vectorizer:
-  enabled: true          # Enable plugin
-  auto_index: true       # Automatic indexing
-  debounce_ms: 1000      # Delay before indexing (ms)
-  # v2: Content cleaning
+  enabled: true
+  auto_index: true
+  model: "Xenova/all-MiniLM-L6-v2"
+  debounce_ms: 1000
   cleaning:
     remove_toc: true
     remove_frontmatter_metadata: false
     remove_imports: false
     remove_comments: false
-  # v2: Semantic chunking
   chunking:
-    strategy: "semantic"  # fixed | semantic
+    strategy: "semantic"    # fixed | semantic
     markdown:
       split_by_headings: true
       min_chunk_size: 200
@@ -193,133 +272,73 @@ vectorizer:
       include_function_signature: true
       min_chunk_size: 300
       max_chunk_size: 1500
-  # v2: Hybrid search
+    fixed:
+      max_chars: 1500
   search:
-    hybrid: false         # vector + BM25
+    hybrid: true
     bm25_weight: 0.3
-  # v2: Quality monitoring
+    freshen: false              # Don't re-index on every search
+    min_score: 0.35             # Minimum relevance cutoff
+    include_archived: false
+    default_limit: 10
+  graph:
+    enabled: true
+    max_related: 4              # Max related chunks per result
+    min_relevance: 0.5          # Min score for related context
+    semantic_edges: false       # O(n^2) — enable only for small repos
+    semantic_edges_max_chunks: 500
+    lsp:
+      enabled: true
+      timeout_ms: 5000
+    read_intercept: true
   quality:
     enable_metrics: false
     enable_cache: true
   indexes:
     code:
       enabled: true
+      pattern: "**/*.{js,ts,jsx,tsx,mjs,cjs,py,go,rs,java,kt,swift,c,cpp,h,hpp,cs,rb,php,scala,clj}"
+      ignore:
+        - "**/node_modules/**"
+        - "**/.git/**"
+        - "**/dist/**"
+        - "**/build/**"
+        - "**/.opencode/**"
+        - "**/vendor/**"
+      hybrid: true
+      bm25_weight: 0.3
     docs:
       enabled: true
+      pattern: "docs/**/*.{md,mdx,txt,rst,adoc}"
+      hybrid: false
+      bm25_weight: 0.2
     config:
       enabled: false
+      pattern: "**/*.{yaml,yml,json,toml,ini,env,xml}"
+      hybrid: false
+      bm25_weight: 0.3
   exclude:
     - node_modules
     - vendor
     - dist
     - build
+    - out
     - __pycache__
 ```
----
-## 📦 Data Structure
-Indexes are stored locally in your project:
-```
-.opencode/
-  vectors/
-    code/              # Code index
-      data/            # LanceDB tables
-      hashes.json      # File hashes (for change detection)
-    docs/              # Documentation index
-      data/
-      hashes.json
-  vectorizer.yaml      # Configuration
-  indexer.log          # Indexing log (if DEBUG=*)
-```
----
-## 🎨 Usage Examples
-### 1. Find all API endpoints
-```javascript
-search({
-  query: "REST API endpoints routes",
-  index: "code"
-})
-```
-### 2. Find testing documentation
-```javascript
-search({
-  query: "how to write tests",
-  index: "docs"
-})
-```
-### 3. Find database configuration
-```javascript
-search({
-  query: "database connection settings",
-  index: "config"
-})
-```
-### 4. Find error handling
-```javascript
-search({
-  query: "error handling try catch",
-  index: "code",
-  limit: 20  // More results
-})
-```
-### 5. Search across entire project
-```javascript
-search({
-  query: "authentication",
-  searchAll: true  // Searches in code, docs, config
-})
-```
----
-## 🛠️ Configuration
 ### Disable automatic indexing
-```yaml
-# .opencode/vectorizer.yaml
-vectorizer:
-  enabled: true
-  auto_index: false  # Manual indexing only
-```
-### Add custom index
-```yaml
-vectorizer:
-  indexes:
-    tests:
-      enabled: true
-      extensions: [.test.js, .spec.ts]
-```
-### Change indexing delay
 ```yaml
 vectorizer:
-  debounce_ms: 3000  # 3 seconds instead of 1
+  auto_index: false
 ```
-### Temporarily disable plugin
+### Skip auto-index via env
 ```bash
 export OPENCODE_SKIP_AUTO_INDEX=1
@@ -327,112 +346,37 @@ export OPENCODE_SKIP_AUTO_INDEX=1
 ---
-## 🐛 Debugging
+## Debugging
 ### Enable logs
 ```bash
-export DEBUG=file-indexer
-# or
+export DEBUG=vectorizer
+# or all logs
 export DEBUG=*
 ```
-Logs will be in `.opencode/indexer.log`
-### Reindex everything
-```javascript
-codeindex({ action: "reindex", index: "code" })
-codeindex({ action: "reindex", index: "docs" })
-```
-### Check index status
-```javascript
-codeindex({ action: "list" })
-```
----
-## 🌟 Advantages
-### Compared to `grep`/`find`
-| Feature | grep/find | usethis_search |
-|---------|-----------|----------------|
-| Text search | ✅ | ✅ |
-| Semantic search | ❌ | ✅ |
-| Finds synonyms | ❌ | ✅ |
-| Understands context | ❌ | ✅ |
-| Works offline | ✅ | ✅ |
-| Auto-updates | ❌ | ✅ |
-### Compared to online search (GitHub Copilot, ChatGPT)
-| Feature | Online | usethis_search |
-|---------|--------|----------------|
-| Works offline | ❌ | ✅ |
-| Privacy | ❌ | ✅ |
-| Free | ❌ | ✅ |
-| Speed | 🐌 | ⚡ |
-| Knows your code | ❌ | ✅ |
+Indexing activity is logged to `.opencode/indexer.log`.
 ---
-## 📊 Technical Details
+## Technical Details
 - **Vectorization:** [@xenova/transformers](https://github.com/xenova/transformers.js) (ONNX Runtime)
 - **Vector DB:** [LanceDB](https://lancedb.com/) (local, serverless)
-- **Model:** `Xenova/all-MiniLM-L6-v2` (multilingual, 384 dimensions)
-- **Model size:** ~23 MB (downloaded once)
-- **Speed:** ~0.5 sec/file (after model loading)
-### v2 Architecture
-```
-File → Content Cleaner → Chunker Factory → Embedder → LanceDB
-                           ├── Markdown Chunker (heading-aware)
-                           ├── Code Chunker (function/class-aware)
-                           └── Fixed Chunker (fallback)
-Query → Query Cache → Embedder → Vector Search ─┐
-                    └──────────→ BM25 Search ────┤→ Hybrid Merge → Filter → Results
-                                                 │
-                                        Metadata Filter (type, lang, date, tags)
-```
-### New Modules (v2)
-| Module | Purpose |
-|--------|---------|
-| `content-cleaner.ts` | Remove noise (TOC, breadcrumbs, markers) |
-| `metadata-extractor.ts` | Extract file_type, language, tags, dates |
-| `markdown-chunker.ts` | Heading-aware splitting with hierarchy |
-| `code-chunker.ts` | Function/class-aware splitting |
-| `chunker-factory.ts` | Route to correct chunker by file type |
-| `bm25-index.ts` | Inverted index for keyword search |
-| `hybrid-search.ts` | Merge vector + BM25 scores |
-| `query-cache.ts` | LRU cache for query embeddings |
-| `search-metrics.ts` | Track search quality metrics |
+- **Chunk Store:** bun:sqlite (WAL mode, concurrent reads)
+- **Graph DB:** bun:sqlite (WAL mode, triple store)
+- **Model:** `Xenova/all-MiniLM-L6-v2` (multilingual, 384 dimensions, ~23 MB)
+- **Embedding speed:** ~0.5 sec/file
+- **Phase 1 speed:** ~0.05 sec/file (no embedding)
+- **Supported languages:** JavaScript, TypeScript, Python, Go, Rust, Java, Kotlin, Swift, C/C++, C#, Ruby, PHP, Scala, Clojure
 ---
-## 🤝 Contributing
-Found a bug? Have an idea? Open an issue or PR!
----
-## 📄 License
+## License
 MIT
 ---
-## 🎉 Authors
-Made with ❤️ by the **Comfanion** team
----
-**Search smart, not hard!** 🚀
+Made by the **Comfanion** team