@chiway/contextweaver 1.5.0 → 1.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +436 -260
- package/README.zh-CN.md +669 -0
- package/dist/{chunk-ISVCQFB4.js → chunk-2EVCLNYN.js} +1 -1
- package/dist/{chunk-VWBKZ6QL.js → chunk-H4MGLXXF.js} +1 -1
- package/dist/{chunk-Y6H7C3NA.js → chunk-MN6BQJDB.js} +1 -1
- package/dist/{chunk-V3K4YVAR.js → chunk-ORYIVY7D.js} +1 -1
- package/dist/{chunk-R6CNZXZ7.js → chunk-YMQWNIQI.js} +1 -1
- package/dist/{chunk-GDVB6PJ4.js → chunk-YSQI5IRI.js} +104 -2
- package/dist/{codebaseRetrieval-DIS5RH2C.js → codebaseRetrieval-4BFIM7PU.js} +2 -2
- package/dist/{findReferences-N7ML7TUP.js → findReferences-EBYR3VNL.js} +2 -2
- package/dist/{getSymbolDefinition-6KMY4H33.js → getSymbolDefinition-ZQK65FPN.js} +2 -2
- package/dist/index.js +6 -6
- package/dist/{listFiles-4VT2TPJD.js → listFiles-W7C5UYOP.js} +2 -2
- package/dist/{scanner-QDFZJLP7.js → scanner-OVMAMQSQ.js} +1 -1
- package/dist/{server-UAI3U7AB.js → server-ZIJIRVWH.js} +5 -5
- package/package.json +7 -1
package/README.md
CHANGED
|
@@ -1,152 +1,218 @@
|
|
|
1
1
|
# ContextWeaver
|
|
2
2
|
|
|
3
3
|
<p align="center">
|
|
4
|
-
<strong>🧵
|
|
4
|
+
<strong>🧵 A codebase context engine woven for AI agents</strong>
|
|
5
5
|
</p>
|
|
6
6
|
|
|
7
7
|
<p align="center">
|
|
8
8
|
<em>Semantic Code Retrieval for AI Agents — Hybrid Search • Graph Expansion • Token-Aware Packing</em>
|
|
9
9
|
</p>
|
|
10
10
|
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
**ContextWeaver** 是一个专为 AI 代码助手设计的语义检索引擎,采用混合搜索(向量 + 词法)、智能上下文扩展和 Token 感知打包策略,为 LLM 提供精准、相关且上下文完整的代码片段。
|
|
14
|
-
|
|
15
11
|
<p align="center">
|
|
16
|
-
<
|
|
12
|
+
<strong>English</strong> ·
|
|
13
|
+
<a href="README.zh-CN.md">简体中文</a>
|
|
17
14
|
</p>
|
|
18
15
|
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
### 🔍 混合检索引擎
|
|
22
|
-
- **向量召回 (Vector Retrieval)**:基于语义相似度的深度理解
|
|
23
|
-
- **词法召回 (Lexical/FTS)**:精确匹配函数名、类名等技术术语
|
|
24
|
-
- **RRF 融合 (Reciprocal Rank Fusion)**:智能融合多路召回结果
|
|
25
|
-
|
|
26
|
-
### 🧠 AST 语义分片
|
|
27
|
-
- **Tree-sitter 解析**:支持 TypeScript、JavaScript、Python、Go、Java、Rust、C、C++、C# 等语言
|
|
28
|
-
- **Dual-Text 策略**:`displayCode` 用于展示,`vectorText` 用于 Embedding
|
|
29
|
-
- **Gap-Aware 合并**:智能处理代码间隙,保持语义完整性
|
|
30
|
-
- **Breadcrumb 注入**:向量文本包含层级路径,提升检索召回率
|
|
31
|
-
- **UTF-16 字符域归一**:在写入 metadata 前用 `SourceAdapter.toCharOffset` 统一偏移,避免多字节字符切片错位(v1.4.0+)
|
|
32
|
-
|
|
33
|
-
### 📊 三阶段上下文扩展
|
|
34
|
-
- **E1 邻居扩展**:同文件前后相邻 chunks,保证代码块完整性
|
|
35
|
-
- **E2 面包屑补全**:同一类/函数下的其他方法,理解整体结构
|
|
36
|
-
- **E3 Import 解析**:跨文件依赖追踪(可配置开关)
|
|
37
|
-
|
|
38
|
-
### 🎯 智能截断策略 (Smart TopK)
|
|
39
|
-
- **Anchor & Floor**:动态阈值 + 绝对下限双保险
|
|
40
|
-
- **Delta Guard**:防止 Top1 outlier 场景的误判
|
|
41
|
-
- **Safe Harbor**:前 N 个结果只检查下限,保证基本召回
|
|
42
|
-
|
|
43
|
-
### 🔌 MCP 原生支持
|
|
44
|
-
- **MCP Server 模式**:一键启动 Model Context Protocol 服务端
|
|
45
|
-
- **意图与术语分离**:LLM 友好的 API 设计
|
|
46
|
-
- **自动索引**:首次查询自动触发索引,增量更新透明无感
|
|
16
|
+
---
|
|
47
17
|
|
|
48
|
-
|
|
49
|
-
- **正文唯一源**:LanceDB 仅存向量与定位元数据,正文回查 `files.content`,索引体积降低 30-50%
|
|
50
|
-
- **跨库事务补偿**:LanceDB → FTS+outbox → SQLite mark 三阶段写入,任一失败自动回滚或重放
|
|
51
|
-
- **迁移状态机**:`pending/done/aborted` 三态持久化,崩溃恢复自动重建
|
|
52
|
-
- **跨进程互斥**:advisory lock 防止 MCP server 与 CLI 并发触发 LanceDB 迁移
|
|
53
|
-
- **chunk_id 去重**:写入前预删除,防止 retry 场景产生重复行
|
|
18
|
+
**ContextWeaver** is a semantic retrieval engine purpose-built for AI coding assistants. It combines hybrid search (vector + lexical), intelligent context expansion, and token-aware packing to deliver precise, relevant, and context-complete code snippets to LLMs.
|
|
54
19
|
|
|
55
|
-
|
|
20
|
+
<p align="center">
|
|
21
|
+
<img src="docs/architecture.png" alt="ContextWeaver architecture overview" width="800" />
|
|
22
|
+
</p>
|
|
56
23
|
|
|
57
|
-
|
|
24
|
+
## ✨ Core Features
|
|
25
|
+
|
|
26
|
+
### 🔍 Hybrid Retrieval Engine
|
|
27
|
+
- **Vector Retrieval**: deep semantic understanding via similarity
|
|
28
|
+
- **Lexical Retrieval (FTS)**: exact matching for function names, class names, and other technical terms
|
|
29
|
+
- **RRF Fusion (Reciprocal Rank Fusion)**: intelligently merges multiple recall channels
|
|
30
|
+
|
|
31
|
+
### 🧠 AST Semantic Chunking
|
|
32
|
+
- **Tree-sitter parsing**: supports TypeScript, JavaScript, Python, Go, Java, Rust, C, C++, C#, and more
|
|
33
|
+
- **Dual-Text strategy**: `displayCode` for presentation, `vectorText` for embedding
|
|
34
|
+
- **Gap-Aware merging**: handles code gaps intelligently while preserving semantic integrity
|
|
35
|
+
- **Breadcrumb injection**: vector text carries hierarchical paths to boost recall
|
|
36
|
+
- **UTF-16 character-domain normalization**: offsets are unified via `SourceAdapter.toCharOffset` before writing metadata, preventing multi-byte character slicing errors (v1.4.0+)
|
|
37
|
+
|
|
38
|
+
### 📊 Three-Stage Context Expansion
|
|
39
|
+
- **E1 Neighbor expansion**: adjacent chunks within the same file, preserving block completeness
|
|
40
|
+
- **E2 Breadcrumb completion**: sibling methods under the same class/function for structural understanding
|
|
41
|
+
- **E3 Import resolution**: cross-file dependency tracking (configurable toggle)
|
|
42
|
+
|
|
43
|
+
### 🎯 Smart TopK Cutoff
|
|
44
|
+
- **Anchor & Floor**: dynamic threshold plus an absolute floor as dual safeguards
|
|
45
|
+
- **Delta Guard**: prevents misjudgment in Top1-outlier scenarios
|
|
46
|
+
- **Safe Harbor**: the first N results only check the floor, guaranteeing baseline recall
|
|
47
|
+
|
|
48
|
+
### 🔌 Native MCP Support
|
|
49
|
+
- **MCP Server mode**: launch a Model Context Protocol server with one command
|
|
50
|
+
- **Multi-tool granularity** (v1.5.0+): beyond core semantic retrieval, adds dedicated tools for structure browsing, symbol references, symbol definitions, and statistics
|
|
51
|
+
- **Intent/term separation**: an LLM-friendly API design
|
|
52
|
+
- **Auto-indexing**: the first query triggers indexing automatically; incremental updates are transparent
|
|
53
|
+
|
|
54
|
+
### ⚡ Query Cache & File Watching (v1.5.0+)
|
|
55
|
+
- **Query cache (QueryCache)**: in-process per-project LRU cache (50 entries by default); a hit skips the entire vector recall / rerank / expansion pipeline
|
|
56
|
+
- **Automatic cache invalidation**: the cache key is composed of `normalized query + projectId + index version + search-config fingerprint`, so it invalidates automatically after an index update or config change — stale results are never returned
|
|
57
|
+
- **Watch mode**: `contextweaver watch` watches the filesystem and triggers incremental indexing automatically, with debouncing (500ms by default) and scan de-duplication (no concurrent scans)
|
|
58
|
+
|
|
59
|
+
### 📈 Statistics & Observability (v1.5.0+)
|
|
60
|
+
- **Three metric groups**: indexing process, search quality/behavior, health/consistency
|
|
61
|
+
- **Dual exits**: `contextweaver stats` CLI (with `--json`) plus the MCP `stats` tool
|
|
62
|
+
- **Consistency diagnostics**: automatically detects abnormal migration state, `pending_marks` backlog, missing vector rows, and more — with suggested fixes
|
|
63
|
+
|
|
64
|
+
### 🛡️ Crash-Safe Data Architecture (v1.4.0+)
|
|
65
|
+
- **Single source of truth for content**: LanceDB stores only vectors and locating metadata; content is read back from `files.content`, reducing index size by 30–50%
|
|
66
|
+
- **Cross-store transactional compensation**: three-stage write LanceDB → FTS+outbox → SQLite mark, with automatic rollback or replay on any failure
|
|
67
|
+
- **Migration state machine**: `pending/done/aborted` persisted, auto-rebuilt on crash recovery
|
|
68
|
+
- **Cross-process mutual exclusion**: an advisory lock prevents the MCP server and CLI from triggering LanceDB migration concurrently
|
|
69
|
+
- **chunk_id de-duplication**: pre-delete before write to avoid duplicate rows on retry
|
|
70
|
+
|
|
71
|
+
## 📦 Quick Start
|
|
72
|
+
|
|
73
|
+
### Requirements
|
|
58
74
|
|
|
59
75
|
- Node.js >= 20
|
|
60
|
-
- pnpm (
|
|
76
|
+
- pnpm (recommended) or npm
|
|
61
77
|
|
|
62
|
-
###
|
|
78
|
+
### Installation
|
|
63
79
|
|
|
64
80
|
```bash
|
|
65
|
-
#
|
|
81
|
+
# Global install
|
|
66
82
|
npm install -g @chiway/contextweaver
|
|
67
83
|
|
|
68
|
-
#
|
|
84
|
+
# Or with pnpm
|
|
69
85
|
pnpm add -g @chiway/contextweaver
|
|
70
86
|
```
|
|
71
87
|
|
|
72
|
-
###
|
|
88
|
+
### Initialize Configuration
|
|
73
89
|
|
|
74
90
|
```bash
|
|
75
|
-
#
|
|
91
|
+
# Create the config file (~/.contextweaver/.env)
|
|
76
92
|
contextweaver init
|
|
77
|
-
#
|
|
93
|
+
# Or the short alias
|
|
78
94
|
cw init
|
|
79
95
|
```
|
|
80
96
|
|
|
81
|
-
|
|
97
|
+
Edit `~/.contextweaver/.env` and fill in your API keys:
|
|
82
98
|
|
|
83
99
|
```bash
|
|
84
|
-
# Embedding API
|
|
100
|
+
# Embedding API config (required)
|
|
85
101
|
EMBEDDINGS_API_KEY=your-api-key-here
|
|
86
102
|
EMBEDDINGS_BASE_URL=https://api.siliconflow.cn/v1/embeddings
|
|
87
103
|
EMBEDDINGS_MODEL=BAAI/bge-m3
|
|
88
104
|
EMBEDDINGS_MAX_CONCURRENCY=10
|
|
89
105
|
EMBEDDINGS_DIMENSIONS=1024
|
|
90
106
|
|
|
91
|
-
# Reranker
|
|
107
|
+
# Reranker config (required)
|
|
92
108
|
RERANK_API_KEY=your-api-key-here
|
|
93
109
|
RERANK_BASE_URL=https://api.siliconflow.cn/v1/rerank
|
|
94
110
|
RERANK_MODEL=BAAI/bge-reranker-v2-m3
|
|
95
111
|
RERANK_TOP_N=20
|
|
96
112
|
|
|
97
|
-
#
|
|
113
|
+
# Search parameters (optional, override built-in defaults)
|
|
114
|
+
CW_SEARCH_WVEC=0.6
|
|
115
|
+
CW_SEARCH_WLEX=0.4
|
|
116
|
+
CW_SEARCH_RERANK_TOP_N=10
|
|
117
|
+
CW_SEARCH_MAX_TOTAL_CHARS=48000
|
|
118
|
+
CW_SEARCH_VECTOR_TOP_K=80
|
|
119
|
+
CW_SEARCH_SMART_MAX_K=8
|
|
120
|
+
CW_SEARCH_IMPORT_FILES_PER_SEED=3
|
|
121
|
+
|
|
122
|
+
# Ignore patterns (optional, comma-separated)
|
|
98
123
|
# IGNORE_PATTERNS=.venv,node_modules
|
|
99
124
|
```
|
|
100
125
|
|
|
101
|
-
###
|
|
126
|
+
### Index a Codebase
|
|
102
127
|
|
|
103
128
|
```bash
|
|
104
|
-
#
|
|
129
|
+
# Run from the codebase root
|
|
105
130
|
contextweaver index
|
|
106
131
|
|
|
107
|
-
#
|
|
132
|
+
# Specify a path
|
|
108
133
|
contextweaver index /path/to/your/project
|
|
109
134
|
|
|
110
|
-
#
|
|
135
|
+
# Force a full re-index
|
|
111
136
|
contextweaver index --force
|
|
112
137
|
```
|
|
113
138
|
|
|
114
|
-
###
|
|
139
|
+
### Watch Mode (v1.5.0+)
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
# Watch for file changes and auto-index incrementally (Ctrl+C to stop)
|
|
143
|
+
contextweaver watch
|
|
144
|
+
|
|
145
|
+
# Specify a path and debounce window (ms)
|
|
146
|
+
contextweaver watch /path/to/project --debounce 800
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
`watch` runs one full incremental scan on startup, then listens to filesystem events; changes trigger a de-duplicated scan within the debounce window, and paths excluded by ignore rules never trigger a scan.
|
|
150
|
+
|
|
151
|
+
### Local Search
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
# Semantic search
|
|
155
|
+
cw search --information-request "How is the user authentication flow implemented?"
|
|
156
|
+
|
|
157
|
+
# With exact terms
|
|
158
|
+
cw search --information-request "Database connection logic" --technical-terms "DatabasePool,Connection"
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
### Structure Browsing & Symbol Lookup (v1.5.0+)
|
|
162
|
+
|
|
163
|
+
The following commands are CLI mirrors of MCP tools, with zero Embedding API cost:
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
# List indexed files (supports glob / language / count filters)
|
|
167
|
+
contextweaver list-files --glob "src/**/*.ts" --language typescript --max-results 100
|
|
168
|
+
|
|
169
|
+
# Look up a symbol definition
|
|
170
|
+
contextweaver definition SearchService --hint-path src/search
|
|
171
|
+
|
|
172
|
+
# Look up symbol references
|
|
173
|
+
contextweaver references handleStats --exclude-definition
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
### Statistics (v1.5.0+)
|
|
115
177
|
|
|
116
178
|
```bash
|
|
117
|
-
#
|
|
118
|
-
|
|
179
|
+
# Human-readable stats report
|
|
180
|
+
contextweaver stats
|
|
181
|
+
|
|
182
|
+
# JSON output (for scripting)
|
|
183
|
+
contextweaver stats --json
|
|
119
184
|
|
|
120
|
-
#
|
|
121
|
-
|
|
185
|
+
# Specify a project path
|
|
186
|
+
contextweaver stats --path /path/to/project
|
|
122
187
|
```
|
|
123
188
|
|
|
124
|
-
###
|
|
189
|
+
### Start the MCP Server
|
|
125
190
|
|
|
126
191
|
```bash
|
|
127
|
-
#
|
|
192
|
+
# Launch the MCP server (for use by Claude and other AI assistants)
|
|
128
193
|
contextweaver mcp
|
|
129
194
|
```
|
|
130
195
|
|
|
131
|
-
###
|
|
196
|
+
### Index Management (v1.4.0+)
|
|
132
197
|
|
|
133
198
|
```bash
|
|
134
|
-
#
|
|
199
|
+
# Show LanceDB migration state
|
|
135
200
|
contextweaver migrate
|
|
136
201
|
|
|
137
|
-
#
|
|
138
|
-
#
|
|
202
|
+
# Clear the aborted state: wipe LanceDB and trigger a full rebuild
|
|
203
|
+
# Triggered when: the Indexer refuses to write after sampling validation fails;
|
|
204
|
+
# run this, then index again.
|
|
139
205
|
contextweaver migrate --reset
|
|
140
206
|
|
|
141
|
-
#
|
|
207
|
+
# Specify a project path
|
|
142
208
|
contextweaver migrate --path /path/to/project
|
|
143
209
|
```
|
|
144
210
|
|
|
145
|
-
## 🔧 MCP
|
|
211
|
+
## 🔧 MCP Integration
|
|
146
212
|
|
|
147
|
-
### Claude Desktop
|
|
213
|
+
### Claude Desktop Configuration
|
|
148
214
|
|
|
149
|
-
|
|
215
|
+
Add the following to your Claude Desktop config file:
|
|
150
216
|
|
|
151
217
|
```json
|
|
152
218
|
{
|
|
@@ -159,25 +225,62 @@ contextweaver migrate --path /path/to/project
|
|
|
159
225
|
}
|
|
160
226
|
```
|
|
161
227
|
|
|
162
|
-
### MCP
|
|
228
|
+
### MCP Tools Overview (v1.5.0+)
|
|
229
|
+
|
|
230
|
+
ContextWeaver exposes 5 MCP tools, following a layered design of "semantic retrieval first, structure browsing second":
|
|
231
|
+
|
|
232
|
+
| Tool | Purpose | Embedding cost |
|
|
233
|
+
|------|---------|----------------|
|
|
234
|
+
| `codebase-retrieval` | **Primary tool**: hybrid semantic + exact-match retrieval | Yes |
|
|
235
|
+
| `list-files` | List indexed file structure (path/language/size) | No |
|
|
236
|
+
| `find-references` | Find heuristic text references to a symbol | No |
|
|
237
|
+
| `get-symbol-definition` | Find likely definition blocks for a symbol | No |
|
|
238
|
+
| `stats` | Index/search/health statistics | No |
|
|
239
|
+
|
|
240
|
+
#### `codebase-retrieval` Parameters
|
|
241
|
+
|
|
242
|
+
| Parameter | Type | Required | Description |
|
|
243
|
+
|-----------|------|----------|-------------|
|
|
244
|
+
| `repo_path` | string | ✅ | Absolute path to the repository root |
|
|
245
|
+
| `information_request` | string | ✅ | The semantic intent in natural language |
|
|
246
|
+
| `technical_terms` | string[] | ❌ | Exact technical terms (class/function names, etc.) |
|
|
247
|
+
|
|
248
|
+
#### `list-files` Parameters
|
|
249
|
+
|
|
250
|
+
| Parameter | Type | Required | Description |
|
|
251
|
+
|-----------|------|----------|-------------|
|
|
252
|
+
| `repo_path` | string | ✅ | Absolute path to the repository root |
|
|
253
|
+
| `glob` | string | ❌ | Glob pattern to filter paths |
|
|
254
|
+
| `language` | string | ❌ | Language filter (matched against `files.language`) |
|
|
255
|
+
| `max_results` | number | ❌ | Max files to return (default 200) |
|
|
256
|
+
|
|
257
|
+
#### `find-references` Parameters
|
|
163
258
|
|
|
164
|
-
|
|
259
|
+
| Parameter | Type | Required | Description |
|
|
260
|
+
|-----------|------|----------|-------------|
|
|
261
|
+
| `repo_path` | string | ✅ | Absolute path to the repository root |
|
|
262
|
+
| `symbol` | string | ✅ | Exact symbol name |
|
|
263
|
+
| `exclude_definition` | boolean | ❌ | Exclude chunks whose breadcrumb tail matches the symbol name |
|
|
264
|
+
| `max_results` | number | ❌ | Max references to return (default 50) |
|
|
165
265
|
|
|
166
|
-
####
|
|
266
|
+
#### `get-symbol-definition` Parameters
|
|
167
267
|
|
|
168
|
-
|
|
|
169
|
-
|
|
170
|
-
| `repo_path` | string | ✅ |
|
|
171
|
-
| `
|
|
172
|
-
| `
|
|
268
|
+
| Parameter | Type | Required | Description |
|
|
269
|
+
|-----------|------|----------|-------------|
|
|
270
|
+
| `repo_path` | string | ✅ | Absolute path to the repository root |
|
|
271
|
+
| `symbol` | string | ✅ | Exact symbol name to resolve |
|
|
272
|
+
| `hint_path` | string | ❌ | Preferred path to disambiguate same-name definitions |
|
|
273
|
+
| `max_results` | number | ❌ | Max definitions to return (default 3) |
|
|
173
274
|
|
|
174
|
-
|
|
275
|
+
> **Note**: `find-references` and `get-symbol-definition` are heuristic text lookups over indexed chunks, not compiler-accurate navigation. For exhaustive raw text matching, use `grep` outside MCP.
|
|
175
276
|
|
|
176
|
-
|
|
177
|
-
- **同文件上下文优先**:默认提供同文件上下文,跨文件探索由 Agent 自主发起
|
|
178
|
-
- **回归代理本能**:工具只负责定位,跨文件探索由 Agent 按需触发
|
|
277
|
+
#### Design Philosophy
|
|
179
278
|
|
|
180
|
-
|
|
279
|
+
- **Intent/term separation**: `information_request` describes "what to do", `technical_terms` filters "what it's called"
|
|
280
|
+
- **Same-file context first**: same-file context is provided by default; cross-file exploration is initiated by the agent
|
|
281
|
+
- **Return to agent instincts**: the tool only locates; cross-file exploration is triggered by the agent on demand
|
|
282
|
+
|
|
283
|
+
## 🏗️ Architecture
|
|
181
284
|
|
|
182
285
|
```mermaid
|
|
183
286
|
flowchart TB
|
|
@@ -187,9 +290,11 @@ flowchart TB
|
|
|
187
290
|
end
|
|
188
291
|
|
|
189
292
|
subgraph Search["SearchService"]
|
|
293
|
+
QC[QueryCache<br/>LRU]
|
|
190
294
|
VR[Vector Retrieval]
|
|
191
295
|
LR[Lexical Retrieval]
|
|
192
296
|
RRF[RRF Fusion + Rerank]
|
|
297
|
+
QC -.cache hit.-> CP
|
|
193
298
|
VR --> RRF
|
|
194
299
|
LR --> RRF
|
|
195
300
|
end
|
|
@@ -216,77 +321,89 @@ flowchart TB
|
|
|
216
321
|
Index --> Storage
|
|
217
322
|
```
|
|
218
323
|
|
|
219
|
-
###
|
|
324
|
+
### Core Modules
|
|
220
325
|
|
|
221
|
-
|
|
|
222
|
-
|
|
223
|
-
| **SearchService** |
|
|
224
|
-
| **
|
|
225
|
-
| **
|
|
226
|
-
| **
|
|
227
|
-
| **
|
|
228
|
-
| **
|
|
229
|
-
| **
|
|
230
|
-
| **
|
|
326
|
+
| Module | Responsibility |
|
|
327
|
+
|--------|----------------|
|
|
328
|
+
| **SearchService** | Hybrid search core: coordinates vector/lexical recall, RRF fusion, rerank; integrates QueryCache |
|
|
329
|
+
| **QueryCache** | Per-project in-process LRU cache (v1.5.0+); a hit skips the entire retrieval pipeline |
|
|
330
|
+
| **GraphExpander** | Context expander: runs the E1/E2/E3 three-stage expansion strategy |
|
|
331
|
+
| **ContextPacker** | Context packer: segment merging and token budget control |
|
|
332
|
+
| **ChunkContentLoader** | Slices `files.content` by `(path, start_index, end_index)` (v1.4.0+) |
|
|
333
|
+
| **VectorStore** | LanceDB adapter; exposes pure vector operations only |
|
|
334
|
+
| **Database (SQLite)** | Metadata storage + FTS5 full-text index + statistics counters, schema_version=3 |
|
|
335
|
+
| **Bootstrap** | Cross-store init coordinator: pending_marks replay + LanceDB schema migration (v1.4.0+) |
|
|
336
|
+
| **SemanticSplitter** | AST semantic chunker (Tree-sitter); normalizes offsets to the UTF-16 character domain on write |
|
|
337
|
+
| **Watcher** | File-watch coordinator (v1.5.0+): debounce + scan de-duplication + ignore filtering |
|
|
338
|
+
| **Stats** | Statistics aggregation layer (v1.5.0+): combines index/search/health metrics |
|
|
231
339
|
|
|
232
|
-
###
|
|
340
|
+
### Data Architecture (v1.4.0+)
|
|
233
341
|
|
|
234
342
|
```
|
|
235
343
|
~/.contextweaver/<projectId>/
|
|
236
344
|
├── index.db # SQLite
|
|
237
|
-
│ ├── files #
|
|
238
|
-
│ ├── files_fts #
|
|
239
|
-
│ ├── chunks_fts #
|
|
345
|
+
│ ├── files # File metadata + full content (content column, the only source for text slicing)
|
|
346
|
+
│ ├── files_fts # External-content table, inverted index pointing to files
|
|
347
|
+
│ ├── chunks_fts # Chunk-level inverted index, per-file wholesale replacement
|
|
240
348
|
│ ├── metadata # schema_version / lancedb_migration_state / lock
|
|
241
|
-
│
|
|
242
|
-
└──
|
|
349
|
+
│ ├── stats # Cumulative index/search counters (v1.5.0+)
|
|
350
|
+
│ └── pending_marks # Outbox: replayed when a vector_index_hash mark failed
|
|
351
|
+
└── vectors.lance/ # LanceDB chunks table (vectors + locating metadata only, no content)
|
|
243
352
|
```
|
|
244
353
|
|
|
245
|
-
|
|
246
|
-
-
|
|
247
|
-
-
|
|
248
|
-
-
|
|
249
|
-
- LanceDB
|
|
354
|
+
**Key invariants**:
|
|
355
|
+
- The single source of truth for content is `files.content`; `ChunkContentLoader` slices via `start_index/end_index` (same source as `displayCode`)
|
|
356
|
+
- All LanceDB offset fields live in the UTF-16 character domain; multi-byte files are never sliced incorrectly
|
|
357
|
+
- Cross-store write order: LanceDB → (FTS + outbox single transaction) → SQLite mark + clear outbox
|
|
358
|
+
- LanceDB migration state `pending/done/aborted` is persisted, with cross-process mutual exclusion via an advisory lock
|
|
359
|
+
- The query cache key is bound to the index version and search-config fingerprint; it invalidates on any index or config change
|
|
250
360
|
|
|
251
|
-
## 📁
|
|
361
|
+
## 📁 Project Structure
|
|
252
362
|
|
|
253
363
|
```
|
|
254
364
|
contextweaver/
|
|
255
365
|
├── src/
|
|
256
|
-
│ ├── index.ts # CLI
|
|
257
|
-
│ ├── config.ts #
|
|
258
|
-
│ ├──
|
|
366
|
+
│ ├── index.ts # CLI entry (init / index / watch / search / mcp / migrate / stats)
|
|
367
|
+
│ ├── config.ts # Config management (environment variables)
|
|
368
|
+
│ ├── defaultEnv.ts # Default .env template
|
|
369
|
+
│ ├── cli/
|
|
370
|
+
│ │ └── mirrorCommands.ts # CLI mirrors of MCP tools (list-files / definition / references)
|
|
371
|
+
│ ├── api/ # External API wrappers
|
|
259
372
|
│ │ ├── embedding.ts # Embedding API
|
|
260
373
|
│ │ └── reranker.ts # Reranker API
|
|
261
|
-
│ ├── chunking/ #
|
|
262
|
-
│ │ ├── SemanticSplitter.ts # AST
|
|
263
|
-
│ │ ├── SourceAdapter.ts #
|
|
264
|
-
│ │ ├── LanguageSpec.ts #
|
|
265
|
-
│ │ ├── ParserPool.ts # Tree-sitter
|
|
266
|
-
│ │ └── types.ts #
|
|
267
|
-
│ ├── scanner/ #
|
|
268
|
-
│ │ ├──
|
|
269
|
-
│ │ ├──
|
|
270
|
-
│ │ ├──
|
|
271
|
-
│ │ ├──
|
|
272
|
-
│ │
|
|
273
|
-
│ ├──
|
|
274
|
-
│ │ └──
|
|
275
|
-
│ ├──
|
|
276
|
-
│ │ └── index.ts # LanceDB
|
|
277
|
-
│ ├──
|
|
278
|
-
│ │
|
|
279
|
-
│
|
|
280
|
-
│ ├──
|
|
281
|
-
│ │
|
|
282
|
-
│
|
|
283
|
-
│ │ ├──
|
|
284
|
-
│ │ ├──
|
|
285
|
-
│ │ ├──
|
|
286
|
-
│ │ ├──
|
|
287
|
-
│ │ ├──
|
|
288
|
-
│ │ ├──
|
|
289
|
-
│ │
|
|
374
|
+
│ ├── chunking/ # Semantic chunking
|
|
375
|
+
│ │ ├── SemanticSplitter.ts # AST semantic chunker
|
|
376
|
+
│ │ ├── SourceAdapter.ts # Source adapter (UTF-16/UTF-8 domain normalization)
|
|
377
|
+
│ │ ├── LanguageSpec.ts # Language spec definitions
|
|
378
|
+
│ │ ├── ParserPool.ts # Tree-sitter parser pool
|
|
379
|
+
│ │ └── types.ts # Chunking type definitions
|
|
380
|
+
│ ├── scanner/ # File scanning
|
|
381
|
+
│ │ ├── index.ts # Scan orchestration
|
|
382
|
+
│ │ ├── crawler.ts # Filesystem traversal
|
|
383
|
+
│ │ ├── processor.ts # File processing
|
|
384
|
+
│ │ ├── watcher.ts # File-watch coordinator (v1.5.0+)
|
|
385
|
+
│ │ ├── filter.ts # Filter rules
|
|
386
|
+
│ │ ├── hash.ts # File hash
|
|
387
|
+
│ │ └── language.ts # Language detection
|
|
388
|
+
│ ├── indexer/ # Indexer
|
|
389
|
+
│ │ └── index.ts # Three-stage transaction (LanceDB → FTS+outbox → SQLite mark)
|
|
390
|
+
│ ├── vectorStore/ # Vector storage
|
|
391
|
+
│ │ └── index.ts # LanceDB adapter (pure vector operations)
|
|
392
|
+
│ ├── db/ # Database
|
|
393
|
+
│ │ ├── index.ts # SQLite + FTS5 + pending_marks + migration state machine + stats counters
|
|
394
|
+
│ │ └── bootstrap.ts # Cross-store init coordinator (v1.4.0+)
|
|
395
|
+
│ ├── search/ # Search service
|
|
396
|
+
│ │ ├── SearchService.ts # Core search service (cache-integrated)
|
|
397
|
+
│ │ ├── QueryCache.ts # Per-project LRU query cache (v1.5.0+)
|
|
398
|
+
│ │ ├── GraphExpander.ts # Context expander
|
|
399
|
+
│ │ ├── ContextPacker.ts # Context packer
|
|
400
|
+
│ │ ├── ChunkContentLoader.ts # Slices by (path, start_index, end_index) (v1.4.0+)
|
|
401
|
+
│ │ ├── fts.ts # Full-text search (per-file wholesale replacement)
|
|
402
|
+
│ │ ├── config.ts # Search default config + value bounds
|
|
403
|
+
│ │ ├── loadConfig.ts # Env-var overrides + config fingerprint (v1.5.0+)
|
|
404
|
+
│ │ ├── types.ts # Type definitions
|
|
405
|
+
│ │ ├── utils.ts # Token-overlap scoring
|
|
406
|
+
│ │ └── resolvers/ # Multi-language import resolvers
|
|
290
407
|
│ │ ├── JsTsResolver.ts
|
|
291
408
|
│ │ ├── PythonResolver.ts
|
|
292
409
|
│ │ ├── GoResolver.ts
|
|
@@ -294,85 +411,118 @@ contextweaver/
|
|
|
294
411
|
│ │ ├── RustResolver.ts
|
|
295
412
|
│ │ ├── CppResolver.ts
|
|
296
413
|
│ │ └── CSharpResolver.ts
|
|
297
|
-
│ ├──
|
|
298
|
-
│ │
|
|
299
|
-
│
|
|
414
|
+
│ ├── stats/ # Statistics aggregation layer (v1.5.0+)
|
|
415
|
+
│ │ └── index.ts # Aggregates and renders index/search/health metrics
|
|
416
|
+
│ ├── mcp/ # MCP server
|
|
417
|
+
│ │ ├── server.ts # MCP server implementation (registers 5 tools)
|
|
418
|
+
│ │ ├── main.ts # MCP entry
|
|
300
419
|
│ │ └── tools/
|
|
301
|
-
│ │
|
|
302
|
-
│
|
|
303
|
-
│ ├──
|
|
304
|
-
│ ├──
|
|
305
|
-
│
|
|
306
|
-
├──
|
|
307
|
-
│
|
|
308
|
-
│
|
|
309
|
-
│
|
|
310
|
-
│
|
|
311
|
-
│
|
|
312
|
-
|
|
420
|
+
│ │ ├── index.ts # Tool registry
|
|
421
|
+
│ │ ├── shared.ts # Shared tool logic
|
|
422
|
+
│ │ ├── codebaseRetrieval.ts # Code retrieval tool
|
|
423
|
+
│ │ ├── listFiles.ts # File structure browsing (v1.5.0+)
|
|
424
|
+
│ │ ├── findReferences.ts # Symbol reference lookup (v1.5.0+)
|
|
425
|
+
│ │ ├── getSymbolDefinition.ts # Symbol definition lookup (v1.5.0+)
|
|
426
|
+
│ │ └── stats.ts # Statistics tool (v1.5.0+)
|
|
427
|
+
│ └── utils/ # Utilities
|
|
428
|
+
│ ├── logger.ts # Logging system
|
|
429
|
+
│ ├── encoding.ts # Encoding detection
|
|
430
|
+
│ └── lock.ts # File lock
|
|
431
|
+
├── tests/ # Unit + integration tests (28 test files, 156 test cases)
|
|
432
|
+
│ ├── chunking/ # SourceAdapter / chunking
|
|
433
|
+
│ ├── cli/ # mirrorCommands
|
|
434
|
+
│ ├── db/ # migration, outbox, advisory lock, index-version
|
|
435
|
+
│ ├── indexer/ # transaction compensation, GC, aborted guard
|
|
436
|
+
│ ├── integration/ # real LanceDB end-to-end
|
|
437
|
+
│ ├── mcp/ # list-files / find-references / get-symbol-definition / shared / tool registry
|
|
438
|
+
│ ├── scanner/ # watcher / index-version
|
|
439
|
+
│ ├── search/ # FTS, ChunkContentLoader, Packer, cache, loadConfig
|
|
440
|
+
│ ├── stats/ # statistics aggregation
|
|
441
|
+
│ └── vectorStore/ # chunk_id de-duplication, sampling validation
|
|
313
442
|
├── package.json
|
|
314
443
|
└── tsconfig.json
|
|
315
444
|
```
|
|
316
445
|
|
|
317
|
-
## ⚙️
|
|
446
|
+
## ⚙️ Configuration Reference
|
|
447
|
+
|
|
448
|
+
### Environment Variables
|
|
449
|
+
|
|
450
|
+
| Variable | Required | Default | Description |
|
|
451
|
+
|----------|----------|---------|-------------|
|
|
452
|
+
| `EMBEDDINGS_API_KEY` | ✅ | - | Embedding API key |
|
|
453
|
+
| `EMBEDDINGS_BASE_URL` | ✅ | - | Embedding API URL |
|
|
454
|
+
| `EMBEDDINGS_MODEL` | ✅ | - | Embedding model name |
|
|
455
|
+
| `EMBEDDINGS_MAX_CONCURRENCY` | ❌ | 10 | Embedding concurrency |
|
|
456
|
+
| `EMBEDDINGS_DIMENSIONS` | ❌ | 1024 | Vector dimensions |
|
|
457
|
+
| `RERANK_API_KEY` | ✅ | - | Reranker API key |
|
|
458
|
+
| `RERANK_BASE_URL` | ✅ | - | Reranker API URL |
|
|
459
|
+
| `RERANK_MODEL` | ✅ | - | Reranker model name |
|
|
460
|
+
| `RERANK_TOP_N` | ❌ | 20 | Rerank return count |
|
|
461
|
+
| `IGNORE_PATTERNS` | ❌ | - | Extra ignore patterns |
|
|
318
462
|
|
|
319
|
-
###
|
|
463
|
+
### Search Parameter Env Overrides (v1.5.0+)
|
|
320
464
|
|
|
321
|
-
|
|
322
|
-
|--------|------|--------|------|
|
|
323
|
-
| `EMBEDDINGS_API_KEY` | ✅ | - | Embedding API 密钥 |
|
|
324
|
-
| `EMBEDDINGS_BASE_URL` | ✅ | - | Embedding API 地址 |
|
|
325
|
-
| `EMBEDDINGS_MODEL` | ✅ | - | Embedding 模型名称 |
|
|
326
|
-
| `EMBEDDINGS_MAX_CONCURRENCY` | ❌ | 10 | Embedding 并发数 |
|
|
327
|
-
| `EMBEDDINGS_DIMENSIONS` | ❌ | 1024 | 向量维度 |
|
|
328
|
-
| `RERANK_API_KEY` | ✅ | - | Reranker API 密钥 |
|
|
329
|
-
| `RERANK_BASE_URL` | ✅ | - | Reranker API 地址 |
|
|
330
|
-
| `RERANK_MODEL` | ✅ | - | Reranker 模型名称 |
|
|
331
|
-
| `RERANK_TOP_N` | ❌ | 20 | Rerank 返回数量 |
|
|
332
|
-
| `IGNORE_PATTERNS` | ❌ | - | 额外忽略模式 |
|
|
465
|
+
The following environment variables override built-in defaults; out-of-range values are automatically clamped to the valid interval. When only one of `wVec`/`wLex` is set, the other is automatically set to `1 - x`.
|
|
333
466
|
|
|
334
|
-
|
|
467
|
+
| Variable | Default | Bounds | Description |
|
|
468
|
+
|----------|---------|--------|-------------|
|
|
469
|
+
| `CW_SEARCH_WVEC` | 0.6 | 0–1 | Vector weight (fusion stage) |
|
|
470
|
+
| `CW_SEARCH_WLEX` | 0.4 | 0–1 | Lexical weight (complements `wVec`) |
|
|
471
|
+
| `CW_SEARCH_RERANK_TOP_N` | 10 | 5–20 | Results kept after rerank |
|
|
472
|
+
| `CW_SEARCH_MAX_TOTAL_CHARS` | 48000 | 20000–80000 | Token budget (in chars, ~12k tokens) |
|
|
473
|
+
| `CW_SEARCH_VECTOR_TOP_K` | 80 | 40–200 | Vector recall candidates |
|
|
474
|
+
| `CW_SEARCH_SMART_MAX_K` | 8 | 5–15 | Smart TopK hard upper bound |
|
|
475
|
+
| `CW_SEARCH_IMPORT_FILES_PER_SEED` | 3 | 0–5 | E3 import files resolved per seed (0 disables cross-file expansion) |
|
|
476
|
+
|
|
477
|
+
### Search Config Parameters (built-in defaults)
|
|
335
478
|
|
|
336
479
|
```typescript
|
|
337
480
|
interface SearchConfig {
|
|
338
|
-
// ===
|
|
339
|
-
vectorTopK: number; //
|
|
340
|
-
vectorTopM: number; //
|
|
341
|
-
ftsTopKFiles: number; // FTS
|
|
342
|
-
lexChunksPerFile: number; //
|
|
343
|
-
lexTotalChunks: number; //
|
|
344
|
-
|
|
345
|
-
// ===
|
|
346
|
-
rrfK0: number; // RRF
|
|
347
|
-
wVec: number; //
|
|
348
|
-
wLex: number; //
|
|
349
|
-
fusedTopM: number; //
|
|
481
|
+
// === Recall ===
|
|
482
|
+
vectorTopK: number; // Vector recall candidates (default 80)
|
|
483
|
+
vectorTopM: number; // Vectors kept after dedup (default 60)
|
|
484
|
+
ftsTopKFiles: number; // FTS recall file count (default 20)
|
|
485
|
+
lexChunksPerFile: number; // Lexical chunks per file (default 2)
|
|
486
|
+
lexTotalChunks: number; // Total lexical chunks (default 40)
|
|
487
|
+
|
|
488
|
+
// === Fusion ===
|
|
489
|
+
rrfK0: number; // RRF smoothing constant (default 20)
|
|
490
|
+
wVec: number; // Vector weight (default 0.6)
|
|
491
|
+
wLex: number; // Lexical weight (default 0.4)
|
|
492
|
+
fusedTopM: number; // Candidates fed into rerank after fusion (default 60)
|
|
350
493
|
|
|
351
494
|
// === Rerank ===
|
|
352
|
-
rerankTopN: number; //
|
|
353
|
-
maxRerankChars: number; //
|
|
495
|
+
rerankTopN: number; // Results kept after rerank (default 10)
|
|
496
|
+
maxRerankChars: number; // Max chars per chunk sent to reranker (default 1000)
|
|
497
|
+
maxBreadcrumbChars: number;// Max chars for breadcrumb context (default 250)
|
|
498
|
+
headRatio: number; // Head/tail ratio when truncating (default 0.67)
|
|
499
|
+
|
|
500
|
+
// === Expansion ===
|
|
501
|
+
neighborHops: number; // E1 neighbor hops (default 2)
|
|
502
|
+
breadcrumbExpandLimit: number; // E2 breadcrumb completions (default 3)
|
|
503
|
+
importFilesPerSeed: number; // E3 import files per seed (default 3)
|
|
504
|
+
chunksPerImportFile: number; // E3 chunks per import file (default 3)
|
|
354
505
|
|
|
355
|
-
// ===
|
|
356
|
-
|
|
357
|
-
|
|
358
|
-
importFilesPerSeed: number; // E3 每 seed 导入文件数(默认 0)
|
|
359
|
-
chunksPerImportFile: number; // E3 每导入文件 chunks(默认 0)
|
|
506
|
+
// === ContextPacker ===
|
|
507
|
+
maxSegmentsPerFile: number; // Max non-contiguous segments per file (default 3)
|
|
508
|
+
maxTotalChars: number; // Token budget (chars, default 48000)
|
|
360
509
|
|
|
361
510
|
// === Smart TopK ===
|
|
362
|
-
enableSmartTopK: boolean;
|
|
363
|
-
smartTopScoreRatio: number; //
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
511
|
+
enableSmartTopK: boolean; // Enable smart cutoff (default true)
|
|
512
|
+
smartTopScoreRatio: number; // Dynamic threshold ratio (default 0.5)
|
|
513
|
+
smartTopScoreDeltaAbs: number; // Max absolute drop from Top1 (default 0.25)
|
|
514
|
+
smartMinScore: number; // Absolute floor (default 0.25)
|
|
515
|
+
smartMinK: number; // Safe Harbor count (default 2)
|
|
516
|
+
smartMaxK: number; // Hard upper bound (default 8)
|
|
367
517
|
}
|
|
368
518
|
```
|
|
369
519
|
|
|
370
|
-
## 🌍
|
|
520
|
+
## 🌍 Multi-Language Support
|
|
371
521
|
|
|
372
|
-
ContextWeaver
|
|
522
|
+
ContextWeaver natively supports AST parsing for the following languages via Tree-sitter:
|
|
373
523
|
|
|
374
|
-
|
|
|
375
|
-
|
|
524
|
+
| Language | AST Parsing | Import Resolution | Extensions |
|
|
525
|
+
|----------|-------------|-------------------|------------|
|
|
376
526
|
| TypeScript | ✅ | ✅ | `.ts`, `.tsx` |
|
|
377
527
|
| JavaScript | ✅ | ✅ | `.js`, `.jsx`, `.mjs`, `.cjs` |
|
|
378
528
|
| Python | ✅ | ✅ | `.py` |
|
|
@@ -383,109 +533,135 @@ ContextWeaver 通过 Tree-sitter 原生支持以下编程语言的 AST 解析:
|
|
|
383
533
|
| C++ | ✅ | ✅ | `.cpp`, `.cc`, `.cxx`, `.hpp` |
|
|
384
534
|
| C# | ✅ | ✅ | `.cs` |
|
|
385
535
|
|
|
386
|
-
|
|
536
|
+
Other languages fall back to line-based chunking and can still be indexed and searched normally.
|
|
387
537
|
|
|
388
|
-
## 🔄
|
|
538
|
+
## 🔄 Workflows
|
|
389
539
|
|
|
390
|
-
###
|
|
540
|
+
### Indexing Flow
|
|
391
541
|
|
|
392
542
|
```
|
|
393
|
-
0. Bootstrap → pending_marks
|
|
394
|
-
1. Crawler →
|
|
395
|
-
2. Processor →
|
|
396
|
-
3. Splitter → AST
|
|
397
|
-
4. Indexer →
|
|
398
|
-
5.
|
|
399
|
-
├─ LanceDB
|
|
400
|
-
├─ FTS + outbox
|
|
401
|
-
└─ SQLite mark +
|
|
402
|
-
6.
|
|
543
|
+
0. Bootstrap → pending_marks replay + LanceDB schema migration (first launch)
|
|
544
|
+
1. Crawler → traverse the filesystem, filter ignored items
|
|
545
|
+
2. Processor → read file content, compute hash
|
|
546
|
+
3. Splitter → AST parse, semantic chunking (offsets normalized to UTF-16 char domain)
|
|
547
|
+
4. Indexer → batch embedding
|
|
548
|
+
5. Stages 4-6 pseudo-transaction:
|
|
549
|
+
├─ LanceDB write (pre-delete (path, hash) to avoid duplicates → add → clear old versions)
|
|
550
|
+
├─ FTS + outbox single SQLite transaction (rolls back LanceDB on failure)
|
|
551
|
+
└─ SQLite mark + clear outbox single transaction (outbox kept on failure, replayed next launch)
|
|
552
|
+
6. Trailing GC → clean up LanceDB orphan chunks (time budget 5s)
|
|
403
553
|
```
|
|
404
554
|
|
|
405
|
-
###
|
|
555
|
+
### Search Flow
|
|
406
556
|
|
|
407
557
|
```
|
|
408
|
-
1. Query Parse →
|
|
409
|
-
2.
|
|
410
|
-
3.
|
|
411
|
-
4.
|
|
412
|
-
5.
|
|
413
|
-
6.
|
|
414
|
-
7.
|
|
415
|
-
8.
|
|
558
|
+
1. Query Parse → parse the query, separate semantics from terms
|
|
559
|
+
2. Cache Lookup → return immediately on hit (v1.5.0+, key includes index version + config fingerprint)
|
|
560
|
+
3. Hybrid Recall → dual-channel vector + lexical recall
|
|
561
|
+
4. RRF Fusion → Reciprocal Rank Fusion
|
|
562
|
+
5. Rerank → cross-encoder reranking
|
|
563
|
+
6. Smart Cutoff → intelligent score cutoff
|
|
564
|
+
7. Graph Expand → neighbor/breadcrumb/import expansion
|
|
565
|
+
8. Context Pack → segment merging, token budget
|
|
566
|
+
9. Cache Store → write to cache (v1.5.0+)
|
|
567
|
+
10. Format Output → format and return to the LLM
|
|
416
568
|
```
|
|
417
569
|
|
|
418
|
-
## 📊
|
|
570
|
+
## 📊 Performance Characteristics
|
|
571
|
+
|
|
572
|
+
- **Query cache**: repeated queries hit the LRU cache, skipping the entire recall/rerank/expansion pipeline (v1.5.0+)
|
|
573
|
+
- **Incremental indexing**: only changed files are processed; re-indexing is 10x+ faster
|
|
574
|
+
- **Batch embedding**: adaptive batch size with concurrency control
|
|
575
|
+
- **Rate-limit recovery**: automatic backoff on 429 errors, gradual recovery
|
|
576
|
+
- **Connection pool reuse**: pooled Tree-sitter parsers
|
|
577
|
+
- **File index caching**: lazy-loaded file-path index in GraphExpander
|
|
578
|
+
- **Zero-cost metadata tools**: `list-files`/`find-references`/`get-symbol-definition` do not call the Embedding API (v1.5.0+)
|
|
419
579
|
|
|
420
|
-
|
|
421
|
-
- **批量 Embedding**:自适应批次大小,支持并发控制
|
|
422
|
-
- **速率限制恢复**:429 错误时自动退避,渐进恢复
|
|
423
|
-
- **连接池复用**:Tree-sitter 解析器池化复用
|
|
424
|
-
- **文件索引缓存**:GraphExpander 文件路径索引 lazy load
|
|
580
|
+
## 📈 Statistics & Observability (v1.5.0+)
|
|
425
581
|
|
|
426
|
-
|
|
582
|
+
`contextweaver stats` outputs three sections:
|
|
427
583
|
|
|
428
|
-
|
|
584
|
+
- **Indexing process**: cumulative index run count, last index time, last-run snapshot (added/modified/deleted/unchanged/skipped/errors + vector index details)
|
|
585
|
+
- **Search quality/behavior**: cumulative queries, cache hit rate, actual compute runs, plus average per-stage latency (retrieve / rerank / expand / pack) and average recalled seed count
|
|
586
|
+
- **Health/consistency**: file count and total content size, LanceDB vector row count, embedding dimensions, index version, migration state, `pending_marks`, language breakdown
|
|
429
587
|
|
|
430
|
-
|
|
588
|
+
When an abnormal migration state, `pending_marks` backlog, or missing vector rows are detected, the report appends **diagnostic warnings** with the corresponding fix commands. The `--json` output maps to `StatsReport` for scripts and monitoring systems.
|
|
589
|
+
|
|
590
|
+
## 🐛 Logging & Debugging
|
|
591
|
+
|
|
592
|
+
Log file location: `~/.contextweaver/logs/app.YYYY-MM-DD.log`
|
|
593
|
+
|
|
594
|
+
Set the log level:
|
|
431
595
|
|
|
432
596
|
```bash
|
|
433
|
-
#
|
|
597
|
+
# Enable debug logging
|
|
434
598
|
LOG_LEVEL=debug contextweaver search --information-request "..."
|
|
435
599
|
```
|
|
436
600
|
|
|
437
|
-
## 🚨
|
|
601
|
+
## 🚨 Troubleshooting (v1.4.0+)
|
|
438
602
|
|
|
439
|
-
### LanceDB
|
|
603
|
+
### LanceDB Migration Stuck (`aborted` state)
|
|
440
604
|
|
|
441
|
-
|
|
605
|
+
**Symptom**: `contextweaver index` errors with "LanceDB is in the aborted state, refusing to write to prevent schema pollution."
|
|
442
606
|
|
|
443
|
-
|
|
607
|
+
**Cause**: during the v1.4.0 upgrade, the old LanceDB index's `display_code` differs from the current `files.content` by >1% on sampling (typically on legacy indexes whose chunk offsets used the UTF-8 byte domain).
|
|
444
608
|
|
|
445
|
-
|
|
609
|
+
**Fix**:
|
|
446
610
|
```bash
|
|
447
|
-
contextweaver migrate --reset #
|
|
448
|
-
contextweaver index #
|
|
611
|
+
contextweaver migrate --reset # Clear the LanceDB chunks table + reset state to done
|
|
612
|
+
contextweaver index # Full rebuild (new schema)
|
|
449
613
|
```
|
|
450
614
|
|
|
451
|
-
|
|
615
|
+
You can also run `contextweaver stats` first to view diagnostic warnings and confirm the current migration state and `pending_marks` backlog.
|
|
452
616
|
|
|
453
|
-
|
|
617
|
+
### Cross-Process Migration Race
|
|
454
618
|
|
|
455
|
-
|
|
619
|
+
If the MCP server is long-running and another terminal runs `contextweaver index`, the two processes contend for migration. v1.4.0 introduces an advisory lock with a 10-minute zombie threshold, automatically letting one process skip migration while the other completes it.
|
|
620
|
+
|
|
621
|
+
If the lock gets stuck (after `kill -9`), clear it manually:
|
|
456
622
|
```bash
|
|
457
623
|
sqlite3 ~/.contextweaver/<projectId>/index.db \
|
|
458
624
|
"DELETE FROM metadata WHERE key = 'lancedb_migration_lock';"
|
|
459
625
|
```
|
|
460
626
|
|
|
461
|
-
###
|
|
627
|
+
### Wasted Duplicate Embeddings
|
|
628
|
+
|
|
629
|
+
v1.4.0 solves this via the `pending_marks` outbox: when an FTS write succeeds but the vector_index_hash mark fails, it is replayed automatically on the next launch, avoiding duplicate embeddings.
|
|
630
|
+
|
|
631
|
+
### Search Results Don't Reflect Recent Changes
|
|
462
632
|
|
|
463
|
-
|
|
633
|
+
Confirm incremental indexing has run (or enable `contextweaver watch` for automatic increments). The query cache key is bound to the index version, so old cache entries invalidate automatically after an index update — no manual clearing needed.
|
|
464
634
|
|
|
465
|
-
## 📜
|
|
635
|
+
## 📜 Version History
|
|
466
636
|
|
|
467
|
-
- **v1.
|
|
468
|
-
-
|
|
469
|
-
-
|
|
470
|
-
-
|
|
471
|
-
-
|
|
472
|
-
-
|
|
473
|
-
-
|
|
474
|
-
- **v1.
|
|
475
|
-
-
|
|
476
|
-
-
|
|
477
|
-
-
|
|
637
|
+
- **v1.5.0** (2026-06): query cache, file watching, statistics, and multi-granularity MCP tools
|
|
638
|
+
- Added `QueryCache` (per-project LRU); the cache key includes the index version + search-config fingerprint and invalidates automatically
|
|
639
|
+
- Added `contextweaver watch` for file watching + debounced incremental indexing
|
|
640
|
+
- Added the `contextweaver stats` CLI (`--json`) and MCP `stats` tool: three metric groups + consistency diagnostics
|
|
641
|
+
- Added 3 MCP tools: `list-files` / `find-references` / `get-symbol-definition`, plus their CLI mirror commands
|
|
642
|
+
- Added `CW_SEARCH_*` environment variables to override search parameters (with bounds clamping)
|
|
643
|
+
- 28 test files / 156 test cases
|
|
644
|
+
- **v1.4.0** (2026-05): data architecture and cross-store consistency overhaul
|
|
645
|
+
- LanceDB chunks table drops `display_code/vector_text`; content is read back from `files.content`
|
|
646
|
+
- SemanticSplitter offsets unified to the UTF-16 character domain
|
|
647
|
+
- schema_version 2 → 3; added the `pending_marks` outbox + tri-state migration state machine
|
|
648
|
+
- Added the `contextweaver migrate` CLI
|
|
649
|
+
- Cross-process advisory lock prevents migration races
|
|
650
|
+
- **v1.3.x**: cross-store write transactionality, trailing auto-GC after scan, files_fts external-content table
|
|
651
|
+
- **v1.2.x**: search pipeline optimization, indexing memory optimization
|
|
652
|
+
- **v1.1.x**: Smart TopK cutoff, Smart Cutoff
|
|
653
|
+
- **v1.0.x**: initial release
|
|
478
654
|
|
|
479
|
-
## 📄
|
|
655
|
+
## 📄 License
|
|
480
656
|
|
|
481
|
-
|
|
657
|
+
This project is licensed under the MIT License.
|
|
482
658
|
|
|
483
|
-
## 🙏
|
|
659
|
+
## 🙏 Acknowledgements
|
|
484
660
|
|
|
485
|
-
- [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) -
|
|
486
|
-
- [LanceDB](https://lancedb.com/) -
|
|
661
|
+
- [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) - high-performance syntax parsing
|
|
662
|
+
- [LanceDB](https://lancedb.com/) - embedded vector database
|
|
487
663
|
- [MCP](https://modelcontextprotocol.io/) - Model Context Protocol
|
|
488
|
-
- [SiliconFlow](https://siliconflow.cn/) -
|
|
664
|
+
- [SiliconFlow](https://siliconflow.cn/) - recommended Embedding/Reranker API service
|
|
489
665
|
|
|
490
666
|
---
|
|
491
667
|
|