@chiway/contextweaver 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +138 -28
- package/dist/{SearchService-OS7CYHNJ.js → SearchService-WVD6THR3.js} +116 -74
- package/dist/{chunk-ZOMGPIU6.js → chunk-3BNHQV5W.js} +1 -5
- package/dist/chunk-BFCIZ52F.js +102 -0
- package/dist/{chunk-X7PAYQMT.js → chunk-GDVB6PJ4.js} +21 -3
- package/dist/{lock-FL54LIQL.js → chunk-HHYPQA3X.js} +1 -1
- package/dist/chunk-ISVCQFB4.js +223 -0
- package/dist/chunk-IZ6IUHNN.js +77 -0
- package/dist/chunk-LB42CZEB.js +18 -0
- package/dist/{chunk-RGJSXUFS.js → chunk-PPLFJGO3.js} +60 -0
- package/dist/chunk-R6CNZXZ7.js +143 -0
- package/dist/chunk-TPM6YP43.js +38 -0
- package/dist/{chunk-EMSMLPMK.js → chunk-V3K4YVAR.js} +10 -117
- package/dist/chunk-VWBKZ6QL.js +115 -0
- package/dist/chunk-XFIM2T6S.js +57 -0
- package/dist/{chunk-AB24E3Z7.js → chunk-XMZZZKG7.js} +23 -79
- package/dist/chunk-XTWNT7KP.js +156 -0
- package/dist/chunk-Y6H7C3NA.js +85 -0
- package/dist/{codebaseRetrieval-3Z4CRA7X.js → codebaseRetrieval-DIS5RH2C.js} +5 -2
- package/dist/{db-PMVM7557.js → db-GBCLP4GG.js} +15 -1
- package/dist/findReferences-N7ML7TUP.js +16 -0
- package/dist/getSymbolDefinition-6KMY4H33.js +17 -0
- package/dist/index.js +244 -41
- package/dist/listFiles-4VT2TPJD.js +14 -0
- package/dist/loadConfig-XTVT2OWW.js +9 -0
- package/dist/lock-HNKQ6X5B.js +8 -0
- package/dist/scanner-QDFZJLP7.js +13 -0
- package/dist/server-UAI3U7AB.js +347 -0
- package/dist/stats-AGKUCJQI.js +12 -0
- package/dist/{vectorStore-HPQZOVWF.js → vectorStore-4ODCERRO.js} +1 -1
- package/package.json +9 -23
- package/dist/scanner-2XGJWYHR.js +0 -11
- package/dist/server-XK6EINRV.js +0 -146
package/README.md
CHANGED
|
@@ -24,10 +24,11 @@
|
|
|
24
24
|
- **RRF 融合 (Reciprocal Rank Fusion)**:智能融合多路召回结果
|
|
25
25
|
|
|
26
26
|
### 🧠 AST 语义分片
|
|
27
|
-
- **Tree-sitter 解析**:支持 TypeScript、JavaScript、Python、Go、Java、Rust
|
|
27
|
+
- **Tree-sitter 解析**:支持 TypeScript、JavaScript、Python、Go、Java、Rust、C、C++、C# 等语言
|
|
28
28
|
- **Dual-Text 策略**:`displayCode` 用于展示,`vectorText` 用于 Embedding
|
|
29
29
|
- **Gap-Aware 合并**:智能处理代码间隙,保持语义完整性
|
|
30
30
|
- **Breadcrumb 注入**:向量文本包含层级路径,提升检索召回率
|
|
31
|
+
- **UTF-16 字符域归一**:在写入 metadata 前用 `SourceAdapter.toCharOffset` 统一偏移,避免多字节字符切片错位(v1.4.0+)
|
|
31
32
|
|
|
32
33
|
### 📊 三阶段上下文扩展
|
|
33
34
|
- **E1 邻居扩展**:同文件前后相邻 chunks,保证代码块完整性
|
|
@@ -44,6 +45,13 @@
|
|
|
44
45
|
- **意图与术语分离**:LLM 友好的 API 设计
|
|
45
46
|
- **自动索引**:首次查询自动触发索引,增量更新透明无感
|
|
46
47
|
|
|
48
|
+
### 🛡️ Crash-Safe 数据架构 (v1.4.0+)
|
|
49
|
+
- **正文唯一源**:LanceDB 仅存向量与定位元数据,正文回查 `files.content`,索引体积降低 30-50%
|
|
50
|
+
- **跨库事务补偿**:LanceDB → FTS+outbox → SQLite mark 三阶段写入,任一失败自动回滚或重放
|
|
51
|
+
- **迁移状态机**:`pending/done/aborted` 三态持久化,崩溃恢复自动重建
|
|
52
|
+
- **跨进程互斥**:advisory lock 防止 MCP server 与 CLI 并发触发 LanceDB 迁移
|
|
53
|
+
- **chunk_id 去重**:写入前预删除,防止 retry 场景产生重复行
|
|
54
|
+
|
|
47
55
|
## 📦 快速开始
|
|
48
56
|
|
|
49
57
|
### 环境要求
|
|
@@ -55,10 +63,10 @@
|
|
|
55
63
|
|
|
56
64
|
```bash
|
|
57
65
|
# 全局安装
|
|
58
|
-
npm install -g @
|
|
66
|
+
npm install -g @chiway/contextweaver
|
|
59
67
|
|
|
60
68
|
# 或使用 pnpm
|
|
61
|
-
pnpm add -g @
|
|
69
|
+
pnpm add -g @chiway/contextweaver
|
|
62
70
|
```
|
|
63
71
|
|
|
64
72
|
### 初始化配置
|
|
@@ -120,6 +128,20 @@ cw search --information-request "数据库连接逻辑" --technical-terms "Datab
|
|
|
120
128
|
contextweaver mcp
|
|
121
129
|
```
|
|
122
130
|
|
|
131
|
+
### 索引管理 (v1.4.0+)
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
# 查看 LanceDB 迁移状态
|
|
135
|
+
contextweaver migrate
|
|
136
|
+
|
|
137
|
+
# 解除 aborted 状态:清空 LanceDB 并触发全量重建
|
|
138
|
+
# 触发时机:抽样校验失败后 Indexer 拒绝写入;运行此命令后再次 index 即可
|
|
139
|
+
contextweaver migrate --reset
|
|
140
|
+
|
|
141
|
+
# 指定项目路径
|
|
142
|
+
contextweaver migrate --path /path/to/project
|
|
143
|
+
```
|
|
144
|
+
|
|
123
145
|
## 🔧 MCP 集成配置
|
|
124
146
|
|
|
125
147
|
### Claude Desktop 配置
|
|
@@ -201,55 +223,93 @@ flowchart TB
|
|
|
201
223
|
| **SearchService** | 混合搜索核心,协调向量/词法召回、RRF 融合、Rerank 精排 |
|
|
202
224
|
| **GraphExpander** | 上下文扩展器,执行 E1/E2/E3 三阶段扩展策略 |
|
|
203
225
|
| **ContextPacker** | 上下文打包器,负责段落合并和 Token 预算控制 |
|
|
204
|
-
| **
|
|
205
|
-
| **
|
|
206
|
-
| **
|
|
226
|
+
| **ChunkContentLoader** | 按 `(path, start_index, end_index)` 从 `files.content` 批量切片(v1.4.0+) |
|
|
227
|
+
| **VectorStore** | LanceDB 适配层,仅暴露纯 vector 操作 |
|
|
228
|
+
| **Database (SQLite)** | 元数据存储 + FTS5 全文索引,schema_version=3 |
|
|
229
|
+
| **Bootstrap** | 跨库初始化协调器:pending_marks 重放 + LanceDB schema 迁移(v1.4.0+) |
|
|
230
|
+
| **SemanticSplitter** | AST 语义分片器,基于 Tree-sitter 解析,写入时统一到 UTF-16 字符域 |
|
|
231
|
+
|
|
232
|
+
### 数据架构 (v1.4.0+)
|
|
233
|
+
|
|
234
|
+
```
|
|
235
|
+
~/.contextweaver/<projectId>/
|
|
236
|
+
├── index.db # SQLite
|
|
237
|
+
│ ├── files # 文件元数据 + 完整正文(content 列,文本切片唯一来源)
|
|
238
|
+
│ ├── files_fts # 外部内容表,倒排索引指向 files
|
|
239
|
+
│ ├── chunks_fts # chunk 级倒排索引,per-file 整体替换
|
|
240
|
+
│ ├── metadata # schema_version / lancedb_migration_state / lock
|
|
241
|
+
│ └── pending_marks # outbox:vector_index_hash 标记失败时启动重放
|
|
242
|
+
└── vectors.lance/ # LanceDB chunks 表(仅向量 + 定位元数据,不存正文)
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
**关键不变量**:
|
|
246
|
+
- 正文唯一源是 `files.content`;`ChunkContentLoader` 用 `start_index/end_index` 切片(与 `displayCode` 同源)
|
|
247
|
+
- 所有 LanceDB 偏移字段都在 UTF-16 字符域,多字节文件不会切错
|
|
248
|
+
- 跨库写入顺序:LanceDB → (FTS + outbox 单事务) → SQLite mark + 清 outbox
|
|
249
|
+
- LanceDB 迁移状态 `pending/done/aborted` 持久化,跨进程用 advisory lock 互斥
|
|
207
250
|
|
|
208
251
|
## 📁 项目结构
|
|
209
252
|
|
|
210
253
|
```
|
|
211
254
|
contextweaver/
|
|
212
255
|
├── src/
|
|
213
|
-
│ ├── index.ts # CLI
|
|
256
|
+
│ ├── index.ts # CLI 入口(init / index / search / mcp / migrate)
|
|
214
257
|
│ ├── config.ts # 配置管理(环境变量)
|
|
215
258
|
│ ├── api/ # 外部 API 封装
|
|
216
|
-
│ │ ├──
|
|
217
|
-
│ │ └──
|
|
259
|
+
│ │ ├── embedding.ts # Embedding API
|
|
260
|
+
│ │ └── reranker.ts # Reranker API
|
|
218
261
|
│ ├── chunking/ # 语义分片
|
|
219
262
|
│ │ ├── SemanticSplitter.ts # AST 语义分片器
|
|
220
|
-
│ │ ├── SourceAdapter.ts #
|
|
263
|
+
│ │ ├── SourceAdapter.ts # 源码适配器(UTF-16/UTF-8 域归一)
|
|
221
264
|
│ │ ├── LanguageSpec.ts # 语言规范定义
|
|
222
|
-
│ │
|
|
265
|
+
│ │ ├── ParserPool.ts # Tree-sitter 解析器池
|
|
266
|
+
│ │ └── types.ts # 分片类型定义
|
|
223
267
|
│ ├── scanner/ # 文件扫描
|
|
224
268
|
│ │ ├── crawler.ts # 文件系统遍历
|
|
225
269
|
│ │ ├── processor.ts # 文件处理
|
|
226
|
-
│ │
|
|
270
|
+
│ │ ├── filter.ts # 过滤规则
|
|
271
|
+
│ │ ├── hash.ts # 文件 hash
|
|
272
|
+
│ │ └── language.ts # 语言识别
|
|
227
273
|
│ ├── indexer/ # 索引器
|
|
228
|
-
│ │ └── index.ts #
|
|
274
|
+
│ │ └── index.ts # 三阶段事务(LanceDB → FTS+outbox → SQLite mark)
|
|
229
275
|
│ ├── vectorStore/ # 向量存储
|
|
230
|
-
│ │ └── index.ts # LanceDB
|
|
276
|
+
│ │ └── index.ts # LanceDB 适配层(纯 vector 操作)
|
|
231
277
|
│ ├── db/ # 数据库
|
|
232
|
-
│ │
|
|
278
|
+
│ │ ├── index.ts # SQLite + FTS5 + pending_marks + 迁移状态机
|
|
279
|
+
│ │ └── bootstrap.ts # 跨库初始化协调(v1.4.0+)
|
|
233
280
|
│ ├── search/ # 搜索服务
|
|
234
|
-
│ │ ├── SearchService.ts
|
|
235
|
-
│ │ ├── GraphExpander.ts
|
|
236
|
-
│ │ ├── ContextPacker.ts
|
|
237
|
-
│ │ ├──
|
|
238
|
-
│ │ ├──
|
|
239
|
-
│ │ ├──
|
|
240
|
-
│ │
|
|
281
|
+
│ │ ├── SearchService.ts # 核心搜索服务
|
|
282
|
+
│ │ ├── GraphExpander.ts # 上下文扩展器
|
|
283
|
+
│ │ ├── ContextPacker.ts # 上下文打包器
|
|
284
|
+
│ │ ├── ChunkContentLoader.ts # 按 (path, start_index, end_index) 切片(v1.4.0+)
|
|
285
|
+
│ │ ├── fts.ts # 全文搜索(per-file 整体替换)
|
|
286
|
+
│ │ ├── config.ts # 搜索配置
|
|
287
|
+
│ │ ├── types.ts # 类型定义
|
|
288
|
+
│ │ ├── utils.ts # token overlap 评分
|
|
289
|
+
│ │ └── resolvers/ # 多语言 Import 解析器
|
|
241
290
|
│ │ ├── JsTsResolver.ts
|
|
242
291
|
│ │ ├── PythonResolver.ts
|
|
243
292
|
│ │ ├── GoResolver.ts
|
|
244
293
|
│ │ ├── JavaResolver.ts
|
|
245
|
-
│ │
|
|
294
|
+
│ │ ├── RustResolver.ts
|
|
295
|
+
│ │ ├── CppResolver.ts
|
|
296
|
+
│ │ └── CSharpResolver.ts
|
|
246
297
|
│ ├── mcp/ # MCP 服务端
|
|
247
298
|
│ │ ├── server.ts # MCP 服务器实现
|
|
248
299
|
│ │ ├── main.ts # MCP 入口
|
|
249
300
|
│ │ └── tools/
|
|
250
301
|
│ │ └── codebaseRetrieval.ts # 代码检索工具
|
|
251
302
|
│ └── utils/ # 工具函数
|
|
252
|
-
│
|
|
303
|
+
│ ├── logger.ts # 日志系统
|
|
304
|
+
│ ├── encoding.ts # 编码检测
|
|
305
|
+
│ └── lock.ts # 文件锁
|
|
306
|
+
├── tests/ # 单测 + 集成测试(109 测试用例)
|
|
307
|
+
│ ├── chunking/ # SourceAdapter / 分片
|
|
308
|
+
│ ├── db/ # 迁移、outbox、advisory lock
|
|
309
|
+
│ ├── indexer/ # 事务补偿、GC、aborted 守卫
|
|
310
|
+
│ ├── integration/ # 真实 LanceDB 端到端
|
|
311
|
+
│ ├── search/ # FTS、ChunkContentLoader、Packer
|
|
312
|
+
│ └── vectorStore/ # chunk_id 去重、抽样校验
|
|
253
313
|
├── package.json
|
|
254
314
|
└── tsconfig.json
|
|
255
315
|
```
|
|
@@ -314,11 +374,14 @@ ContextWeaver 通过 Tree-sitter 原生支持以下编程语言的 AST 解析:
|
|
|
314
374
|
| 语言 | AST 解析 | Import 解析 | 文件扩展名 |
|
|
315
375
|
|------|----------|-------------|-----------|
|
|
316
376
|
| TypeScript | ✅ | ✅ | `.ts`, `.tsx` |
|
|
317
|
-
| JavaScript | ✅ | ✅ | `.js`, `.jsx`, `.mjs` |
|
|
377
|
+
| JavaScript | ✅ | ✅ | `.js`, `.jsx`, `.mjs`, `.cjs` |
|
|
318
378
|
| Python | ✅ | ✅ | `.py` |
|
|
319
379
|
| Go | ✅ | ✅ | `.go` |
|
|
320
380
|
| Java | ✅ | ✅ | `.java` |
|
|
321
381
|
| Rust | ✅ | ✅ | `.rs` |
|
|
382
|
+
| C | ✅ | ✅ | `.c`, `.h` |
|
|
383
|
+
| C++ | ✅ | ✅ | `.cpp`, `.cc`, `.cxx`, `.hpp` |
|
|
384
|
+
| C# | ✅ | ✅ | `.cs` |
|
|
322
385
|
|
|
323
386
|
其他语言会采用基于行的 Fallback 分片策略,仍可正常索引和搜索。
|
|
324
387
|
|
|
@@ -327,11 +390,16 @@ ContextWeaver 通过 Tree-sitter 原生支持以下编程语言的 AST 解析:
|
|
|
327
390
|
### 索引流程
|
|
328
391
|
|
|
329
392
|
```
|
|
393
|
+
0. Bootstrap → pending_marks 重放 + LanceDB schema 迁移(首次启动)
|
|
330
394
|
1. Crawler → 遍历文件系统,过滤忽略项
|
|
331
395
|
2. Processor → 读取文件内容,计算 hash
|
|
332
|
-
3. Splitter → AST
|
|
333
|
-
4. Indexer → 批量 Embedding
|
|
334
|
-
5.
|
|
396
|
+
3. Splitter → AST 解析,语义分片(偏移归一到 UTF-16 字符域)
|
|
397
|
+
4. Indexer → 批量 Embedding
|
|
398
|
+
5. 阶段 4-6 伪事务:
|
|
399
|
+
├─ LanceDB 写入(预删 (path, hash) 防重复 → add → 清旧版本)
|
|
400
|
+
├─ FTS + outbox 单 SQLite 事务(失败回滚 LanceDB)
|
|
401
|
+
└─ SQLite mark + 清 outbox 单事务(失败时 outbox 保留,下次启动 replay)
|
|
402
|
+
6. 末尾 GC → 清理 LanceDB 孤儿 chunks(time budget 5s)
|
|
335
403
|
```
|
|
336
404
|
|
|
337
405
|
### 搜索流程
|
|
@@ -366,6 +434,48 @@ ContextWeaver 通过 Tree-sitter 原生支持以下编程语言的 AST 解析:
|
|
|
366
434
|
LOG_LEVEL=debug contextweaver search --information-request "..."
|
|
367
435
|
```
|
|
368
436
|
|
|
437
|
+
## 🚨 故障排查 (v1.4.0+)
|
|
438
|
+
|
|
439
|
+
### LanceDB 迁移卡死 (`aborted` 状态)
|
|
440
|
+
|
|
441
|
+
**现象**:`contextweaver index` 报错 "LanceDB 处于 aborted 状态,拒绝写入以防止 schema 污染"。
|
|
442
|
+
|
|
443
|
+
**原因**:v1.4.0 升级时 LanceDB 旧索引中的 `display_code` 与当前 `files.content` 抽样差异 > 1%(通常发生在 chunk 偏移用 UTF-8 字节域旧索引上)。
|
|
444
|
+
|
|
445
|
+
**解决**:
|
|
446
|
+
```bash
|
|
447
|
+
contextweaver migrate --reset # 清空 LanceDB chunks 表 + 重置状态为 done
|
|
448
|
+
contextweaver index # 全量重建(新 schema)
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
### 跨进程迁移竞争
|
|
452
|
+
|
|
453
|
+
如果 MCP server 长驻 + 另一终端跑 `contextweaver index`,两进程会争抢迁移。v1.4.0 引入 10 分钟僵尸阈值的 advisory lock,自动让一个进程跳过迁移、另一个完成。
|
|
454
|
+
|
|
455
|
+
如锁卡住(process kill -9 后),可手动清理:
|
|
456
|
+
```bash
|
|
457
|
+
sqlite3 ~/.contextweaver/<projectId>/index.db \
|
|
458
|
+
"DELETE FROM metadata WHERE key = 'lancedb_migration_lock';"
|
|
459
|
+
```
|
|
460
|
+
|
|
461
|
+
### 重复 embedding 浪费
|
|
462
|
+
|
|
463
|
+
v1.4.0 已通过 `pending_marks` outbox 机制解决:FTS 写入成功但 vector_index_hash 标记失败时,下次启动自动 replay,不会触发重复 embedding。
|
|
464
|
+
|
|
465
|
+
## 📜 版本历史
|
|
466
|
+
|
|
467
|
+
- **v1.4.0** (2026-05): 数据架构与跨库一致性大修
|
|
468
|
+
- LanceDB chunks 表移除 `display_code/vector_text`,正文回查 `files.content`
|
|
469
|
+
- SemanticSplitter 偏移统一到 UTF-16 字符域
|
|
470
|
+
- schema_version 2 → 3,新增 `pending_marks` outbox + 三态迁移状态机
|
|
471
|
+
- 新增 `contextweaver migrate` CLI
|
|
472
|
+
- 跨进程 advisory lock 防止迁移竞争
|
|
473
|
+
- 109 个测试(含真实 LanceDB 端到端集成)
|
|
474
|
+
- **v1.3.x**: 跨库写入事务性、scan 末尾自动 GC、files_fts 外部内容表
|
|
475
|
+
- **v1.2.x**: 搜索管道优化、索引内存优化
|
|
476
|
+
- **v1.1.x**: 智能 TopK 截断、Smart Cutoff
|
|
477
|
+
- **v1.0.x**: 初始 release
|
|
478
|
+
|
|
369
479
|
## 📄 开源协议
|
|
370
480
|
|
|
371
481
|
本项目采用 MIT 许可证。
|
|
@@ -1,21 +1,32 @@
|
|
|
1
1
|
import {
|
|
2
|
-
|
|
2
|
+
createSearchConfigFingerprint
|
|
3
|
+
} from "./chunk-IZ6IUHNN.js";
|
|
4
|
+
import {
|
|
3
5
|
bootstrap,
|
|
4
6
|
getGraphExpander,
|
|
5
7
|
getIndexer,
|
|
6
8
|
scoreChunkTokenOverlap
|
|
7
|
-
} from "./chunk-
|
|
9
|
+
} from "./chunk-XMZZZKG7.js";
|
|
10
|
+
import "./chunk-LB42CZEB.js";
|
|
11
|
+
import {
|
|
12
|
+
ChunkContentLoader
|
|
13
|
+
} from "./chunk-XFIM2T6S.js";
|
|
8
14
|
import {
|
|
9
15
|
getVectorStore
|
|
10
|
-
} from "./chunk-
|
|
16
|
+
} from "./chunk-3BNHQV5W.js";
|
|
11
17
|
import {
|
|
18
|
+
DEFAULT_CONFIG
|
|
19
|
+
} from "./chunk-BFCIZ52F.js";
|
|
20
|
+
import {
|
|
21
|
+
getIndexVersion,
|
|
22
|
+
incrementStat,
|
|
12
23
|
initDb,
|
|
13
24
|
isChunksFtsInitialized,
|
|
14
25
|
isFtsInitialized,
|
|
15
26
|
searchChunksFts,
|
|
16
27
|
searchFilesFts,
|
|
17
28
|
segmentQuery
|
|
18
|
-
} from "./chunk-
|
|
29
|
+
} from "./chunk-PPLFJGO3.js";
|
|
19
30
|
import {
|
|
20
31
|
isDebugEnabled,
|
|
21
32
|
logger
|
|
@@ -161,10 +172,8 @@ function sleep(ms) {
|
|
|
161
172
|
|
|
162
173
|
// src/search/ContextPacker.ts
|
|
163
174
|
var ContextPacker = class {
|
|
164
|
-
projectId;
|
|
165
175
|
config;
|
|
166
|
-
constructor(
|
|
167
|
-
this.projectId = projectId;
|
|
176
|
+
constructor(_projectId, config) {
|
|
168
177
|
this.config = config;
|
|
169
178
|
}
|
|
170
179
|
/**
|
|
@@ -270,73 +279,63 @@ var ContextPacker = class {
|
|
|
270
279
|
}
|
|
271
280
|
};
|
|
272
281
|
|
|
273
|
-
// src/search/
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
maxBreadcrumbChars: 250,
|
|
301
|
-
// Max chars for breadcrumb context in rerank input. Range: 100–500.
|
|
302
|
-
headRatio: 0.67,
|
|
303
|
-
// Ratio of head vs tail when truncating chunks. Range: 0.5–0.8.
|
|
304
|
-
// ── Expansion (上下文扩展: E1 邻居 / E2 面包屑 / E3 跨文件导入) ──
|
|
305
|
-
neighborHops: 2,
|
|
306
|
-
// E1: How many sibling chunks to expand in each direction. Range: 1–3.
|
|
307
|
-
breadcrumbExpandLimit: 3,
|
|
308
|
-
// E2: Max ancestor breadcrumbs (class/function scope). Range: 1–5.
|
|
309
|
-
importFilesPerSeed: 3,
|
|
310
|
-
// E3: Cross-file import files to resolve per seed chunk. Range: 0–5. Set to 3 to enable import-graph expansion for better cross-file context.
|
|
311
|
-
chunksPerImportFile: 3,
|
|
312
|
-
// E3: Chunks to pull from each resolved import file. Range: 1–5. Set to 3 for balanced coverage of imported symbols.
|
|
313
|
-
decayNeighbor: 0.8,
|
|
314
|
-
// Score decay per E1 hop. Range: 0.5–0.9. Higher = neighbors stay relevant longer.
|
|
315
|
-
decayBreadcrumb: 0.7,
|
|
316
|
-
// Score decay per E2 level. Range: 0.4–0.8.
|
|
317
|
-
decayImport: 0.6,
|
|
318
|
-
// Score decay for E3 import chunks. Range: 0.3–0.7. Lower than E1/E2 since cross-file is less certain.
|
|
319
|
-
decayDepth: 0.7,
|
|
320
|
-
// General depth decay multiplier. Range: 0.5–0.9.
|
|
321
|
-
// ── ContextPacker (上下文打包) ──
|
|
322
|
-
maxSegmentsPerFile: 3,
|
|
323
|
-
// Max non-contiguous segments per file in output. Range: 1–5. Prevents excessive fragmentation.
|
|
324
|
-
maxTotalChars: 48e3,
|
|
325
|
-
// Token budget expressed as chars (~12k tokens). Range: 20000–80000.
|
|
326
|
-
// ── Smart TopK (动态结果数量) ──
|
|
327
|
-
enableSmartTopK: true,
|
|
328
|
-
// Dynamically adjust result count based on score distribution.
|
|
329
|
-
smartTopScoreRatio: 0.5,
|
|
330
|
-
// Min score as ratio of top-1 score to remain included. Range: 0.3–0.7.
|
|
331
|
-
smartTopScoreDeltaAbs: 0.25,
|
|
332
|
-
// Max absolute score drop from top-1 before cutting off. Range: 0.1–0.4.
|
|
333
|
-
smartMinScore: 0.25,
|
|
334
|
-
// Hard floor: chunks below this score are always excluded. Range: 0.1–0.4.
|
|
335
|
-
smartMinK: 2,
|
|
336
|
-
// Minimum results to return regardless of scores. Range: 1–3.
|
|
337
|
-
smartMaxK: 8
|
|
338
|
-
// Maximum results when smart topK is active. Range: 5–15.
|
|
282
|
+
// src/search/QueryCache.ts
|
|
283
|
+
import crypto from "crypto";
|
|
284
|
+
var MAX_CACHE_ENTRIES = 50;
|
|
285
|
+
var LruCache = class {
|
|
286
|
+
constructor(maxSize) {
|
|
287
|
+
this.maxSize = maxSize;
|
|
288
|
+
}
|
|
289
|
+
entries = /* @__PURE__ */ new Map();
|
|
290
|
+
get(key) {
|
|
291
|
+
const value = this.entries.get(key);
|
|
292
|
+
if (value === void 0) return void 0;
|
|
293
|
+
this.entries.delete(key);
|
|
294
|
+
this.entries.set(key, value);
|
|
295
|
+
return value;
|
|
296
|
+
}
|
|
297
|
+
set(key, value) {
|
|
298
|
+
if (this.entries.has(key)) {
|
|
299
|
+
this.entries.delete(key);
|
|
300
|
+
}
|
|
301
|
+
this.entries.set(key, value);
|
|
302
|
+
if (this.entries.size > this.maxSize) {
|
|
303
|
+
const oldestKey = this.entries.keys().next().value;
|
|
304
|
+
if (oldestKey !== void 0) {
|
|
305
|
+
this.entries.delete(oldestKey);
|
|
306
|
+
}
|
|
307
|
+
}
|
|
308
|
+
}
|
|
339
309
|
};
|
|
310
|
+
var projectCaches = /* @__PURE__ */ new Map();
|
|
311
|
+
function normalizeQuery(query) {
|
|
312
|
+
return query.trim().replace(/\s+/g, " ").toLowerCase();
|
|
313
|
+
}
|
|
314
|
+
function getProjectCache(projectId) {
|
|
315
|
+
let cache = projectCaches.get(projectId);
|
|
316
|
+
if (!cache) {
|
|
317
|
+
cache = new LruCache(MAX_CACHE_ENTRIES);
|
|
318
|
+
projectCaches.set(projectId, cache);
|
|
319
|
+
}
|
|
320
|
+
return cache;
|
|
321
|
+
}
|
|
322
|
+
function buildQueryCacheKey(input) {
|
|
323
|
+
const normalizedQuery = normalizeQuery(input.query);
|
|
324
|
+
return crypto.createHash("sha256").update(
|
|
325
|
+
JSON.stringify({
|
|
326
|
+
query: normalizedQuery,
|
|
327
|
+
projectId: input.projectId,
|
|
328
|
+
indexVersion: input.indexVersion,
|
|
329
|
+
configFingerprint: input.configFingerprint
|
|
330
|
+
})
|
|
331
|
+
).digest("hex");
|
|
332
|
+
}
|
|
333
|
+
function getCachedContextPack(projectId, key) {
|
|
334
|
+
return getProjectCache(projectId).get(key);
|
|
335
|
+
}
|
|
336
|
+
function setCachedContextPack(projectId, key, pack) {
|
|
337
|
+
getProjectCache(projectId).set(key, pack);
|
|
338
|
+
}
|
|
340
339
|
|
|
341
340
|
// src/search/SearchService.ts
|
|
342
341
|
var SearchService = class {
|
|
@@ -345,9 +344,11 @@ var SearchService = class {
|
|
|
345
344
|
vectorStore = null;
|
|
346
345
|
db = null;
|
|
347
346
|
config;
|
|
347
|
+
configFingerprint;
|
|
348
348
|
constructor(projectId, _projectPath, config) {
|
|
349
349
|
this.projectId = projectId;
|
|
350
350
|
this.config = { ...DEFAULT_CONFIG, ...config };
|
|
351
|
+
this.configFingerprint = createSearchConfigFingerprint(this.config);
|
|
351
352
|
}
|
|
352
353
|
async init() {
|
|
353
354
|
const embeddingConfig = getEmbeddingConfig();
|
|
@@ -366,6 +367,18 @@ var SearchService = class {
|
|
|
366
367
|
* 构建上下文包(用于问答/生成)
|
|
367
368
|
*/
|
|
368
369
|
async buildContextPack(query) {
|
|
370
|
+
const db = this.db;
|
|
371
|
+
const cacheKey = buildQueryCacheKey({
|
|
372
|
+
query,
|
|
373
|
+
projectId: this.projectId,
|
|
374
|
+
indexVersion: getIndexVersion(db),
|
|
375
|
+
configFingerprint: this.configFingerprint
|
|
376
|
+
});
|
|
377
|
+
const cached = getCachedContextPack(this.projectId, cacheKey);
|
|
378
|
+
if (cached) {
|
|
379
|
+
this.recordSearchStats(db, { cacheHit: true });
|
|
380
|
+
return cached;
|
|
381
|
+
}
|
|
369
382
|
const timingMs = {};
|
|
370
383
|
let t0 = Date.now();
|
|
371
384
|
const candidates = await this.hybridRetrieve(query);
|
|
@@ -385,7 +398,7 @@ var SearchService = class {
|
|
|
385
398
|
const packer = new ContextPacker(this.projectId, this.config);
|
|
386
399
|
const files = await packer.pack([...seeds, ...expanded], this.db);
|
|
387
400
|
timingMs.pack = Date.now() - t0;
|
|
388
|
-
|
|
401
|
+
const pack = {
|
|
389
402
|
query,
|
|
390
403
|
seeds,
|
|
391
404
|
expanded,
|
|
@@ -396,6 +409,35 @@ var SearchService = class {
|
|
|
396
409
|
timingMs
|
|
397
410
|
}
|
|
398
411
|
};
|
|
412
|
+
setCachedContextPack(this.projectId, cacheKey, pack);
|
|
413
|
+
this.recordSearchStats(db, { cacheHit: false, timingMs, seedCount: seeds.length });
|
|
414
|
+
return pack;
|
|
415
|
+
}
|
|
416
|
+
/**
|
|
417
|
+
* 记录搜索统计埋点(静默吞错,不影响搜索主流程)
|
|
418
|
+
*
|
|
419
|
+
* 多个计数器用事务包裹,保证一次查询要么全写要么全不写。
|
|
420
|
+
*/
|
|
421
|
+
recordSearchStats(db, args) {
|
|
422
|
+
try {
|
|
423
|
+
const tx = db.transaction(() => {
|
|
424
|
+
incrementStat(db, "stats.search.total_queries");
|
|
425
|
+
if (args.cacheHit) {
|
|
426
|
+
incrementStat(db, "stats.search.cache_hits");
|
|
427
|
+
return;
|
|
428
|
+
}
|
|
429
|
+
incrementStat(db, "stats.search.compute_runs");
|
|
430
|
+
const t = args.timingMs ?? {};
|
|
431
|
+
incrementStat(db, "stats.search.sum_retrieve_ms", Math.round(t.retrieve ?? 0));
|
|
432
|
+
incrementStat(db, "stats.search.sum_rerank_ms", Math.round(t.rerank ?? 0));
|
|
433
|
+
incrementStat(db, "stats.search.sum_expand_ms", Math.round(t.expand ?? 0));
|
|
434
|
+
incrementStat(db, "stats.search.sum_pack_ms", Math.round(t.pack ?? 0));
|
|
435
|
+
incrementStat(db, "stats.search.sum_seed_count", args.seedCount ?? 0);
|
|
436
|
+
});
|
|
437
|
+
tx();
|
|
438
|
+
} catch (err) {
|
|
439
|
+
logger.debug({ error: err.message }, "\u641C\u7D22\u7EDF\u8BA1\u57CB\u70B9\u5931\u8D25");
|
|
440
|
+
}
|
|
399
441
|
}
|
|
400
442
|
// 召回方法
|
|
401
443
|
/**
|
|
@@ -35,11 +35,9 @@ function sampleCheckDisplayCode(oldRows, getContent, options = {}) {
|
|
|
35
35
|
var VectorStore = class {
|
|
36
36
|
db = null;
|
|
37
37
|
table = null;
|
|
38
|
-
projectId;
|
|
39
38
|
dbPath;
|
|
40
39
|
vectorDim;
|
|
41
40
|
constructor(projectId, vectorDim = 1024, dbPathOverride) {
|
|
42
|
-
this.projectId = projectId;
|
|
43
41
|
this.dbPath = dbPathOverride ?? path.join(BASE_DIR, projectId, "vectors.lance");
|
|
44
42
|
this.vectorDim = vectorDim;
|
|
45
43
|
}
|
|
@@ -185,9 +183,7 @@ var VectorStore = class {
|
|
|
185
183
|
continue;
|
|
186
184
|
}
|
|
187
185
|
if (this.table && batch.length > 0) {
|
|
188
|
-
await this.deleteFilesByHash(
|
|
189
|
-
batch.map((f) => ({ path: f.path, hash: f.hash }))
|
|
190
|
-
);
|
|
186
|
+
await this.deleteFilesByHash(batch.map((f) => ({ path: f.path, hash: f.hash })));
|
|
191
187
|
}
|
|
192
188
|
if (!this.table) {
|
|
193
189
|
await this.ensureTable(batchRecords);
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
// src/search/config.ts
|
|
2
|
+
var SEARCH_CONFIG_BOUNDS = {
|
|
3
|
+
vectorTopK: { min: 40, max: 200, integer: true },
|
|
4
|
+
vectorTopM: { min: 30, max: 100, integer: true },
|
|
5
|
+
ftsTopKFiles: { min: 10, max: 50, integer: true },
|
|
6
|
+
lexChunksPerFile: { min: 1, max: 5, integer: true },
|
|
7
|
+
lexTotalChunks: { min: 20, max: 80, integer: true },
|
|
8
|
+
rrfK0: { min: 10, max: 60, integer: true },
|
|
9
|
+
wVec: { min: 0, max: 1, integer: false },
|
|
10
|
+
wLex: { min: 0, max: 1, integer: false },
|
|
11
|
+
fusedTopM: { min: 30, max: 100, integer: true },
|
|
12
|
+
rerankTopN: { min: 5, max: 20, integer: true },
|
|
13
|
+
maxRerankChars: { min: 500, max: 2e3, integer: true },
|
|
14
|
+
maxBreadcrumbChars: { min: 100, max: 500, integer: true },
|
|
15
|
+
headRatio: { min: 0.5, max: 0.8, integer: false },
|
|
16
|
+
neighborHops: { min: 1, max: 3, integer: true },
|
|
17
|
+
breadcrumbExpandLimit: { min: 1, max: 5, integer: true },
|
|
18
|
+
importFilesPerSeed: { min: 0, max: 5, integer: true },
|
|
19
|
+
chunksPerImportFile: { min: 1, max: 5, integer: true },
|
|
20
|
+
decayNeighbor: { min: 0.5, max: 0.9, integer: false },
|
|
21
|
+
decayBreadcrumb: { min: 0.4, max: 0.8, integer: false },
|
|
22
|
+
decayImport: { min: 0.3, max: 0.7, integer: false },
|
|
23
|
+
decayDepth: { min: 0.5, max: 0.9, integer: false },
|
|
24
|
+
maxSegmentsPerFile: { min: 1, max: 5, integer: true },
|
|
25
|
+
maxTotalChars: { min: 2e4, max: 8e4, integer: true },
|
|
26
|
+
smartTopScoreRatio: { min: 0.3, max: 0.7, integer: false },
|
|
27
|
+
smartTopScoreDeltaAbs: { min: 0.1, max: 0.4, integer: false },
|
|
28
|
+
smartMinScore: { min: 0.1, max: 0.4, integer: false },
|
|
29
|
+
smartMinK: { min: 1, max: 3, integer: true },
|
|
30
|
+
smartMaxK: { min: 5, max: 15, integer: true }
|
|
31
|
+
};
|
|
32
|
+
var DEFAULT_CONFIG = {
|
|
33
|
+
// ── Recall (向量 + 词法召回) ──
|
|
34
|
+
vectorTopK: 80,
|
|
35
|
+
// Vector ANN candidates before dedup. Range: 40–200. Higher = better recall, more compute.
|
|
36
|
+
vectorTopM: 60,
|
|
37
|
+
// Vectors kept after dedup. Range: 30–100.
|
|
38
|
+
ftsTopKFiles: 20,
|
|
39
|
+
// Max files returned by FTS5 full-text search. Range: 10–50.
|
|
40
|
+
lexChunksPerFile: 2,
|
|
41
|
+
// Chunks to pull per FTS-matched file. Range: 1–5. Low keeps diversity across files.
|
|
42
|
+
lexTotalChunks: 40,
|
|
43
|
+
// Hard cap on total lexical chunks. Range: 20–80.
|
|
44
|
+
// ── RRF Fusion (向量 + 词法分数融合) ──
|
|
45
|
+
rrfK0: 20,
|
|
46
|
+
// RRF smoothing constant. Range: 10–60. Lower amplifies top ranks.
|
|
47
|
+
wVec: 0.6,
|
|
48
|
+
// Vector weight in fused score. Range: 0.3–0.8. Semantic relevance emphasis.
|
|
49
|
+
wLex: 0.4,
|
|
50
|
+
// Lexical weight in fused score. wVec + wLex should equal 1.0.
|
|
51
|
+
fusedTopM: 60,
|
|
52
|
+
// Candidates after fusion, fed into reranker. Range: 30–100.
|
|
53
|
+
// ── Rerank (精排) ──
|
|
54
|
+
rerankTopN: 10,
|
|
55
|
+
// Final top-N results after reranking. Range: 5–20.
|
|
56
|
+
maxRerankChars: 1e3,
|
|
57
|
+
// Max chars per chunk sent to reranker. Truncated beyond this. Range: 500–2000.
|
|
58
|
+
maxBreadcrumbChars: 250,
|
|
59
|
+
// Max chars for breadcrumb context in rerank input. Range: 100–500.
|
|
60
|
+
headRatio: 0.67,
|
|
61
|
+
// Ratio of head vs tail when truncating chunks. Range: 0.5–0.8.
|
|
62
|
+
// ── Expansion (上下文扩展: E1 邻居 / E2 面包屑 / E3 跨文件导入) ──
|
|
63
|
+
neighborHops: 2,
|
|
64
|
+
// E1: How many sibling chunks to expand in each direction. Range: 1–3.
|
|
65
|
+
breadcrumbExpandLimit: 3,
|
|
66
|
+
// E2: Max ancestor breadcrumbs (class/function scope). Range: 1–5.
|
|
67
|
+
importFilesPerSeed: 3,
|
|
68
|
+
// E3: Cross-file import files to resolve per seed chunk. Range: 0–5. Set to 3 to enable import-graph expansion for better cross-file context.
|
|
69
|
+
chunksPerImportFile: 3,
|
|
70
|
+
// E3: Chunks to pull from each resolved import file. Range: 1–5. Set to 3 for balanced coverage of imported symbols.
|
|
71
|
+
decayNeighbor: 0.8,
|
|
72
|
+
// Score decay per E1 hop. Range: 0.5–0.9. Higher = neighbors stay relevant longer.
|
|
73
|
+
decayBreadcrumb: 0.7,
|
|
74
|
+
// Score decay per E2 level. Range: 0.4–0.8.
|
|
75
|
+
decayImport: 0.6,
|
|
76
|
+
// Score decay for E3 import chunks. Range: 0.3–0.7. Lower than E1/E2 since cross-file is less certain.
|
|
77
|
+
decayDepth: 0.7,
|
|
78
|
+
// General depth decay multiplier. Range: 0.5–0.9.
|
|
79
|
+
// ── ContextPacker (上下文打包) ──
|
|
80
|
+
maxSegmentsPerFile: 3,
|
|
81
|
+
// Max non-contiguous segments per file in output. Range: 1–5. Prevents excessive fragmentation.
|
|
82
|
+
maxTotalChars: 48e3,
|
|
83
|
+
// Token budget expressed as chars (~12k tokens). Range: 20000–80000.
|
|
84
|
+
// ── Smart TopK (动态结果数量) ──
|
|
85
|
+
enableSmartTopK: true,
|
|
86
|
+
// Dynamically adjust result count based on score distribution.
|
|
87
|
+
smartTopScoreRatio: 0.5,
|
|
88
|
+
// Min score as ratio of top-1 score to remain included. Range: 0.3–0.7.
|
|
89
|
+
smartTopScoreDeltaAbs: 0.25,
|
|
90
|
+
// Max absolute score drop from top-1 before cutting off. Range: 0.1–0.4.
|
|
91
|
+
smartMinScore: 0.25,
|
|
92
|
+
// Hard floor: chunks below this score are always excluded. Range: 0.1–0.4.
|
|
93
|
+
smartMinK: 2,
|
|
94
|
+
// Minimum results to return regardless of scores. Range: 1–3.
|
|
95
|
+
smartMaxK: 8
|
|
96
|
+
// Maximum results when smart topK is active. Range: 5–15.
|
|
97
|
+
};
|
|
98
|
+
|
|
99
|
+
export {
|
|
100
|
+
SEARCH_CONFIG_BOUNDS,
|
|
101
|
+
DEFAULT_CONFIG
|
|
102
|
+
};
|