@theclawlab/xdb 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 sw chen
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,90 @@
1
+ # xdb
2
+
3
+ Intent-driven data collection management CLI for AI agents. Transparently combines LanceDB (vector) and SQLite (relational/FTS) behind a unified interface.
4
+
5
+ ## Features
6
+
7
+ - Dual-engine architecture: LanceDB for vector search, SQLite for metadata and full-text search
8
+ - Policy-based collections — declare intent, not implementation
9
+ - Automatic embedding via `pai embed` (no manual vector management)
10
+ - JSONL input/output for machine-to-machine workflows
11
+ - Upsert semantics with auto-generated UUIDs
12
+ - Embedding dimension tracking and consistency validation
13
+
14
+ ## Install
15
+
16
+ ### From npm
17
+
18
+ ```bash
19
+ npm install -g @theclawlab/xdb
20
+ ```
21
+
22
+ ### From source
23
+
24
+ ```bash
25
+ npm install
26
+ npm run build
27
+ npm link
28
+ ```
29
+
30
+ Requires [pai] installed for embedding support.
31
+
32
+ ## Quick Start
33
+
34
+ ```bash
35
+ # Create a collection with hybrid (vector + FTS) policy
36
+ xdb col init my-docs --policy hybrid/knowledge-base
37
+
38
+ # Write data (embedding happens automatically via pai)
39
+ echo '{"id":"doc1","content":"How to compress files with tar"}' | xdb put my-docs
40
+
41
+ # Semantic search
42
+ xdb find my-docs "compress files" --similar
43
+
44
+ # Full-text search
45
+ xdb find my-docs "tar" --match
46
+
47
+ # SQL filtering
48
+ xdb find my-docs --where "json_extract(data, '$.category') = 'archive'"
49
+
50
+ # Batch write (JSONL via stdin)
51
+ cat records.jsonl | xdb put my-docs --batch
52
+ ```
53
+
54
+ ## Commands
55
+
56
+ | Command | Description |
57
+ |---------|-------------|
58
+ | `xdb put <collection> [json]` | Write data (single JSON arg or JSONL via stdin) |
59
+ | `xdb find <collection> [query]` | Search with `--similar`, `--match`, or `--where` |
60
+ | `xdb embed [text]` | Generate text embeddings via configured pai provider |
61
+ | `xdb col init <name> --policy <p>` | Create a collection with a policy |
62
+ | `xdb col list` | List collections with stats |
63
+ | `xdb col info <name>` | Show collection details |
64
+ | `xdb col rm <name>` | Delete a collection |
65
+ | `xdb policy list` | List available policies |
66
+ | `xdb config` | Manage xdb configuration |
67
+
68
+ ## Policies
69
+
70
+ | Policy | Vector | FTS | Engine |
71
+ |--------|--------|-----|--------|
72
+ | `hybrid/knowledge-base` | `content` | yes | LanceDB + SQLite |
73
+ | `relational/structured-logs` | — | — | SQLite |
74
+ | `relational/simple-kv` | — | — | SQLite |
75
+ | `vector/feature-store` | `tensor` | — | LanceDB |
76
+
77
+ ## Storage
78
+
79
+ ```
80
+ ~/.local/share/xdb/
81
+ └── collections/
82
+ └── <name>/
83
+ ├── collection_meta.json
84
+ ├── vector.lance/
85
+ └── relational.db
86
+ ```
87
+
88
+ ## Documentation
89
+
90
+ - **[USAGE.md](USAGE.md)** — Full usage guide with all providers, options, and examples
package/USAGE.md ADDED
@@ -0,0 +1,347 @@
1
+ # xdb 使用指南
2
+
3
+ `xdb` 是一个意图驱动的数据中心 CLI,为 AI Agent 和 CLI 工具链设计。内部透明整合 LanceDB(向量)与 SQLite(关系/全文检索),调用者只需声明意图。
4
+
5
+ ## 安装
6
+
7
+ ```bash
8
+ npm install
9
+ npm run build
10
+ npm link # 全局安装 xdb 命令
11
+ ```
12
+
13
+ 向量化功能依赖 [pai] 命令。请确保 `pai` 已安装并配置了 embedding provider:
14
+
15
+ ```bash
16
+ pai model default --embed-provider openai --embed-model text-embedding-3-small
17
+ ```
18
+
19
+ ## 文本向量化
20
+
21
+ ### `xdb embed`
22
+
23
+ 直接调用已配置的 embedding provider 对文本进行向量化,输出向量数据。
24
+
25
+ ```bash
26
+ # 单条文本
27
+ xdb embed "how to compress files"
28
+
29
+ # 从 stdin
30
+ echo "database optimization" | xdb embed
31
+
32
+ # 从文件
33
+ xdb embed --input-file document.txt
34
+
35
+ # 批量模式(输入为 JSON 字符串数组)
36
+ xdb embed --batch '["hello","world","foo"]'
37
+
38
+ # JSON 输出(含模型和用量信息)
39
+ xdb embed "hello" --json
40
+ ```
41
+
42
+ **选项:**
43
+ - `--batch` — 批量模式,输入为 JSON 字符串数组
44
+ - `--json` — JSON 格式输出(含 model、usage 元数据)
45
+ - `--input-file <path>` — 从文件读取输入
46
+
47
+ **输入来源**(三选一,互斥):
48
+ 1. 位置参数:`xdb embed "text"`
49
+ 2. stdin:`echo "text" | xdb embed`
50
+ 3. 文件:`xdb embed --input-file file.txt`
51
+
52
+ **输出格式:**
53
+
54
+ 纯文本(默认)— 每行一个 hex 编码向量数组:
55
+ ```
56
+ ["3f800000","bf800000",...]
57
+ ```
58
+
59
+ JSON 模式(`--json`)— 单条:
60
+ ```json
61
+ {"embedding":["3f800000",...],"model":"text-embedding-3-small","usage":{"prompt_tokens":2,"total_tokens":2}}
62
+ ```
63
+
64
+ JSON 模式(`--json --batch`)— 批量:
65
+ ```json
66
+ {"embeddings":[["3f800000",...],["3f000000",...]],"model":"text-embedding-3-small","usage":{"prompt_tokens":4,"total_tokens":4}}
67
+ ```
68
+
69
+ 向量以 float32 hex 编码(每个维度 8 位十六进制字符串),精度无损且比 JSON 数字数组更紧凑。
70
+
71
+ 若输入文本超出模型 token 上限,会自动截断并在 stderr 输出警告。
72
+
73
+ embedding provider 和模型通过 `pai` 配置:
74
+
75
+ ```bash
76
+ pai model default --embed-provider openai --embed-model text-embedding-3-small
77
+ ```
78
+
79
+ ## 集合管理
80
+
81
+ ### 创建集合
82
+
83
+ 每个集合需要指定一个 policy,决定底层引擎组合和数据处理方式:
84
+
85
+ ```bash
86
+ # 混合模式:向量 + 全文检索(最常用)
87
+ xdb col init my-docs --policy hybrid/knowledge-base
88
+
89
+ # 纯关系模式:结构化日志
90
+ xdb col init logs --policy relational/structured-logs
91
+
92
+ # 纯向量模式:特征存储
93
+ xdb col init features --policy vector/feature-store
94
+
95
+ # 简单键值对
96
+ xdb col init cache --policy relational/simple-kv
97
+ ```
98
+
99
+ policy 可以只写主类型,自动使用默认子类型:
100
+
101
+ ```bash
102
+ xdb col init my-docs --policy hybrid # 等同于 hybrid/knowledge-base
103
+ xdb col init logs --policy relational # 等同于 relational/structured-logs
104
+ ```
105
+
106
+ 自定义 policy 参数(覆盖默认字段配置):
107
+
108
+ ```bash
109
+ xdb col init my-col --policy hybrid/knowledge-base \
110
+ --params '{"fields":{"title":{"findCaps":["match"]}}}'
111
+ ```
112
+
113
+ ### 查看集合
114
+
115
+ ```bash
116
+ xdb col list
117
+ ```
118
+
119
+ 输出 JSONL,每行一个集合信息:
120
+
121
+ ```json
122
+ {"name":"my-docs","policy":"hybrid/knowledge-base","recordCount":42,"sizeBytes":102400,"embeddingDimension":1536}
123
+ ```
124
+
125
+ ### 查看集合详情
126
+
127
+ ```bash
128
+ xdb col info my-docs
129
+ ```
130
+
131
+ 输出集合的完整信息,包括 policy 快照、字段配置、记录数等:
132
+
133
+ ```
134
+ name: my-docs
135
+ createdAt: 2025-01-15T10:30:00.000Z
136
+ policy: hybrid/knowledge-base
137
+ engines: hybrid
138
+ autoIndex: true
139
+ records: 42
140
+ size: 100.0 KB
141
+ embedDim: 1536
142
+ fields:
143
+ content findCaps=[similar, match]
144
+ ```
145
+
146
+ 也支持 `--json` 输出:
147
+
148
+ ```bash
149
+ xdb col info my-docs --json
150
+ ```
151
+
152
+ ### 删除集合
153
+
154
+ ```bash
155
+ xdb col rm my-docs
156
+ ```
157
+
158
+ 物理删除集合目录及所有索引文件。
159
+
160
+ ## 写入数据
161
+
162
+ ### 单条写入
163
+
164
+ ```bash
165
+ # 通过位置参数传入 JSON
166
+ xdb put my-docs '{"id":"doc1","content":"How to use tar for compression"}'
167
+
168
+ # 通过 stdin 传入
169
+ echo '{"content":"Git branching strategies"}' | xdb put my-docs
170
+ ```
171
+
172
+ - `id` 字段可选,缺省时自动生成 UUID
173
+ - 相同 `id` 的记录会被 upsert(更新已有记录)
174
+ - `hybrid/knowledge-base` policy 下,`content` 字段会自动向量化并建立全文索引
175
+
176
+ ### 批量写入
177
+
178
+ ```bash
179
+ # JSONL 格式,每行一个 JSON 对象
180
+ cat data.jsonl | xdb put my-docs --batch
181
+ ```
182
+
183
+ `--batch` 模式会:
184
+ - 开启 SQLite 事务
185
+ - 批量调用 `pai embed --batch` 进行向量化
186
+ - 输出写入统计到 stdout
187
+
188
+ ```json
189
+ {"inserted":95,"updated":5,"errors":0}
190
+ ```
191
+
192
+ ## 检索数据
193
+
194
+ ### 语义搜索(--similar)
195
+
196
+ 基于向量相似度检索,需要集合 policy 包含 `similar` 能力的字段:
197
+
198
+ ```bash
199
+ xdb find my-docs "how to compress files" --similar
200
+ xdb find my-docs "网络调试工具" --similar --limit 5
201
+ ```
202
+
203
+ 也可以通过 stdin 传入查询文本:
204
+
205
+ ```bash
206
+ echo "database optimization" | xdb find my-docs --similar
207
+ ```
208
+
209
+ ### 全文检索(--match)
210
+
211
+ 基于 SQLite FTS5 的关键词匹配:
212
+
213
+ ```bash
214
+ xdb find my-docs "tar compression" --match
215
+ ```
216
+
217
+ ### 条件过滤(--where)
218
+
219
+ SQL WHERE 子句,作用于 SQLite 的 records 表:
220
+
221
+ ```bash
222
+ xdb find my-docs --where "json_extract(data, '$.category') = 'network'"
223
+ xdb find my-docs --where "json_extract(data, '$.priority') > 5" --limit 20
224
+ ```
225
+
226
+ ### 组合查询
227
+
228
+ `--match` 和 `--where` 可以组合使用:
229
+
230
+ ```bash
231
+ xdb find my-docs "compression" --match --where "json_extract(data, '$.category') = 'archive'"
232
+ ```
233
+
234
+ ### 输出格式
235
+
236
+ 所有检索结果以 JSONL 输出,每行包含原始数据和系统元数据:
237
+
238
+ ```json
239
+ {"id":"doc1","content":"How to use tar...","category":"archive","_score":0.95,"_engine":"lancedb"}
240
+ {"id":"doc2","content":"Gzip compression...","category":"archive","_score":0.87,"_engine":"lancedb"}
241
+ ```
242
+
243
+ - `_score`: 相关度分数(语义搜索为 `1/(1+distance)`,全文检索为 FTS5 rank)
244
+ - `_engine`: 结果来源引擎(`lancedb` 或 `sqlite`)
245
+
246
+ ## Policy 详解
247
+
248
+ ### 查看可用 Policy
249
+
250
+ ```bash
251
+ xdb policy list
252
+ ```
253
+
254
+ 输出所有内置 policy 的详细信息:
255
+
256
+ ```
257
+ hybrid/knowledge-base
258
+ engines: LanceDB + SQLite
259
+ fields: content [similar, match]
260
+ autoIndex: yes
261
+ relational/structured-logs
262
+ engines: SQLite
263
+ fields: (none)
264
+ autoIndex: yes
265
+ relational/simple-kv
266
+ engines: SQLite
267
+ fields: (none)
268
+ autoIndex: no
269
+ vector/feature-store
270
+ engines: LanceDB
271
+ fields: tensor [similar]
272
+ autoIndex: no
273
+ ```
274
+
275
+ 也支持 `--json` 输出:
276
+
277
+ ```bash
278
+ xdb policy list --json
279
+ ```
280
+
281
+ ### Policy 对照表
282
+
283
+ | Policy | 向量化字段 | 全文检索 | 自动索引 | 适用场景 |
284
+ |--------|-----------|---------|---------|---------|
285
+ | `hybrid/knowledge-base` | `content` | `content` | 是 | 文档、知识库、命令索引 |
286
+ | `relational/structured-logs` | — | — | 是 | 日志、事件、结构化数据 |
287
+ | `relational/simple-kv` | — | — | 否 | 缓存、配置、键值对 |
288
+ | `vector/feature-store` | `tensor` | — | 否 | ML 特征向量、嵌入存储 |
289
+
290
+ policy 决定了:
291
+ - 哪些字段会被向量化(通过 `pai embed`)
292
+ - 哪些字段会建立全文索引(FTS5)
293
+ - `find` 命令支持哪些搜索意图(`--similar`、`--match`)
294
+
295
+ ## 存储结构
296
+
297
+ ```
298
+ ~/.local/share/xdb/
299
+ └── collections/
300
+ └── my-docs/
301
+ ├── collection_meta.json # Policy 快照 + 元数据
302
+ ├── vector.lance/ # LanceDB 向量数据
303
+ └── relational.db # SQLite 关系数据 + FTS
304
+ ```
305
+
306
+ 每个集合完全自包含,可以直接复制目录进行迁移。
307
+
308
+ ### 在脚本中使用
309
+
310
+ ```bash
311
+ # 写入并查询
312
+ echo '{"id":"note1","content":"Remember to update DNS records"}' | xdb put notes
313
+ xdb find notes "DNS" --match | jq '.[].content'
314
+
315
+ # 批量导入 CSV(转换为 JSONL)
316
+ cat data.csv | python3 -c "
317
+ import csv, json, sys
318
+ for row in csv.DictReader(sys.stdin):
319
+ print(json.dumps(row))
320
+ " | xdb put my-data --batch
321
+ ```
322
+
323
+ ### 在 LLM Agent 中使用
324
+
325
+ xdb 的 JSONL 输入输出设计天然适合 Agent 调用:
326
+
327
+ ```bash
328
+ # Agent 存储知识
329
+ xdb put knowledge '{"id":"fact-1","content":"The speed of light is 299792458 m/s","category":"physics"}'
330
+
331
+ # Agent 检索知识
332
+ xdb find knowledge "light speed" --similar --limit 1
333
+ ```
334
+
335
+ ## 退出码
336
+
337
+ | 退出码 | 含义 |
338
+ |--------|------|
339
+ | 0 | 成功 |
340
+ | 2 | 参数错误 / 集合不存在 / 能力不匹配 |
341
+ | 1 | 运行时错误(引擎故障、pai 调用失败等) |
342
+
343
+ ## 注意事项
344
+
345
+ - 向量化依赖 `pai embed`,首次写入含向量字段的数据时会记录 embedding 维度,后续写入必须使用相同维度的模型
346
+ - 更换 embedding 模型需要删除并重建集合
347
+ - `--where` 子句直接作用于 SQLite,使用 `json_extract(data, '$.field')` 访问 JSON 字段
package/dist/cli.d.ts ADDED
@@ -0,0 +1,2 @@
1
+
2
+ export { }