@theclawlab/xdb 1.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +90 -0
- package/USAGE.md +347 -0
- package/dist/cli.d.ts +2 -0
- package/dist/cli.js +2197 -0
- package/dist/cli.js.map +1 -0
- package/package.json +41 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 sw chen
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
# xdb
|
|
2
|
+
|
|
3
|
+
Intent-driven data collection management CLI for AI agents. Transparently combines LanceDB (vector) and SQLite (relational/FTS) behind a unified interface.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
- Dual-engine architecture: LanceDB for vector search, SQLite for metadata and full-text search
|
|
8
|
+
- Policy-based collections — declare intent, not implementation
|
|
9
|
+
- Automatic embedding via `pai embed` (no manual vector management)
|
|
10
|
+
- JSONL input/output for machine-to-machine workflows
|
|
11
|
+
- Upsert semantics with auto-generated UUIDs
|
|
12
|
+
- Embedding dimension tracking and consistency validation
|
|
13
|
+
|
|
14
|
+
## Install
|
|
15
|
+
|
|
16
|
+
### From npm
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
npm install -g @theclawlab/xdb
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
### From source
|
|
23
|
+
|
|
24
|
+
```bash
|
|
25
|
+
npm install
|
|
26
|
+
npm run build
|
|
27
|
+
npm link
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
Requires [pai] installed for embedding support.
|
|
31
|
+
|
|
32
|
+
## Quick Start
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
# Create a collection with hybrid (vector + FTS) policy
|
|
36
|
+
xdb col init my-docs --policy hybrid/knowledge-base
|
|
37
|
+
|
|
38
|
+
# Write data (embedding happens automatically via pai)
|
|
39
|
+
echo '{"id":"doc1","content":"How to compress files with tar"}' | xdb put my-docs
|
|
40
|
+
|
|
41
|
+
# Semantic search
|
|
42
|
+
xdb find my-docs "compress files" --similar
|
|
43
|
+
|
|
44
|
+
# Full-text search
|
|
45
|
+
xdb find my-docs "tar" --match
|
|
46
|
+
|
|
47
|
+
# SQL filtering
|
|
48
|
+
xdb find my-docs --where "json_extract(data, '$.category') = 'archive'"
|
|
49
|
+
|
|
50
|
+
# Batch write (JSONL via stdin)
|
|
51
|
+
cat records.jsonl | xdb put my-docs --batch
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## Commands
|
|
55
|
+
|
|
56
|
+
| Command | Description |
|
|
57
|
+
|---------|-------------|
|
|
58
|
+
| `xdb put <collection> [json]` | Write data (single JSON arg or JSONL via stdin) |
|
|
59
|
+
| `xdb find <collection> [query]` | Search with `--similar`, `--match`, or `--where` |
|
|
60
|
+
| `xdb embed [text]` | Generate text embeddings via configured pai provider |
|
|
61
|
+
| `xdb col init <name> --policy <p>` | Create a collection with a policy |
|
|
62
|
+
| `xdb col list` | List collections with stats |
|
|
63
|
+
| `xdb col info <name>` | Show collection details |
|
|
64
|
+
| `xdb col rm <name>` | Delete a collection |
|
|
65
|
+
| `xdb policy list` | List available policies |
|
|
66
|
+
| `xdb config` | Manage xdb configuration |
|
|
67
|
+
|
|
68
|
+
## Policies
|
|
69
|
+
|
|
70
|
+
| Policy | Vector | FTS | Engine |
|
|
71
|
+
|--------|--------|-----|--------|
|
|
72
|
+
| `hybrid/knowledge-base` | `content` | yes | LanceDB + SQLite |
|
|
73
|
+
| `relational/structured-logs` | — | — | SQLite |
|
|
74
|
+
| `relational/simple-kv` | — | — | SQLite |
|
|
75
|
+
| `vector/feature-store` | `tensor` | — | LanceDB |
|
|
76
|
+
|
|
77
|
+
## Storage
|
|
78
|
+
|
|
79
|
+
```
|
|
80
|
+
~/.local/share/xdb/
|
|
81
|
+
└── collections/
|
|
82
|
+
└── <name>/
|
|
83
|
+
├── collection_meta.json
|
|
84
|
+
├── vector.lance/
|
|
85
|
+
└── relational.db
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Documentation
|
|
89
|
+
|
|
90
|
+
- **[USAGE.md](USAGE.md)** — Full usage guide with all providers, options, and examples
|
package/USAGE.md
ADDED
|
@@ -0,0 +1,347 @@
|
|
|
1
|
+
# xdb 使用指南
|
|
2
|
+
|
|
3
|
+
`xdb` 是一个意图驱动的数据中心 CLI,为 AI Agent 和 CLI 工具链设计。内部透明整合 LanceDB(向量)与 SQLite(关系/全文检索),调用者只需声明意图。
|
|
4
|
+
|
|
5
|
+
## 安装
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
npm install
|
|
9
|
+
npm run build
|
|
10
|
+
npm link # 全局安装 xdb 命令
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
向量化功能依赖 [pai] 命令。请确保 `pai` 已安装并配置了 embedding provider:
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
pai model default --embed-provider openai --embed-model text-embedding-3-small
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## 文本向量化
|
|
20
|
+
|
|
21
|
+
### `xdb embed`
|
|
22
|
+
|
|
23
|
+
直接调用已配置的 embedding provider 对文本进行向量化,输出向量数据。
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
# 单条文本
|
|
27
|
+
xdb embed "how to compress files"
|
|
28
|
+
|
|
29
|
+
# 从 stdin
|
|
30
|
+
echo "database optimization" | xdb embed
|
|
31
|
+
|
|
32
|
+
# 从文件
|
|
33
|
+
xdb embed --input-file document.txt
|
|
34
|
+
|
|
35
|
+
# 批量模式(输入为 JSON 字符串数组)
|
|
36
|
+
xdb embed --batch '["hello","world","foo"]'
|
|
37
|
+
|
|
38
|
+
# JSON 输出(含模型和用量信息)
|
|
39
|
+
xdb embed "hello" --json
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
**选项:**
|
|
43
|
+
- `--batch` — 批量模式,输入为 JSON 字符串数组
|
|
44
|
+
- `--json` — JSON 格式输出(含 model、usage 元数据)
|
|
45
|
+
- `--input-file <path>` — 从文件读取输入
|
|
46
|
+
|
|
47
|
+
**输入来源**(三选一,互斥):
|
|
48
|
+
1. 位置参数:`xdb embed "text"`
|
|
49
|
+
2. stdin:`echo "text" | xdb embed`
|
|
50
|
+
3. 文件:`xdb embed --input-file file.txt`
|
|
51
|
+
|
|
52
|
+
**输出格式:**
|
|
53
|
+
|
|
54
|
+
纯文本(默认)— 每行一个 hex 编码向量数组:
|
|
55
|
+
```
|
|
56
|
+
["3f800000","bf800000",...]
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
JSON 模式(`--json`)— 单条:
|
|
60
|
+
```json
|
|
61
|
+
{"embedding":["3f800000",...],"model":"text-embedding-3-small","usage":{"prompt_tokens":2,"total_tokens":2}}
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
JSON 模式(`--json --batch`)— 批量:
|
|
65
|
+
```json
|
|
66
|
+
{"embeddings":[["3f800000",...],["3f000000",...]],"model":"text-embedding-3-small","usage":{"prompt_tokens":4,"total_tokens":4}}
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
向量以 float32 hex 编码(每个维度 8 位十六进制字符串),精度无损且比 JSON 数字数组更紧凑。
|
|
70
|
+
|
|
71
|
+
若输入文本超出模型 token 上限,会自动截断并在 stderr 输出警告。
|
|
72
|
+
|
|
73
|
+
embedding provider 和模型通过 `pai` 配置:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
pai model default --embed-provider openai --embed-model text-embedding-3-small
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
## 集合管理
|
|
80
|
+
|
|
81
|
+
### 创建集合
|
|
82
|
+
|
|
83
|
+
每个集合需要指定一个 policy,决定底层引擎组合和数据处理方式:
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
# 混合模式:向量 + 全文检索(最常用)
|
|
87
|
+
xdb col init my-docs --policy hybrid/knowledge-base
|
|
88
|
+
|
|
89
|
+
# 纯关系模式:结构化日志
|
|
90
|
+
xdb col init logs --policy relational/structured-logs
|
|
91
|
+
|
|
92
|
+
# 纯向量模式:特征存储
|
|
93
|
+
xdb col init features --policy vector/feature-store
|
|
94
|
+
|
|
95
|
+
# 简单键值对
|
|
96
|
+
xdb col init cache --policy relational/simple-kv
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
policy 可以只写主类型,自动使用默认子类型:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
xdb col init my-docs --policy hybrid # 等同于 hybrid/knowledge-base
|
|
103
|
+
xdb col init logs --policy relational # 等同于 relational/structured-logs
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
自定义 policy 参数(覆盖默认字段配置):
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
xdb col init my-col --policy hybrid/knowledge-base \
|
|
110
|
+
--params '{"fields":{"title":{"findCaps":["match"]}}}'
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### 查看集合
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
xdb col list
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
输出 JSONL,每行一个集合信息:
|
|
120
|
+
|
|
121
|
+
```json
|
|
122
|
+
{"name":"my-docs","policy":"hybrid/knowledge-base","recordCount":42,"sizeBytes":102400,"embeddingDimension":1536}
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### 查看集合详情
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
xdb col info my-docs
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
输出集合的完整信息,包括 policy 快照、字段配置、记录数等:
|
|
132
|
+
|
|
133
|
+
```
|
|
134
|
+
name: my-docs
|
|
135
|
+
createdAt: 2025-01-15T10:30:00.000Z
|
|
136
|
+
policy: hybrid/knowledge-base
|
|
137
|
+
engines: hybrid
|
|
138
|
+
autoIndex: true
|
|
139
|
+
records: 42
|
|
140
|
+
size: 100.0 KB
|
|
141
|
+
embedDim: 1536
|
|
142
|
+
fields:
|
|
143
|
+
content findCaps=[similar, match]
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
也支持 `--json` 输出:
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
xdb col info my-docs --json
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### 删除集合
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
xdb col rm my-docs
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
物理删除集合目录及所有索引文件。
|
|
159
|
+
|
|
160
|
+
## 写入数据
|
|
161
|
+
|
|
162
|
+
### 单条写入
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
# 通过位置参数传入 JSON
|
|
166
|
+
xdb put my-docs '{"id":"doc1","content":"How to use tar for compression"}'
|
|
167
|
+
|
|
168
|
+
# 通过 stdin 传入
|
|
169
|
+
echo '{"content":"Git branching strategies"}' | xdb put my-docs
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
- `id` 字段可选,缺省时自动生成 UUID
|
|
173
|
+
- 相同 `id` 的记录会被 upsert(更新已有记录)
|
|
174
|
+
- `hybrid/knowledge-base` policy 下,`content` 字段会自动向量化并建立全文索引
|
|
175
|
+
|
|
176
|
+
### 批量写入
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
# JSONL 格式,每行一个 JSON 对象
|
|
180
|
+
cat data.jsonl | xdb put my-docs --batch
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
`--batch` 模式会:
|
|
184
|
+
- 开启 SQLite 事务
|
|
185
|
+
- 批量调用 `pai embed --batch` 进行向量化
|
|
186
|
+
- 输出写入统计到 stdout
|
|
187
|
+
|
|
188
|
+
```json
|
|
189
|
+
{"inserted":95,"updated":5,"errors":0}
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
## 检索数据
|
|
193
|
+
|
|
194
|
+
### 语义搜索(--similar)
|
|
195
|
+
|
|
196
|
+
基于向量相似度检索,需要集合 policy 包含 `similar` 能力的字段:
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
xdb find my-docs "how to compress files" --similar
|
|
200
|
+
xdb find my-docs "网络调试工具" --similar --limit 5
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
也可以通过 stdin 传入查询文本:
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
echo "database optimization" | xdb find my-docs --similar
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
### 全文检索(--match)
|
|
210
|
+
|
|
211
|
+
基于 SQLite FTS5 的关键词匹配:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
xdb find my-docs "tar compression" --match
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
### 条件过滤(--where)
|
|
218
|
+
|
|
219
|
+
SQL WHERE 子句,作用于 SQLite 的 records 表:
|
|
220
|
+
|
|
221
|
+
```bash
|
|
222
|
+
xdb find my-docs --where "json_extract(data, '$.category') = 'network'"
|
|
223
|
+
xdb find my-docs --where "json_extract(data, '$.priority') > 5" --limit 20
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
### 组合查询
|
|
227
|
+
|
|
228
|
+
`--match` 和 `--where` 可以组合使用:
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
xdb find my-docs "compression" --match --where "json_extract(data, '$.category') = 'archive'"
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
### 输出格式
|
|
235
|
+
|
|
236
|
+
所有检索结果以 JSONL 输出,每行包含原始数据和系统元数据:
|
|
237
|
+
|
|
238
|
+
```json
|
|
239
|
+
{"id":"doc1","content":"How to use tar...","category":"archive","_score":0.95,"_engine":"lancedb"}
|
|
240
|
+
{"id":"doc2","content":"Gzip compression...","category":"archive","_score":0.87,"_engine":"lancedb"}
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
- `_score`: 相关度分数(语义搜索为 `1/(1+distance)`,全文检索为 FTS5 rank)
|
|
244
|
+
- `_engine`: 结果来源引擎(`lancedb` 或 `sqlite`)
|
|
245
|
+
|
|
246
|
+
## Policy 详解
|
|
247
|
+
|
|
248
|
+
### 查看可用 Policy
|
|
249
|
+
|
|
250
|
+
```bash
|
|
251
|
+
xdb policy list
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
输出所有内置 policy 的详细信息:
|
|
255
|
+
|
|
256
|
+
```
|
|
257
|
+
hybrid/knowledge-base
|
|
258
|
+
engines: LanceDB + SQLite
|
|
259
|
+
fields: content [similar, match]
|
|
260
|
+
autoIndex: yes
|
|
261
|
+
relational/structured-logs
|
|
262
|
+
engines: SQLite
|
|
263
|
+
fields: (none)
|
|
264
|
+
autoIndex: yes
|
|
265
|
+
relational/simple-kv
|
|
266
|
+
engines: SQLite
|
|
267
|
+
fields: (none)
|
|
268
|
+
autoIndex: no
|
|
269
|
+
vector/feature-store
|
|
270
|
+
engines: LanceDB
|
|
271
|
+
fields: tensor [similar]
|
|
272
|
+
autoIndex: no
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
也支持 `--json` 输出:
|
|
276
|
+
|
|
277
|
+
```bash
|
|
278
|
+
xdb policy list --json
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
### Policy 对照表
|
|
282
|
+
|
|
283
|
+
| Policy | 向量化字段 | 全文检索 | 自动索引 | 适用场景 |
|
|
284
|
+
|--------|-----------|---------|---------|---------|
|
|
285
|
+
| `hybrid/knowledge-base` | `content` | `content` | 是 | 文档、知识库、命令索引 |
|
|
286
|
+
| `relational/structured-logs` | — | — | 是 | 日志、事件、结构化数据 |
|
|
287
|
+
| `relational/simple-kv` | — | — | 否 | 缓存、配置、键值对 |
|
|
288
|
+
| `vector/feature-store` | `tensor` | — | 否 | ML 特征向量、嵌入存储 |
|
|
289
|
+
|
|
290
|
+
policy 决定了:
|
|
291
|
+
- 哪些字段会被向量化(通过 `pai embed`)
|
|
292
|
+
- 哪些字段会建立全文索引(FTS5)
|
|
293
|
+
- `find` 命令支持哪些搜索意图(`--similar`、`--match`)
|
|
294
|
+
|
|
295
|
+
## 存储结构
|
|
296
|
+
|
|
297
|
+
```
|
|
298
|
+
~/.local/share/xdb/
|
|
299
|
+
└── collections/
|
|
300
|
+
└── my-docs/
|
|
301
|
+
├── collection_meta.json # Policy 快照 + 元数据
|
|
302
|
+
├── vector.lance/ # LanceDB 向量数据
|
|
303
|
+
└── relational.db # SQLite 关系数据 + FTS
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
每个集合完全自包含,可以直接复制目录进行迁移。
|
|
307
|
+
|
|
308
|
+
### 在脚本中使用
|
|
309
|
+
|
|
310
|
+
```bash
|
|
311
|
+
# 写入并查询
|
|
312
|
+
echo '{"id":"note1","content":"Remember to update DNS records"}' | xdb put notes
|
|
313
|
+
xdb find notes "DNS" --match | jq '.[].content'
|
|
314
|
+
|
|
315
|
+
# 批量导入 CSV(转换为 JSONL)
|
|
316
|
+
cat data.csv | python3 -c "
|
|
317
|
+
import csv, json, sys
|
|
318
|
+
for row in csv.DictReader(sys.stdin):
|
|
319
|
+
print(json.dumps(row))
|
|
320
|
+
" | xdb put my-data --batch
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
### 在 LLM Agent 中使用
|
|
324
|
+
|
|
325
|
+
xdb 的 JSONL 输入输出设计天然适合 Agent 调用:
|
|
326
|
+
|
|
327
|
+
```bash
|
|
328
|
+
# Agent 存储知识
|
|
329
|
+
xdb put knowledge '{"id":"fact-1","content":"The speed of light is 299792458 m/s","category":"physics"}'
|
|
330
|
+
|
|
331
|
+
# Agent 检索知识
|
|
332
|
+
xdb find knowledge "light speed" --similar --limit 1
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
## 退出码
|
|
336
|
+
|
|
337
|
+
| 退出码 | 含义 |
|
|
338
|
+
|--------|------|
|
|
339
|
+
| 0 | 成功 |
|
|
340
|
+
| 2 | 参数错误 / 集合不存在 / 能力不匹配 |
|
|
341
|
+
| 1 | 运行时错误(引擎故障、pai 调用失败等) |
|
|
342
|
+
|
|
343
|
+
## 注意事项
|
|
344
|
+
|
|
345
|
+
- 向量化依赖 `pai embed`,首次写入含向量字段的数据时会记录 embedding 维度,后续写入必须使用相同维度的模型
|
|
346
|
+
- 更换 embedding 模型需要删除并重建集合
|
|
347
|
+
- `--where` 子句直接作用于 SQLite,使用 `json_extract(data, '$.field')` 访问 JSON 字段
|
package/dist/cli.d.ts
ADDED