@fugood/buttress-server 2.25.0-beta.6 → 2.25.0-beta.63
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +316 -19
- package/config/sample.toml +5 -2
- package/lib/chunk-C7Qqr4sF.mjs +2 -0
- package/lib/index.d.mts +145 -7
- package/lib/index.mjs +823 -55
- package/package.json +7 -4
- package/public/status.html +162 -1
- package/lib/chunk-C8PTHxhX.mjs +0 -2
package/README.md
CHANGED
|
@@ -20,50 +20,347 @@ npx bricks-buttress --config ./config.toml
|
|
|
20
20
|
npx bricks-buttress
|
|
21
21
|
```
|
|
22
22
|
|
|
23
|
+
## Workspace Binding (`bricks buttress`)
|
|
24
|
+
|
|
25
|
+
By default, a buttress-server runs in **public mode**: any client on the LAN can connect, no auth required. To restrict access to a single BRICKS workspace and enable workspace-scoped JWT auth, **bind** the server with the `bricks buttress` CLI commands. Once bound, the server only accepts WebSocket / file-transfer requests carrying a valid access token signed by that workspace's issuer.
|
|
26
|
+
|
|
27
|
+
The `bricks` CLI is the tool that performs the binding and writes the local state file. Install it first — see the [bricks-cli docs](https://docs.bricks.tools/cli) — then `bricks auth login` with the workspace owner's account before running the commands below.
|
|
28
|
+
|
|
29
|
+
### Bind a server to a workspace
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# Pair the local machine's buttress-server with the workspace of the current bricks-cli profile
|
|
33
|
+
bricks buttress bind
|
|
34
|
+
|
|
35
|
+
# Override the auto-detected server id, give it a friendly name, or write to a custom state dir
|
|
36
|
+
bricks buttress bind --server-id buttress-mac-studio --name "Studio LLM" --state-dir /etc/buttress
|
|
37
|
+
|
|
38
|
+
# For headless/remote setups: emit state.json to stdout instead of writing to disk
|
|
39
|
+
bricks buttress bind --print > /etc/buttress/state.json
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
The state file (`~/.bricks-cli/buttress/state.json` by default, or `$BRICKS_BUTTRESS_STATE_DIR`) stores:
|
|
43
|
+
|
|
44
|
+
- `workspace.id` / `workspace.name` — which workspace this server belongs to
|
|
45
|
+
- `workspace.serverId` — the server's stable id (defaults to `buttress-<machineId>`)
|
|
46
|
+
- `workspace.issuerPublicKey` + `workspace.kid` — Ed25519 SPKI used to verify access tokens
|
|
47
|
+
|
|
48
|
+
**Restart `bricks-buttress` after binding** for the change to take effect — the state file is read once at startup.
|
|
49
|
+
|
|
50
|
+
### Inspect bindings
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
# Show local state.json + the workspace-side bound list
|
|
54
|
+
bricks buttress status
|
|
55
|
+
|
|
56
|
+
# Same, JSON-formatted
|
|
57
|
+
bricks buttress status --json
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Discover servers on the LAN
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
# UDP scan + HTTP /buttress/info verification (3s timeout by default)
|
|
64
|
+
bricks buttress scan
|
|
65
|
+
|
|
66
|
+
# UDP only (skip the /buttress/info round-trip)
|
|
67
|
+
bricks buttress scan --udp-only
|
|
68
|
+
|
|
69
|
+
# Machine-readable
|
|
70
|
+
bricks buttress scan --json
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
`scan` lists every buttress-server visible on the LAN, including unbound (public) ones, with their version, auth state (`open` vs `JWT required` + kid), bound workspace, and per-generator hardware caps (`score`, GPU, usable memory). Servers whose workspace matches your current `bricks-cli` profile are highlighted; this is purely a discovery command and does not mint any tokens.
|
|
74
|
+
|
|
75
|
+
### Unbind
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
# Remove the binding from the workspace and delete the local state.json
|
|
79
|
+
bricks buttress unbind
|
|
80
|
+
|
|
81
|
+
# Keep the local state file (useful if you only want to revoke server-side)
|
|
82
|
+
bricks buttress unbind --keep-local
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
After unbinding, restart the server to return it to public mode.
|
|
86
|
+
|
|
87
|
+
### Issue a long-lived access token
|
|
88
|
+
|
|
89
|
+
For headless callers (CI, ctor agents) that already hold a workspace token, mint a long-lived buttress access token instead of relying on a per-launcher session token:
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
# Default 30-day TTL
|
|
93
|
+
bricks buttress issue-token
|
|
94
|
+
|
|
95
|
+
# Custom TTL (seconds), JSON output for scripting
|
|
96
|
+
bricks buttress issue-token --ttl 3600 --json
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
The token claims `{ k: 'ba', w_id, st: 'ws', sid, jti, exp }` and any buttress-server bound to the same workspace will accept it.
|
|
100
|
+
|
|
23
101
|
## Configuration
|
|
24
102
|
|
|
25
|
-
Configuration
|
|
26
|
-
- `--config` / `-c` flag with TOML file path
|
|
103
|
+
Configuration is loaded from a TOML file passed via `--config` / `-c`. Every top-level table is optional — missing sections fall back to defaults. See `config/sample.toml` for an end-to-end example.
|
|
27
104
|
|
|
28
|
-
###
|
|
105
|
+
### Top-level sections
|
|
106
|
+
|
|
107
|
+
| Section | Purpose |
|
|
108
|
+
| ------------------------- | -------------------------------------------------------------------------------------------------- |
|
|
109
|
+
| `[env]` | Environment variables exported into the process **only if not already set** |
|
|
110
|
+
| `[server]` | HTTP/RPC listener (port, log level, body limits) |
|
|
111
|
+
| `[runtime]` | Global defaults shared by every generator (most `[generators.model]` keys may live here too) |
|
|
112
|
+
| `[runtime.session_cache]` | KV-cache reuse store — see [Session State Cache](#session-state-cache) |
|
|
113
|
+
| `[autodiscover]` | LAN UDP / HTTP / mDNS discovery toggles |
|
|
114
|
+
| `[openai_compat]` | Enable `/oai-compat/v1/*` — see [Compatibility Endpoints](#compatibility-endpoints-experimental) |
|
|
115
|
+
| `[anthropic_messages]` | Enable `/anthropic-messages` — see [Compatibility Endpoints](#compatibility-endpoints-experimental)|
|
|
116
|
+
| `[[generators]]` | Array of generator instances — one entry per loaded model |
|
|
117
|
+
|
|
118
|
+
### `[env]`
|
|
29
119
|
|
|
30
120
|
```toml
|
|
31
|
-
# Environment variables (only set if not already defined in system)
|
|
32
121
|
[env]
|
|
33
|
-
|
|
122
|
+
HUGGINGFACE_TOKEN = "hf_xxx" # ggml backends read this; HF_TOKEN is not picked up automatically
|
|
34
123
|
CUDA_VISIBLE_DEVICES = "0"
|
|
124
|
+
```
|
|
35
125
|
|
|
36
|
-
[
|
|
37
|
-
|
|
38
|
-
|
|
126
|
+
Values here are exported only when the variable isn't already set in the process — see [Environment Variable Priority](#environment-variable-priority). For HuggingFace auth across all backends, `[runtime] huggingface_token = "hf_xxx"` works regardless of variable name.
|
|
127
|
+
|
|
128
|
+
### `[server]`
|
|
129
|
+
|
|
130
|
+
| Key | Type | Default |
|
|
131
|
+
| ----------------- | -------------- | ---------------------------------------------------------------------- |
|
|
132
|
+
| `id` | string | `buttress-<machineId>` — stable id used for autodiscover / binding |
|
|
133
|
+
| `name` | string | `Buttress Server (<short id>)` — display name |
|
|
134
|
+
| `port` | number | `2080` (overridden by `--port`) |
|
|
135
|
+
| `log_level` | `"debug"`/`"info"`/`"warn"`/`"error"` | unset |
|
|
136
|
+
| `max_body_size` | string\|number | `"50MB"` — e.g. `"100MB"`, `"1GB"`, or raw bytes |
|
|
137
|
+
| `session_timeout` | string\|number | `60000` ms — accepts ms numbers or duration strings (`"30s"`) |
|
|
138
|
+
| `temp_file_dir` | string | `$TMPDIR/.buttress` |
|
|
139
|
+
|
|
140
|
+
### `[runtime]` — global generator defaults
|
|
141
|
+
|
|
142
|
+
Most ggml-llm `[generators.model]` keys can also live in `[runtime]` as defaults. Per-generator values win; otherwise the runtime default applies.
|
|
143
|
+
|
|
144
|
+
| Key | Type | Notes |
|
|
145
|
+
| ---------------------------- | ----------------------------- | ---------------------------------------------------------------------- |
|
|
146
|
+
| `cache_dir` | string | Model + metadata cache root (default `~/.buttress/models`) |
|
|
147
|
+
| `huggingface_token` | string | Falls back to `$HUGGINGFACE_TOKEN` |
|
|
148
|
+
| `http_headers` | table | Extra headers attached to HF / HTTP downloads |
|
|
149
|
+
| `context_release_delay_ms` | number | Idle time before unloading a context (default `10000`; `0` = immediate)|
|
|
150
|
+
| `prefer_variants` | string[] | Override variant probe order (ggml backends) |
|
|
151
|
+
| `n_threads` | number | CPU thread count |
|
|
152
|
+
| `n_ctx` | number | Context window (per-model value wins; auto-capped at training context) |
|
|
153
|
+
| `n_gpu_layers` | number\|`"auto"` | Layers offloaded to GPU (default `"auto"`) |
|
|
154
|
+
| `n_batch` / `n_ubatch` | number | Prompt batch / micro-batch size. **Note:** `n_batch` has a model-level default of `512` that shadows the runtime value unless `[generators.model] n_batch` is set explicitly. |
|
|
155
|
+
| `n_parallel` | number | Parallel sequences (default `4`) |
|
|
156
|
+
| `n_cpu_moe` | number | MoE expert layers offloaded to CPU |
|
|
157
|
+
| `flash_attn_type` | `"on"` / `"off"` / `"auto"` | When a GPU backend is selected, defaults to `"auto"`; on CPU, defaults to `"off"`. Explicit `"on"` / `"off"` / `"auto"` overrides. |
|
|
158
|
+
| `cache_type_k`, `cache_type_v` | string | KV-cache dtype (`f16`, `f32`, `q8_0`, `q4_0`, …) |
|
|
159
|
+
| `kv_unified` | boolean | Use a unified KV cache across sequences |
|
|
160
|
+
| `swa_full` | boolean | Materialize full attention even for sliding-window layers |
|
|
161
|
+
| `ctx_shift` | boolean | Allow llama.cpp's rolling context shift |
|
|
162
|
+
| `use_mmap`, `use_mlock` | boolean | Memory-mapping / locking |
|
|
163
|
+
| `no_extra_bufts` | boolean | Disable extra compute buffer types |
|
|
164
|
+
| `cpu_mask`, `cpu_strict` | string / boolean | CPU affinity (advanced) |
|
|
165
|
+
| `devices` | string[] | Restrict to specific GGML devices |
|
|
166
|
+
| Speculative keys | various | `speculative`, `spec_type`, `spec_draft_n_max/n_min/p_min/p_split` |
|
|
167
|
+
|
|
168
|
+
### `[autodiscover]`
|
|
169
|
+
|
|
170
|
+
Set `autodiscover = true` for defaults, `false` (or omit) to disable, or a table for fine control:
|
|
39
171
|
|
|
40
|
-
|
|
41
|
-
|
|
172
|
+
```toml
|
|
173
|
+
[autodiscover]
|
|
174
|
+
udp.port = 8089
|
|
175
|
+
udp.announcements = { enabled = true, interval = 5000 }
|
|
176
|
+
udp.requests = { enabled = true, responseDelay = 100 }
|
|
177
|
+
http.enabled = true
|
|
178
|
+
http.path = "/buttress/info"
|
|
179
|
+
http.cors = true
|
|
180
|
+
# mdns.enabled = false # Bonjour/Avahi advertisement (optional)
|
|
181
|
+
```
|
|
42
182
|
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
183
|
+
### `[[generators]]`
|
|
184
|
+
|
|
185
|
+
Every generator entry has a `type`, an optional `[generators.backend]` table, and a `[generators.model]` table:
|
|
186
|
+
|
|
187
|
+
```toml
|
|
188
|
+
[[generators]]
|
|
189
|
+
type = "ggml-llm" # or "ggml-stt" / "mlx-llm"
|
|
190
|
+
|
|
191
|
+
[generators.backend]
|
|
192
|
+
# (see per-type sections below)
|
|
193
|
+
|
|
194
|
+
[generators.model]
|
|
195
|
+
repo_id = "..."
|
|
196
|
+
# (see per-type sections below)
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
#### Common `[generators.model]` keys
|
|
200
|
+
|
|
201
|
+
Shared by **all** generator types:
|
|
202
|
+
|
|
203
|
+
| Key | Type | Notes |
|
|
204
|
+
| ------------------------- | --------- | -------------------------------------------------------------------------------- |
|
|
205
|
+
| `repo_id` *(required)* | string | HuggingFace repo (`org/repo`) |
|
|
206
|
+
| `revision` | string | Default `"main"` |
|
|
207
|
+
| `download` | boolean | Pre-download at server startup (default `false`) |
|
|
208
|
+
|
|
209
|
+
Additional keys honored by **ggml-llm** and **ggml-stt** (mlx-llm gets quantization from the repo itself and does not use these):
|
|
210
|
+
|
|
211
|
+
| Key | Type | Notes |
|
|
212
|
+
| ------------------------- | --------- | -------------------------------------------------------------------------------- |
|
|
213
|
+
| `filename` | string | Pin a specific artifact in the repo |
|
|
214
|
+
| `url` | string | Direct download URL (skips manifest lookup) |
|
|
215
|
+
| `quantization` | string | Preferred quant tag — e.g. `q4_0`, `q8_0`, `mxfp4` |
|
|
216
|
+
| `preferred_quantizations` | string[] | Ordered fallback list when `quantization` doesn't match (alias: `quantizations`) |
|
|
217
|
+
| `allow_local_file` | boolean | Required to use `local_path` / `mmproj_local_path` |
|
|
218
|
+
| `local_path` | string | Use a local file as the load path. Repo metadata is still resolved from HF, so `repo_id` is still required. |
|
|
219
|
+
| `api_base`, `base_url` | string | Override HF API / blob hosts (mirrors / proxies) |
|
|
220
|
+
|
|
221
|
+
---
|
|
222
|
+
|
|
223
|
+
### `ggml-llm` (llama.cpp via `@fugood/llama.node`)
|
|
224
|
+
|
|
225
|
+
Loads a GGUF LLM. Runtime keys above can be overridden per-generator under `[generators.model]`; `[generators.backend]` only controls backend selection and resource planning.
|
|
226
|
+
|
|
227
|
+
**`[generators.backend]`**
|
|
48
228
|
|
|
49
|
-
|
|
229
|
+
| Key | Type | Default | Notes |
|
|
230
|
+
| --------------------- | -------- | --------------------------------------------- | ---------------------------------------------------------------- |
|
|
231
|
+
| `variant` | string | auto | Force `cuda` / `vulkan` / `snapdragon` / `default` |
|
|
232
|
+
| `variant_preference` | string[] | `["cuda","vulkan","snapdragon","default"]` | Probe order when `variant` is unset |
|
|
233
|
+
| `gpu_memory_fraction` | number | `0.85` | Max GPU fraction the hardware guardrails may plan against |
|
|
234
|
+
| `cpu_memory_fraction` | number | `0.5` | Max RAM fraction for CPU-side buffers |
|
|
235
|
+
|
|
236
|
+
**`[generators.model]`** — in addition to the common keys above:
|
|
237
|
+
|
|
238
|
+
| Key | Type | Notes |
|
|
239
|
+
| ----------------------------------------------------------------------------- | ---------------- | -------------------------------------------------------------------- |
|
|
240
|
+
| `n_ctx` | number | Context window. Auto-capped at the model's training context. |
|
|
241
|
+
| `n_gpu_layers` | number\|`"auto"` | Layers offloaded to GPU (default `"auto"`) |
|
|
242
|
+
| `n_batch` | number | Prompt batch size (default `512`) |
|
|
243
|
+
| `n_ubatch`, `n_threads`, `n_parallel`, `n_cpu_moe` | number | Same semantics as the `[runtime]` defaults |
|
|
244
|
+
| `flash_attn_type`, `cache_type_k`, `cache_type_v`, `kv_unified`, `swa_full`, `ctx_shift`, `use_mmap`, `use_mlock`, `no_extra_bufts`, `cpu_mask`, `cpu_strict`, `devices` | various | Per-model overrides for the `[runtime]` defaults |
|
|
245
|
+
|
|
246
|
+
**Multimodal (mtmd)** — auto-downloads the matching `mmproj-*.gguf` from the same repo and calls `initMultimodal`:
|
|
247
|
+
|
|
248
|
+
| Key | Type | Notes |
|
|
249
|
+
| ------------------------- | ------- | ------------------------------------------------------------------ |
|
|
250
|
+
| `enable_mtmd` | boolean | Default `false` |
|
|
251
|
+
| `mmproj_filename` | string | Pin a specific projector file |
|
|
252
|
+
| `mmproj_url` | string | Direct URL override |
|
|
253
|
+
| `mmproj_local_path` | string | Local projector (requires `allow_local_file = true`) |
|
|
254
|
+
| `mmproj_use_gpu` | boolean | `null` = auto (true when `n_gpu_layers > 0`) |
|
|
255
|
+
| `mmproj_image_min_tokens` | number | Min visual tokens (dynamic-resolution models; `-1` = unset) |
|
|
256
|
+
| `mmproj_image_max_tokens` | number | Max visual tokens (`-1` = unset) |
|
|
257
|
+
|
|
258
|
+
**Speculative decoding**
|
|
259
|
+
|
|
260
|
+
| Key | Type | Notes |
|
|
261
|
+
| -------------------- | ------ | -------------------------------------------------- |
|
|
262
|
+
| `speculative` | string | Draft model identifier |
|
|
263
|
+
| `spec_type` | string | Strategy (backend-defined) |
|
|
264
|
+
| `spec_draft_n_max` | int | Max drafted tokens per step |
|
|
265
|
+
| `spec_draft_n_min` | int | Min drafted tokens |
|
|
266
|
+
| `spec_draft_p_min` | number | Min acceptance probability |
|
|
267
|
+
| `spec_draft_p_split` | number | Split threshold |
|
|
268
|
+
|
|
269
|
+
**Example**
|
|
270
|
+
|
|
271
|
+
```toml
|
|
50
272
|
[[generators]]
|
|
51
273
|
type = "ggml-llm"
|
|
52
274
|
[generators.backend]
|
|
53
275
|
variant_preference = ["cuda", "vulkan", "default"]
|
|
276
|
+
gpu_memory_fraction = 0.95
|
|
54
277
|
[generators.model]
|
|
55
278
|
repo_id = "ggml-org/gpt-oss-20b-GGUF"
|
|
56
279
|
quantization = "mxfp4"
|
|
57
280
|
n_ctx = 12800
|
|
281
|
+
download = true
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
---
|
|
285
|
+
|
|
286
|
+
### `ggml-stt` (whisper.cpp via `@fugood/whisper.node`)
|
|
58
287
|
|
|
59
|
-
|
|
288
|
+
Loads a Whisper GGML model for speech-to-text.
|
|
289
|
+
|
|
290
|
+
**`[generators.backend]`**
|
|
291
|
+
|
|
292
|
+
| Key | Type | Default | Notes |
|
|
293
|
+
| --------------------- | -------- | ----------------------------- | ---------------------------------- |
|
|
294
|
+
| `variant` | string | auto | `cuda` / `vulkan` / `default` |
|
|
295
|
+
| `variant_preference` | string[] | `["cuda","vulkan","default"]` | Probe order |
|
|
296
|
+
| `gpu_memory_fraction` | number | `0.85` | |
|
|
297
|
+
| `cpu_memory_fraction` | number | `0.5` | |
|
|
298
|
+
|
|
299
|
+
**`[generators.model]`** — common keys plus:
|
|
300
|
+
|
|
301
|
+
| Key | Type | Default | Notes |
|
|
302
|
+
| ------------------------- | ----------------------------- | -------------------------------- | ---------------------------------------------------- |
|
|
303
|
+
| `repo_id` | string | `"BricksDisplay/whisper-ggml"` | Defaulted (unlike ggml-llm) |
|
|
304
|
+
| `preferred_quantizations` | string[] | `["q8_0", <no-quant>, "q5_1"]` | Default fallback chain |
|
|
305
|
+
| `use_gpu` | boolean | `true` | Force-disable GPU even when available |
|
|
306
|
+
| `use_flash_attn` | `"on"` / `"off"` / `"auto"` / boolean | `"auto"` | `"auto"` enables flash-attn when GPU is in use. `true`/`false` are accepted as shortcuts for `"on"`/`"off"`. |
|
|
307
|
+
|
|
308
|
+
**Runtime extras** — under `[runtime]` for ggml-stt only:
|
|
309
|
+
|
|
310
|
+
| Key | Type | Notes |
|
|
311
|
+
| ------------- | ------ | ------------------------------------------- |
|
|
312
|
+
| `max_threads` | number | Caps the whisper.cpp thread count |
|
|
313
|
+
|
|
314
|
+
**Example**
|
|
315
|
+
|
|
316
|
+
```toml
|
|
60
317
|
[[generators]]
|
|
61
318
|
type = "ggml-stt"
|
|
62
319
|
[generators.backend]
|
|
63
|
-
variant_preference = ["
|
|
320
|
+
variant_preference = ["cuda", "vulkan", "default"]
|
|
64
321
|
[generators.model]
|
|
65
322
|
repo_id = "BricksDisplay/whisper-ggml"
|
|
66
|
-
filename = "ggml-
|
|
323
|
+
filename = "ggml-large-v3-turbo-q8_0.bin"
|
|
324
|
+
use_gpu = true
|
|
325
|
+
use_flash_attn = "on"
|
|
326
|
+
download = true
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
### `mlx-llm` (Apple Silicon, Python `mlx-lm` / `mlx-vlm` bridge)
|
|
332
|
+
|
|
333
|
+
Loads an MLX-format model on Apple Silicon. On first use, the backend creates a virtualenv at `{cache_dir}/mlx-env` and installs `mlx_lm_package`, `mlx_vlm_package`, plus `torch` and `torchvision` (required by some VLM processors). If an existing venv already has `mlx_vlm` and `torch` importable, the install step is skipped. There is no `[generators.backend]` section.
|
|
334
|
+
|
|
335
|
+
**`[generators.model]`** — common `repo_id` / `revision` / `download` plus:
|
|
336
|
+
|
|
337
|
+
| Key | Type | Default | Notes |
|
|
338
|
+
| ------------------ | ------------------- | --------- | ----------------------------------------------------------------------------- |
|
|
339
|
+
| `adapter_path` | string | — | Local LoRA adapter directory |
|
|
340
|
+
| `vlm` | `"auto"` / boolean | `"auto"` | Force VLM (`true`) vs text-only (`false`); `"auto"` infers from the repo |
|
|
341
|
+
| `tokenizer_config` | table | — | Forwarded to `mlx_lm.load(..., tokenizer_config=...)` |
|
|
342
|
+
| `model_config` | table | — | Forwarded to `mlx_lm.load(..., model_config=...)` |
|
|
343
|
+
|
|
344
|
+
`quantization`, `filename`, and `preferred_quantizations` are **not** used — the MLX repo itself determines the quantization.
|
|
345
|
+
|
|
346
|
+
**Runtime extras** — under `[runtime]` for mlx-llm:
|
|
347
|
+
|
|
348
|
+
| Key | Type | Default | Notes |
|
|
349
|
+
| ------------------- | ------ | ----------------------------- | -------------------------------------------------------------------- |
|
|
350
|
+
| `mlx_env_dir` | string | `{cache_dir}/mlx-env` | Location of the auto-managed Python venv |
|
|
351
|
+
| `mlx_lm_package` | string | `"mlx-lm==0.31.1"` | pip spec used when provisioning the venv |
|
|
352
|
+
| `mlx_vlm_package` | string | `"mlx-vlm==0.4.0"` | pip spec used when provisioning the venv |
|
|
353
|
+
| `session_cache.*` | table | enabled, `5GB`, 100 entries | Separate cache from ggml-llm (lives in `{cache_dir}/mlx-session-cache`) |
|
|
354
|
+
|
|
355
|
+
**Example**
|
|
356
|
+
|
|
357
|
+
```toml
|
|
358
|
+
[[generators]]
|
|
359
|
+
type = "mlx-llm"
|
|
360
|
+
[generators.model]
|
|
361
|
+
repo_id = "mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
|
|
362
|
+
vlm = true
|
|
363
|
+
download = true
|
|
67
364
|
```
|
|
68
365
|
|
|
69
366
|
### Programmatic Usage
|
package/config/sample.toml
CHANGED
|
@@ -11,18 +11,21 @@
|
|
|
11
11
|
# HF_TOKEN = "your_huggingface_token_here"
|
|
12
12
|
# CUDA_VISIBLE_DEVICES = "0"
|
|
13
13
|
|
|
14
|
+
[autodiscover]
|
|
15
|
+
enabled = true
|
|
16
|
+
|
|
14
17
|
[server]
|
|
15
18
|
port = 2080
|
|
16
19
|
log_level = "info"
|
|
17
20
|
# max_body_size = "100MB" # Supports string (e.g., "100MB", "1GB") or number in bytes
|
|
18
21
|
|
|
19
22
|
[openai_compat]
|
|
20
|
-
|
|
23
|
+
enabled = true
|
|
21
24
|
# cors_allowed_origins = ["http://localhost:3000", "https://example.com"] # Restrict to specific origins
|
|
22
25
|
# cors_allowed_origins = "*" # Allow all origins (default)
|
|
23
26
|
|
|
24
27
|
[anthropic_messages]
|
|
25
|
-
|
|
28
|
+
enabled = true
|
|
26
29
|
# cors_allowed_origins = ["http://localhost:3000", "https://example.com"]
|
|
27
30
|
# cors_allowed_origins = "*"
|
|
28
31
|
|
|
@@ -0,0 +1,2 @@
|
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
var e=Object.defineProperty,t=Object.getOwnPropertyDescriptor,n=Object.getOwnPropertyNames,r=Object.prototype.hasOwnProperty,i=(e,t)=>()=>(e&&(t=e(e=0)),t),a=(e,t)=>()=>(t||e((t={exports:{}}).exports,t),t.exports),o=(t,n)=>{let r={};for(var i in t)e(r,i,{get:t[i],enumerable:!0});return n&&e(r,Symbol.toStringTag,{value:`Module`}),r},s=(i,a,o,s)=>{if(a&&typeof a==`object`||typeof a==`function`)for(var c=n(a),l=0,u=c.length,d;l<u;l++)d=c[l],!r.call(i,d)&&d!==o&&e(i,d,{get:(e=>a[e]).bind(null,d),enumerable:!(s=t(a,d))||s.enumerable});return i},c=t=>r.call(t,`module.exports`)?t[`module.exports`]:s(e({},`__esModule`,{value:!0}),t);export{c as i,i as n,o as r,a as t};
|
package/lib/index.d.mts
CHANGED
|
@@ -1,8 +1,17 @@
|
|
|
1
1
|
|
|
2
2
|
import { AnyElysia, Elysia } from "elysia";
|
|
3
|
+
import * as node_stream_web0 from "node:stream/web";
|
|
3
4
|
import { ReadableStream } from "node:stream/web";
|
|
5
|
+
import crypto from "node:crypto";
|
|
4
6
|
import { EventEmitter } from "node:events";
|
|
7
|
+
import { estimateOnnxRuntimeMemory as estimateRuntimeMemory } from "@fugood/buttress-hardware-guardrails";
|
|
5
8
|
|
|
9
|
+
//#region ../buttress-backend-core/lib/types/test-caps-config.d.ts
|
|
10
|
+
declare function createTestCapsDefaultConfigResolver(defaultConfig: any, type: any): {
|
|
11
|
+
globalDefaults: any;
|
|
12
|
+
resolveDefaultConfig: (modelRepoId: any) => {};
|
|
13
|
+
};
|
|
14
|
+
//#endregion
|
|
6
15
|
//#region ../buttress-backend-core/lib/types/caps.d.ts
|
|
7
16
|
declare function getCapabilities(type: any, currentClientCapabilities?: any, options?: {}): Promise<any>;
|
|
8
17
|
//#endregion
|
|
@@ -100,8 +109,80 @@ declare function testGgmlSttCapabilities({
|
|
|
100
109
|
modelId: string | null;
|
|
101
110
|
defaultConfig: any | null;
|
|
102
111
|
}): Promise<void>;
|
|
112
|
+
//#endregion
|
|
113
|
+
//#region ../buttress-backend-core/lib/types/utils/onnx.d.ts
|
|
114
|
+
declare function resolveModelCacheDir(cacheDir: string, repoId: string, revision?: string): string;
|
|
115
|
+
declare function estimateOnnxModelSize({
|
|
116
|
+
repoId,
|
|
117
|
+
revision,
|
|
118
|
+
modelType,
|
|
119
|
+
dtype,
|
|
120
|
+
cacheDir,
|
|
121
|
+
baseUrl,
|
|
122
|
+
subfolder,
|
|
123
|
+
headers,
|
|
124
|
+
configJson
|
|
125
|
+
}: {
|
|
126
|
+
repoId: string;
|
|
127
|
+
revision?: string;
|
|
128
|
+
modelType?: string;
|
|
129
|
+
dtype?: string | Record<string, string>;
|
|
130
|
+
cacheDir?: string;
|
|
131
|
+
baseUrl?: string;
|
|
132
|
+
subfolder?: string;
|
|
133
|
+
headers?: Record<string, string>;
|
|
134
|
+
configJson?: Record<string, any>;
|
|
135
|
+
}): Promise<{
|
|
136
|
+
totalBytes: number;
|
|
137
|
+
files: Array<{
|
|
138
|
+
name: string;
|
|
139
|
+
dtype: string;
|
|
140
|
+
bytes: number;
|
|
141
|
+
}>;
|
|
142
|
+
warnings: string[];
|
|
143
|
+
source: "local" | "remote";
|
|
144
|
+
}>;
|
|
145
|
+
declare function resolveOnnxDownloadManifest({
|
|
146
|
+
repoId,
|
|
147
|
+
revision,
|
|
148
|
+
modelType,
|
|
149
|
+
dtype,
|
|
150
|
+
cacheDir,
|
|
151
|
+
baseUrl,
|
|
152
|
+
subfolder,
|
|
153
|
+
headers,
|
|
154
|
+
configJson
|
|
155
|
+
}: {
|
|
156
|
+
repoId: string;
|
|
157
|
+
revision?: string;
|
|
158
|
+
modelType?: string;
|
|
159
|
+
dtype?: string | Record<string, string>;
|
|
160
|
+
cacheDir: string;
|
|
161
|
+
baseUrl?: string;
|
|
162
|
+
subfolder?: string;
|
|
163
|
+
headers?: Record<string, string>;
|
|
164
|
+
configJson?: Record<string, any>;
|
|
165
|
+
}): Promise<{
|
|
166
|
+
modelDir: string;
|
|
167
|
+
files: Array<{
|
|
168
|
+
rfilename: string;
|
|
169
|
+
url: string;
|
|
170
|
+
localPath: string;
|
|
171
|
+
size: number;
|
|
172
|
+
}>;
|
|
173
|
+
config: Record<string, any> | null;
|
|
174
|
+
}>;
|
|
175
|
+
declare function startOnnxModelDownload(config: object, globalDownloadManager: object, options?: {
|
|
176
|
+
onProgress?: (p: number) => void;
|
|
177
|
+
onComplete?: (info: object) => void;
|
|
178
|
+
onError?: (err: Error) => void;
|
|
179
|
+
}): Promise<{
|
|
180
|
+
started: boolean;
|
|
181
|
+
localPath: string | null;
|
|
182
|
+
repoId: string | null;
|
|
183
|
+
}>;
|
|
103
184
|
declare namespace index_d_exports {
|
|
104
|
-
export { finalizeGenerator, generatorRegistry, getCapabilities, getModelIdentifier, ggmlLlm, ggmlStt, globalDownloadManager, mlxLlm, showModelsTable, showSttModelsTable, startGenerator, startModelDownload, status, testGgmlLlmCapabilities, testGgmlSttCapabilities };
|
|
185
|
+
export { createTestCapsDefaultConfigResolver, estimateOnnxModelSize, estimateRuntimeMemory, finalizeGenerator, generatorRegistry, getCapabilities, getModelIdentifier, ggmlLlm, ggmlStt, globalDownloadManager, mlxLlm, onnxStt, onnxTts, resolveModelCacheDir, resolveOnnxDownloadManifest, showModelsTable, showSttModelsTable, startGenerator, startModelDownload, startOnnxModelDownload, status, testGgmlLlmCapabilities, testGgmlSttCapabilities };
|
|
105
186
|
}
|
|
106
187
|
declare function startGenerator(type: any, config: any): Promise<{
|
|
107
188
|
id: any;
|
|
@@ -131,8 +212,47 @@ declare namespace mlxLlm {
|
|
|
131
212
|
function applyChatTemplate(id: any, property: any): Promise<any>;
|
|
132
213
|
function releaseContext(id: any, property: any): Promise<any>;
|
|
133
214
|
}
|
|
215
|
+
declare namespace onnxStt {
|
|
216
|
+
function initContext(id: any, property: any): Promise<any>;
|
|
217
|
+
/**
|
|
218
|
+
* @returns {import('node:stream/web').ReadableStream}
|
|
219
|
+
*/
|
|
220
|
+
function transcribe(id: any, property: any): node_stream_web0.ReadableStream;
|
|
221
|
+
function transcribeData(id: any, property: any): Promise<any>;
|
|
222
|
+
function releaseContext(id: any, property: any): Promise<any>;
|
|
223
|
+
}
|
|
224
|
+
declare namespace onnxTts {
|
|
225
|
+
function initContext(id: any, property: any): Promise<any>;
|
|
226
|
+
function addSpeaker(id: any, speaker: any): Promise<any>;
|
|
227
|
+
function synthesize(id: any, property: any): Promise<any>;
|
|
228
|
+
function releaseContext(id: any, property: any): Promise<any>;
|
|
229
|
+
}
|
|
134
230
|
declare namespace status {
|
|
135
|
-
export function getFullStatus():
|
|
231
|
+
export function getFullStatus(): {
|
|
232
|
+
timestamp: string;
|
|
233
|
+
ggmlLlm: any;
|
|
234
|
+
ggmlStt: any;
|
|
235
|
+
mlxLlm: any;
|
|
236
|
+
onnxStt: any;
|
|
237
|
+
onnxTts: {
|
|
238
|
+
generators: {
|
|
239
|
+
id: any;
|
|
240
|
+
type: any;
|
|
241
|
+
refCount: any;
|
|
242
|
+
repoId: any;
|
|
243
|
+
dtype: any;
|
|
244
|
+
provider: any;
|
|
245
|
+
device: any;
|
|
246
|
+
modelBytes: any;
|
|
247
|
+
vocoderRepoId: any;
|
|
248
|
+
pipelines: any;
|
|
249
|
+
}[];
|
|
250
|
+
history: {
|
|
251
|
+
modelLoads: any[];
|
|
252
|
+
syntheses: any[];
|
|
253
|
+
};
|
|
254
|
+
};
|
|
255
|
+
};
|
|
136
256
|
export function getGgmlLlmStatus(): any;
|
|
137
257
|
export function getGgmlSttStatus(): any;
|
|
138
258
|
export function getMlxLlmStatus(): any;
|
|
@@ -183,7 +303,7 @@ declare namespace globalDownloadManager {
|
|
|
183
303
|
* The download will be tracked by the global download manager so that
|
|
184
304
|
* initContext can wait for it if needed.
|
|
185
305
|
*
|
|
186
|
-
* @param {string} type - The generator type ('ggml-llm'
|
|
306
|
+
* @param {string} type - The generator type ('ggml-llm', 'ggml-stt', etc.)
|
|
187
307
|
* @param {Object} config - The generator configuration
|
|
188
308
|
* @param {Object} options - Options for the download
|
|
189
309
|
* @param {function} options.onProgress - Progress callback (0-1)
|
|
@@ -222,7 +342,7 @@ type RuntimeConfig = {
|
|
|
222
342
|
huggingface_token?: string;
|
|
223
343
|
session_cache?: SessionCacheConfig;
|
|
224
344
|
} & Record<string, any>;
|
|
225
|
-
type GeneratorType = 'ggml-llm' | 'ggml-stt' | 'mlx-llm';
|
|
345
|
+
type GeneratorType = 'ggml-llm' | 'ggml-stt' | 'mlx-llm' | 'onnx-stt' | 'onnx-tts';
|
|
226
346
|
type GeneratorConfig = {
|
|
227
347
|
type: GeneratorType;
|
|
228
348
|
} & Record<string, any>;
|
|
@@ -279,24 +399,41 @@ type Config = {
|
|
|
279
399
|
generators: GeneratorConfig[];
|
|
280
400
|
};
|
|
281
401
|
type GeneratorInfo = {
|
|
282
|
-
type: GeneratorType;
|
|
402
|
+
type: GeneratorType; /** Performance score 0–100 from buttress-hardware-guardrails. */
|
|
403
|
+
score?: number; /** Whether the host has an accelerator (GPU/Metal/etc) for this backend. */
|
|
404
|
+
hasGpu?: boolean; /** Usable memory in bytes for this backend (GPU when present, else CPU). */
|
|
405
|
+
usableBytes?: number;
|
|
283
406
|
} & Record<string, any>;
|
|
284
407
|
type ServerInfo = {
|
|
285
408
|
id: string;
|
|
286
409
|
name: string;
|
|
410
|
+
version: string;
|
|
287
411
|
address: string;
|
|
412
|
+
addresses?: string[];
|
|
288
413
|
port: number;
|
|
289
414
|
url: string;
|
|
290
415
|
generators: GeneratorInfo[];
|
|
291
416
|
authentication: {
|
|
292
417
|
required: boolean;
|
|
293
|
-
type: string;
|
|
418
|
+
type: string; /** Issuer key id (when type === 'workspace-jwt'). */
|
|
419
|
+
kid?: string; /** True when buttress is paired with a workspace. */
|
|
420
|
+
bound?: boolean;
|
|
421
|
+
}; /** Workspace identity (only present when paired). */
|
|
422
|
+
workspace?: {
|
|
423
|
+
id: string;
|
|
424
|
+
name?: string;
|
|
294
425
|
};
|
|
295
426
|
};
|
|
296
427
|
//#endregion
|
|
297
428
|
//#region src/autodiscover/types.d.ts
|
|
298
429
|
type GetServerInfoFn = () => ServerInfo;
|
|
299
430
|
//#endregion
|
|
431
|
+
//#region src/autodiscover/sign.d.ts
|
|
432
|
+
interface AnnounceSigner {
|
|
433
|
+
kid: string;
|
|
434
|
+
privateKey: crypto.KeyObject;
|
|
435
|
+
}
|
|
436
|
+
//#endregion
|
|
300
437
|
//#region src/autodiscover/index.d.ts
|
|
301
438
|
/**
|
|
302
439
|
* Autodiscover service that manages discovery transports.
|
|
@@ -306,9 +443,10 @@ type GetServerInfoFn = () => ServerInfo;
|
|
|
306
443
|
declare class AutodiscoverService {
|
|
307
444
|
private config;
|
|
308
445
|
private getServerInfo;
|
|
446
|
+
private signer;
|
|
309
447
|
private transports;
|
|
310
448
|
private started;
|
|
311
|
-
constructor(config: AutodiscoverConfig, getServerInfo: GetServerInfoFn);
|
|
449
|
+
constructor(config: AutodiscoverConfig, getServerInfo: GetServerInfoFn, signer: AnnounceSigner | null);
|
|
312
450
|
start(): Promise<void>;
|
|
313
451
|
stop(): Promise<void>;
|
|
314
452
|
}
|