ollama-agent-router 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 ollama-agent-router contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,366 @@
1
+ # ollama-agent-router
2
+
3
+ `ollama-agent-router` is a local HTTP and CLI gateway for Ollama. It exposes an OpenAI-compatible chat completion endpoint and routes each request to the best configured local model based on task type, queue depth, loaded model state, GPU/VRAM headroom, priority, and sync/async policy.
4
+
5
+ It is designed for machines that run several Ollama models with different strengths, for example a small triage model, one or more code models, and a larger exclusive reasoning model.
6
+
7
+ ## Architecture
8
+
9
+ Request flow:
10
+
11
+ 1. `POST /v1/chat/completions` receives an OpenAI-style request, or `POST {server.basePath}/v1/chat/completions` when a base path is configured.
12
+ 2. The task classifier chooses a task type using deterministic heuristics.
13
+ 3. The router scores configured candidate models from `routes`.
14
+ 4. GPU state, loaded Ollama models, busy exclusive models, and queue depth are applied.
15
+ 5. The request is either run synchronously, accepted as an async job, or rejected.
16
+ 6. Sync calls are proxied to Ollama `/v1/chat/completions`.
17
+ 7. Async jobs are held in process memory and executed by per-model `p-queue` queues.
18
+
19
+ ## Quick Start
20
+
21
+ ```bash
22
+ npm install
23
+ npm run build
24
+ npm link
25
+ ollama-agent-router configure
26
+ ollama-agent-router serve --config ollama-agent-router.yaml
27
+ ```
28
+
29
+ Then call:
30
+
31
+ ```bash
32
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
33
+ -H 'content-type: application/json' \
34
+ -d '{
35
+ "model": "auto",
36
+ "messages": [{"role": "user", "content": "Write a TypeScript debounce function"}],
37
+ "router": {"allowAsync": true}
38
+ }'
39
+ ```
40
+
41
+ ## Ollama Setup
42
+
43
+ Install and start Ollama separately. Pull the models referenced by your config:
44
+
45
+ ```bash
46
+ ollama pull qwen2.5-coder:7b
47
+ ollama pull deepseek-coder:6.7b
48
+ ```
49
+
50
+ The router calls:
51
+
52
+ ```text
53
+ POST {ollama.baseUrl}/v1/chat/completions
54
+ GET {ollama.baseUrl}/api/tags
55
+ ollama ps
56
+ ```
57
+
58
+ ## GEX44 Example
59
+
60
+ `examples/gex44.yaml` is tuned for an RTX 4000 SFF Ada with 20 GB VRAM:
61
+
62
+ - `B-A-M-N/vibethinker:1.5b` for triage, simple chat, summarize, and classification.
63
+ - `qwen2.5-coder:7b` for code generation, fixes, tool use, and fallback agent work.
64
+ - `deepseek-coder:6.7b` for code review, generation, fixes, and second opinions.
65
+ - `gpt-oss:20b` for agentic reasoning, large context, planning, tool use, and complex debugging.
66
+
67
+ Start with:
68
+
69
+ ```bash
70
+ ollama-agent-router serve --config examples/gex44.yaml
71
+ ```
72
+
73
+ ## Config Reference
74
+
75
+ Lookup order:
76
+
77
+ 1. `--config path`
78
+ 2. `./ollama-agent-router.yaml`
79
+ 3. `~/.config/ollama-agent-router/config.yaml`
80
+ 4. `/etc/ollama-agent-router/config.yaml`
81
+
82
+ Top-level sections:
83
+
84
+ - `server`: host, port, base path, HTTPS certificates, and JSON body limit.
85
+ - `ollama`: base URL, OpenAI-compatible path, native API path, keep-alive, timeout.
86
+ - `gpu`: provider, VRAM limits, GPU-only default, NVIDIA monitor command.
87
+ - `router`: default mode, heavy-load thresholds, classifier config.
88
+ - `jobs`: in-memory store, result TTL, retry count, cleanup interval.
89
+ - `models`: model specs including size, purpose, concurrency, exclusivity, tags.
90
+ - `routes`: task type to candidate model names.
91
+ - `queue`: global queue limits and default priority.
92
+
93
+ Task types:
94
+
95
+ ```text
96
+ triage, simple_chat, summarize, code_generate, code_review, code_fix,
97
+ agentic_reasoning, large_context, tool_use, unknown
98
+ ```
99
+
100
+ Server options:
101
+
102
+ ```yaml
103
+ server:
104
+ host: 127.0.0.1
105
+ port: 11435
106
+ basePath: /
107
+ requestBodyLimit: 8mb
108
+ https:
109
+ enabled: false
110
+ certPath:
111
+ keyPath:
112
+ caPath:
113
+ ```
114
+
115
+ Set `server.port` to choose the listening port. Set `server.basePath` to expose every router endpoint under a prefix, for example `/ollama-router`; then chat completions move to `/ollama-router/v1/chat/completions`, health to `/ollama-router/health`, and jobs to `/ollama-router/v1/jobs/{jobId}`.
116
+
117
+ To run HTTPS directly from the router, set `server.https.enabled: true` and provide PEM certificate and key paths:
118
+
119
+ ```yaml
120
+ server:
121
+ host: 0.0.0.0
122
+ port: 11435
123
+ basePath: /ollama-router
124
+ requestBodyLimit: 8mb
125
+ https:
126
+ enabled: true
127
+ certPath: /etc/ollama-agent-router/tls.crt
128
+ keyPath: /etc/ollama-agent-router/tls.key
129
+ caPath:
130
+ ```
131
+
132
+ ## API Examples
133
+
134
+ Sync-preferred request:
135
+
136
+ ```json
137
+ {
138
+ "model": "auto",
139
+ "messages": [{"role": "user", "content": "Review this TypeScript function"}],
140
+ "router": {
141
+ "mode": "sync",
142
+ "taskType": "code_review",
143
+ "priority": "high",
144
+ "requireGpuOnly": true
145
+ }
146
+ }
147
+ ```
148
+
149
+ The router returns a normal chat completion payload with an added top-level `router` object:
150
+
151
+ ```json
152
+ {
153
+ "id": "chatcmpl_x",
154
+ "object": "chat.completion",
155
+ "model": "deepseek-coder:6.7b",
156
+ "choices": [],
157
+ "router": {
158
+ "mode": "sync",
159
+ "taskType": "code_review",
160
+ "selectedModel": "deepseek-coder:6.7b",
161
+ "fallbackModels": ["qwen2.5-coder:7b"],
162
+ "queueTimeMs": 4,
163
+ "executionTimeMs": 1200,
164
+ "decisionReason": "Selected deepseek-coder:6.7b for code_review with score 250.0"
165
+ }
166
+ }
167
+ ```
168
+
169
+ Status endpoints:
170
+
171
+ ```bash
172
+ curl http://127.0.0.1:11435/health
173
+ curl http://127.0.0.1:11435/metrics
174
+ curl http://127.0.0.1:11435/v1/router/status
175
+ curl http://127.0.0.1:11435/v1/router/models
176
+ curl http://127.0.0.1:11435/v1/router/gpu
177
+ ```
178
+
179
+ ## Async Jobs
180
+
181
+ When a selected model is busy or the router detects heavy load and `allowAsync=true`, the API returns:
182
+
183
+ ```json
184
+ {
185
+ "id": "job_01JABCDEF123",
186
+ "object": "router.job",
187
+ "status": "queued",
188
+ "message": "Heavy load. Job accepted for asynchronous processing."
189
+ }
190
+ ```
191
+
192
+ Job endpoints:
193
+
194
+ ```bash
195
+ curl http://127.0.0.1:11435/v1/jobs/{jobId}
196
+ curl http://127.0.0.1:11435/v1/jobs/{jobId}/result
197
+ curl -X DELETE http://127.0.0.1:11435/v1/jobs/{jobId}
198
+ ```
199
+
200
+ ## CLI Usage
201
+
202
+ ```bash
203
+ ollama-agent-router serve --config examples/gex44.yaml
204
+ ollama-agent-router init
205
+ ollama-agent-router init --wizard
206
+ ollama-agent-router configure
207
+ ollama-agent-router configure --detect
208
+ ollama-agent-router configure --output ./ollama-agent-router.yaml
209
+ ollama-agent-router configure --non-interactive --answers answers.yaml --output config.yaml
210
+ ollama-agent-router validate-config --config examples/gex44.yaml
211
+ ollama-agent-router status
212
+ ollama-agent-router models
213
+ ollama-agent-router gpu
214
+ ollama-agent-router jobs
215
+ ollama-agent-router job {jobId}
216
+ ollama-agent-router result {jobId}
217
+ ollama-agent-router cancel {jobId}
218
+ ```
219
+
220
+ The short alias `oar` is installed with the same commands. For a router mounted under a base path, pass it to client commands:
221
+
222
+ ```bash
223
+ ollama-agent-router --url https://127.0.0.1:11435 --base-path /ollama-router status
224
+ ```
225
+
226
+ ## Configuration Wizard
227
+
228
+ `ollama-agent-router configure` is a detect-first YAML generator. It tries to detect the local machine and asks mainly for confirmation:
229
+
230
+ - Ollama binary path and API URL, including `OLLAMA_HOST`
231
+ - installed Ollama models from `ollama list`
232
+ - loaded model state from `ollama ps`
233
+ - Linux NVIDIA GPU/VRAM from `nvidia-smi`
234
+ - macOS Apple Silicon/unified-memory defaults
235
+ - CPU cores, system RAM, platform, and architecture
236
+
237
+ Useful modes:
238
+
239
+ ```bash
240
+ ollama-agent-router configure --detect
241
+ ollama-agent-router configure --dry-run
242
+ ollama-agent-router configure --overwrite
243
+ ollama-agent-router configure --non-interactive --answers answers.yaml --output config.yaml
244
+ ```
245
+
246
+ On Linux with a working NVIDIA stack, the wizard enables GPU monitoring and defaults GPU-only routing to true. On macOS, it does not enable NVIDIA monitoring and defaults GPU-only routing to false because Apple Silicon uses unified memory rather than dedicated NVIDIA VRAM.
247
+
248
+ When no suitable NVIDIA GPU is detected, the generated config is CPU-friendly: per-model concurrency is capped at `1`, global concurrency defaults to `1`, queue limits are smaller, `router.defaultMode` stays `auto`, and heavy load is triggered after a shallow queue so clients using `allowAsync: true` are moved to async jobs earlier.
249
+
250
+ Minimal non-interactive answers file:
251
+
252
+ ```yaml
253
+ models:
254
+ mode: manual
255
+ items:
256
+ - name: qwen2.5-coder:7b
257
+ role: code
258
+ sizeGb: 4.7
259
+ ```
260
+
261
+ ## Installation
262
+
263
+ npm:
264
+
265
+ ```bash
266
+ npm install -g ollama-agent-router
267
+ ollama-agent-router init
268
+ ```
269
+
270
+ Homebrew:
271
+
272
+ ```bash
273
+ brew install ExeconOne/tap/ollama-agent-router
274
+ ```
275
+
276
+ Debian package from a release asset:
277
+
278
+ ```bash
279
+ sudo apt install ./ollama-agent-router_0.1.0_all.deb
280
+ ```
281
+
282
+ APT repository from GitHub Pages:
283
+
284
+ ```bash
285
+ curl -fsSL https://execonone.github.io/ollama-agent-router/apt/gpg.key \
286
+ | sudo gpg --dearmor -o /usr/share/keyrings/ollama-agent-router.gpg
287
+
288
+ echo "deb [signed-by=/usr/share/keyrings/ollama-agent-router.gpg] https://execonone.github.io/ollama-agent-router/apt stable main" \
289
+ | sudo tee /etc/apt/sources.list.d/ollama-agent-router.list
290
+
291
+ sudo apt-get update
292
+ sudo apt-get install ollama-agent-router
293
+ ```
294
+
295
+ ## Development
296
+
297
+ ```bash
298
+ npm install
299
+ npm run dev -- --config examples/gex44.yaml
300
+ npm run build
301
+ npm run typecheck
302
+ npm test
303
+ ```
304
+
305
+ The project uses TypeScript, ESM, Express, zod, pino, p-queue, nanoid, and Vitest.
306
+
307
+ Design notes:
308
+
309
+ - CLI configuration wizard HLD: `docs/cli-configurator-hld.md`
310
+
311
+ ## Release Guide
312
+
313
+ Releases are automated through GitHub Actions when pushing a version tag.
314
+ Publishing to npm uses Trusted Publishing/OIDC, so no `NPM_TOKEN` secret is required.
315
+
316
+ Configure npm Trusted Publishing for the package:
317
+
318
+ - Provider: GitHub Actions
319
+ - Organization or user: `ExeconOne`
320
+ - Repository: `ollama-agent-router`
321
+ - Workflow filename: `release.yml`
322
+ - Environment name: `github-pages`
323
+
324
+ Required repository secrets:
325
+
326
+ - `TAP_GITHUB_TOKEN`: GitHub token with write access to `ExeconOne/homebrew-tap`.
327
+ - `APT_GPG_PRIVATE_KEY`: ASCII-armored private GPG key.
328
+ - `APT_GPG_PASSPHRASE`: passphrase for that key, if any.
329
+
330
+ GitHub Pages must be enabled with source set to GitHub Actions.
331
+ The APT repository is always signed; releases fail if `APT_GPG_PRIVATE_KEY` is not configured.
332
+ The release workflow runs npm on Node.js 24 because npm Trusted Publishing requires npm CLI 11.5.1+ and Node.js 22.14.0+.
333
+
334
+ Release flow:
335
+
336
+ ```bash
337
+ npm version patch
338
+ git push origin main
339
+ git push origin v0.1.1
340
+ ```
341
+
342
+ The release workflow will:
343
+
344
+ 1. Run typecheck, tests, and build.
345
+ 2. Publish the package to npm.
346
+ 3. Prune dev dependencies and build the `.deb` package with nFPM.
347
+ 4. Create a GitHub Release with npm tarball and `.deb` assets.
348
+ 5. Publish a signed APT repository under GitHub Pages at `/apt`.
349
+ 6. Update `Formula/ollama-agent-router.rb` in `ExeconOne/homebrew-tap`.
350
+
351
+ Local dry-run before tagging:
352
+
353
+ ```bash
354
+ npm run typecheck
355
+ npm test
356
+ npm run build
357
+ npm publish --dry-run
358
+ ```
359
+
360
+ ## Safety Notes
361
+
362
+ VRAM accounting is conservative but not perfect. Ollama can split a model across CPU and GPU when VRAM is tight. Set `router.requireGpuOnly` or `gpu.requireGpuOnlyByDefault` to block loaded models whose `ollama ps` processor column indicates CPU/GPU split.
363
+
364
+ Large exclusive models should use `exclusive: true` and `maxConcurrent: 1`. Keep `vramSafetyReserveMb` high enough for context growth and system use. If you see CPU fallback, reduce concurrency, reduce context, or route heavy work to async jobs.
365
+
366
+ Async job state is intentionally in memory for v1. Restarting the router clears queued jobs, job history, and results.
package/dist/cli.d.ts ADDED
@@ -0,0 +1 @@
1
+ #!/usr/bin/env node