ollama-agent-router 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +366 -0
- package/dist/cli.d.ts +1 -0
- package/dist/cli.js +1725 -0
- package/dist/cli.js.map +1 -0
- package/dist/index.d.ts +818 -0
- package/dist/index.js +1668 -0
- package/dist/index.js.map +1 -0
- package/docs/cli-configurator-hld.md +932 -0
- package/examples/gex44.yaml +111 -0
- package/package.json +80 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 ollama-agent-router contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,366 @@
|
|
|
1
|
+
# ollama-agent-router
|
|
2
|
+
|
|
3
|
+
`ollama-agent-router` is a local HTTP and CLI gateway for Ollama. It exposes an OpenAI-compatible chat completion endpoint and routes each request to the best configured local model based on task type, queue depth, loaded model state, GPU/VRAM headroom, priority, and sync/async policy.
|
|
4
|
+
|
|
5
|
+
It is designed for machines that run several Ollama models with different strengths, for example a small triage model, one or more code models, and a larger exclusive reasoning model.
|
|
6
|
+
|
|
7
|
+
## Architecture
|
|
8
|
+
|
|
9
|
+
Request flow:
|
|
10
|
+
|
|
11
|
+
1. `POST /v1/chat/completions` receives an OpenAI-style request, or `POST {server.basePath}/v1/chat/completions` when a base path is configured.
|
|
12
|
+
2. The task classifier chooses a task type using deterministic heuristics.
|
|
13
|
+
3. The router scores configured candidate models from `routes`.
|
|
14
|
+
4. GPU state, loaded Ollama models, busy exclusive models, and queue depth are applied.
|
|
15
|
+
5. The request is either run synchronously, accepted as an async job, or rejected.
|
|
16
|
+
6. Sync calls are proxied to Ollama `/v1/chat/completions`.
|
|
17
|
+
7. Async jobs are held in process memory and executed by per-model `p-queue` queues.
|
|
18
|
+
|
|
19
|
+
## Quick Start
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
npm install
|
|
23
|
+
npm run build
|
|
24
|
+
npm link
|
|
25
|
+
ollama-agent-router configure
|
|
26
|
+
ollama-agent-router serve --config ollama-agent-router.yaml
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Then call:
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
curl -s http://127.0.0.1:11435/v1/chat/completions \
|
|
33
|
+
-H 'content-type: application/json' \
|
|
34
|
+
-d '{
|
|
35
|
+
"model": "auto",
|
|
36
|
+
"messages": [{"role": "user", "content": "Write a TypeScript debounce function"}],
|
|
37
|
+
"router": {"allowAsync": true}
|
|
38
|
+
}'
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
## Ollama Setup
|
|
42
|
+
|
|
43
|
+
Install and start Ollama separately. Pull the models referenced by your config:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
ollama pull qwen2.5-coder:7b
|
|
47
|
+
ollama pull deepseek-coder:6.7b
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
The router calls:
|
|
51
|
+
|
|
52
|
+
```text
|
|
53
|
+
POST {ollama.baseUrl}/v1/chat/completions
|
|
54
|
+
GET {ollama.baseUrl}/api/tags
|
|
55
|
+
ollama ps
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## GEX44 Example
|
|
59
|
+
|
|
60
|
+
`examples/gex44.yaml` is tuned for an RTX 4000 SFF Ada with 20 GB VRAM:
|
|
61
|
+
|
|
62
|
+
- `B-A-M-N/vibethinker:1.5b` for triage, simple chat, summarize, and classification.
|
|
63
|
+
- `qwen2.5-coder:7b` for code generation, fixes, tool use, and fallback agent work.
|
|
64
|
+
- `deepseek-coder:6.7b` for code review, generation, fixes, and second opinions.
|
|
65
|
+
- `gpt-oss:20b` for agentic reasoning, large context, planning, tool use, and complex debugging.
|
|
66
|
+
|
|
67
|
+
Start with:
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
ollama-agent-router serve --config examples/gex44.yaml
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Config Reference
|
|
74
|
+
|
|
75
|
+
Lookup order:
|
|
76
|
+
|
|
77
|
+
1. `--config path`
|
|
78
|
+
2. `./ollama-agent-router.yaml`
|
|
79
|
+
3. `~/.config/ollama-agent-router/config.yaml`
|
|
80
|
+
4. `/etc/ollama-agent-router/config.yaml`
|
|
81
|
+
|
|
82
|
+
Top-level sections:
|
|
83
|
+
|
|
84
|
+
- `server`: host, port, base path, HTTPS certificates, and JSON body limit.
|
|
85
|
+
- `ollama`: base URL, OpenAI-compatible path, native API path, keep-alive, timeout.
|
|
86
|
+
- `gpu`: provider, VRAM limits, GPU-only default, NVIDIA monitor command.
|
|
87
|
+
- `router`: default mode, heavy-load thresholds, classifier config.
|
|
88
|
+
- `jobs`: in-memory store, result TTL, retry count, cleanup interval.
|
|
89
|
+
- `models`: model specs including size, purpose, concurrency, exclusivity, tags.
|
|
90
|
+
- `routes`: task type to candidate model names.
|
|
91
|
+
- `queue`: global queue limits and default priority.
|
|
92
|
+
|
|
93
|
+
Task types:
|
|
94
|
+
|
|
95
|
+
```text
|
|
96
|
+
triage, simple_chat, summarize, code_generate, code_review, code_fix,
|
|
97
|
+
agentic_reasoning, large_context, tool_use, unknown
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Server options:
|
|
101
|
+
|
|
102
|
+
```yaml
|
|
103
|
+
server:
|
|
104
|
+
host: 127.0.0.1
|
|
105
|
+
port: 11435
|
|
106
|
+
basePath: /
|
|
107
|
+
requestBodyLimit: 8mb
|
|
108
|
+
https:
|
|
109
|
+
enabled: false
|
|
110
|
+
certPath:
|
|
111
|
+
keyPath:
|
|
112
|
+
caPath:
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Set `server.port` to choose the listening port. Set `server.basePath` to expose every router endpoint under a prefix, for example `/ollama-router`; then chat completions move to `/ollama-router/v1/chat/completions`, health to `/ollama-router/health`, and jobs to `/ollama-router/v1/jobs/{jobId}`.
|
|
116
|
+
|
|
117
|
+
To run HTTPS directly from the router, set `server.https.enabled: true` and provide PEM certificate and key paths:
|
|
118
|
+
|
|
119
|
+
```yaml
|
|
120
|
+
server:
|
|
121
|
+
host: 0.0.0.0
|
|
122
|
+
port: 11435
|
|
123
|
+
basePath: /ollama-router
|
|
124
|
+
requestBodyLimit: 8mb
|
|
125
|
+
https:
|
|
126
|
+
enabled: true
|
|
127
|
+
certPath: /etc/ollama-agent-router/tls.crt
|
|
128
|
+
keyPath: /etc/ollama-agent-router/tls.key
|
|
129
|
+
caPath:
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## API Examples
|
|
133
|
+
|
|
134
|
+
Sync-preferred request:
|
|
135
|
+
|
|
136
|
+
```json
|
|
137
|
+
{
|
|
138
|
+
"model": "auto",
|
|
139
|
+
"messages": [{"role": "user", "content": "Review this TypeScript function"}],
|
|
140
|
+
"router": {
|
|
141
|
+
"mode": "sync",
|
|
142
|
+
"taskType": "code_review",
|
|
143
|
+
"priority": "high",
|
|
144
|
+
"requireGpuOnly": true
|
|
145
|
+
}
|
|
146
|
+
}
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
The router returns a normal chat completion payload with an added top-level `router` object:
|
|
150
|
+
|
|
151
|
+
```json
|
|
152
|
+
{
|
|
153
|
+
"id": "chatcmpl_x",
|
|
154
|
+
"object": "chat.completion",
|
|
155
|
+
"model": "deepseek-coder:6.7b",
|
|
156
|
+
"choices": [],
|
|
157
|
+
"router": {
|
|
158
|
+
"mode": "sync",
|
|
159
|
+
"taskType": "code_review",
|
|
160
|
+
"selectedModel": "deepseek-coder:6.7b",
|
|
161
|
+
"fallbackModels": ["qwen2.5-coder:7b"],
|
|
162
|
+
"queueTimeMs": 4,
|
|
163
|
+
"executionTimeMs": 1200,
|
|
164
|
+
"decisionReason": "Selected deepseek-coder:6.7b for code_review with score 250.0"
|
|
165
|
+
}
|
|
166
|
+
}
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
Status endpoints:
|
|
170
|
+
|
|
171
|
+
```bash
|
|
172
|
+
curl http://127.0.0.1:11435/health
|
|
173
|
+
curl http://127.0.0.1:11435/metrics
|
|
174
|
+
curl http://127.0.0.1:11435/v1/router/status
|
|
175
|
+
curl http://127.0.0.1:11435/v1/router/models
|
|
176
|
+
curl http://127.0.0.1:11435/v1/router/gpu
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
## Async Jobs
|
|
180
|
+
|
|
181
|
+
When a selected model is busy or the router detects heavy load and `allowAsync=true`, the API returns:
|
|
182
|
+
|
|
183
|
+
```json
|
|
184
|
+
{
|
|
185
|
+
"id": "job_01JABCDEF123",
|
|
186
|
+
"object": "router.job",
|
|
187
|
+
"status": "queued",
|
|
188
|
+
"message": "Heavy load. Job accepted for asynchronous processing."
|
|
189
|
+
}
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
Job endpoints:
|
|
193
|
+
|
|
194
|
+
```bash
|
|
195
|
+
curl http://127.0.0.1:11435/v1/jobs/{jobId}
|
|
196
|
+
curl http://127.0.0.1:11435/v1/jobs/{jobId}/result
|
|
197
|
+
curl -X DELETE http://127.0.0.1:11435/v1/jobs/{jobId}
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
## CLI Usage
|
|
201
|
+
|
|
202
|
+
```bash
|
|
203
|
+
ollama-agent-router serve --config examples/gex44.yaml
|
|
204
|
+
ollama-agent-router init
|
|
205
|
+
ollama-agent-router init --wizard
|
|
206
|
+
ollama-agent-router configure
|
|
207
|
+
ollama-agent-router configure --detect
|
|
208
|
+
ollama-agent-router configure --output ./ollama-agent-router.yaml
|
|
209
|
+
ollama-agent-router configure --non-interactive --answers answers.yaml --output config.yaml
|
|
210
|
+
ollama-agent-router validate-config --config examples/gex44.yaml
|
|
211
|
+
ollama-agent-router status
|
|
212
|
+
ollama-agent-router models
|
|
213
|
+
ollama-agent-router gpu
|
|
214
|
+
ollama-agent-router jobs
|
|
215
|
+
ollama-agent-router job {jobId}
|
|
216
|
+
ollama-agent-router result {jobId}
|
|
217
|
+
ollama-agent-router cancel {jobId}
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
The short alias `oar` is installed with the same commands. For a router mounted under a base path, pass it to client commands:
|
|
221
|
+
|
|
222
|
+
```bash
|
|
223
|
+
ollama-agent-router --url https://127.0.0.1:11435 --base-path /ollama-router status
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
## Configuration Wizard
|
|
227
|
+
|
|
228
|
+
`ollama-agent-router configure` is a detect-first YAML generator. It tries to detect the local machine and asks mainly for confirmation:
|
|
229
|
+
|
|
230
|
+
- Ollama binary path and API URL, including `OLLAMA_HOST`
|
|
231
|
+
- installed Ollama models from `ollama list`
|
|
232
|
+
- loaded model state from `ollama ps`
|
|
233
|
+
- Linux NVIDIA GPU/VRAM from `nvidia-smi`
|
|
234
|
+
- macOS Apple Silicon/unified-memory defaults
|
|
235
|
+
- CPU cores, system RAM, platform, and architecture
|
|
236
|
+
|
|
237
|
+
Useful modes:
|
|
238
|
+
|
|
239
|
+
```bash
|
|
240
|
+
ollama-agent-router configure --detect
|
|
241
|
+
ollama-agent-router configure --dry-run
|
|
242
|
+
ollama-agent-router configure --overwrite
|
|
243
|
+
ollama-agent-router configure --non-interactive --answers answers.yaml --output config.yaml
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
On Linux with a working NVIDIA stack, the wizard enables GPU monitoring and defaults GPU-only routing to true. On macOS, it does not enable NVIDIA monitoring and defaults GPU-only routing to false because Apple Silicon uses unified memory rather than dedicated NVIDIA VRAM.
|
|
247
|
+
|
|
248
|
+
When no suitable NVIDIA GPU is detected, the generated config is CPU-friendly: per-model concurrency is capped at `1`, global concurrency defaults to `1`, queue limits are smaller, `router.defaultMode` stays `auto`, and heavy load is triggered after a shallow queue so clients using `allowAsync: true` are moved to async jobs earlier.
|
|
249
|
+
|
|
250
|
+
Minimal non-interactive answers file:
|
|
251
|
+
|
|
252
|
+
```yaml
|
|
253
|
+
models:
|
|
254
|
+
mode: manual
|
|
255
|
+
items:
|
|
256
|
+
- name: qwen2.5-coder:7b
|
|
257
|
+
role: code
|
|
258
|
+
sizeGb: 4.7
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
## Installation
|
|
262
|
+
|
|
263
|
+
npm:
|
|
264
|
+
|
|
265
|
+
```bash
|
|
266
|
+
npm install -g ollama-agent-router
|
|
267
|
+
ollama-agent-router init
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
Homebrew:
|
|
271
|
+
|
|
272
|
+
```bash
|
|
273
|
+
brew install ExeconOne/tap/ollama-agent-router
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
Debian package from a release asset:
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
sudo apt install ./ollama-agent-router_0.1.0_all.deb
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
APT repository from GitHub Pages:
|
|
283
|
+
|
|
284
|
+
```bash
|
|
285
|
+
curl -fsSL https://execonone.github.io/ollama-agent-router/apt/gpg.key \
|
|
286
|
+
| sudo gpg --dearmor -o /usr/share/keyrings/ollama-agent-router.gpg
|
|
287
|
+
|
|
288
|
+
echo "deb [signed-by=/usr/share/keyrings/ollama-agent-router.gpg] https://execonone.github.io/ollama-agent-router/apt stable main" \
|
|
289
|
+
| sudo tee /etc/apt/sources.list.d/ollama-agent-router.list
|
|
290
|
+
|
|
291
|
+
sudo apt-get update
|
|
292
|
+
sudo apt-get install ollama-agent-router
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
## Development
|
|
296
|
+
|
|
297
|
+
```bash
|
|
298
|
+
npm install
|
|
299
|
+
npm run dev -- --config examples/gex44.yaml
|
|
300
|
+
npm run build
|
|
301
|
+
npm run typecheck
|
|
302
|
+
npm test
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
The project uses TypeScript, ESM, Express, zod, pino, p-queue, nanoid, and Vitest.
|
|
306
|
+
|
|
307
|
+
Design notes:
|
|
308
|
+
|
|
309
|
+
- CLI configuration wizard HLD: `docs/cli-configurator-hld.md`
|
|
310
|
+
|
|
311
|
+
## Release Guide
|
|
312
|
+
|
|
313
|
+
Releases are automated through GitHub Actions when pushing a version tag.
|
|
314
|
+
Publishing to npm uses Trusted Publishing/OIDC, so no `NPM_TOKEN` secret is required.
|
|
315
|
+
|
|
316
|
+
Configure npm Trusted Publishing for the package:
|
|
317
|
+
|
|
318
|
+
- Provider: GitHub Actions
|
|
319
|
+
- Organization or user: `ExeconOne`
|
|
320
|
+
- Repository: `ollama-agent-router`
|
|
321
|
+
- Workflow filename: `release.yml`
|
|
322
|
+
- Environment name: `github-pages`
|
|
323
|
+
|
|
324
|
+
Required repository secrets:
|
|
325
|
+
|
|
326
|
+
- `TAP_GITHUB_TOKEN`: GitHub token with write access to `ExeconOne/homebrew-tap`.
|
|
327
|
+
- `APT_GPG_PRIVATE_KEY`: ASCII-armored private GPG key.
|
|
328
|
+
- `APT_GPG_PASSPHRASE`: passphrase for that key, if any.
|
|
329
|
+
|
|
330
|
+
GitHub Pages must be enabled with source set to GitHub Actions.
|
|
331
|
+
The APT repository is always signed; releases fail if `APT_GPG_PRIVATE_KEY` is not configured.
|
|
332
|
+
The release workflow runs npm on Node.js 24 because npm Trusted Publishing requires npm CLI 11.5.1+ and Node.js 22.14.0+.
|
|
333
|
+
|
|
334
|
+
Release flow:
|
|
335
|
+
|
|
336
|
+
```bash
|
|
337
|
+
npm version patch
|
|
338
|
+
git push origin main
|
|
339
|
+
git push origin v0.1.1
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
The release workflow will:
|
|
343
|
+
|
|
344
|
+
1. Run typecheck, tests, and build.
|
|
345
|
+
2. Publish the package to npm.
|
|
346
|
+
3. Prune dev dependencies and build the `.deb` package with nFPM.
|
|
347
|
+
4. Create a GitHub Release with npm tarball and `.deb` assets.
|
|
348
|
+
5. Publish a signed APT repository under GitHub Pages at `/apt`.
|
|
349
|
+
6. Update `Formula/ollama-agent-router.rb` in `ExeconOne/homebrew-tap`.
|
|
350
|
+
|
|
351
|
+
Local dry-run before tagging:
|
|
352
|
+
|
|
353
|
+
```bash
|
|
354
|
+
npm run typecheck
|
|
355
|
+
npm test
|
|
356
|
+
npm run build
|
|
357
|
+
npm publish --dry-run
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
## Safety Notes
|
|
361
|
+
|
|
362
|
+
VRAM accounting is conservative but not perfect. Ollama can split a model across CPU and GPU when VRAM is tight. Set `router.requireGpuOnly` or `gpu.requireGpuOnlyByDefault` to block loaded models whose `ollama ps` processor column indicates CPU/GPU split.
|
|
363
|
+
|
|
364
|
+
Large exclusive models should use `exclusive: true` and `maxConcurrent: 1`. Keep `vramSafetyReserveMb` high enough for context growth and system use. If you see CPU fallback, reduce concurrency, reduce context, or route heavy work to async jobs.
|
|
365
|
+
|
|
366
|
+
Async job state is intentionally in memory for v1. Restarting the router clears queued jobs, job history, and results.
|
package/dist/cli.d.ts
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
#!/usr/bin/env node
|