ollama-agent-router 0.1.7 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -94,6 +94,221 @@ Start with:
94
94
  ollama-agent-router serve --config examples/gex44.yaml
95
95
  ```
96
96
 
97
+ `examples/gex44-secured.yaml` is the same hardware profile with the standalone plane locked down: API key required, anonymous access rejected, per-key rate limits, and the admin plane enabled on localhost. Use it as a starting point when the router is exposed beyond a single user or process.
98
+
99
+ ## Routing Algorithm
100
+
101
+ ### Candidate selection
102
+
103
+ For every request the router builds a candidate list from three sources, merged in order:
104
+
105
+ 1. `router.preferredModels` from the request — added first, regardless of `routes`.
106
+ 2. `routes[taskType]` — the ordered list for the classified task type.
107
+ 3. Any model whose `purpose` or `tags` array contains the task type — acts as a catch-all fallback.
108
+
109
+ Models listed in `router.forbiddenModels` are dropped from the candidate list entirely.
110
+
111
+ ### Blocking checks
112
+
113
+ Before scoring, each candidate is checked for hard blocks:
114
+
115
+ - **`gpu_only`** — `requireGpuOnly` is set (globally or per-request) and the model is not fully on GPU, has a CPU/GPU split in `ollama ps`, or there is not enough free VRAM to load it.
116
+ - **`busy`** — the model has `exclusive: true` and is already running, or `allowWhenBusy: false` and has reached `maxConcurrent`.
117
+
118
+ Blocked models are excluded from sync selection but can still be picked for async jobs.
119
+
120
+ ### Scoring
121
+
122
+ Every non-blocked candidate receives a numeric score. Higher score wins. Starting value: **100**.
123
+
124
+ | Component | Delta | Notes |
125
+ |---|---|---|
126
+ | Route position | `+50` for index 0, `−8` per step | First entry in `routes[taskType]` gets the full bonus |
127
+ | `model.priority` | `+priority` | Set per model, 1–100 |
128
+ | `purpose` match | `+25` | Model's `purpose` array contains the task type |
129
+ | `preferredModels` | `+80` for index 0, `−10` per step | Request-level override |
130
+ | Already loaded in Ollama | **`+20`** | Model appears in `ollama ps` output |
131
+ | Heavy complexity + `costClass: high` | `+20` | Classifier returned `heavy`; rewards large models |
132
+ | Light complexity + `costClass: low` | `+15` | Classifier returned `light`; rewards small models |
133
+ | Free VRAM headroom | `+0..+25` | Scales with `(freeMb − requiredMb) / 512`, capped at 25 |
134
+ | Insufficient VRAM | **`−60`** | `model.sizeGb × 1024 + vramSafetyReserveMb > freeMb` |
135
+ | Queue depth | `−18 × queueDepth` | Per-model queue length |
136
+ | Running count | `−25 × running` | Per-model active executions |
137
+ | Exclusive + running | `−80 × running` additional | `exclusive: true` models penalised heavily while in use |
138
+
139
+ The candidate with the highest score is selected. The others appear in `fallbackModels` in the response.
140
+
141
+ ### Model config fields that affect routing
142
+
143
+ ```yaml
144
+ models:
145
+ - name: gpt-oss:20b
146
+ sizeGb: 14.0 # used for VRAM headroom calculation
147
+ purpose: [agentic_reasoning, large_context, planning, tool_use, complex_debugging]
148
+ # +25 score when task type matches; also adds model to the candidate list
149
+ priority: 95 # added directly to score; use to rank models of similar capability
150
+ maxConcurrent: 1 # hard cap on parallel executions
151
+ costClass: high # low | medium | high — matched against request complexity for bonus/penalty
152
+ exclusive: true # if running, gets −80 extra penalty per execution; only one at a time
153
+ allowWhenBusy: false # if false and maxConcurrent reached → blocked entirely
154
+ ```
155
+
156
+ **`purpose`** — declares what the model can do. Each entry that matches the request's task type adds `+25` to the score and also makes the model a candidate even when it is not listed in `routes[taskType]`. Use it for every task type the model handles well, including secondary ones (e.g. add `agentic_reasoning` to a coder model that works as a capable fallback).
157
+
158
+ **`costClass`** — signals the relative weight of the model:
159
+ - `high`: gets `+20` when the classifier decides the request is complex (`heavy`). Intended for large reasoning models.
160
+ - `low`: gets `+15` when the request is simple (`light`). Intended for small triage/chat models.
161
+ - `medium`: no complexity bonus in either direction.
162
+
163
+ **`exclusive`** — intended for large models that cannot safely share GPU memory with another concurrent execution. While one request is running, the model accumulates `−80` per running job on top of the standard `−25`, making it effectively unselectable for sync requests until free.
164
+
165
+ ### `routes` config and its relation to scoring
166
+
167
+ ```yaml
168
+ routes:
169
+ agentic_reasoning: [gpt-oss:20b, qwen2.5-coder:7b]
170
+ ```
171
+
172
+ Order matters: `gpt-oss:20b` at index 0 gets `+50`, `qwen2.5-coder:7b` at index 1 gets `+42`. Each additional position costs `−8`.
173
+
174
+ A model does not need to be in `routes` to be selected — if it declares the task type in `purpose` or `tags` it will still enter the candidate list (with a route-position score of 0).
175
+
176
+ ### Sync vs async decision
177
+
178
+ After scoring, the router checks whether to run synchronously or push to the async queue:
179
+
180
+ 1. If `router.mode: async` — always async.
181
+ 2. If heavy load is detected (total queue depth ≥ `router.heavyLoadQueueDepth` **or** free VRAM < `router.heavyLoadGpuFreeMbThreshold`) and `allowAsync: true` — async.
182
+ 3. If the top-scored model is busy and `allowAsync: true` — async on that model.
183
+ 4. Otherwise — sync on the top-scored model.
184
+
185
+ `allowAsync` defaults to `true`. Set `"router": {"mode": "sync"}` in the request to force synchronous execution regardless of load.
186
+
187
+ ### Forcing a specific model
188
+
189
+ `preferredModels` adds `+80` to the first entry, making it win unless blocked by VRAM or busy constraints. `forbiddenModels` removes models from the candidate list entirely — useful when testing a specific model in isolation.
190
+
191
+ ### Request examples
192
+
193
+ **Explicit task type — let the router pick the best model for the task:**
194
+
195
+ ```bash
196
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
197
+ -H 'content-type: application/json' \
198
+ -H 'authorization: Bearer <api-key>' \
199
+ -d '{
200
+ "model": "auto",
201
+ "messages": [{"role": "user", "content": "Plan a multi-service refactor"}],
202
+ "router": {
203
+ "taskType": "agentic_reasoning"
204
+ }
205
+ }'
206
+ ```
207
+
208
+ **Explicit task type with async fallback on heavy load:**
209
+
210
+ ```bash
211
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
212
+ -H 'content-type: application/json' \
213
+ -H 'authorization: Bearer <api-key>' \
214
+ -d '{
215
+ "model": "auto",
216
+ "messages": [{"role": "user", "content": "Plan a multi-service refactor"}],
217
+ "router": {
218
+ "taskType": "agentic_reasoning",
219
+ "allowAsync": true
220
+ }
221
+ }'
222
+ ```
223
+
224
+ Returns `202` with a job id when load is high; `200` with the result when run synchronously.
225
+
226
+ **Force a specific model, block all others:**
227
+
228
+ ```bash
229
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
230
+ -H 'content-type: application/json' \
231
+ -H 'authorization: Bearer <api-key>' \
232
+ -d '{
233
+ "model": "auto",
234
+ "messages": [{"role": "user", "content": "Review this PR diff"}],
235
+ "router": {
236
+ "taskType": "code_review",
237
+ "preferredModels": ["gpt-oss:20b"],
238
+ "forbiddenModels": ["qwen2.5-coder:7b", "deepseek-coder:6.7b"]
239
+ }
240
+ }'
241
+ ```
242
+
243
+ **Force sync, no async fallback even under load:**
244
+
245
+ ```bash
246
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
247
+ -H 'content-type: application/json' \
248
+ -H 'authorization: Bearer <api-key>' \
249
+ -d '{
250
+ "model": "auto",
251
+ "messages": [{"role": "user", "content": "Fix the off-by-one error"}],
252
+ "router": {
253
+ "taskType": "code_fix",
254
+ "mode": "sync",
255
+ "allowAsync": false
256
+ }
257
+ }'
258
+ ```
259
+
260
+ **High priority request — jumps ahead in the queue:**
261
+
262
+ ```bash
263
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
264
+ -H 'content-type: application/json' \
265
+ -H 'authorization: Bearer <api-key>' \
266
+ -d '{
267
+ "model": "auto",
268
+ "messages": [{"role": "user", "content": "Summarize this log"}],
269
+ "router": {
270
+ "taskType": "summarize",
271
+ "priority": "high"
272
+ }
273
+ }'
274
+ ```
275
+
276
+ **GPU-only — reject if model would run on CPU or with a CPU/GPU split:**
277
+
278
+ ```bash
279
+ curl -s http://127.0.0.1:11435/v1/chat/completions \
280
+ -H 'content-type: application/json' \
281
+ -H 'authorization: Bearer <api-key>' \
282
+ -d '{
283
+ "model": "auto",
284
+ "messages": [{"role": "user", "content": "Generate a REST API scaffold"}],
285
+ "router": {
286
+ "taskType": "code_generate",
287
+ "requireGpuOnly": true
288
+ }
289
+ }'
290
+ ```
291
+
292
+ Returns `503` if no GPU-only candidate is available.
293
+
294
+ **Check what the router decided** — every `200` response includes a `router` object:
295
+
296
+ ```json
297
+ {
298
+ "router": {
299
+ "mode": "sync",
300
+ "taskType": "agentic_reasoning",
301
+ "selectedModel": "gpt-oss:20b",
302
+ "fallbackModels": ["gpt-oss:20b", "qwen2.5-coder:7b"],
303
+ "queueTimeMs": 3,
304
+ "executionTimeMs": 8420,
305
+ "decisionReason": "Selected gpt-oss:20b for agentic_reasoning with score 290.0"
306
+ }
307
+ }
308
+ ```
309
+
310
+ `decisionReason` includes the winning score, which helps diagnose unexpected model selection — compare it against the scoring table above to see which component tipped the balance.
311
+
97
312
  ## Config Reference
98
313
 
99
314
  Lookup order:
@@ -252,17 +467,22 @@ This prevents the admin API from changing the rules that protect itself. When `a
252
467
 
253
468
  Admin API:
254
469
 
470
+ **Read the current managed access config**
471
+
255
472
  ```bash
256
473
  curl http://127.0.0.1:11435/v1/admin/access/config \
257
474
  -H 'authorization: Bearer admin-secret'
475
+ ```
476
+
477
+ **Replace the entire managed access config** (planes + all keys at once)
258
478
 
479
+ ```bash
259
480
  curl -X PUT http://127.0.0.1:11435/v1/admin/access/config \
260
481
  -H 'authorization: Bearer admin-secret' \
261
482
  -H 'content-type: application/json' \
262
483
  -d '{
263
484
  "expectedVersion": 1,
264
485
  "config": {
265
- "version": 1,
266
486
  "planes": {
267
487
  "standalone": {
268
488
  "enabled": true,
@@ -280,7 +500,51 @@ curl -X PUT http://127.0.0.1:11435/v1/admin/access/config \
280
500
  }'
281
501
  ```
282
502
 
283
- The admin `PUT` supports optimistic concurrency. If `expectedVersion` is present and does not match the active managed access config, the router returns `409`.
503
+ `expectedVersion` enables optimistic concurrency. If present and the value does not match the active managed config version, the router returns `409`.
504
+
505
+ **Add an API key**
506
+
507
+ Generate a key and its SHA-256 hash first:
508
+
509
+ ```bash
510
+ node -e "
511
+ const c = require('crypto'), k = 'onr-' + c.randomBytes(20).toString('hex');
512
+ console.log('key: ', k);
513
+ console.log('hash: sha256:' + c.createHash('sha256').update(k).digest('hex'));
514
+ "
515
+ ```
516
+
517
+ Then add the key:
518
+
519
+ ```bash
520
+ curl -X POST http://127.0.0.1:11435/v1/admin/access/keys \
521
+ -H 'authorization: Bearer admin-secret' \
522
+ -H 'content-type: application/json' \
523
+ -d '{
524
+ "id": "user-alice",
525
+ "name": "Alice",
526
+ "keyHash": "sha256:<hash>",
527
+ "scopes": ["standalone"],
528
+ "limits": {
529
+ "standalone": {"requests": 100, "windowSeconds": 60}
530
+ }
531
+ }'
532
+ ```
533
+
534
+ Returns `201` with the created key entry. Returns `409` if the `id` is already in use.
535
+
536
+ `scopes` controls which planes accept the key. Valid values are `standalone`, `runtimeAgent`, or both. `limits` is optional; when omitted, the plane's `defaultLimit` applies.
537
+
538
+ **Revoke an API key**
539
+
540
+ ```bash
541
+ curl -X DELETE http://127.0.0.1:11435/v1/admin/access/keys/user-alice \
542
+ -H 'authorization: Bearer admin-secret'
543
+ ```
544
+
545
+ Returns `200 { "revoked": { ... } }` with the removed key entry. Returns `404` if the id is not found. The change takes effect immediately without a restart.
546
+
547
+ All admin operations are written atomically to `access.managedConfigPath` and appended to the audit log when `auditLog: true`.
284
548
 
285
549
  For admin client certificate checks, enable HTTPS, configure `server.https.caPath`, and set:
286
550
 
package/dist/cli.js CHANGED
@@ -44,6 +44,17 @@ var protectedPlaneConfigSchema = z.object({
44
44
  }).default({ requireApiKey: false, anonymous: "allow" }),
45
45
  defaultLimit: trafficLimitSchema.optional()
46
46
  });
47
+ var apiKeySchema = z.object({
48
+ id: z.string().min(1).regex(/^[a-zA-Z0-9._:-]+$/, "api key id may contain only letters, numbers, dots, underscores, colons, and dashes"),
49
+ name: z.string().min(1).optional(),
50
+ keyHash: z.string().regex(/^sha256:[a-fA-F0-9]{64}$/, "keyHash must use sha256:<64 hex chars>"),
51
+ enabled: z.boolean().default(true),
52
+ scopes: z.array(z.enum(["standalone", "runtimeAgent"])).min(1),
53
+ limits: z.object({
54
+ standalone: trafficLimitSchema.optional(),
55
+ runtimeAgent: trafficLimitSchema.optional()
56
+ }).optional()
57
+ });
47
58
  var managedAccessConfigSchema = z.object({
48
59
  version: z.number().int().nonnegative().default(1),
49
60
  updatedAt: z.string().datetime().optional(),
@@ -60,19 +71,7 @@ var managedAccessConfigSchema = z.object({
60
71
  standalone: { enabled: true, auth: { requireApiKey: false, anonymous: "allow" } },
61
72
  runtimeAgent: { enabled: true, auth: { requireApiKey: false, anonymous: "allow" } }
62
73
  }),
63
- apiKeys: z.array(
64
- z.object({
65
- id: z.string().min(1).regex(/^[a-zA-Z0-9._:-]+$/, "api key id may contain only letters, numbers, dots, underscores, colons, and dashes"),
66
- name: z.string().min(1).optional(),
67
- keyHash: z.string().regex(/^sha256:[a-fA-F0-9]{64}$/, "keyHash must use sha256:<64 hex chars>"),
68
- enabled: z.boolean().default(true),
69
- scopes: z.array(z.enum(["standalone", "runtimeAgent"])).min(1),
70
- limits: z.object({
71
- standalone: trafficLimitSchema.optional(),
72
- runtimeAgent: trafficLimitSchema.optional()
73
- }).optional()
74
- })
75
- ).default([])
74
+ apiKeys: z.array(apiKeySchema).default([])
76
75
  });
77
76
  var defaultManagedAccessConfig = managedAccessConfigSchema.parse({});
78
77
  var adminPlaneConfigSchema = z.object({
@@ -159,8 +158,7 @@ var modelSpecSchema = z2.object({
159
158
  timeoutMs: z2.number().int().positive(),
160
159
  costClass: z2.enum(["low", "medium", "high"]).default("medium"),
161
160
  exclusive: z2.boolean().default(false),
162
- allowWhenBusy: z2.boolean().default(false),
163
- tags: z2.array(z2.string()).default([])
161
+ allowWhenBusy: z2.boolean().default(false)
164
162
  });
165
163
  var appConfigSchema = z2.object({
166
164
  server: z2.object({
@@ -364,7 +362,6 @@ models:
364
362
  costClass: low
365
363
  exclusive: false
366
364
  allowWhenBusy: true
367
- tags: [general]
368
365
  routes:
369
366
  triage: [llama3.2:3b]
370
367
  simple_chat: [llama3.2:3b]
@@ -962,8 +959,7 @@ function buildModelSpec(name, role, sizeGb, cpuOnly) {
962
959
  timeoutMs: heavy ? 3e5 : code ? 18e4 : 9e4,
963
960
  costClass: heavy ? "high" : code ? "medium" : "low",
964
961
  exclusive: heavy,
965
- allowWhenBusy: !heavy,
966
- tags: tagsForRole(role)
962
+ allowWhenBusy: !heavy
967
963
  };
968
964
  }
969
965
  function purposesForRole(role) {
@@ -981,21 +977,6 @@ function purposesForRole(role) {
981
977
  return ["triage", "simple_chat", "summarize"];
982
978
  }
983
979
  }
984
- function tagsForRole(role) {
985
- switch (role) {
986
- case "code":
987
- return ["code", "fallback"];
988
- case "review":
989
- return ["code", "review"];
990
- case "heavy":
991
- return ["reasoning", "large_context"];
992
- case "tool":
993
- return ["tool_use"];
994
- case "fast":
995
- default:
996
- return ["fast", "chat"];
997
- }
998
- }
999
980
  function generateRoutes(models) {
1000
981
  const fast = models.filter((model) => model.costClass === "low").map((model) => model.name);
1001
982
  const code = models.filter((model) => model.purpose.includes("code_generate")).map((model) => model.name);
@@ -1551,7 +1532,6 @@ var RoutingEngine = class {
1551
1532
  score += Math.max(0, 50 - routeIndex * 8);
1552
1533
  score += model.priority;
1553
1534
  if (model.purpose.includes(context.classification.taskType)) score += 25;
1554
- if (model.tags.includes(context.classification.taskType)) score += 15;
1555
1535
  if (preferredIndex >= 0) score += 80 - preferredIndex * 10;
1556
1536
  if (loaded) score += 20;
1557
1537
  if (context.classification.complexity === "heavy" && model.costClass === "high") score += 20;
@@ -1573,7 +1553,7 @@ var RoutingEngine = class {
1573
1553
  for (const name of context.router.preferredModels) names.add(name);
1574
1554
  for (const name of routeNames) names.add(name);
1575
1555
  for (const model of this.config.models) {
1576
- if (model.purpose.includes(context.classification.taskType) || model.tags.includes(context.classification.taskType)) {
1556
+ if (model.purpose.includes(context.classification.taskType)) {
1577
1557
  names.add(model.name);
1578
1558
  }
1579
1559
  }
@@ -1652,6 +1632,51 @@ var AccessControlStore = class {
1652
1632
  });
1653
1633
  return structuredClone(updated);
1654
1634
  }
1635
+ async addApiKey(input2) {
1636
+ const key = apiKeySchema.parse(input2);
1637
+ await this.enqueueWrite(async () => {
1638
+ if (this.managed.apiKeys.some((k) => k.id === key.id)) {
1639
+ throw new AccessHttpError(409, `API key with id '${key.id}' already exists`);
1640
+ }
1641
+ if (!this.access.managedConfigPath) {
1642
+ throw new AccessHttpError(500, "access.managedConfigPath is not configured");
1643
+ }
1644
+ const next = {
1645
+ ...this.managed,
1646
+ version: this.managed.version + 1,
1647
+ updatedAt: (/* @__PURE__ */ new Date()).toISOString(),
1648
+ apiKeys: [...this.managed.apiKeys, key]
1649
+ };
1650
+ await writeManagedAccessConfig(this.access.managedConfigPath, next);
1651
+ this.managed = next;
1652
+ this.access.managed = next;
1653
+ });
1654
+ return structuredClone(key);
1655
+ }
1656
+ async revokeApiKey(id) {
1657
+ let removed;
1658
+ await this.enqueueWrite(async () => {
1659
+ const idx = this.managed.apiKeys.findIndex((k) => k.id === id);
1660
+ if (idx === -1) {
1661
+ throw new AccessHttpError(404, `API key '${id}' not found`);
1662
+ }
1663
+ if (!this.access.managedConfigPath) {
1664
+ throw new AccessHttpError(500, "access.managedConfigPath is not configured");
1665
+ }
1666
+ removed = this.managed.apiKeys[idx];
1667
+ const next = {
1668
+ ...this.managed,
1669
+ version: this.managed.version + 1,
1670
+ updatedAt: (/* @__PURE__ */ new Date()).toISOString(),
1671
+ apiKeys: this.managed.apiKeys.filter((_, i) => i !== idx)
1672
+ };
1673
+ await writeManagedAccessConfig(this.access.managedConfigPath, next);
1674
+ this.managed = next;
1675
+ this.access.managed = next;
1676
+ this.limiter.clear();
1677
+ });
1678
+ return structuredClone(removed);
1679
+ }
1655
1680
  publicMiddleware(planeOrPlanes) {
1656
1681
  return (req, res, next) => {
1657
1682
  const planes = Array.isArray(planeOrPlanes) ? planeOrPlanes : [planeOrPlanes];
@@ -1962,6 +1987,26 @@ function createApp(config, deps) {
1962
1987
  next(error);
1963
1988
  }
1964
1989
  });
1990
+ api.post("/v1/admin/access/keys", adminAccess, async (req, res, next) => {
1991
+ try {
1992
+ const key = await access3.addApiKey(req.body);
1993
+ auditAdmin(config.access.admin, req, "success", "key_added", res.locals.admin?.remoteIp, key.id);
1994
+ res.status(201).json(key);
1995
+ } catch (error) {
1996
+ auditAdmin(config.access.admin, req, "failure", error instanceof Error ? error.message : String(error), res.locals.admin?.remoteIp);
1997
+ next(error);
1998
+ }
1999
+ });
2000
+ api.delete("/v1/admin/access/keys/:id", adminAccess, async (req, res, next) => {
2001
+ try {
2002
+ const revoked = await access3.revokeApiKey(req.params.id);
2003
+ auditAdmin(config.access.admin, req, "success", "key_revoked", res.locals.admin?.remoteIp, revoked.id);
2004
+ res.json({ revoked });
2005
+ } catch (error) {
2006
+ auditAdmin(config.access.admin, req, "failure", error instanceof Error ? error.message : String(error), res.locals.admin?.remoteIp);
2007
+ next(error);
2008
+ }
2009
+ });
1965
2010
  api.get("/v1/router/capabilities", runtimeAgentAccess, (_req, res) => {
1966
2011
  res.json(buildCapabilities(config));
1967
2012
  });