ollama-agent-router 0.1.5 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +204 -3
- package/dist/cli.js +793 -134
- package/dist/cli.js.map +1 -1
- package/dist/index.d.ts +1591 -31
- package/dist/index.js +808 -137
- package/dist/index.js.map +1 -1
- package/docs/kong-runtime-contract-plan.md +415 -0
- package/examples/gex44.yaml +1 -0
- package/package.json +20 -9
package/README.md
CHANGED
|
@@ -1,9 +1,18 @@
|
|
|
1
1
|
# ollama-agent-router
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Ollama Agent Router is a local LLM router for Ollama. It provides an OpenAI-compatible API gateway that routes agent and chat requests to the best local Ollama model based on task type, GPU/VRAM headroom, queue depth, loaded model state, and sync/async policy.
|
|
4
4
|
|
|
5
5
|
It is designed for machines that run several Ollama models with different strengths, for example a small triage model, one or more code models, and a larger exclusive reasoning model.
|
|
6
6
|
|
|
7
|
+
## Why use Ollama Agent Router?
|
|
8
|
+
|
|
9
|
+
Use `ollama-agent-router` when you need:
|
|
10
|
+
|
|
11
|
+
- an Ollama router for multiple local models
|
|
12
|
+
- an Ollama agent router for coding agents and autonomous workflows
|
|
13
|
+
- an OpenAI-compatible local LLM router
|
|
14
|
+
- GPU-aware routing, queues, async jobs, and model selection
|
|
15
|
+
|
|
7
16
|
## Architecture
|
|
8
17
|
|
|
9
18
|
Request flow:
|
|
@@ -97,6 +106,7 @@ Lookup order:
|
|
|
97
106
|
Top-level sections:
|
|
98
107
|
|
|
99
108
|
- `server`: host, port, base path, HTTPS certificates, and JSON body limit.
|
|
109
|
+
- `access`: optional access control for standalone, runtime-agent, and admin planes.
|
|
100
110
|
- `ollama`: base URL, OpenAI-compatible path, native API path, keep-alive, timeout.
|
|
101
111
|
- `gpu`: provider, VRAM limits, GPU-only default, NVIDIA monitor command.
|
|
102
112
|
- `router`: default mode, heavy-load thresholds, classifier config.
|
|
@@ -116,6 +126,7 @@ Server options:
|
|
|
116
126
|
|
|
117
127
|
```yaml
|
|
118
128
|
server:
|
|
129
|
+
nodeId: local
|
|
119
130
|
host: 127.0.0.1
|
|
120
131
|
port: 11435
|
|
121
132
|
basePath: /
|
|
@@ -127,12 +138,13 @@ server:
|
|
|
127
138
|
caPath:
|
|
128
139
|
```
|
|
129
140
|
|
|
130
|
-
Set `server.port` to choose the listening port. Set `server.basePath` to expose every router endpoint under a prefix, for example `/ollama-router`; then chat completions move to `/ollama-router/v1/chat/completions`, health to `/ollama-router/health`, and jobs to `/ollama-router/v1/jobs/{jobId}`.
|
|
141
|
+
Set `server.nodeId` to a stable machine/runtime id when the router is used behind Kong. It is embedded in new async job ids so a gateway can route job status/result requests back to the right node-router. Allowed characters are letters, numbers, dots, and dashes. Set `server.port` to choose the listening port. Set `server.basePath` to expose every router endpoint under a prefix, for example `/ollama-router`; then chat completions move to `/ollama-router/v1/chat/completions`, health to `/ollama-router/health`, and jobs to `/ollama-router/v1/jobs/{jobId}`.
|
|
131
142
|
|
|
132
143
|
To run HTTPS directly from the router, set `server.https.enabled: true` and provide PEM certificate and key paths:
|
|
133
144
|
|
|
134
145
|
```yaml
|
|
135
146
|
server:
|
|
147
|
+
nodeId: gex44-a
|
|
136
148
|
host: 0.0.0.0
|
|
137
149
|
port: 11435
|
|
138
150
|
basePath: /ollama-router
|
|
@@ -144,6 +156,143 @@ server:
|
|
|
144
156
|
caPath:
|
|
145
157
|
```
|
|
146
158
|
|
|
159
|
+
## Access Planes
|
|
160
|
+
|
|
161
|
+
The router can expose three separate access planes:
|
|
162
|
+
|
|
163
|
+
- **Standalone plane**: the full local OpenAI-compatible router API, including `POST /v1/chat/completions` and job endpoints.
|
|
164
|
+
- **Runtime agent plane**: machine-local endpoints used by Kong or another gateway, including `/v1/router/*` and selected-model execution.
|
|
165
|
+
- **Admin plane**: access-management endpoints under `/v1/admin/access/*`.
|
|
166
|
+
|
|
167
|
+
Access control is backward-compatible. If `access` is not configured, the standalone and runtime-agent planes stay enabled without API key requirements, matching earlier releases. The admin plane is disabled by default.
|
|
168
|
+
|
|
169
|
+
API keys are sent with:
|
|
170
|
+
|
|
171
|
+
```text
|
|
172
|
+
Authorization: Bearer <api-key>
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
`x-api-key` is also accepted for clients that cannot set bearer tokens.
|
|
176
|
+
|
|
177
|
+
Create SHA-256 key hashes before putting keys in config:
|
|
178
|
+
|
|
179
|
+
```bash
|
|
180
|
+
node -e "const crypto=require('crypto'); console.log('sha256:'+crypto.createHash('sha256').update(process.argv[1]).digest('hex'))" 'secret-value'
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
Example access configuration:
|
|
184
|
+
|
|
185
|
+
```yaml
|
|
186
|
+
access:
|
|
187
|
+
managedConfigPath: ./ollama-agent-router.access.yaml
|
|
188
|
+
bootstrapIfMissing: true
|
|
189
|
+
|
|
190
|
+
admin:
|
|
191
|
+
enabled: true
|
|
192
|
+
allowedIps: [127.0.0.1, "::1", 10.0.0.0/8]
|
|
193
|
+
trustedProxy: false
|
|
194
|
+
apiKeyHashes:
|
|
195
|
+
- sha256:replace-with-admin-key-hash
|
|
196
|
+
clientCert:
|
|
197
|
+
required: false
|
|
198
|
+
allowedFingerprints: []
|
|
199
|
+
allowedSubjects: []
|
|
200
|
+
auditLog: true
|
|
201
|
+
|
|
202
|
+
managed:
|
|
203
|
+
version: 1
|
|
204
|
+
planes:
|
|
205
|
+
standalone:
|
|
206
|
+
enabled: true
|
|
207
|
+
auth:
|
|
208
|
+
requireApiKey: true
|
|
209
|
+
anonymous: reject
|
|
210
|
+
defaultLimit:
|
|
211
|
+
requests: 60
|
|
212
|
+
windowSeconds: 60
|
|
213
|
+
runtimeAgent:
|
|
214
|
+
enabled: true
|
|
215
|
+
auth:
|
|
216
|
+
requireApiKey: true
|
|
217
|
+
anonymous: reject
|
|
218
|
+
defaultLimit:
|
|
219
|
+
requests: 600
|
|
220
|
+
windowSeconds: 60
|
|
221
|
+
apiKeys:
|
|
222
|
+
- id: local-client
|
|
223
|
+
name: Local standalone client
|
|
224
|
+
keyHash: sha256:replace-with-client-key-hash
|
|
225
|
+
enabled: true
|
|
226
|
+
scopes: [standalone]
|
|
227
|
+
limits:
|
|
228
|
+
standalone:
|
|
229
|
+
requests: 120
|
|
230
|
+
windowSeconds: 60
|
|
231
|
+
- id: kong-runtime
|
|
232
|
+
name: Kong runtime caller
|
|
233
|
+
keyHash: sha256:replace-with-kong-key-hash
|
|
234
|
+
enabled: true
|
|
235
|
+
scopes: [runtimeAgent]
|
|
236
|
+
limits:
|
|
237
|
+
runtimeAgent:
|
|
238
|
+
requests: 2000
|
|
239
|
+
windowSeconds: 60
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
`access.managed` is the initial access policy. When `access.managedConfigPath` is set, the router loads that file at startup. If the file is missing and `bootstrapIfMissing: true`, it writes the initial policy to that path. Admin API changes are then written atomically to this managed YAML file and survive restarts.
|
|
243
|
+
|
|
244
|
+
The admin plane security settings are boot-only. They are intentionally not managed through the admin API:
|
|
245
|
+
|
|
246
|
+
- `access.admin.allowedIps`
|
|
247
|
+
- `access.admin.trustedProxy`
|
|
248
|
+
- `access.admin.apiKeyHashes`
|
|
249
|
+
- `access.admin.clientCert`
|
|
250
|
+
|
|
251
|
+
This prevents the admin API from changing the rules that protect itself. When `access.admin.enabled: true`, `access.managedConfigPath` and at least one admin API key hash are required.
|
|
252
|
+
|
|
253
|
+
Admin API:
|
|
254
|
+
|
|
255
|
+
```bash
|
|
256
|
+
curl http://127.0.0.1:11435/v1/admin/access/config \
|
|
257
|
+
-H 'authorization: Bearer admin-secret'
|
|
258
|
+
|
|
259
|
+
curl -X PUT http://127.0.0.1:11435/v1/admin/access/config \
|
|
260
|
+
-H 'authorization: Bearer admin-secret' \
|
|
261
|
+
-H 'content-type: application/json' \
|
|
262
|
+
-d '{
|
|
263
|
+
"expectedVersion": 1,
|
|
264
|
+
"config": {
|
|
265
|
+
"version": 1,
|
|
266
|
+
"planes": {
|
|
267
|
+
"standalone": {
|
|
268
|
+
"enabled": true,
|
|
269
|
+
"auth": {"requireApiKey": true, "anonymous": "reject"},
|
|
270
|
+
"defaultLimit": {"requests": 60, "windowSeconds": 60}
|
|
271
|
+
},
|
|
272
|
+
"runtimeAgent": {
|
|
273
|
+
"enabled": true,
|
|
274
|
+
"auth": {"requireApiKey": true, "anonymous": "reject"},
|
|
275
|
+
"defaultLimit": {"requests": 600, "windowSeconds": 60}
|
|
276
|
+
}
|
|
277
|
+
},
|
|
278
|
+
"apiKeys": []
|
|
279
|
+
}
|
|
280
|
+
}'
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
The admin `PUT` supports optimistic concurrency. If `expectedVersion` is present and does not match the active managed access config, the router returns `409`.
|
|
284
|
+
|
|
285
|
+
For admin client certificate checks, enable HTTPS, configure `server.https.caPath`, and set:
|
|
286
|
+
|
|
287
|
+
```yaml
|
|
288
|
+
access:
|
|
289
|
+
admin:
|
|
290
|
+
clientCert:
|
|
291
|
+
required: true
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
The HTTPS server requests a client certificate and the admin middleware verifies that it is trusted. Optional fingerprint and subject allowlists can narrow trust further.
|
|
295
|
+
|
|
147
296
|
## API Examples
|
|
148
297
|
|
|
149
298
|
Sync-preferred request:
|
|
@@ -187,17 +336,68 @@ Status endpoints:
|
|
|
187
336
|
curl http://127.0.0.1:11435/health
|
|
188
337
|
curl http://127.0.0.1:11435/metrics
|
|
189
338
|
curl http://127.0.0.1:11435/v1/router/status
|
|
339
|
+
curl http://127.0.0.1:11435/v1/router/capabilities
|
|
340
|
+
curl http://127.0.0.1:11435/v1/router/runtime
|
|
190
341
|
curl http://127.0.0.1:11435/v1/router/models
|
|
191
342
|
curl http://127.0.0.1:11435/v1/router/gpu
|
|
192
343
|
```
|
|
193
344
|
|
|
345
|
+
## Kong Runtime Agent API
|
|
346
|
+
|
|
347
|
+
When used with [`kong-ollama-agent-router`](https://github.com/ExeconOne/kong-ollama-agent-router), this process acts as a local runtime agent. Kong owns public request validation, classification, model selection, and response enrichment. The node-router supplies machine-local state and executes the model selected by Kong.
|
|
348
|
+
|
|
349
|
+
Kong-facing endpoints:
|
|
350
|
+
|
|
351
|
+
```bash
|
|
352
|
+
curl http://127.0.0.1:11435/v1/router/capabilities
|
|
353
|
+
curl http://127.0.0.1:11435/v1/router/runtime
|
|
354
|
+
curl -X POST http://127.0.0.1:11435/v1/router/execute
|
|
355
|
+
curl -X POST http://127.0.0.1:11435/v1/router/jobs
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
`GET /v1/router/capabilities` returns the stable routing config snapshot: `nodeId`, package version, router defaults, GPU policy, queue defaults, configured models, and routes. It does not call Ollama or GPU probes, so Kong can cache it for longer periods.
|
|
359
|
+
|
|
360
|
+
`GET /v1/router/runtime` returns volatile runtime state: Ollama reachability, loaded models, GPU snapshot, queue depth/running counts, and retained job counters. Kong should cache it only briefly.
|
|
361
|
+
|
|
362
|
+
`POST /v1/router/execute` runs a request on a model already selected by Kong. It does not classify or route again:
|
|
363
|
+
|
|
364
|
+
```json
|
|
365
|
+
{
|
|
366
|
+
"selectedModel": "deepseek-coder:6.7b",
|
|
367
|
+
"request": {
|
|
368
|
+
"model": "deepseek-coder:6.7b",
|
|
369
|
+
"messages": [{"role": "user", "content": "Review this TypeScript function"}],
|
|
370
|
+
"stream": false
|
|
371
|
+
},
|
|
372
|
+
"routerDecision": {
|
|
373
|
+
"taskType": "code_review",
|
|
374
|
+
"score": 250,
|
|
375
|
+
"reason": "Selected by Kong"
|
|
376
|
+
}
|
|
377
|
+
}
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
The response is wrapped so Kong can add its own public `router` metadata:
|
|
381
|
+
|
|
382
|
+
```json
|
|
383
|
+
{
|
|
384
|
+
"result": {},
|
|
385
|
+
"nodeId": "gex44-a",
|
|
386
|
+
"selectedModel": "deepseek-coder:6.7b",
|
|
387
|
+
"queueTimeMs": 4,
|
|
388
|
+
"executionTimeMs": 1200
|
|
389
|
+
}
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
`POST /v1/router/jobs` creates an async job on the selected model. New job ids include the node id, for example `job_gex44-a_01JABCDEF123`, so Kong can route later `GET /v1/jobs/{jobId}` and `GET /v1/jobs/{jobId}/result` calls to the owning node-router.
|
|
393
|
+
|
|
194
394
|
## Async Jobs
|
|
195
395
|
|
|
196
396
|
When a selected model is busy or the router detects heavy load and `allowAsync=true`, the API returns:
|
|
197
397
|
|
|
198
398
|
```json
|
|
199
399
|
{
|
|
200
|
-
"id": "
|
|
400
|
+
"id": "job_gex44-a_01JABCDEF123",
|
|
201
401
|
"object": "router.job",
|
|
202
402
|
"status": "queued",
|
|
203
403
|
"message": "Heavy load. Job accepted for asynchronous processing."
|
|
@@ -322,6 +522,7 @@ The project uses TypeScript, ESM, Express, zod, pino, p-queue, nanoid, and Vites
|
|
|
322
522
|
Design notes:
|
|
323
523
|
|
|
324
524
|
- CLI configuration wizard HLD: `docs/cli-configurator-hld.md`
|
|
525
|
+
- Kong runtime agent contract plan: `docs/kong-runtime-contract-plan.md`
|
|
325
526
|
|
|
326
527
|
## Release Guide
|
|
327
528
|
|