npm - ollama-agent-router - Versions diffs - 0.1.4 → 0.1.6 - Mend

ollama-agent-router 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md +74 -5
package/dist/cli.js +235 -10
package/dist/cli.js.map +1 -1
package/dist/index.d.ts +95 -30
package/dist/index.js +231 -6
package/dist/index.js.map +1 -1
package/docs/kong-runtime-contract-plan.md +415 -0
package/examples/gex44.yaml +1 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -18,10 +18,25 @@ Request flow:
 ## Quick Start
+Install with Homebrew on macOS or Linux:
 ```bash
-npm install
-npm run build
-npm link
+brew install ExeconOne/tap/ollama-agent-router
+ollama-agent-router configure
+ollama-agent-router serve --config ollama-agent-router.yaml
+```
+Or install from the APT repository on Debian/Ubuntu:
+```bash
+curl -fsSL https://execonone.github.io/ollama-agent-router/apt/gpg.key \
+  | sudo gpg --dearmor -o /usr/share/keyrings/ollama-agent-router.gpg
+echo "deb [signed-by=/usr/share/keyrings/ollama-agent-router.gpg] https://execonone.github.io/ollama-agent-router/apt stable main" \
+  | sudo tee /etc/apt/sources.list.d/ollama-agent-router.list
+sudo apt-get update
+sudo apt-get install ollama-agent-router
 ollama-agent-router configure
 ollama-agent-router serve --config ollama-agent-router.yaml
 ```
@@ -101,6 +116,7 @@ Server options:
 ```yaml
 server:
+  nodeId: local
   host: 127.0.0.1
   port: 11435
   basePath: /
@@ -112,12 +128,13 @@ server:
     caPath:
 ```
-Set `server.port` to choose the listening port. Set `server.basePath` to expose every router endpoint under a prefix, for example `/ollama-router`; then chat completions move to `/ollama-router/v1/chat/completions`, health to `/ollama-router/health`, and jobs to `/ollama-router/v1/jobs/{jobId}`.
+Set `server.nodeId` to a stable machine/runtime id when the router is used behind Kong. It is embedded in new async job ids so a gateway can route job status/result requests back to the right node-router. Allowed characters are letters, numbers, dots, and dashes. Set `server.port` to choose the listening port. Set `server.basePath` to expose every router endpoint under a prefix, for example `/ollama-router`; then chat completions move to `/ollama-router/v1/chat/completions`, health to `/ollama-router/health`, and jobs to `/ollama-router/v1/jobs/{jobId}`.
 To run HTTPS directly from the router, set `server.https.enabled: true` and provide PEM certificate and key paths:
 ```yaml
 server:
+  nodeId: gex44-a
   host: 0.0.0.0
   port: 11435
   basePath: /ollama-router
@@ -172,17 +189,68 @@ Status endpoints:
 curl http://127.0.0.1:11435/health
 curl http://127.0.0.1:11435/metrics
 curl http://127.0.0.1:11435/v1/router/status
+curl http://127.0.0.1:11435/v1/router/capabilities
+curl http://127.0.0.1:11435/v1/router/runtime
 curl http://127.0.0.1:11435/v1/router/models
 curl http://127.0.0.1:11435/v1/router/gpu
 ```
+## Kong Runtime Agent API
+When used with `kong-ollama-router`, this process acts as a local runtime agent. Kong owns public request validation, classification, model selection, and response enrichment. The node-router supplies machine-local state and executes the model selected by Kong.
+Kong-facing endpoints:
+```bash
+curl http://127.0.0.1:11435/v1/router/capabilities
+curl http://127.0.0.1:11435/v1/router/runtime
+curl -X POST http://127.0.0.1:11435/v1/router/execute
+curl -X POST http://127.0.0.1:11435/v1/router/jobs
+```
+`GET /v1/router/capabilities` returns the stable routing config snapshot: `nodeId`, package version, router defaults, GPU policy, queue defaults, configured models, and routes. It does not call Ollama or GPU probes, so Kong can cache it for longer periods.
+`GET /v1/router/runtime` returns volatile runtime state: Ollama reachability, loaded models, GPU snapshot, queue depth/running counts, and retained job counters. Kong should cache it only briefly.
+`POST /v1/router/execute` runs a request on a model already selected by Kong. It does not classify or route again:
+```json
+{
+  "selectedModel": "deepseek-coder:6.7b",
+  "request": {
+    "model": "deepseek-coder:6.7b",
+    "messages": [{"role": "user", "content": "Review this TypeScript function"}],
+    "stream": false
+  },
+  "routerDecision": {
+    "taskType": "code_review",
+    "score": 250,
+    "reason": "Selected by Kong"
+  }
+}
+```
+The response is wrapped so Kong can add its own public `router` metadata:
+```json
+{
+  "result": {},
+  "nodeId": "gex44-a",
+  "selectedModel": "deepseek-coder:6.7b",
+  "queueTimeMs": 4,
+  "executionTimeMs": 1200
+}
+```
+`POST /v1/router/jobs` creates an async job on the selected model. New job ids include the node id, for example `job_gex44-a_01JABCDEF123`, so Kong can route later `GET /v1/jobs/{jobId}` and `GET /v1/jobs/{jobId}/result` calls to the owning node-router.
 ## Async Jobs
 When a selected model is busy or the router detects heavy load and `allowAsync=true`, the API returns:
 ```json
 {
-  "id": "job_01JABCDEF123",
+  "id": "job_gex44-a_01JABCDEF123",
   "object": "router.job",
   "status": "queued",
   "message": "Heavy load. Job accepted for asynchronous processing."
@@ -307,6 +375,7 @@ The project uses TypeScript, ESM, Express, zod, pino, p-queue, nanoid, and Vites
 Design notes:
 - CLI configuration wizard HLD: `docs/cli-configurator-hld.md`
+- Kong runtime agent contract plan: `docs/kong-runtime-contract-plan.md`
 ## Release Guide

package/dist/cli.js CHANGED Viewed

@@ -2,7 +2,7 @@
 // src/cli-program.ts
 import { readFile as readFile4 } from "fs/promises";
-import { readFileSync } from "fs";
+import { readFileSync as readFileSync2 } from "fs";
 import { Command } from "commander";
 // src/config.ts
@@ -46,6 +46,7 @@ var modelSpecSchema = z.object({
 });
 var appConfigSchema = z.object({
   server: z.object({
+    nodeId: z.string().regex(/^[a-zA-Z0-9.-]+$/, "server.nodeId may contain only letters, numbers, dots, and dashes").default("local"),
     host: z.string().min(1),
     port: z.number().int().min(1).max(65535),
     basePath: z.string().min(1).default("/"),
@@ -152,6 +153,7 @@ async function writeDefaultConfig(path) {
   await writeFile(target, defaultConfigYaml, "utf8");
 }
 var defaultConfigYaml = `server:
+  nodeId: local
   host: 127.0.0.1
   port: 11435
   basePath: /
@@ -249,11 +251,13 @@ var StaticGpuMonitor = class {
   async snapshot() {
     if (this.config.provider === "none") return void 0;
     return {
+      provider: this.config.provider,
       name: this.config.name ?? "Configured GPU",
       vramTotalMb: this.config.vramTotalMb,
       vramUsedMb: 0,
       vramFreeMb: this.config.vramTotalMb,
-      utilizationPct: 0
+      utilizationPct: 0,
+      snapshotAgeMs: 0
     };
   }
 };
@@ -280,11 +284,13 @@ function parseNvidiaSmi(output2) {
   return output2.split(/\r?\n/).map((line) => line.trim()).filter(Boolean).map((line) => {
     const [name, total, used, free, utilization] = line.split(",").map((part) => part.trim());
     return {
+      provider: "nvidia",
       name,
       vramTotalMb: Number(total),
       vramUsedMb: Number(used),
       vramFreeMb: Number(free),
-      utilizationPct: Number(utilization)
+      utilizationPct: Number(utilization),
+      snapshotAgeMs: 0
     };
   }).filter((gpu) => gpu.name && Number.isFinite(gpu.vramTotalMb));
 }
@@ -341,6 +347,20 @@ var HttpOllamaClient = class {
       return [];
     }
   }
+  async health() {
+    const controller = new AbortController();
+    const timer = setTimeout(() => controller.abort(), 1e3);
+    try {
+      const response = await fetch(new URL(`${this.config.nativeApiBasePath}/tags`, this.config.baseUrl), {
+        signal: controller.signal
+      });
+      return response.ok;
+    } catch {
+      return false;
+    } finally {
+      clearTimeout(timer);
+    }
+  }
 };
 var OllamaHttpError = class extends Error {
   constructor(statusCode, payload) {
@@ -488,6 +508,7 @@ function generateConfigFromDetection(detection, answers = {}) {
   const serverHttps = typeof httpsAnswer === "boolean" ? { enabled: httpsAnswer } : { enabled: false, ...httpsAnswer ?? {} };
   const config = {
     server: {
+      nodeId: "local",
       host: "127.0.0.1",
       port: 11435,
       basePath: "/",
@@ -949,15 +970,17 @@ async function fileExists(path) {
 // src/job-store.ts
 import { nanoid } from "nanoid";
 var InMemoryJobStore = class {
-  constructor(config) {
+  constructor(config, nodeId = "local") {
     this.config = config;
+    this.nodeId = nodeId;
   }
   config;
+  nodeId;
   jobs = /* @__PURE__ */ new Map();
   create(input2) {
     const now = /* @__PURE__ */ new Date();
     const record = {
-      id: `job_${nanoid(16)}`,
+      id: `job_${this.nodeId}_${nanoid(16)}`,
       status: "queued",
       task_type: input2.taskType,
       selected_model: input2.selectedModel,
@@ -981,6 +1004,25 @@ var InMemoryJobStore = class {
   list(limit = 50) {
     return [...this.jobs.values()].sort((a, b) => b.created_at.localeCompare(a.created_at)).slice(0, limit).map((job) => ({ ...job }));
   }
+  summary() {
+    const counts = {
+      queued: 0,
+      running: 0,
+      succeededRetained: 0,
+      failedRetained: 0,
+      cancelledRetained: 0,
+      expiredRetained: 0
+    };
+    for (const job of this.jobs.values()) {
+      if (job.status === "queued") counts.queued += 1;
+      if (job.status === "running") counts.running += 1;
+      if (job.status === "succeeded") counts.succeededRetained += 1;
+      if (job.status === "failed") counts.failedRetained += 1;
+      if (job.status === "cancelled") counts.cancelledRetained += 1;
+      if (job.status === "expired") counts.expiredRetained += 1;
+    }
+    return counts;
+  }
   markRunning(id) {
     const job = this.jobs.get(id);
     if (!job || job.status !== "queued" && job.status !== "running") return this.get(id);
@@ -1152,6 +1194,7 @@ var QueueManager = class {
 // src/server.ts
 import http from "http";
 import https from "https";
+import { readFileSync } from "fs";
 import { readFile as readFile3 } from "fs/promises";
 import express from "express";
 import { pinoHttp } from "pino-http";
@@ -1403,6 +1446,7 @@ var logger = pino({
 });
 // src/server.ts
+var packageJson = JSON.parse(readFileSync(new URL("../package.json", import.meta.url), "utf8"));
 var chatRequestSchema = z3.object({
   model: z3.string().optional(),
   messages: z3.array(z3.object({ role: z3.string(), content: z3.unknown() })).min(1),
@@ -1419,6 +1463,32 @@ var chatRequestSchema = z3.object({
     requireGpuOnly: z3.boolean().optional()
   }).optional()
 }).passthrough();
+var classificationSchema = z3.object({
+  taskType: z3.enum(taskTypes).optional(),
+  complexity: z3.enum(["light", "medium", "heavy"]).optional(),
+  requiresLargeContext: z3.boolean().optional(),
+  requiresToolUse: z3.boolean().optional(),
+  confidence: z3.number().min(0).max(1).optional()
+}).optional();
+var routerDecisionSchema = z3.object({
+  taskType: z3.enum(taskTypes).optional(),
+  score: z3.number().optional(),
+  reason: z3.string().optional(),
+  priority: z3.enum(["low", "normal", "high"]).optional()
+}).passthrough().optional();
+var executeRequestSchema = z3.object({
+  selectedModel: z3.string().min(1),
+  request: chatRequestSchema,
+  priority: z3.enum(["low", "normal", "high"]).optional(),
+  routerDecision: routerDecisionSchema
+});
+var createRouterJobSchema = z3.object({
+  selectedModel: z3.string().min(1),
+  request: chatRequestSchema,
+  classification: classificationSchema,
+  priority: z3.enum(["low", "normal", "high"]).optional(),
+  routerDecision: routerDecisionSchema
+});
 function createApp(config, deps) {
   const app = express();
   const api = express.Router();
@@ -1430,12 +1500,29 @@ function createApp(config, deps) {
   api.get("/health", (_req, res) => {
     res.json({ status: "ok", service: "ollama-agent-router" });
   });
-  api.get("/metrics", (_req, res) => {
+  api.get("/metrics", async (_req, res) => {
     const snapshot = deps.queue.snapshot();
+    const jobSummary = deps.jobs.summary();
+    const jobsByStatusAndModel = countJobsByStatusAndModel(deps.jobs.list(Number.MAX_SAFE_INTEGER));
+    const [gpu, ollamaReachable] = await Promise.all([safeGpu(deps.gpu), safeOllamaReachable(deps.ollama)]);
     res.type("text/plain").send(
       [
         `oar_queue_global_queued ${snapshot.globalQueued}`,
         `oar_queue_global_running ${snapshot.globalRunning}`,
+        `oar_ollama_reachable ${ollamaReachable ? 1 : 0}`,
+        ...gpu ? [
+          `oar_gpu_vram_free_mb ${gpu.vramFreeMb}`,
+          `oar_gpu_utilization_pct ${gpu.utilizationPct}`
+        ] : [],
+        `oar_jobs_total{status="queued"} ${jobSummary.queued}`,
+        `oar_jobs_total{status="running"} ${jobSummary.running}`,
+        `oar_jobs_total{status="succeeded"} ${jobSummary.succeededRetained}`,
+        `oar_jobs_total{status="failed"} ${jobSummary.failedRetained}`,
+        `oar_jobs_total{status="cancelled"} ${jobSummary.cancelledRetained}`,
+        `oar_jobs_total{status="expired"} ${jobSummary.expiredRetained}`,
+        ...jobsByStatusAndModel.map(
+          (item) => `oar_jobs_total{status="${escapeMetricLabel(item.status)}",model="${escapeMetricLabel(item.model)}"} ${item.count}`
+        ),
         ...snapshot.byModel.flatMap((item) => [
           `oar_model_queue_depth{model="${escapeMetricLabel(item.model)}"} ${item.queued}`,
           `oar_model_running{model="${escapeMetricLabel(item.model)}"} ${item.running}`
@@ -1443,9 +1530,20 @@ function createApp(config, deps) {
       ].join("\n")
     );
   });
+  api.get("/v1/router/capabilities", (_req, res) => {
+    res.json(buildCapabilities(config));
+  });
+  api.get("/v1/router/runtime", async (_req, res, next) => {
+    try {
+      res.json(await buildRuntimeSnapshot(config, deps));
+    } catch (error) {
+      next(error);
+    }
+  });
   api.get("/v1/router/status", async (_req, res, next) => {
     try {
       res.json({
+        nodeId: config.server.nodeId,
         service: "ollama-agent-router",
         queue: deps.queue.snapshot(),
         gpu: await safeGpu(deps.gpu),
@@ -1499,6 +1597,63 @@ function createApp(config, deps) {
     if (!job) return res.status(404).json({ error: { message: "Job not found" } });
     return res.json(job);
   });
+  api.post("/v1/router/execute", async (req, res, next) => {
+    try {
+      const payload = executeRequestSchema.parse(req.body);
+      if (payload.request.stream) {
+        return res.status(400).json({ error: { message: "Streaming is not supported by ollama-agent-router v1" } });
+      }
+      const model = findConfiguredModel(config, payload.selectedModel);
+      if (!model) {
+        return res.status(404).json({ error: { message: `Unknown configured model: ${payload.selectedModel}` } });
+      }
+      const priorityName = payload.priority ?? payload.routerDecision?.priority ?? config.queue.defaultPriority;
+      const output2 = await deps.queue.runSync({
+        model,
+        request: payload.request,
+        priority: priorityWeights[priorityName],
+        timeoutMs: payload.request.router?.maxExecutionTimeMs ?? config.queue.timeoutMs
+      });
+      return res.json({
+        result: output2.result,
+        nodeId: config.server.nodeId,
+        selectedModel: model.name,
+        queueTimeMs: output2.queueTimeMs,
+        executionTimeMs: output2.executionTimeMs
+      });
+    } catch (error) {
+      next(error);
+    }
+  });
+  api.post("/v1/router/jobs", (req, res, next) => {
+    try {
+      const payload = createRouterJobSchema.parse(req.body);
+      if (payload.request.stream) {
+        return res.status(400).json({ error: { message: "Streaming is not supported by ollama-agent-router v1" } });
+      }
+      const model = findConfiguredModel(config, payload.selectedModel);
+      if (!model) {
+        return res.status(404).json({ error: { message: `Unknown configured model: ${payload.selectedModel}` } });
+      }
+      const classification = normalizeClassification(config, payload.classification);
+      const priorityName = payload.priority ?? payload.routerDecision?.priority ?? config.queue.defaultPriority;
+      const job = deps.queue.enqueueAsync({
+        model,
+        request: payload.request,
+        classification,
+        priority: priorityWeights[priorityName]
+      });
+      return res.status(202).json({
+        id: job.id,
+        status: "queued",
+        position: job.position,
+        nodeId: config.server.nodeId,
+        selectedModel: model.name
+      });
+    } catch (error) {
+      next(error);
+    }
+  });
   api.post("/v1/chat/completions", async (req, res, next) => {
     try {
       const request = chatRequestSchema.parse(req.body);
@@ -1567,11 +1722,74 @@ function createApp(config, deps) {
   app.use(normalizeBasePath(config.server.basePath), api);
   app.use((error, _req, res, _next) => {
     const message = error instanceof Error ? error.message : String(error);
-    const status = error instanceof z3.ZodError ? 400 : 500;
+    const status = error instanceof z3.ZodError ? 400 : error instanceof OllamaHttpError ? 502 : 500;
     res.status(status).json({ error: { message } });
   });
   return app;
 }
+function buildCapabilities(config) {
+  return {
+    nodeId: config.server.nodeId,
+    status: "ok",
+    version: packageJson.version,
+    router: config.router,
+    gpu: {
+      requireGpuOnlyByDefault: config.gpu.requireGpuOnlyByDefault,
+      vramSafetyReserveMb: config.gpu.vramSafetyReserveMb
+    },
+    queue: {
+      defaultPriority: config.queue.defaultPriority,
+      timeoutMs: config.queue.timeoutMs
+    },
+    models: config.models,
+    routes: config.routes
+  };
+}
+async function buildRuntimeSnapshot(config, deps) {
+  const [ollamaReachable, loadedModels, gpu] = await Promise.all([
+    safeOllamaReachable(deps.ollama),
+    safeLoadedModels(deps.ollama),
+    safeGpu(deps.gpu)
+  ]);
+  const status = ollamaReachable ? config.gpu.monitor.enabled && config.gpu.provider !== "none" && !gpu ? "degraded" : "ok" : "unavailable";
+  return {
+    nodeId: config.server.nodeId,
+    status,
+    timestamp: (/* @__PURE__ */ new Date()).toISOString(),
+    ollama: {
+      baseUrl: config.ollama.baseUrl,
+      reachable: ollamaReachable
+    },
+    gpu: gpu ? { provider: config.gpu.provider, snapshotAgeMs: 0, ...gpu } : void 0,
+    loadedModels,
+    queues: deps.queue.snapshot(),
+    jobs: deps.jobs.summary()
+  };
+}
+function findConfiguredModel(config, selectedModel) {
+  return config.models.find((model) => model.name === selectedModel);
+}
+function normalizeClassification(config, classification) {
+  return {
+    taskType: classification?.taskType ?? config.router.defaultTaskType,
+    complexity: classification?.complexity ?? "medium",
+    requiresLargeContext: classification?.requiresLargeContext ?? false,
+    requiresToolUse: classification?.requiresToolUse ?? false,
+    confidence: classification?.confidence ?? 1
+  };
+}
+function countJobsByStatusAndModel(jobs) {
+  const counts = /* @__PURE__ */ new Map();
+  for (const job of jobs) {
+    const status = job.status;
+    const model = job.selected_model ?? "unknown";
+    const key = `${status}\0${model}`;
+    const current = counts.get(key) ?? { status, model, count: 0 };
+    current.count += 1;
+    counts.set(key, current);
+  }
+  return [...counts.values()];
+}
 async function startServer(config, deps) {
   const app = createApp(config, deps);
   const server = await createHttpServer(config, app);
@@ -1633,18 +1851,25 @@ async function safeGpu(gpu) {
     return void 0;
   }
 }
+async function safeOllamaReachable(ollama) {
+  try {
+    return await ollama.health();
+  } catch {
+    return false;
+  }
+}
 function escapeMetricLabel(label) {
   return label.replaceAll("\\", "\\\\").replaceAll('"', '\\"');
 }
 // src/cli-program.ts
-var packageJson = JSON.parse(readFileSync(new URL("../package.json", import.meta.url), "utf8"));
+var packageJson2 = JSON.parse(readFileSync2(new URL("../package.json", import.meta.url), "utf8"));
 function createProgram() {
   const program = new Command();
-  program.name("ollama-agent-router").alias("oar").description("Intelligent HTTP/CLI router for Ollama").version(packageJson.version, "-v, --version", "display version").option("-c, --config <path>", "config file path").option("-u, --url <url>", "router URL for client commands", "http://127.0.0.1:11435").option("--base-path <path>", "router API base path for client commands", "/");
+  program.name("ollama-agent-router").alias("oar").description("Intelligent HTTP/CLI router for Ollama").version(packageJson2.version, "-v, --version", "display version").option("-c, --config <path>", "config file path").option("-u, --url <url>", "router URL for client commands", "http://127.0.0.1:11435").option("--base-path <path>", "router API base path for client commands", "/");
   program.command("serve").description("start the router server").option("-c, --config <path>", "config file path").action(async (options) => {
     const { config, path } = await loadConfig(options.config ?? program.opts().config);
-    const jobs = new InMemoryJobStore(config.jobs);
+    const jobs = new InMemoryJobStore(config.jobs, config.server.nodeId);
     const ollama = new HttpOllamaClient(config.ollama);
     const gpu = new NvidiaGpuMonitor(config.gpu);
     const queue = new QueueManager(config, ollama, jobs);