@khanglvm/llm-router 2.5.1 → 2.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -7,6 +7,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [2.6.0] - 2026-04-23
11
+
12
+ ### Added
13
+ - Local `llama.cpp` variants can now persist a per-model runtime profile, including auto-tuned presets and custom launch overrides, so each GGUF variant can run with settings that match its own size and context shape instead of sharing one global `llama-server` startup profile.
14
+ - The Web UI now exposes managed `llama.cpp` runtime health for Local Models, including tracked instance counts, healthy/stale summaries, and persisted runtime-profile data for each saved variant.
15
+
16
+ ### Changed
17
+ - Local variant requests are now resolved through a managed per-variant `llama.cpp` runtime layer that can reuse compatible instances, allocate fallback ports safely, and start the right runtime configuration for the specific model variant without exposing multi-process lifecycle management to the user.
18
+ - Hugging Face GGUF search/download flows now surface file size plus estimated runtime memory guidance directly in the Local Models workflow, making it easier to choose a viable quantization before download.
19
+
20
+ ### Fixed
21
+ - Managed `llama.cpp` runtimes now reconcile stale tracked instances before reuse, avoid reserving dead immediate-exit servers, and drain pending shutdown/startup edges more reliably so local per-model routing does not leave behind stale `llama-server` processes.
22
+
23
+ ## [2.5.2] - 2026-04-23
24
+
25
+ ### Fixed
26
+ - `yarn dev` now force-reclaims stale dev web-console listeners on startup and restarts matching stale dev routers so the next dev session takes over the sandbox cleanly instead of inheriting the old process.
27
+
10
28
  ## [2.5.1] - 2026-04-23
11
29
 
12
30
  ### Fixed
package/README.md CHANGED
@@ -29,7 +29,7 @@ llr ai-help # agent-oriented setup brief
29
29
  - **Model aliases with routing** — group models into stable alias names with weighted round-robin, quota-aware balancing, and automatic fallback
30
30
  - **Rate limiting** — set request caps per model or across all models over configurable time windows
31
31
  - **Coding tool routing** — one-click routing config for Codex CLI, Claude Code, Factory Droid, and AMP
32
- - **Dev sandbox** — `yarn dev` runs the console against a dedicated dev config/router port, highlights dev mode in terminal + UI, and can clone the production config into the sandbox for quick iteration
32
+ - **Dev sandbox** — `yarn dev` runs the console against a dedicated dev config/router port, highlights dev mode in terminal + UI, can clone the production config into the sandbox for quick iteration, and automatically reclaims stale dev listeners before the next session starts
33
33
  - **Claude native web tools** — local handling for Claude web search and page fetch requests, with selectable Claude Code web-search providers from the shared Web Search config
34
34
  - **Seamless local updates** — `llr update` keeps the fixed local router endpoint online, drains in-flight requests, and automatically retries through backend restart windows
35
35
  - **Web search** — built-in web search for AMP and other router-managed tools
@@ -44,6 +44,9 @@ Open `llr` and use the **Local Models** tab to manage local inference sources al
44
44
  - **Native macOS browsing** — use the built-in file picker to choose a single GGUF file, scan a folder recursively for GGUF models, or browse directly to a local `llama-server` binary
45
45
  - **Managed + attached model library** — stale or moved files stay visible instead of crashing the app, and can be repaired by locating the file again or removed cleanly
46
46
  - **Router-visible local variants** — create friendly model variants with bounded presets, context-window metadata, preload toggles, and Mac unified-memory fit guidance with clearer safe/tight recommendations
47
+ - **Per-variant llama.cpp tuning** — each local variant can store its own runtime profile so balanced, throughput, long-context, low-memory, or custom launch overrides do not fight over one shared global `llama-server` config
48
+ - **Managed per-model runtimes** — the router automatically starts, reuses, and stops the right `llama.cpp` instance for the requested local variant, with stale-runtime cleanup handled internally instead of asking the user to manage separate servers
49
+ - **GGUF size + memory guidance** — Hugging Face search results now show model file size plus estimated runtime memory fit guidance before download, helping choose viable quantizations faster
47
50
  - **Alias-ready local routing** — once saved, local variants behave like normal router models and can be used in aliases, capability flags, and fallback chains
48
51
 
49
52
  For v1, the managed download flow only searches public Hugging Face GGUF files and the fit guidance is tuned for Macs with unified memory.
@@ -60,7 +63,7 @@ That means `llr update` can install a new package version and gracefully swap th
60
63
  yarn dev
61
64
  ```
62
65
 
63
- Development mode uses the dedicated `~/.llm-router-dev.json` config and its own local router port so it can run alongside a startup-managed or manually started production router. The terminal and Web UI both show a dev-mode indicator, and the dev Web UI includes a one-click sync action to copy the current production config into the sandbox without changing the dev router binding.
66
+ Development mode uses the dedicated `~/.llm-router-dev.json` config and its own local router port so it can run alongside a startup-managed or manually started production router. The terminal and Web UI both show a dev-mode indicator, the dev Web UI includes a one-click sync action to copy the current production config into the sandbox without changing the dev router binding, and each new `yarn dev` run automatically takes over any stale dev web-console/router listeners from a prior session.
64
67
 
65
68
  ## Web UI
66
69
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@khanglvm/llm-router",
3
- "version": "2.5.1",
3
+ "version": "2.6.0",
4
4
  "description": "LLM Router: single gateway endpoint for multi-provider LLMs with unified OpenAI+Anthropic format and seamless fallback",
5
5
  "keywords": [
6
6
  "llm-router",
@@ -0,0 +1,114 @@
1
+ import { getActiveRuntimeState } from "./instance-state.js";
2
+ import { reclaimPort } from "./port-reclaim.js";
3
+ import { startWebConsoleServer } from "./web-console-server.js";
4
+
5
+ const DEV_ROUTER_STOP_REASON = "Stopping the dev router because the dev web console exited.";
6
+
7
+ function normalizeHost(value) {
8
+ return String(value || "127.0.0.1").trim() || "127.0.0.1";
9
+ }
10
+
11
+ function shouldRestartStaleDevRouter(runtimeBeforeStart, runtimeAfterStart, snapshot) {
12
+ if (!runtimeBeforeStart || !runtimeAfterStart || !snapshot?.router?.running) return false;
13
+ if (Number(runtimeBeforeStart.pid) !== Number(runtimeAfterStart.pid)) return false;
14
+ if (snapshot?.config?.parseError) return false;
15
+ if (!Number(snapshot?.config?.providerCount)) return false;
16
+
17
+ const localServer = snapshot?.config?.localServer || {};
18
+ return Number(runtimeAfterStart.port) === Number(localServer.port)
19
+ && normalizeHost(runtimeAfterStart.host) === normalizeHost(localServer.host);
20
+ }
21
+
22
+ async function stopDevRouterAfterExit(server, onError) {
23
+ if (!server || typeof server.stopRouter !== "function") return;
24
+
25
+ try {
26
+ await server.stopRouter({
27
+ reason: DEV_ROUTER_STOP_REASON,
28
+ reclaimPortIfStopped: true
29
+ });
30
+ } catch (error) {
31
+ onError(`Failed stopping the dev router during shutdown: ${error instanceof Error ? error.message : String(error)}`);
32
+ }
33
+ }
34
+
35
+ export async function startManagedDevWebConsole(options = {}, deps = {}) {
36
+ const line = typeof deps.line === "function" ? deps.line : console.log;
37
+ const error = typeof deps.error === "function" ? deps.error : console.error;
38
+ const startWebConsoleServerFn = typeof deps.startWebConsoleServer === "function"
39
+ ? deps.startWebConsoleServer
40
+ : startWebConsoleServer;
41
+ const getActiveRuntimeStateFn = typeof deps.getActiveRuntimeState === "function"
42
+ ? deps.getActiveRuntimeState
43
+ : getActiveRuntimeState;
44
+ const reclaimPortFn = typeof deps.reclaimPort === "function"
45
+ ? deps.reclaimPort
46
+ : (args) => reclaimPort(args, deps);
47
+ const serverOptions = {
48
+ ...options,
49
+ devMode: true
50
+ };
51
+ const runtimeBeforeStart = await getActiveRuntimeStateFn().catch(() => null);
52
+
53
+ let server;
54
+ try {
55
+ server = await startWebConsoleServerFn(serverOptions);
56
+ } catch (startError) {
57
+ if (startError?.code !== "EADDRINUSE") throw startError;
58
+
59
+ const reclaimed = await reclaimPortFn({
60
+ port: serverOptions.port,
61
+ line,
62
+ error
63
+ });
64
+ if (!reclaimed?.ok) {
65
+ throw new Error(reclaimed?.errorMessage || `Failed to reclaim port ${serverOptions.port}.`);
66
+ }
67
+
68
+ line(`Port ${serverOptions.port} reclaimed successfully.`);
69
+ server = await startWebConsoleServerFn(serverOptions);
70
+ }
71
+
72
+ const startupSnapshot = typeof server.getSnapshot === "function"
73
+ ? await server.getSnapshot().catch(() => null)
74
+ : null;
75
+ const runtimeAfterStart = await getActiveRuntimeStateFn().catch(() => null);
76
+ if (shouldRestartStaleDevRouter(runtimeBeforeStart, runtimeAfterStart, startupSnapshot)
77
+ && typeof server.restartRouter === "function") {
78
+ await server.restartRouter(startupSnapshot.config.localServer);
79
+ }
80
+
81
+ let stopRouterPromise = null;
82
+ const ensureDevRouterStopped = () => {
83
+ if (stopRouterPromise) return stopRouterPromise;
84
+ stopRouterPromise = stopDevRouterAfterExit(server, error);
85
+ return stopRouterPromise;
86
+ };
87
+
88
+ const done = (async () => {
89
+ let result;
90
+ try {
91
+ result = await server.done;
92
+ } finally {
93
+ await ensureDevRouterStopped();
94
+ }
95
+ return result;
96
+ })();
97
+
98
+ let shutdownPromise = null;
99
+ const shutdown = async (reason = "dev-console-closed") => {
100
+ if (shutdownPromise) return shutdownPromise;
101
+ shutdownPromise = (async () => {
102
+ await ensureDevRouterStopped();
103
+ await server.close(reason);
104
+ return done;
105
+ })();
106
+ return shutdownPromise;
107
+ };
108
+
109
+ return {
110
+ ...server,
111
+ done,
112
+ shutdown
113
+ };
114
+ }
@@ -1,5 +1,6 @@
1
1
  import path from "node:path";
2
2
  import { promises as fs } from "node:fs";
3
+ import { estimateLlamacppRuntimeBytes } from "./llamacpp-runtime-profile.js";
3
4
 
4
5
  const HUGGING_FACE_API_URL = "https://huggingface.co/api/models";
5
6
  const HUGGING_FACE_BASE_URL = "https://huggingface.co";
@@ -154,6 +155,13 @@ export function shapeHuggingFaceGgufResults(files, systemInfo = {}) {
154
155
  expectedContextWindow: systemInfo?.expectedContextWindow
155
156
  }, systemInfo);
156
157
  const quantization = parseQuantizationFromFileName(file);
158
+ const estimatedRuntimeBytes = sizeBytes
159
+ ? estimateLlamacppRuntimeBytes({
160
+ sizeBytes,
161
+ contextWindow: systemInfo?.expectedContextWindow,
162
+ preset: status.fit === "tight" ? "memory-safe" : "balanced"
163
+ })
164
+ : undefined;
157
165
  const fitScore = status.fit === "safe" ? 30 : status.fit === "tight" ? 15 : status.fit === "unknown" ? 8 : -20;
158
166
  const rankingScore = fitScore
159
167
  + (status.disabled ? -100 : 0)
@@ -166,6 +174,10 @@ export function shapeHuggingFaceGgufResults(files, systemInfo = {}) {
166
174
  file,
167
175
  quantization,
168
176
  sizeBytes,
177
+ estimatedRuntimeBytes,
178
+ memoryLabel: estimatedRuntimeBytes
179
+ ? `${(estimatedRuntimeBytes / (1024 ** 3)).toFixed(1)} GB runtime est.`
180
+ : "Runtime estimate unavailable",
169
181
  disabled: status.disabled,
170
182
  disabledReason: status.reason,
171
183
  fit: status.fit,
@@ -0,0 +1,202 @@
1
+ export function createLlamacppManagedRuntimeRegistry(deps = {}) {
2
+ const instances = new Map();
3
+ const inFlightStarts = new Map();
4
+ let nextPort = 39391;
5
+ const MIN_PORT = 1;
6
+ const MAX_PORT = 65535;
7
+
8
+ function resolveSpawnRuntime(overrides = {}) {
9
+ if (typeof overrides.spawnRuntime === "function") return overrides.spawnRuntime;
10
+ if (typeof deps.spawnRuntime === "function") return deps.spawnRuntime;
11
+ return async ({ host = "127.0.0.1", port } = {}) => ({
12
+ pid: undefined,
13
+ host,
14
+ port,
15
+ baseUrl: `http://${host}:${port}/v1`
16
+ });
17
+ }
18
+
19
+ function resolveWaitForHealthy(overrides = {}) {
20
+ if (typeof overrides.waitForHealthy === "function") return overrides.waitForHealthy;
21
+ if (typeof deps.waitForHealthy === "function") return deps.waitForHealthy;
22
+ return async (instance) => ({ ...instance, healthy: true });
23
+ }
24
+
25
+ function resolveListListeningPids(overrides = {}) {
26
+ if (typeof overrides.listListeningPids === "function") return overrides.listListeningPids;
27
+ if (typeof deps.listListeningPids === "function") return deps.listListeningPids;
28
+ return async () => [];
29
+ }
30
+
31
+ function resolveStopProcessByPid(overrides = {}) {
32
+ if (typeof overrides.stopProcessByPid === "function") return overrides.stopProcessByPid;
33
+ if (typeof deps.stopProcessByPid === "function") return deps.stopProcessByPid;
34
+ return async () => {};
35
+ }
36
+
37
+ function isTrackedInstanceReusable(instance) {
38
+ if (instance?.healthy !== true) return false;
39
+ const child = instance?.child;
40
+ if (child) {
41
+ return child.exitCode === null && child.killed !== true;
42
+ }
43
+ return true;
44
+ }
45
+
46
+ function isChildAlive(child) {
47
+ if (!child) return true;
48
+ return child.exitCode === null && child.killed !== true;
49
+ }
50
+
51
+ function normalizeRuntimePort(value, fallback = null) {
52
+ const parsed = Number(value);
53
+ if (!Number.isInteger(parsed) || parsed < MIN_PORT || parsed > MAX_PORT) return fallback;
54
+ return parsed;
55
+ }
56
+
57
+ function buildCompatibilityKey(variantKey, profileHash) {
58
+ return `${String(variantKey || "")}::${String(profileHash || "")}`;
59
+ }
60
+
61
+ function buildReservedPorts() {
62
+ const reserved = new Set();
63
+ for (const instance of instances.values()) {
64
+ if (!isChildAlive(instance?.child)) continue;
65
+ const port = normalizeRuntimePort(instance?.port);
66
+ if (port !== null) reserved.add(port);
67
+ }
68
+ for (const start of inFlightStarts.values()) {
69
+ const port = normalizeRuntimePort(start?.reservedPort);
70
+ if (port !== null) reserved.add(port);
71
+ }
72
+ return reserved;
73
+ }
74
+
75
+ function pruneDeadInstances() {
76
+ for (const [instanceId, instance] of instances.entries()) {
77
+ if (!isChildAlive(instance?.child)) {
78
+ instances.delete(instanceId);
79
+ }
80
+ }
81
+ }
82
+
83
+ function allocatePort(preferredPort) {
84
+ const reservedPorts = buildReservedPorts();
85
+ const preferred = normalizeRuntimePort(preferredPort);
86
+ if (preferred !== null && !reservedPorts.has(preferred)) {
87
+ if (preferred >= nextPort) nextPort = preferred + 1;
88
+ return preferred;
89
+ }
90
+
91
+ let port = Math.max(39391, nextPort);
92
+ if (port > MAX_PORT) {
93
+ port = 39391;
94
+ }
95
+ const startPort = port;
96
+ while (reservedPorts.has(port)) {
97
+ port += 1;
98
+ if (port > MAX_PORT) {
99
+ port = 39391;
100
+ }
101
+ if (port === startPort) {
102
+ throw new Error("No available managed runtime port.");
103
+ }
104
+ }
105
+
106
+ nextPort = port + 1;
107
+ return port;
108
+ }
109
+
110
+ async function ensureRuntimeForVariant({ variantKey, profileHash, launchArgs, preferredPort } = {}, runtimeDeps = {}) {
111
+ const spawnRuntime = resolveSpawnRuntime(runtimeDeps);
112
+ const waitForHealthy = resolveWaitForHealthy(runtimeDeps);
113
+ const compatibilityKey = buildCompatibilityKey(variantKey, profileHash);
114
+ pruneDeadInstances();
115
+
116
+ for (const instance of instances.values()) {
117
+ if (
118
+ instance.profileHash === profileHash
119
+ && instance.variantKey === variantKey
120
+ && isTrackedInstanceReusable(instance)
121
+ ) {
122
+ return instance;
123
+ }
124
+ }
125
+
126
+ const inFlight = inFlightStarts.get(compatibilityKey);
127
+ if (inFlight?.promise) {
128
+ return inFlight.promise;
129
+ }
130
+
131
+ const port = allocatePort(preferredPort);
132
+ const startPromise = (async () => {
133
+ const spawned = await spawnRuntime({ variantKey, profileHash, launchArgs, port });
134
+ const healthy = await waitForHealthy(spawned);
135
+ const assignedPort = normalizeRuntimePort(healthy?.port, port);
136
+ if (!isChildAlive(healthy?.child)) {
137
+ throw new Error("Managed runtime exited before becoming healthy.");
138
+ }
139
+ const instance = {
140
+ instanceId: `${variantKey}:${profileHash}:${assignedPort}`,
141
+ owner: "llm-router",
142
+ variantKey,
143
+ profileHash,
144
+ healthy: true,
145
+ ...healthy,
146
+ port: assignedPort
147
+ };
148
+ instances.set(instance.instanceId, instance);
149
+ return instance;
150
+ })().finally(() => {
151
+ inFlightStarts.delete(compatibilityKey);
152
+ });
153
+
154
+ inFlightStarts.set(compatibilityKey, { promise: startPromise, reservedPort: port });
155
+ return startPromise;
156
+ }
157
+
158
+ async function reconcile(runtimeDeps = {}) {
159
+ const listListeningPids = resolveListListeningPids(runtimeDeps);
160
+ const stopProcessByPid = resolveStopProcessByPid(runtimeDeps);
161
+ for (const [instanceId, instance] of instances.entries()) {
162
+ const probe = await listListeningPids(instance.port).catch(() => null);
163
+ const livePids = Array.isArray(probe)
164
+ ? probe
165
+ : (probe && typeof probe === "object" && Array.isArray(probe.pids) ? probe.pids : null);
166
+ const probeFailed = Boolean(probe && typeof probe === "object" && probe.ok === false);
167
+ if (probeFailed || !Array.isArray(livePids)) continue;
168
+ if (livePids.includes(instance.pid)) continue;
169
+ if (instance.owner === "llm-router") {
170
+ await stopProcessByPid(instance.pid).catch(() => {});
171
+ }
172
+ instances.delete(instanceId);
173
+ }
174
+ }
175
+
176
+ async function waitForInFlightStarts() {
177
+ while (inFlightStarts.size > 0) {
178
+ const pending = [...inFlightStarts.values()]
179
+ .map((entry) => entry?.promise)
180
+ .filter(Boolean)
181
+ .map((promise) => promise.catch(() => null));
182
+ if (pending.length === 0) return;
183
+ await Promise.all(pending);
184
+ }
185
+ }
186
+
187
+ return {
188
+ ensureRuntimeForVariant,
189
+ reconcile,
190
+ waitForInFlightStarts,
191
+ trackInstance: async (instance) => {
192
+ instances.set(instance.instanceId, { ...instance });
193
+ },
194
+ untrackInstance: async (instanceId) => {
195
+ instances.delete(instanceId);
196
+ },
197
+ clear: async () => {
198
+ instances.clear();
199
+ },
200
+ snapshot: () => [...instances.values()]
201
+ };
202
+ }
@@ -0,0 +1,133 @@
1
+ function isPlainObject(value) {
2
+ return Boolean(value) && typeof value === "object" && !Array.isArray(value);
3
+ }
4
+
5
+ function normalizeString(value) {
6
+ return typeof value === "string" ? value.trim() : "";
7
+ }
8
+
9
+ function toGiB(bytes) {
10
+ return Math.round((Number(bytes || 0) / (1024 ** 3)) * 10) / 10;
11
+ }
12
+
13
+ function normalizePositiveInteger(value, fallback) {
14
+ const parsed = Number(value);
15
+ if (!Number.isFinite(parsed) || parsed <= 0) return fallback;
16
+ return Math.floor(parsed);
17
+ }
18
+
19
+ const LLAMACPP_PRESET_TUNING = Object.freeze({
20
+ balanced: Object.freeze({
21
+ canonicalPreset: "balanced",
22
+ batchSize: 64,
23
+ ubatchSize: 16,
24
+ gpuLayers: { darwin: 99, other: 0 },
25
+ penaltyRatio: 0.10,
26
+ noContBatching: false
27
+ }),
28
+ "long-context": Object.freeze({
29
+ canonicalPreset: "long-context",
30
+ batchSize: 32,
31
+ ubatchSize: 8,
32
+ gpuLayers: { darwin: 80, other: 0 },
33
+ penaltyRatio: 0.16,
34
+ noContBatching: false
35
+ }),
36
+ "low-memory": Object.freeze({
37
+ canonicalPreset: "low-memory",
38
+ batchSize: 32,
39
+ ubatchSize: 8,
40
+ gpuLayers: { darwin: 0, other: 0 },
41
+ penaltyRatio: 0.04,
42
+ noContBatching: true
43
+ }),
44
+ "fast-response": Object.freeze({
45
+ canonicalPreset: "fast-response",
46
+ batchSize: 16,
47
+ ubatchSize: 8,
48
+ gpuLayers: { darwin: 40, other: 0 },
49
+ penaltyRatio: 0.07,
50
+ noContBatching: false
51
+ }),
52
+ "cpu-safe": Object.freeze({
53
+ canonicalPreset: "cpu-safe",
54
+ batchSize: 32,
55
+ ubatchSize: 8,
56
+ gpuLayers: { darwin: 0, other: 0 },
57
+ penaltyRatio: 0.04,
58
+ noContBatching: true
59
+ })
60
+ });
61
+
62
+ function resolveCanonicalPreset(requestedPreset) {
63
+ const normalizedPreset = normalizeString(requestedPreset).toLowerCase();
64
+ if (normalizedPreset === "throughput") return LLAMACPP_PRESET_TUNING["fast-response"];
65
+ if (normalizedPreset === "memory-safe") return LLAMACPP_PRESET_TUNING["low-memory"];
66
+ return LLAMACPP_PRESET_TUNING[normalizedPreset] || LLAMACPP_PRESET_TUNING.balanced;
67
+ }
68
+
69
+ export function estimateLlamacppRuntimeBytes({
70
+ sizeBytes = 0,
71
+ contextWindow = 0,
72
+ preset = "balanced"
73
+ } = {}) {
74
+ const base = Number(sizeBytes || 0);
75
+ const contextBytes = Number(contextWindow || 0) * 163840;
76
+ const tuning = resolveCanonicalPreset(preset);
77
+ const presetPenalty = Math.floor(base * tuning.penaltyRatio);
78
+ return base + contextBytes + presetPenalty;
79
+ }
80
+
81
+ export function deriveLlamacppLaunchProfile({
82
+ variant,
83
+ baseModel,
84
+ system
85
+ } = {}) {
86
+ const requestedPreset = normalizeString(variant?.preset)
87
+ || normalizeString(variant?.runtimeProfile?.preset)
88
+ || "balanced";
89
+ const failureCategory = normalizeString(
90
+ variant?.runtimeProfile?.lastFailure?.category || variant?.runtimeStatus?.lastFailure?.category
91
+ );
92
+ const tuning = resolveCanonicalPreset(failureCategory === "metal-oom" ? "cpu-safe" : requestedPreset);
93
+ const preset = tuning.canonicalPreset;
94
+ const contextWindow = normalizePositiveInteger(variant?.contextWindow, 2048);
95
+ const overrides = isPlainObject(variant?.runtimeProfile?.overrides) ? variant.runtimeProfile.overrides : {};
96
+ const extraArgs = Array.isArray(variant?.runtimeProfile?.extraArgs)
97
+ ? variant.runtimeProfile.extraArgs.map((value) => normalizeString(value)).filter(Boolean)
98
+ : [];
99
+ const gpuLayers = Number.isFinite(Number(overrides.gpuLayers))
100
+ ? Math.floor(Number(overrides.gpuLayers))
101
+ : (system?.platform === "darwin" ? tuning.gpuLayers.darwin : tuning.gpuLayers.other);
102
+ const batchSize = Number.isFinite(Number(overrides.batchSize))
103
+ ? Math.floor(Number(overrides.batchSize))
104
+ : tuning.batchSize;
105
+ const ubatchSize = Number.isFinite(Number(overrides.ubatchSize))
106
+ ? Math.floor(Number(overrides.ubatchSize))
107
+ : tuning.ubatchSize;
108
+ const estimatedRuntimeBytes = estimateLlamacppRuntimeBytes({
109
+ sizeBytes: baseModel?.metadata?.sizeBytes,
110
+ contextWindow,
111
+ preset
112
+ });
113
+ const args = [
114
+ "-m", normalizeString(baseModel?.path),
115
+ "-a", normalizeString(variant?.id),
116
+ "-c", String(contextWindow),
117
+ "-np", "1",
118
+ "-b", String(batchSize),
119
+ "-ub", String(ubatchSize),
120
+ "--cache-ram", "0",
121
+ "--no-warmup"
122
+ ];
123
+
124
+ if (tuning.noContBatching) args.push("--no-cont-batching");
125
+ args.push("-ngl", String(gpuLayers), ...extraArgs);
126
+
127
+ return {
128
+ preset,
129
+ args: args.filter(Boolean),
130
+ estimatedRuntimeBytes,
131
+ memoryLabel: `${toGiB(estimatedRuntimeBytes)} GB`
132
+ };
133
+ }