@khanglvm/llm-router 2.4.1 → 2.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [2.5.2] - 2026-04-23
11
+
12
+ ### Fixed
13
+ - `yarn dev` now force-reclaims stale dev web-console listeners on startup and restarts matching stale dev routers so the next dev session takes over the sandbox cleanly instead of inheriting the old process.
14
+
15
+ ## [2.5.1] - 2026-04-23
16
+
17
+ ### Fixed
18
+ - Relaxed the live Claude Code publish smoke check so short affirmative routed replies such as `OK` or `好的` no longer fail `npm publish` when the end-to-end router path is otherwise healthy.
19
+
20
+ ## [2.5.0] - 2026-04-23
21
+
22
+ ### Added
23
+ - Local Models can now use a native macOS file/folder picker to attach GGUF files in place, scan a selected folder recursively for GGUF artifacts, and browse directly to a local `llama-server` runtime binary.
24
+
25
+ ### Changed
26
+ - Hugging Face GGUF search results for Local Models now rank quantizations more intelligently, show tighter Mac memory-fit guidance, and call out better long-context download choices for 64 GB Macs.
27
+ - `llama.cpp` runtime detection now searches common local source-build locations in addition to `PATH` and Homebrew installs, and server validation now recognizes more `llama-server` help output variants including TurboQuant builds.
28
+
29
+ ### Fixed
30
+ - OpenAI-to-Claude response translation now preserves Anthropic-compatible usage metadata such as `speed`, `service_tier`, cache counters, and tool-usage fields so Claude Code no longer trips over missing `usage.speed` on routed responses.
31
+
10
32
  ## [2.4.1] - 2026-04-19
11
33
 
12
34
  ### Fixed
package/README.md CHANGED
@@ -29,13 +29,25 @@ llr ai-help # agent-oriented setup brief
29
29
  - **Model aliases with routing** — group models into stable alias names with weighted round-robin, quota-aware balancing, and automatic fallback
30
30
  - **Rate limiting** — set request caps per model or across all models over configurable time windows
31
31
  - **Coding tool routing** — one-click routing config for Codex CLI, Claude Code, Factory Droid, and AMP
32
- - **Dev sandbox** — `yarn dev` runs the console against a dedicated dev config/router port, highlights dev mode in terminal + UI, and can clone the production config into the sandbox for quick iteration
32
+ - **Dev sandbox** — `yarn dev` runs the console against a dedicated dev config/router port, highlights dev mode in terminal + UI, can clone the production config into the sandbox for quick iteration, and automatically reclaims stale dev listeners before the next session starts
33
33
  - **Claude native web tools** — local handling for Claude web search and page fetch requests, with selectable Claude Code web-search providers from the shared Web Search config
34
34
  - **Seamless local updates** — `llr update` keeps the fixed local router endpoint online, drains in-flight requests, and automatically retries through backend restart windows
35
35
  - **Web search** — built-in web search for AMP and other router-managed tools
36
36
  - **Deployable** — run locally or deploy to Cloudflare Workers
37
37
  - **AI-agent friendly** — full CLI parity with `llr config --operation=...` so agents can configure everything programmatically
38
38
 
39
+ ## Local Models
40
+
41
+ Open `llr` and use the **Local Models** tab to manage local inference sources alongside hosted providers.
42
+
43
+ - **`llama.cpp` runtime** — detect or point at a local `llama-server`, attach GGUF files in place, or download public GGUF artifacts into the router-managed library under `~/.llm-router/local-models`
44
+ - **Native macOS browsing** — use the built-in file picker to choose a single GGUF file, scan a folder recursively for GGUF models, or browse directly to a local `llama-server` binary
45
+ - **Managed + attached model library** — stale or moved files stay visible instead of crashing the app, and can be repaired by locating the file again or removed cleanly
46
+ - **Router-visible local variants** — create friendly model variants with bounded presets, context-window metadata, preload toggles, and Mac unified-memory fit guidance with clearer safe/tight recommendations
47
+ - **Alias-ready local routing** — once saved, local variants behave like normal router models and can be used in aliases, capability flags, and fallback chains
48
+
49
+ For v1, the managed download flow only searches public Hugging Face GGUF files and the fit guidance is tuned for Macs with unified memory.
50
+
39
51
  ## Local Runtime Reliability
40
52
 
41
53
  `llr start` keeps a small supervisor bound to the fixed local router port and runs the real router backend behind it on an internal loopback port.
@@ -48,7 +60,7 @@ That means `llr update` can install a new package version and gracefully swap th
48
60
  yarn dev
49
61
  ```
50
62
 
51
- Development mode uses the dedicated `~/.llm-router-dev.json` config and its own local router port so it can run alongside a startup-managed or manually started production router. The terminal and Web UI both show a dev-mode indicator, and the dev Web UI includes a one-click sync action to copy the current production config into the sandbox without changing the dev router binding.
63
+ Development mode uses the dedicated `~/.llm-router-dev.json` config and its own local router port so it can run alongside a startup-managed or manually started production router. The terminal and Web UI both show a dev-mode indicator, the dev Web UI includes a one-click sync action to copy the current production config into the sandbox without changing the dev router binding, and each new `yarn dev` run automatically takes over any stale dev web-console/router listeners from a prior session.
52
64
 
53
65
  ## Web UI
54
66
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@khanglvm/llm-router",
3
- "version": "2.4.1",
3
+ "version": "2.5.2",
4
4
  "description": "LLM Router: single gateway endpoint for multi-provider LLMs with unified OpenAI+Anthropic format and seamless fallback",
5
5
  "keywords": [
6
6
  "llm-router",
@@ -0,0 +1,114 @@
1
+ import { getActiveRuntimeState } from "./instance-state.js";
2
+ import { reclaimPort } from "./port-reclaim.js";
3
+ import { startWebConsoleServer } from "./web-console-server.js";
4
+
5
+ const DEV_ROUTER_STOP_REASON = "Stopping the dev router because the dev web console exited.";
6
+
7
+ function normalizeHost(value) {
8
+ return String(value || "127.0.0.1").trim() || "127.0.0.1";
9
+ }
10
+
11
+ function shouldRestartStaleDevRouter(runtimeBeforeStart, runtimeAfterStart, snapshot) {
12
+ if (!runtimeBeforeStart || !runtimeAfterStart || !snapshot?.router?.running) return false;
13
+ if (Number(runtimeBeforeStart.pid) !== Number(runtimeAfterStart.pid)) return false;
14
+ if (snapshot?.config?.parseError) return false;
15
+ if (!Number(snapshot?.config?.providerCount)) return false;
16
+
17
+ const localServer = snapshot?.config?.localServer || {};
18
+ return Number(runtimeAfterStart.port) === Number(localServer.port)
19
+ && normalizeHost(runtimeAfterStart.host) === normalizeHost(localServer.host);
20
+ }
21
+
22
+ async function stopDevRouterAfterExit(server, onError) {
23
+ if (!server || typeof server.stopRouter !== "function") return;
24
+
25
+ try {
26
+ await server.stopRouter({
27
+ reason: DEV_ROUTER_STOP_REASON,
28
+ reclaimPortIfStopped: true
29
+ });
30
+ } catch (error) {
31
+ onError(`Failed stopping the dev router during shutdown: ${error instanceof Error ? error.message : String(error)}`);
32
+ }
33
+ }
34
+
35
+ export async function startManagedDevWebConsole(options = {}, deps = {}) {
36
+ const line = typeof deps.line === "function" ? deps.line : console.log;
37
+ const error = typeof deps.error === "function" ? deps.error : console.error;
38
+ const startWebConsoleServerFn = typeof deps.startWebConsoleServer === "function"
39
+ ? deps.startWebConsoleServer
40
+ : startWebConsoleServer;
41
+ const getActiveRuntimeStateFn = typeof deps.getActiveRuntimeState === "function"
42
+ ? deps.getActiveRuntimeState
43
+ : getActiveRuntimeState;
44
+ const reclaimPortFn = typeof deps.reclaimPort === "function"
45
+ ? deps.reclaimPort
46
+ : (args) => reclaimPort(args, deps);
47
+ const serverOptions = {
48
+ ...options,
49
+ devMode: true
50
+ };
51
+ const runtimeBeforeStart = await getActiveRuntimeStateFn().catch(() => null);
52
+
53
+ let server;
54
+ try {
55
+ server = await startWebConsoleServerFn(serverOptions);
56
+ } catch (startError) {
57
+ if (startError?.code !== "EADDRINUSE") throw startError;
58
+
59
+ const reclaimed = await reclaimPortFn({
60
+ port: serverOptions.port,
61
+ line,
62
+ error
63
+ });
64
+ if (!reclaimed?.ok) {
65
+ throw new Error(reclaimed?.errorMessage || `Failed to reclaim port ${serverOptions.port}.`);
66
+ }
67
+
68
+ line(`Port ${serverOptions.port} reclaimed successfully.`);
69
+ server = await startWebConsoleServerFn(serverOptions);
70
+ }
71
+
72
+ const startupSnapshot = typeof server.getSnapshot === "function"
73
+ ? await server.getSnapshot().catch(() => null)
74
+ : null;
75
+ const runtimeAfterStart = await getActiveRuntimeStateFn().catch(() => null);
76
+ if (shouldRestartStaleDevRouter(runtimeBeforeStart, runtimeAfterStart, startupSnapshot)
77
+ && typeof server.restartRouter === "function") {
78
+ await server.restartRouter(startupSnapshot.config.localServer);
79
+ }
80
+
81
+ let stopRouterPromise = null;
82
+ const ensureDevRouterStopped = () => {
83
+ if (stopRouterPromise) return stopRouterPromise;
84
+ stopRouterPromise = stopDevRouterAfterExit(server, error);
85
+ return stopRouterPromise;
86
+ };
87
+
88
+ const done = (async () => {
89
+ let result;
90
+ try {
91
+ result = await server.done;
92
+ } finally {
93
+ await ensureDevRouterStopped();
94
+ }
95
+ return result;
96
+ })();
97
+
98
+ let shutdownPromise = null;
99
+ const shutdown = async (reason = "dev-console-closed") => {
100
+ if (shutdownPromise) return shutdownPromise;
101
+ shutdownPromise = (async () => {
102
+ await ensureDevRouterStopped();
103
+ await server.close(reason);
104
+ return done;
105
+ })();
106
+ return shutdownPromise;
107
+ };
108
+
109
+ return {
110
+ ...server,
111
+ done,
112
+ shutdown
113
+ };
114
+ }
@@ -0,0 +1,273 @@
1
+ import path from "node:path";
2
+ import { promises as fs } from "node:fs";
3
+
4
+ const HUGGING_FACE_API_URL = "https://huggingface.co/api/models";
5
+ const HUGGING_FACE_BASE_URL = "https://huggingface.co";
6
+ const POTENTIAL_MODEL_ARTIFACT_PATTERN = /\.(gguf|safetensors|bin|pth|pt)$/i;
7
+ const DEFAULT_EXPECTED_CONTEXT_WINDOW = 200000;
8
+
9
+ function normalizeString(value) {
10
+ return typeof value === "string" ? value.trim() : "";
11
+ }
12
+
13
+ function normalizePositiveNumber(value) {
14
+ const parsed = Number(value);
15
+ if (!Number.isFinite(parsed) || parsed <= 0) return undefined;
16
+ return parsed;
17
+ }
18
+
19
+ function parseQuantizationFromFileName(fileName) {
20
+ const match = String(fileName || "").match(/(UD-[A-Z0-9_]+|IQ\d+_[A-Z]+|Q\d+_[A-Z0-9]+|Q\d+_0|MXFP4_MOE|BF16|F16|F32)/i);
21
+ return match ? match[1].toUpperCase() : "";
22
+ }
23
+
24
+ function scoreQuantization(fileName) {
25
+ const quantization = parseQuantizationFromFileName(fileName);
26
+ if (!quantization) return 0;
27
+ if (quantization.startsWith("Q5")) return 6;
28
+ if (quantization.startsWith("IQ")) return 5;
29
+ if (quantization === "Q4_K_M" || quantization === "Q4_K_S" || quantization.startsWith("Q4")) return 4;
30
+ if (quantization.startsWith("Q6")) return 3;
31
+ if (quantization.startsWith("Q8")) return 2;
32
+ if (quantization === "BF16" || quantization === "F16" || quantization === "F32") return 1;
33
+ return 1;
34
+ }
35
+
36
+ function buildCompatibilityBadges(fileName, fit, recommendation = "") {
37
+ const badges = [];
38
+ if (/\.gguf$/i.test(fileName)) badges.push("GGUF");
39
+ badges.push("llama.cpp");
40
+ if (fit === "safe") badges.push("Mac OK");
41
+ else if (fit === "tight") badges.push("Mac Tight");
42
+ else badges.push("Mac review");
43
+ if (/best fit/i.test(recommendation)) badges.push("Best fit");
44
+ return badges;
45
+ }
46
+
47
+ function isPotentialModelArtifact(fileName) {
48
+ return POTENTIAL_MODEL_ARTIFACT_PATTERN.test(String(fileName || ""));
49
+ }
50
+
51
+ function encodePathSegments(rawPath) {
52
+ return String(rawPath || "")
53
+ .split("/")
54
+ .filter(Boolean)
55
+ .map((segment) => encodeURIComponent(segment))
56
+ .join("/");
57
+ }
58
+
59
+ function extractHuggingFaceFiles(models = []) {
60
+ const files = [];
61
+ for (const model of Array.isArray(models) ? models : []) {
62
+ const repo = normalizeString(model?.id || model?.modelId);
63
+ if (!repo) continue;
64
+ for (const sibling of Array.isArray(model?.siblings) ? model.siblings : []) {
65
+ const file = normalizeString(sibling?.rfilename);
66
+ if (!file || !isPotentialModelArtifact(file)) continue;
67
+ files.push({
68
+ repo,
69
+ file,
70
+ size: normalizePositiveNumber(sibling?.size) ?? normalizePositiveNumber(sibling?.lfs?.size),
71
+ downloads: normalizePositiveNumber(model?.downloads) || 0,
72
+ likes: normalizePositiveNumber(model?.likes) || 0,
73
+ gguf: model?.gguf || undefined,
74
+ private: model?.private === true,
75
+ gated: model?.gated === true
76
+ });
77
+ }
78
+ }
79
+ return files;
80
+ }
81
+
82
+ export function classifyGgufCandidateForMac(candidate, { totalMemoryBytes } = {}) {
83
+ const fileName = normalizeString(candidate?.file || candidate?.rfilename);
84
+ const sizeBytes = normalizePositiveNumber(candidate?.sizeBytes ?? candidate?.size);
85
+ const expectedContextWindow = normalizePositiveNumber(candidate?.expectedContextWindow) || DEFAULT_EXPECTED_CONTEXT_WINDOW;
86
+
87
+ if (!/\.gguf$/i.test(fileName)) {
88
+ return {
89
+ fit: "unsupported",
90
+ disabled: true,
91
+ reason: "Not a GGUF file",
92
+ recommendation: "Unsupported for llama.cpp in v1."
93
+ };
94
+ }
95
+
96
+ if (sizeBytes && totalMemoryBytes && sizeBytes > Number(totalMemoryBytes) * 0.85) {
97
+ return {
98
+ fit: "over-budget",
99
+ disabled: true,
100
+ reason: "Too large for this Mac",
101
+ recommendation: "Skip this one on a 64 GB Mac."
102
+ };
103
+ }
104
+
105
+ if (!sizeBytes || !totalMemoryBytes) {
106
+ return {
107
+ fit: "unknown",
108
+ disabled: false,
109
+ reason: "",
110
+ recommendation: "Review memory fit manually before download."
111
+ };
112
+ }
113
+
114
+ const memoryRatio = sizeBytes / Number(totalMemoryBytes);
115
+ const quantScore = scoreQuantization(fileName);
116
+
117
+ if (expectedContextWindow >= 200000 && memoryRatio >= 0.5) {
118
+ return {
119
+ fit: "tight",
120
+ disabled: false,
121
+ reason: "200K context will be tight on this Mac",
122
+ recommendation: quantScore >= 2
123
+ ? "200K context needs review on a 64 GB Mac."
124
+ : "Large context and heavy quantization choice need review."
125
+ };
126
+ }
127
+
128
+ if (memoryRatio >= 0.4) {
129
+ return {
130
+ fit: "tight",
131
+ disabled: false,
132
+ reason: "Fits, but leaves limited unified memory headroom",
133
+ recommendation: "Reasonable fit, but memory headroom will be tight."
134
+ };
135
+ }
136
+
137
+ return {
138
+ fit: "safe",
139
+ disabled: false,
140
+ reason: "",
141
+ recommendation: quantScore >= 4
142
+ ? "Best fit for a 64 GB Mac and long-context testing."
143
+ : "Fits this Mac comfortably."
144
+ };
145
+ }
146
+
147
+ export function shapeHuggingFaceGgufResults(files, systemInfo = {}) {
148
+ const results = (Array.isArray(files) ? files : []).map((entry) => {
149
+ const file = normalizeString(entry?.file || entry?.rfilename);
150
+ const sizeBytes = normalizePositiveNumber(entry?.sizeBytes ?? entry?.size);
151
+ const status = classifyGgufCandidateForMac({
152
+ file,
153
+ sizeBytes,
154
+ expectedContextWindow: systemInfo?.expectedContextWindow
155
+ }, systemInfo);
156
+ const quantization = parseQuantizationFromFileName(file);
157
+ const fitScore = status.fit === "safe" ? 30 : status.fit === "tight" ? 15 : status.fit === "unknown" ? 8 : -20;
158
+ const rankingScore = fitScore
159
+ + (status.disabled ? -100 : 0)
160
+ + (scoreQuantization(file) * 10)
161
+ + Math.min(15, Math.log10(Number(entry?.downloads || 0) + 1) * 4)
162
+ + Math.min(8, Math.log10(Number(entry?.likes || 0) + 1) * 3)
163
+ - Math.min(12, (sizeBytes || 0) / (1024 ** 3));
164
+ return {
165
+ repo: normalizeString(entry?.repo || entry?.id || entry?.modelId),
166
+ file,
167
+ quantization,
168
+ sizeBytes,
169
+ disabled: status.disabled,
170
+ disabledReason: status.reason,
171
+ fit: status.fit,
172
+ recommendation: status.recommendation,
173
+ badges: buildCompatibilityBadges(file, status.fit, status.recommendation),
174
+ rankingScore
175
+ };
176
+ });
177
+
178
+ return results.sort((left, right) => {
179
+ if (right.rankingScore !== left.rankingScore) return right.rankingScore - left.rankingScore;
180
+ return String(left.file || "").localeCompare(String(right.file || ""));
181
+ });
182
+ }
183
+
184
+ export async function searchHuggingFaceGgufCandidates(query, {
185
+ limit = 20,
186
+ totalMemoryBytes,
187
+ expectedContextWindow = DEFAULT_EXPECTED_CONTEXT_WINDOW,
188
+ fetchImpl = fetch
189
+ } = {}) {
190
+ const search = normalizeString(query);
191
+ const url = new URL(HUGGING_FACE_API_URL);
192
+ if (search) url.searchParams.set("search", search);
193
+ url.searchParams.set("limit", String(Math.max(1, Math.min(50, Number(limit) || 20))));
194
+ for (const field of ["siblings", "gguf", "downloads", "likes", "gated", "private"]) {
195
+ url.searchParams.append("expand[]", field);
196
+ }
197
+
198
+ const response = await fetchImpl(url, {
199
+ headers: {
200
+ accept: "application/json"
201
+ }
202
+ });
203
+ if (!response.ok) {
204
+ throw new Error(`Hugging Face search failed (${response.status}).`);
205
+ }
206
+
207
+ const payload = await response.json();
208
+ return shapeHuggingFaceGgufResults(
209
+ extractHuggingFaceFiles(payload),
210
+ { totalMemoryBytes, expectedContextWindow }
211
+ );
212
+ }
213
+
214
+ export function buildHuggingFaceFileDownloadUrl(repo, file) {
215
+ const normalizedRepo = encodePathSegments(repo);
216
+ const normalizedFile = encodePathSegments(file);
217
+ return `${HUGGING_FACE_BASE_URL}/${normalizedRepo}/resolve/main/${normalizedFile}?download=true`;
218
+ }
219
+
220
+ export async function downloadManagedHuggingFaceGguf({
221
+ repo,
222
+ file,
223
+ destinationPath
224
+ } = {}, {
225
+ fetchImpl = fetch,
226
+ onProgress = () => {}
227
+ } = {}) {
228
+ const targetRepo = normalizeString(repo);
229
+ const targetFile = normalizeString(file);
230
+ const outputPath = normalizeString(destinationPath);
231
+ if (!targetRepo || !targetFile || !outputPath) {
232
+ throw new Error("repo, file, and destinationPath are required.");
233
+ }
234
+
235
+ const url = buildHuggingFaceFileDownloadUrl(targetRepo, targetFile);
236
+ const response = await fetchImpl(url, {
237
+ headers: {
238
+ accept: "application/octet-stream"
239
+ }
240
+ });
241
+ if (!response.ok || !response.body) {
242
+ throw new Error(`Hugging Face download failed (${response.status}).`);
243
+ }
244
+
245
+ await fs.mkdir(path.dirname(outputPath), { recursive: true });
246
+ const tempPath = `${outputPath}.part`;
247
+ const fileHandle = await fs.open(tempPath, "w");
248
+ const totalBytes = normalizePositiveNumber(response.headers.get("content-length"));
249
+ let receivedBytes = 0;
250
+
251
+ try {
252
+ const reader = response.body.getReader();
253
+ while (true) {
254
+ const { value, done } = await reader.read();
255
+ if (done) break;
256
+ const chunk = value || new Uint8Array();
257
+ if (chunk.byteLength > 0) {
258
+ await fileHandle.write(chunk);
259
+ receivedBytes += chunk.byteLength;
260
+ onProgress({ receivedBytes, totalBytes });
261
+ }
262
+ }
263
+ } finally {
264
+ await fileHandle.close();
265
+ }
266
+
267
+ await fs.rename(tempPath, outputPath);
268
+ return {
269
+ filePath: outputPath,
270
+ sizeBytes: receivedBytes || totalBytes || undefined,
271
+ downloadUrl: url
272
+ };
273
+ }