npm - llm-checker - Versions diffs - 3.7.0 → 3.7.4 - Mend

llm-checker 3.7.0 → 3.7.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md +31 -1
package/bin/enhanced_cli.js +46 -0
package/bin/mcp-server.mjs +5 -0
package/package.json +1 -1
package/src/data/model-database.js +3 -1
package/src/data/registry-ingestors.js +20 -6
package/src/data/registry-recommender.js +122 -4
package/src/data/seed/models.db +0 -0
package/src/models/deterministic-selector.js +69 -36

package/README.md CHANGED Viewed

@@ -5,7 +5,7 @@
 **Intelligent Ollama Model Selector**
 AI-powered CLI that analyzes your hardware and recommends optimal LLM models.
-Deterministic scoring across **200+ Ollama models** and **7k+ variants** with a packaged SQLite catalog, live sync, and hardware-calibrated memory estimation.
+Deterministic scoring across a packaged **multi-source registry** (Hugging Face + Ollama + GPT4All, **33k+ exact artifacts**) and the Ollama catalog, with live sync, runtime targeting, and hardware-calibrated memory estimation.
 [![npm version](https://img.shields.io/npm/v/llm-checker?style=flat-square&color=0066FF)](https://www.npmjs.com/package/llm-checker)
 [![npm downloads](https://img.shields.io/npm/dm/llm-checker?style=flat-square&color=0066FF)](https://www.npmjs.com/package/llm-checker)
@@ -39,6 +39,7 @@ Choosing the right LLM for your hardware is complex. With thousands of model var
 | | Feature | Description |
 |:---:|---|---|
 | **200+** | Packaged Model Catalog | Ships with a synced Ollama SQLite catalog and can refresh from Ollama on demand |
+| **33k+** | Multi-Source Registry | Exact installable/downloadable artifacts from Hugging Face, Ollama, and GPT4All with per-source commands and runtime targeting |
 | **4D** | Scoring Engine | Quality, Speed, Fit, Context &mdash; weighted by use case |
 | **Multi-GPU** | Hardware Detection | Apple Silicon, NVIDIA CUDA, AMD ROCm, Intel Arc, CPU, integrated/dedicated inventory visibility |
 | **Calibrated** | Memory Estimation | Bytes-per-parameter formula validated against real Ollama sizes |
@@ -151,6 +152,14 @@ hash -r
 llm-checker --version
 ```
+### v3.7.0 Highlights
+- New **multi-source model registry**: a packaged snapshot of ~33,700 exact installable/downloadable artifacts from Hugging Face, Ollama, and GPT4All, with per-source commands (`hf download ...`, `ollama pull ...`).
+- `recommend` and `check` now draw candidates from the registry through one canonical deterministic scoring core, with `--runtime auto/ollama/vllm/mlx/llama.cpp/transformers` targeting; they fall back to the Ollama catalog when the registry is unavailable.
+- New `registry-sync`, `registry-search`, and `registry-recommend` commands.
+- Mixture-of-Experts models are sized by their **total** parameter count (all experts stay resident under Ollama/Metal/vLLM), so a large MoE can no longer falsely "fit" small hardware.
+- Carries the 3.6.1 batch: unified scoring across `check`/`recommend`/`smart-recommend` (#88), high-end/multi-GPU VRAM detection (#95), MCP server hardening (#97), and the Windows interactive-panel fixes (#86).
 ### v3.5.13 Highlights
 - Ships npm packages with a ready-to-use SQLite model catalog:
@@ -389,6 +398,27 @@ llm-checker search "qwen coder" --json
 | `search <query>` | Search the synced catalog with filters and intelligent scoring |
 | `smart-recommend` | Advanced recommendations using the full scoring engine |
+### Model Registry Commands (v3.7.0+)
+Exact installable/downloadable artifacts from a packaged multi-source registry (Hugging Face + Ollama + GPT4All).
+| Command | Description |
+|---------|-------------|
+| `registry-sync` | Sync the multi-source registry (Hugging Face, Ollama, GPT4All) |
+| `registry-search [query]` | Search exact artifacts with `--source`, `--format`, `--runtime`, `--quant`, `--max-size`, `--min-params`/`--max-params` filters |
+| `registry-recommend [query]` | Recommend the best exact artifacts for your hardware, with `--runtime auto/ollama/vllm/mlx/llama.cpp/transformers` targeting and `--category`/`--optimize` |
+```bash
+# Best coding artifacts across all sources, auto runtime
+llm-checker registry-recommend --category coding
+# Only Apple-native MLX artifacts
+llm-checker registry-recommend --category coding --runtime mlx
+# Search Hugging Face for vLLM-ready reasoning models under 24B
+llm-checker registry-search qwen --source huggingface --runtime vllm --max-params 24
+```
 ### Enterprise Policy Commands
 | Command | Description |

package/bin/enhanced_cli.js CHANGED Viewed

@@ -410,6 +410,30 @@ function parsePositiveNumberOption(value, fallback = null) {
     return Number.isFinite(parsed) && parsed > 0 ? parsed : fallback;
 }
+// Allowed enum values for the registry commands. Invalid values must be rejected
+// with a clear error instead of silently returning "no results" or falling back
+// to the built-in catalog.
+const REGISTRY_SOURCES = ['ollama', 'huggingface', 'gpt4all'];
+const REGISTRY_FORMATS = ['gguf', 'safetensors', 'mlx', 'ollama', 'pytorch', 'pytorch_bin', 'ggml'];
+const REGISTRY_RUNTIMES = ['auto', 'all', '*', 'ollama', 'llama.cpp', 'transformers', 'vllm', 'mlx'];
+const REGISTRY_OPTIMIZE = ['balanced', 'speed', 'quality', 'context', 'coding'];
+function assertRegistryEnum(label, value, allowed) {
+    if (value === undefined || value === null || value === '') return;
+    if (!allowed.includes(String(value).toLowerCase())) {
+        const shown = allowed.filter((v) => !['all', '*'].includes(v)).join(', ');
+        throw new Error(`Invalid --${label} "${value}". Allowed: ${shown}`);
+    }
+}
+// Throws on the first invalid registry enum option. Returns nothing on success.
+function validateRegistryFilters(options = {}) {
+    assertRegistryEnum('source', options.source, REGISTRY_SOURCES);
+    assertRegistryEnum('format', options.format, REGISTRY_FORMATS);
+    assertRegistryEnum('runtime', options.runtime, REGISTRY_RUNTIMES);
+    assertRegistryEnum('optimize', options.optimize, REGISTRY_OPTIMIZE);
+}
 function truncateMiddle(value, maxLength = 48) {
     const text = String(value || '');
     if (text.length <= maxLength) return text;
@@ -4886,6 +4910,17 @@ program
     .option('-l, --limit <n>', 'Maximum number of results', '20')
     .option('-j, --json', 'Output as JSON')
     .action(async (query = '', options) => {
+        try {
+            validateRegistryFilters(options);
+        } catch (validationError) {
+            if (options.json) {
+                console.log(JSON.stringify({ error: validationError.message }, null, 2));
+            } else {
+                console.error(chalk.red(`✗ ${validationError.message}`));
+            }
+            process.exitCode = 1;
+            return;
+        }
         if (!options.json) showAsciiArt('registry-search');
         const ModelDatabase = require('../src/data/model-database');
@@ -4993,6 +5028,17 @@ program
     .option('-l, --limit <n>', 'Maximum number of recommendations', '10')
     .option('-j, --json', 'Output as JSON')
     .action(async (query = '', options) => {
+        try {
+            validateRegistryFilters(options);
+        } catch (validationError) {
+            if (options.json) {
+                console.log(JSON.stringify({ error: validationError.message }, null, 2));
+            } else {
+                console.error(chalk.red(`✗ ${validationError.message}`));
+            }
+            process.exitCode = 1;
+            return;
+        }
         if (!options.json) showAsciiArt('registry-recommend');
         const UnifiedDetector = require('../src/hardware/unified-detector');

package/bin/mcp-server.mjs CHANGED Viewed

@@ -290,9 +290,14 @@ const ALLOWED_CLI_COMMANDS = new Set([
   "sync",
   "search",
   "smart-recommend",
+  "registry-sync",
+  "registry-search",
+  "registry-recommend",
   "hw-detect",
 ]);
+export { ALLOWED_CLI_COMMANDS };
 // ============================================================================
 // MCP SERVER
 // ============================================================================

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "llm-checker",
-  "version": "3.7.0",
+  "version": "3.7.4",
   "description": "Intelligent CLI tool with AI-powered model selection that analyzes your hardware and recommends optimal LLM models for your system",
   "bin": {
     "llm-checker": "bin/cli.js",

package/src/data/model-database.js CHANGED Viewed

@@ -227,9 +227,11 @@ class ModelDatabase {
             CREATE INDEX IF NOT EXISTS idx_model_artifacts_source ON model_artifacts(source_id);
             CREATE INDEX IF NOT EXISTS idx_model_artifacts_format ON model_artifacts(format);
             CREATE INDEX IF NOT EXISTS idx_model_artifacts_quant ON model_artifacts(quantization);
-            CREATE INDEX IF NOT EXISTS idx_model_artifacts_runtime ON model_artifacts(runtime_support);
             CREATE INDEX IF NOT EXISTS idx_model_artifacts_size ON model_artifacts(size_gb);
             CREATE INDEX IF NOT EXISTS idx_model_artifacts_downloads ON model_artifacts(downloads DESC);
+            -- Drop a dead index from older DBs: runtime_support is a JSON blob only
+            -- queried with LIKE, so a B-tree index on it is never used.
+            DROP INDEX IF EXISTS idx_model_artifacts_runtime;
         `;
         if (this.useBetterSqlite) {

package/src/data/registry-ingestors.js CHANGED Viewed

@@ -146,8 +146,10 @@ function parseActiveParamsB(...values) {
 function inferQuantization(...values) {
     const text = values.map((value) => String(value || '')).join(' ');
-    const ggufQuant = text.match(/\b(IQ\d(?:_[A-Z0-9]+)?|Q\d(?:_[A-Z0-9]+){0,2}|F16|FP16|BF16|Q8_0)\b/i);
-    if (ggufQuant) return ggufQuant[1].toUpperCase().replace(/^F16$/, 'FP16');
+    // Note: F16/FP16/BF16 are PRECISIONS, not quantizations — they're handled by
+    // inferPrecision so a full-precision model isn't mislabeled as "quantized".
+    const ggufQuant = text.match(/\b(IQ\d(?:_[A-Z0-9]+)?|Q\d(?:_[A-Z0-9]+){0,2}|Q8_0)\b/i);
+    if (ggufQuant) return ggufQuant[1].toUpperCase();
     const bitQuant = text.match(/\b([234568])\s*[-_ ]?bit\b/i);
     if (bitQuant) return `${bitQuant[1]}bit`;
@@ -271,9 +273,16 @@ function getSiblingSizeBytes(sibling = {}) {
 function isModelArtifactFile(filename) {
     const lower = String(filename || '').toLowerCase();
     if (!lower) return false;
+    // Exclude non-model weight files that would otherwise be ingested as standalone
+    // "models": LoRA/PEFT adapters (a few MB but inherit the repo's param count) and
+    // optimizer/training state.
+    if (/(^|[/_-])adapter[_-]?(model|config)/.test(lower)) return false;
+    if (/(^|[/_-])(lora|optimizer|scheduler|rng_state|trainer_state|training_args)/.test(lower)) return false;
     if (lower.endsWith('.gguf')) return true;
     if (lower.endsWith('.safetensors')) return true;
-    if (/pytorch_model.*\.bin$/.test(lower)) return true;
+    if (/pytorch_model.*\.(bin)$/.test(lower)) return true;
+    // Mistral-style consolidated weights (consolidated.00.pth) were being dropped.
+    if (/(^|[/])consolidated.*\.(pt|pth|bin)$/.test(lower)) return true;
     if (/model.*\.(bin|pt|pth)$/.test(lower)) return true;
     if (/ggml.*\.bin$/.test(lower)) return true;
     return false;
@@ -398,11 +407,15 @@ function normalizeGpt4AllEntry(entry) {
     const repoMatch = url.match(/huggingface\.co\/([^/]+\/[^/]+)\/resolve\/([^/]+)\/(.+)$/);
     const repoId = repoMatch ? repoMatch[1] : `gpt4all/${name}`;
+    // When the download points at a Hugging Face repo, use that repo id as the
+    // canonical model id so the same model lines up across sources for dedup.
+    const canonicalModelId = repoMatch ? repoMatch[1] : name;
     const filename = repoMatch ? decodeURIComponent(repoMatch[3]) : (filenameCandidate || url.split('/').filter(Boolean).pop());
     const repoKey = makeScopedId('gpt4all', repoId);
     const tags = ['gpt4all', entry.type, entry.quant].filter(Boolean);
     const paramsB = parseParamsB(entry.parameters, name, filename);
-    const sizeBytes = Number(entry.filesize || entry.fileSize || entry.size || 0) || null;
+    // Sizes can arrive as comma-formatted strings ("8,000,000,000"); strip non-digits.
+    const sizeBytes = Number(String(entry.filesize ?? entry.fileSize ?? entry.size ?? 0).replace(/[^0-9.]/g, '')) || null;
     const format = inferFormat(filename, tags);
     return {
@@ -412,7 +425,7 @@ function normalizeGpt4AllEntry(entry) {
             source_id: 'gpt4all',
             repo_id: repoId,
             namespace: repoId.includes('/') ? repoId.split('/')[0] : 'gpt4all',
-            canonical_model_id: name,
+            canonical_model_id: canonicalModelId,
             display_name: name,
             url: repoMatch ? `https://huggingface.co/${repoId}` : url,
             license: entry.license || 'unknown',
@@ -434,7 +447,7 @@ function normalizeGpt4AllEntry(entry) {
             source_id: 'gpt4all',
             repo_key: repoKey,
             repo_id: repoId,
-            canonical_model_id: name,
+            canonical_model_id: canonicalModelId,
             artifact_name: filename || name,
             filename: filename || '',
             format,
@@ -746,6 +759,7 @@ module.exports = {
     inferFormat,
     inferQuantization,
     inferRuntimeSupport,
+    isModelArtifactFile,
     parseParamsB,
     buildHuggingFaceDownloadUrl
 };

package/src/data/registry-recommender.js CHANGED Viewed

@@ -173,7 +173,12 @@ function artifactToSelectorModel(row) {
         .filter(Boolean)
         .map((tag) => String(tag).toLowerCase());
-    const sizeGB = Number(row.size_gb);
+    // A sharded weight file's size is only ONE shard, not the whole model. Don't
+    // let it stand in for the model's memory (that made a 56B model look like
+    // ~4.6GB and "fit" tiny hardware); leave size unset so memory estimates from
+    // the (total) parameter count instead.
+    const rawSizeGB = Number(row.size_gb);
+    const sizeGB = (!shardedFile && Number.isFinite(rawSizeGB) && rawSizeGB > 0) ? rawSizeGB : NaN;
     const sizeByQuant = Number.isFinite(sizeGB) && sizeGB > 0
         ? { [quant]: sizeGB }
         : {};
@@ -240,6 +245,84 @@ function dedupeRecommendationPool(models) {
     return [...deduped.values()];
 }
+// A source may trail the top score by up to this and still earn a guaranteed slot.
+const SOURCE_DIVERSITY_MARGIN = 15;
+// Never surface a model below this score purely for source diversity.
+const SOURCE_DIVERSITY_FLOOR = 55;
+// Group key that ignores quantization / shard / tag so variants of the SAME
+// model collapse together (e.g. all `qwen2.5-coder:7b-*` quants, or every
+// `layers-N.safetensors` shard of one HF repo).
+function modelDiversityKey(candidate) {
+    const meta = (candidate && candidate.meta) || {};
+    const name = String(meta.name || meta.model_identifier || '')
+        .toLowerCase()
+        .replace(/:.*$/, '')   // drop an ollama :tag
+        .replace(/\s+/g, ' ')
+        .trim();
+    const p = Number(meta.paramsB);
+    if (Number.isFinite(p) && p > 0) {
+        return `${name}|${Math.round(p * 10) / 10}`;
+    }
+    // Params unknown: do NOT bucket every unknown-size model of the same name
+    // together (that silently drops distinct models / sources). Keep them apart by
+    // source + identifier.
+    const src = String(meta.source || '').toLowerCase();
+    const id = String(meta.model_identifier || meta.name || '').toLowerCase();
+    return `${name}|na|${src}|${id}`;
+}
+// Collapse quant/shard/tag variants of the same model to a single best-scoring
+// entry, so the top picks are DISTINCT models instead of 12 quants of one.
+function collapseToDistinctModels(candidates) {
+    const best = new Map();
+    for (const c of Array.isArray(candidates) ? candidates : []) {
+        if (!c) continue;
+        const key = modelDiversityKey(c);
+        const cur = best.get(key);
+        if (!cur || (Number(c.score) || 0) > (Number(cur.score) || 0)) best.set(key, c);
+    }
+    return [...best.values()].sort((a, b) => (Number(b.score) || 0) - (Number(a.score) || 0));
+}
+// Guarantee that each source with a competitive candidate appears in the top
+// `limit`, so Hugging Face / GPT4All artifacts are visible when they score close
+// to Ollama. Diversity never promotes a clearly worse model (floor + margin gates).
+function applySourceDiversity(distinctSorted, limit) {
+    const list = Array.isArray(distinctSorted) ? distinctSorted : [];
+    if (list.length === 0) return [];
+    const max = Number(limit) > 0 ? Number(limit) : 10;
+    if (list.length <= max) return list.slice(0, max);
+    const topScore = Number(list[0].score) || 0;
+    // Reserve most slots for the genuine best-by-score so diversity can never
+    // displace several real top picks for several obscure sources. Only the tail
+    // (~40% of slots) is used to surface competitive alternate sources.
+    const guaranteed = Math.max(1, Math.ceil(max * 0.6));
+    const result = list.slice(0, guaranteed);
+    const chosen = new Set(result);
+    const present = new Set(result.map((c) => (c.meta && c.meta.source) || 'unknown'));
+    while (result.length < max) {
+        // Prefer the best candidate from a not-yet-shown source that is still
+        // competitive (within margin + above floor); otherwise the next best overall.
+        let pick = list.find((c) => {
+            if (chosen.has(c)) return false;
+            const src = (c.meta && c.meta.source) || 'unknown';
+            const score = Number(c.score) || 0;
+            return !present.has(src) && score >= SOURCE_DIVERSITY_FLOOR && score >= topScore - SOURCE_DIVERSITY_MARGIN;
+        });
+        if (!pick) pick = list.find((c) => !chosen.has(c));
+        if (!pick) break;
+        result.push(pick);
+        chosen.add(pick);
+        present.add((pick.meta && pick.meta.source) || 'unknown');
+    }
+    return result
+        .sort((a, b) => (Number(b.score) || 0) - (Number(a.score) || 0))
+        .slice(0, max);
+}
 function candidateToRecommendation(candidate) {
     const artifact = candidate.meta.artifact || {};
     return {
@@ -360,9 +443,32 @@ class RegistryRecommender {
         const selectorHardware = normalizeHardwareForSelector(options.hardware || {});
         const normalizedRuntime = runtimeFilter || 'auto';
+        // No registry artifacts matched the filters: return an empty result rather
+        // than letting the deterministic selector silently substitute its built-in
+        // catalog (which would mislabel non-registry models as "registry" rows).
+        if (modelPool.length === 0) {
+            return {
+                category,
+                runtime: normalizedRuntime,
+                rows,
+                modelPool,
+                result: {
+                    category,
+                    optimizeFor: this.selector.normalizeOptimizationObjective(options.optimizeFor || 'balanced'),
+                    runtime: normalizedRuntime,
+                    candidates: [],
+                    total_evaluated: 0,
+                    timestamp: new Date().toISOString()
+                }
+            };
+        }
+        // Rank a wider window than requested so we can collapse model variants and
+        // apply source diversity before trimming to the caller's limit.
+        const rankWindow = Math.max(limit * 8, 200);
         const result = runtimeFilter
             ? await this.selector.selectModels(category, {
-                topN: limit,
+                topN: rankWindow,
                 enableProbe: false,
                 silent: true,
                 optimizeFor: options.optimizeFor || 'balanced',
@@ -374,13 +480,20 @@ class RegistryRecommender {
             })
             : this.scoreAutoRuntimePool({
                 category,
-                limit,
+                limit: rankWindow,
                 targetCtx,
                 optimizeFor: options.optimizeFor || 'balanced',
                 hardware: selectorHardware,
                 modelPool
             });
+        // Collapse quant/shard variants to distinct models, then guarantee source
+        // diversity, and finally trim to the requested limit.
+        if (result && Array.isArray(result.candidates)) {
+            const distinct = collapseToDistinctModels(result.candidates);
+            result.candidates = applySourceDiversity(distinct, limit);
+        }
         return {
             category,
             runtime: normalizedRuntime,
@@ -493,7 +606,9 @@ class RegistryRecommender {
             optimizeFor: objective,
             runtime: 'auto',
             hardware: normalizedHardware,
-            candidates: candidates.slice(0, limit),
+            // Return a wide sorted window; selectCategory collapses variants and
+            // applies source diversity before trimming to the caller's limit.
+            candidates: candidates.slice(0, Math.max(limit, 2000)),
             total_evaluated: filtered.length,
             timestamp: new Date().toISOString()
         };
@@ -506,6 +621,9 @@ class RegistryRecommender {
 module.exports = {
     RegistryRecommender,
+    collapseToDistinctModels,
+    applySourceDiversity,
+    modelDiversityKey,
     artifactToSelectorModel,
     candidateToRecommendation,
     normalizeHardwareForSelector,

package/src/data/seed/models.db CHANGED Viewed

Binary file

package/src/models/deterministic-selector.js CHANGED Viewed

@@ -243,13 +243,12 @@ class DeterministicModelSelector {
             directVRAM ??
             0;
-        // Multi-GPU fallback when only per-GPU memory is known.
-        if (!explicitTotalVRAM && gpuCount > 1) {
-            if (vramPerGPU) {
-                vramGB = vramPerGPU * gpuCount;
-            } else if (directVRAM && Boolean(gpu.isMultiGPU || input.isMultiGPU)) {
-                vramGB = Math.max(directVRAM, directVRAM * gpuCount);
-            }
+        // Multi-GPU: only scale up when memory is known to be PER-GPU (vramPerGPU).
+        // A bare `vram`/`vramGB` is treated as the box total and never multiplied,
+        // so we don't double an already-total figure and falsely "fit" a model
+        // (e.g. a 2x24=48GB box must stay 48GB, not become 96GB).
+        if (!explicitTotalVRAM && gpuCount > 1 && vramPerGPU) {
+            vramGB = vramPerGPU * gpuCount;
         }
         let gpuType = gpu.type;
@@ -1152,6 +1151,17 @@ class DeterministicModelSelector {
             return explicitParams;
         }
+        // Use the variant's OWN artifact size to DISAMBIGUATE the model-level size
+        // list. A size-unknown variant (e.g. `:latest`) must not blindly inherit
+        // model_sizes[0]: for qwen3 (model_sizes ["30b","235b"]) that mislabeled a
+        // small qwen3:latest as 30B and poisoned the real qwen3:30b size map, making
+        // a 19GB model falsely "fit" a 16GB machine.
+        const artifactSizeGB = this.extractVariantSizeGB(variant, null);
+        const artifactParamsB =
+            (!this.isCloudVariantTag(variant.tag) && Number.isFinite(artifactSizeGB) && artifactSizeGB > 0)
+                ? this.inferParamsFromArtifactSizeGB(artifactSizeGB, quant)
+                : null;
         const metadataCandidates = this.extractParameterCandidates(
             ollamaModel.model_sizes,
             ollamaModel.parameters,
@@ -1159,12 +1169,23 @@ class DeterministicModelSelector {
             ollamaModel.parameter_count
         );
         if (metadataCandidates.length > 0) {
+            if (Number.isFinite(artifactParamsB) && artifactParamsB > 0) {
+                // Pick the listed size CLOSEST to what this variant's own artifact
+                // implies; if even the closest is far off, trust the artifact size.
+                let closest = metadataCandidates[0];
+                let bestDiff = Math.abs(closest - artifactParamsB);
+                for (const cand of metadataCandidates) {
+                    const diff = Math.abs(cand - artifactParamsB);
+                    if (diff < bestDiff) { bestDiff = diff; closest = cand; }
+                }
+                const tolerance = Math.max(2, closest * 0.5);
+                return bestDiff <= tolerance ? closest : artifactParamsB;
+            }
             return metadataCandidates[0];
         }
-        const artifactSizeGB = this.extractVariantSizeGB(variant, null);
-        if (!this.isCloudVariantTag(variant.tag) && Number.isFinite(artifactSizeGB) && artifactSizeGB > 0) {
-            return this.inferParamsFromArtifactSizeGB(artifactSizeGB, quant);
+        if (Number.isFinite(artifactParamsB) && artifactParamsB > 0) {
+            return artifactParamsB;
         }
         const modelArtifactSizeGB = this.extractArtifactSizeGBFromValue(ollamaModel.main_size);
@@ -1512,28 +1533,35 @@ class DeterministicModelSelector {
                 return false;
             }
+            // Guard against malformed external pool rows (a missing tags/modalities
+            // /name field used to throw and silently nuke the whole category).
+            const tags = Array.isArray(model.tags) ? model.tags : [];
+            const modalities = Array.isArray(model.modalities) ? model.modalities : [];
+            const name = String(model.name || model.model_identifier || '').toLowerCase();
+            const paramsB = Number(model.paramsB) || 0;
             switch (category) {
                 case 'coding':
-                    return model.tags.some(tag => ['coder', 'code', 'instruct'].includes(tag)) ||
-                           model.name.toLowerCase().includes('code');
+                    return tags.some(tag => ['coder', 'code', 'instruct'].includes(tag)) ||
+                           name.includes('code');
                 case 'multimodal':
-                    return model.modalities.includes('vision') ||
-                           model.tags.includes('vision');
+                    return modalities.includes('vision') ||
+                           tags.includes('vision');
                 case 'embeddings':
-                    return model.tags.includes('embedding') ||
-                           model.tags.includes('embeddings') ||
-                           model.name.toLowerCase().includes('embed') ||
-                           model.name.toLowerCase().includes('bge-') ||
-                           model.name.toLowerCase().includes('nomic-embed') ||
-                           model.name.toLowerCase().includes('all-minilm') ||
+                    return tags.includes('embedding') ||
+                           tags.includes('embeddings') ||
+                           name.includes('embed') ||
+                           name.includes('bge-') ||
+                           name.includes('nomic-embed') ||
+                           name.includes('all-minilm') ||
                            model.specialization === 'embeddings';
                 case 'reasoning':
-                    return model.tags.includes('instruct') ||
-                           model.paramsB >= 7; // Prefer larger models for reasoning
+                    return tags.includes('instruct') ||
+                           paramsB >= 7; // Prefer larger models for reasoning
                 default: // general, reading, summarization
                     return true; // Most models can handle these
             }
@@ -1711,15 +1739,19 @@ class DeterministicModelSelector {
             : (Number.isFinite(directVariantMatch) && directVariantMatch > 0 ? directVariantMatch : null);
         const parameterProfile = this.resolveMemoryParameterProfile(model);
-        const modeledWeightGB = parameterProfile.effectiveParamsB * bpp;
-        const preferSparseInferenceParams =
-            parameterProfile.isMoE &&
-            (parameterProfile.assumptionSource === 'moe_active_metadata' ||
-                parameterProfile.assumptionSource === 'moe_derived_expert_ratio');
-        const useObservedArtifactSize =
-            !preferSparseInferenceParams &&
-            Number.isFinite(observedWeightGB) &&
-            observedWeightGB > 0;
+        // Weight memory must account for ALL resident parameters. For MoE under
+        // Ollama / Metal / vLLM every expert is resident, so size the weights by
+        // the TOTAL parameter count (not the active count). Active params drive
+        // speed and KV-cache only. Sizing weights by active params used to make a
+        // 236B MoE look like ~14GB and falsely "fit" small hardware.
+        const weightParamsB =
+            parameterProfile.isMoE && Number.isFinite(parameterProfile.totalParamsB) && parameterProfile.totalParamsB > 0
+                ? parameterProfile.totalParamsB
+                : parameterProfile.effectiveParamsB;
+        const modeledWeightGB = weightParamsB * bpp;
+        // A real observed artifact size always wins for weight memory — never let
+        // an MoE "sparse inference" assumption discard a measured on-disk size.
+        const useObservedArtifactSize = Number.isFinite(observedWeightGB) && observedWeightGB > 0;
         const modelMemGB = useObservedArtifactSize ? observedWeightGB : modeledWeightGB;
         const effectiveCtx = Number.isFinite(Number(ctx)) && Number(ctx) > 0 ? Number(ctx) : 4096;
@@ -1729,9 +1761,10 @@ class DeterministicModelSelector {
         // Runtime overhead (Metal/CUDA context, buffers)
         const runtimeOverhead = useObservedArtifactSize ? 0.35 : 0.5;
+        const usedMoeTotal = parameterProfile.isMoE && weightParamsB === parameterProfile.totalParamsB;
         const memorySource = useObservedArtifactSize
             ? 'observed_artifact_size'
-            : (preferSparseInferenceParams ? 'moe_sparse_inference_params' : 'estimated_from_params');
+            : (usedMoeTotal ? 'moe_total_params' : 'estimated_from_params');
         return {
             parameterProfile,