npm - nano-brain - Versions diffs - 2026.3.8 → 2026.3.9 - Mend

nano-brain 2026.3.8 → 2026.3.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/openspec/changes/nano-brain-resource-optimization/proposal.md +43 -0
package/package.json +1 -2
package/src/embeddings.ts +8 -178
package/src/expansion.ts +2 -66
package/src/reranker.ts +3 -95
package/src/server.ts +0 -1

package/openspec/changes/nano-brain-resource-optimization/proposal.md ADDED Viewed

@@ -0,0 +1,43 @@
+## Why
+nano-brain runs as an MCP server inside the ai-sandbox-wrapper Docker container — the same container that hosts OpenCode (the AI coding agent), 4+ Pyright LSP servers, Playwright MCP, GraphQL inspector, database inspector, and sequential-thinking MCP. Current container profile: ~20GB RAM total, 8 vCPUs, but nano-brain competes with all co-tenants for resources. Observed state: 17.3GB of 20GB used, only 2.7GB available, 2.9GB swap active — the system is under severe memory pressure.
+nano-brain's ~650MB baseline RAM (reranker model alone is ~500MB, loaded eagerly regardless of usage) is a significant contributor. When nano-brain indexes a codebase, its synchronous Tree-sitter parsing and 30+ `readFileSync` calls on overlay filesystem (2-5x slower than native) block the Node.js event loop for 25-85 seconds, consuming CPU quota that should serve MCP tool requests from the AI agent. The OpenCode process itself uses ~950MB RSS — nano-brain's resource consumption directly degrades the agent's responsiveness.
+This is not a theoretical concern: the container is already swapping 2.9GB to disk, which means every memory allocation triggers page faults that slow all processes.
+## What Changes
+- **Lazy model loading**: Reranker model loaded on first search request (if reranking enabled) instead of eagerly at startup. Saves ~500MB when reranking is unused.
+- **Model disposal**: `dispose()` methods call actual native `model.dispose()` on node-llama-cpp. Cleanup handler in server.ts disposes all models on shutdown.
+- **Tree-sitter parser pooling**: Reuse Parser instances across files instead of `new Parser()` per file. Pool size bounded by language count.
+- **Hash memoization**: Workspace root hash computed once and cached. Eliminates 10+ redundant SHA-256 computations per command.
+- **SQLite pragma tuning**: Add `cache_size`, `mmap_size`, and `temp_store` pragmas for faster query performance.
+- **Query embedding cache eviction**: Cap `llm_cache` entries of type `qembed` with LRU eviction (max 500 entries).
+- **Inference thread limiting**: Expose `nThreads` config for node-llama-cpp to control CPU core usage. Critical in containers where cgroup CPU quota is shared.
+- **Single-context mode**: Config option to use 1 inference context per model instead of up to 4, trading throughput for ~60% RAM reduction per model. Default to 1 when container detected.
+- **Event loop yield points**: Insert `setImmediate()` between file processing iterations in codebase indexing and symbol extraction loops.
+- **Container-aware defaults**: When `isInsideContainer()` returns true (already detected via `host.ts`), automatically apply conservative defaults: single-context mode, lazy model loading, reduced SQLite cache, lower thread count. No config required — just works in Docker.
+## Capabilities
+### New Capabilities
+- `lazy-model-loading`: Defer model initialization to first use with configurable eager/lazy mode per model
+- `resource-limits`: Configurable inference thread count, context pool size, cache bounds, and SQLite tuning
+- `parser-pooling`: Reusable Tree-sitter parser instances with bounded pool per language
+- `container-aware-defaults`: Auto-detect Docker/container environment and apply conservative resource defaults
+### Modified Capabilities
+- `mcp-server`: Server cleanup handler disposes models; model status reflects lazy loading state (not-loaded → loading → loaded)
+- `storage-limits`: Query embedding cache gains LRU eviction bound; SQLite pragma tuning affects storage performance characteristics
+## Impact
+- **Memory**: Baseline RAM reduced from ~650MB to ~150MB when reranker is lazy-loaded and single-context mode is used. Frees ~500MB for OpenCode and other co-tenant MCP servers in the shared container. Reduces swap pressure (currently 2.9GB swapped).
+- **CPU**: Event loop blocking reduced by 70-80% during indexing via yield points and parser pooling. MCP tool requests from the AI agent remain responsive during background indexing — critical since nano-brain shares the container's 8 vCPUs with OpenCode, 4 Pyright instances, and other MCP servers.
+- **Container co-tenancy**: Auto-detect Docker environment via existing `isInsideContainer()` in `host.ts` and apply conservative defaults. nano-brain becomes a good neighbor in the shared ai-sandbox-wrapper container.
+- **Overlay FS**: Async file I/O mitigates the 2-5x latency penalty of Docker overlay filesystem on `readFileSync` calls.
+- **Files changed**: `server.ts` (model lifecycle, cleanup), `store.ts` (pragmas, cache eviction), `embeddings.ts` (dispose, thread config, context pool), `reranker.ts` (lazy loading, dispose), `treesitter.ts` (parser pool), `codebase.ts` (yield points), `symbols.ts` (yield points), `types.ts` (config interfaces), `search.ts` (lazy provider resolution), `host.ts` (resource limit detection)
+- **Config**: New `resources` section in `config.yml` with `threads`, `contextPoolSize`, `lazyModels`, `cacheMaxEntries` fields. All optional — container-aware defaults kick in automatically.
+- **No breaking changes**: All optimizations are backward compatible. Existing non-container deployments keep current behavior unless explicitly configured.
+- **No new dependencies**: All changes use existing node-llama-cpp APIs, Node.js built-ins, and the existing `isInsideContainer()` detection.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "nano-brain",
-  "version": "2026.3.8",
+  "version": "2026.3.9",
   "type": "module",
   "bin": {
     "nano-brain": "./bin/cli.js"
@@ -22,7 +22,6 @@
     "better-sqlite3": "^12.6.2",
     "chokidar": "^5.0.0",
     "fast-glob": "^3.3.3",
-    "node-llama-cpp": "^3.3.3",
     "sqlite-vec": "^0.1.7-alpha.2",
     "tree-sitter": "^0.22.4",
     "tree-sitter-javascript": "^0.23.1",

package/src/embeddings.ts CHANGED Viewed

@@ -1,7 +1,6 @@
-import { getLlama } from 'node-llama-cpp';
 import { promises as fs } from 'fs';
 import { join, dirname } from 'path';
-import { homedir, cpus } from 'os';
+import { homedir } from 'os';
 import type { EmbeddingResult, EmbeddingConfig } from './types.js';
 import { log } from './logger.js';
 import { resolveHostUrl } from './host.js';
@@ -16,94 +15,9 @@ export interface EmbeddingProvider {
 }
 export interface EmbeddingProviderOptions {
-  modelPath?: string;
-  cacheDir?: string;
   embeddingConfig?: EmbeddingConfig;
 }
-const DEFAULT_MODEL_URI = 'hf:nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf';
-const MODEL_NAME = 'nomic-embed-text-v1.5';
-const DIMENSIONS = 768;
-interface ParsedModelURI {
-  org: string;
-  repo: string;
-  file: string;
-}
-function parseModelURI(uri: string): ParsedModelURI | null {
-  const match = uri.match(/^hf:([^/]+)\/([^/]+)\/(.+\.gguf)$/);
-  if (!match) return null;
-  return {
-    org: match[1],
-    repo: match[2],
-    file: match[3],
-  };
-}
-async function downloadModel(url: string, destPath: string): Promise<void> {
-  console.log(`Downloading model from ${url}...`);
-  await fs.mkdir(dirname(destPath), { recursive: true });
-  const response = await fetch(url);
-  if (!response.ok) {
-    throw new Error(`Failed to download model: ${response.statusText}`);
-  }
-  const totalSize = parseInt(response.headers.get('content-length') || '0', 10);
-  let downloadedSize = 0;
-  const tempPath = `${destPath}.tmp`;
-  const fileHandle = await fs.open(tempPath, 'w');
-  try {
-    const reader = response.body?.getReader();
-    if (!reader) throw new Error('No response body');
-    while (true) {
-      const { done, value } = await reader.read();
-      if (done) break;
-      await fileHandle.write(value);
-      downloadedSize += value.length;
-      if (totalSize > 0) {
-        const percent = ((downloadedSize / totalSize) * 100).toFixed(1);
-        process.stdout.write(`\rDownload progress: ${percent}%`);
-      }
-    }
-    console.log('\nDownload complete');
-  } finally {
-    await fileHandle.close();
-  }
-  await fs.rename(tempPath, destPath);
-}
-export async function resolveModelPath(
-  uri: string,
-  cacheDir?: string
-): Promise<string> {
-  const parsed = parseModelURI(uri);
-  if (!parsed) {
-    throw new Error(`Invalid model URI format: ${uri}`);
-  }
-  const baseDir = cacheDir || join(homedir(), '.nano-brain', 'models');
-  const modelPath = join(baseDir, parsed.org, parsed.repo, parsed.file);
-  try {
-    await fs.access(modelPath);
-    return modelPath;
-  } catch {
-    const url = `https://huggingface.co/${parsed.org}/${parsed.repo}/resolve/main/${parsed.file}`;
-    await downloadModel(url, modelPath);
-    return modelPath;
-  }
-}
 function formatQueryPrompt(query: string): string {
   return `search_query: ${query}`;
 }
@@ -163,7 +77,7 @@ export async function checkOpenAIHealth(
 class OllamaEmbeddingProvider implements EmbeddingProvider {
   private url: string;
   private model: string;
-  private dimensions: number = DIMENSIONS;
+  private dimensions: number = 768;
   private maxChars: number = 6000;
   private contextTokens: number = 0;
@@ -455,75 +369,6 @@ class OpenAICompatibleEmbeddingProvider implements EmbeddingProvider {
   }
 }
-class EmbeddingProviderImpl implements EmbeddingProvider {
-  private contexts: any[] = [];
-  private currentContextIndex = 0;
-  constructor(
-    private model: any,
-    private parallelism: number
-  ) {}
-  async initialize(): Promise<void> {
-    for (let i = 0; i < this.parallelism; i++) {
-      const context = await this.model.createEmbeddingContext();
-      this.contexts.push(context);
-    }
-  }
-  async embed(text: string): Promise<EmbeddingResult> {
-    const context = this.contexts[0];
-    const result = await context.getEmbeddingFor(text);
-    return {
-      embedding: Array.from(result.vector),
-      model: MODEL_NAME,
-      dimensions: DIMENSIONS,
-    };
-  }
-  async embedBatch(texts: string[]): Promise<EmbeddingResult[]> {
-    const results: EmbeddingResult[] = [];
-    const batchSize = Math.min(4, this.parallelism);
-    for (let i = 0; i < texts.length; i += batchSize) {
-      const batch = texts.slice(i, i + batchSize);
-      const batchPromises = batch.map(async (text, idx) => {
-        const contextIdx = idx % this.contexts.length;
-        const context = this.contexts[contextIdx];
-        const result = await context.getEmbeddingFor(text);
-        return {
-          embedding: Array.from(result.vector) as number[],
-          model: MODEL_NAME,
-          dimensions: DIMENSIONS,
-        };
-      });
-      const batchResults = await Promise.all(batchPromises);
-      results.push(...batchResults);
-    }
-    return results;
-  }
-  getDimensions(): number {
-    return DIMENSIONS;
-  }
-  getModel(): string {
-    return MODEL_NAME;
-  }
-  getMaxChars(): number {
-    return 6000;
-  }
-  dispose(): void {
-    this.contexts = [];
-  }
-}
 export async function createEmbeddingProvider(
   options?: EmbeddingProviderOptions
 ): Promise<EmbeddingProvider | null> {
@@ -576,29 +421,14 @@ export async function createEmbeddingProvider(
         console.error('[embedding] Ollama explicitly configured but not reachable, no fallback');
         return null;
       }
-      log('embedding', 'createEmbeddingProvider ollama unreachable fallback=local');
-      console.warn('[embedding] Falling back to local node-llama-cpp...');
+      log('embedding', 'createEmbeddingProvider ollama unreachable no-fallback');
+      console.warn('[embedding] Ollama not reachable, no fallback available');
     }
   }
-  // Fallback to local node-llama-cpp
-  try {
-    const modelUri = options?.modelPath || DEFAULT_MODEL_URI;
-    const modelPath = await resolveModelPath(modelUri, options?.cacheDir);
-    const llama = await getLlama();
-    const model = await llama.loadModel({ modelPath });
-    const cpuCount = cpus().length;
-    const parallelism = Math.max(1, Math.min(4, Math.floor(cpuCount / 4)));
-    const provider = new EmbeddingProviderImpl(model, parallelism);
-    await provider.initialize();
-    log('embedding', 'createEmbeddingProvider selected=local model=' + MODEL_NAME);
-    console.error(`[embedding] Using local provider: ${MODEL_NAME}`);
-    return provider;
-  } catch (error) {
-    log('embedding', 'createEmbeddingProvider local failed');
-    console.warn('Failed to load embedding model:', error instanceof Error ? error.message : String(error));
-    return null;
-  }
+  log('embedding', 'createEmbeddingProvider no provider available');
+  console.error('[embedding] No embedding provider available. Configure openai or ollama in config.yml');
+  return null;
 }
-export { formatQueryPrompt, formatDocumentPrompt, parseModelURI };
+export { formatQueryPrompt, formatDocumentPrompt };

package/src/expansion.ts CHANGED Viewed

@@ -1,6 +1,3 @@
-import { getLlama } from 'node-llama-cpp';
-import { resolveModelPath } from './embeddings.js';
 export interface QueryExpander {
   expand(query: string): Promise<string[]>;
   dispose(): void;
@@ -11,69 +8,8 @@ export interface QueryExpanderOptions {
   cacheDir?: string;
 }
-const DEFAULT_MODEL_URI = 'hf:tobi/qmd-query-expansion-1.7B-GGUF/qmd-query-expansion-1.7B-Q8_0.gguf';
-const MODEL_NAME = 'qmd-query-expansion-1.7B';
-class QueryExpanderImpl implements QueryExpander {
-  constructor(
-    private model: any,
-    private context: any
-  ) {}
-  async expand(query: string): Promise<string[]> {
-    try {
-      const prompt = `Generate 2 alternative search queries for: ${query}\n\n1.`;
-      const result = await this.context.evaluate([prompt], {
-        maxTokens: 200,
-        temperature: 0.7,
-      });
-      const generated = result?.text || '';
-      const lines = generated.split('\n').filter(line => line.trim());
-      const variants: string[] = [];
-      for (const line of lines) {
-        const match = line.match(/^\d+\.\s*(.+)$/);
-        if (match && match[1]) {
-          variants.push(match[1].trim());
-        }
-      }
-      if (variants.length >= 2) {
-        return variants.slice(0, 2);
-      }
-      return [query];
-    } catch (error) {
-      console.warn('Query expansion failed:', error instanceof Error ? error.message : String(error));
-      return [query];
-    }
-  }
-  dispose(): void {
-    this.context = null;
-  }
-}
 export async function createQueryExpander(
-  options?: QueryExpanderOptions
+  _options?: QueryExpanderOptions
 ): Promise<QueryExpander | null> {
-  try {
-    const modelUri = options?.modelPath || DEFAULT_MODEL_URI;
-    const modelPath = await resolveModelPath(modelUri, options?.cacheDir);
-    const llama = await getLlama();
-    const model = await llama.loadModel({ modelPath });
-    const context = await model.createContext({
-      contextSize: 2048,
-    });
-    return new QueryExpanderImpl(model, context);
-  } catch (error) {
-    console.warn('Failed to load query expander model:', error instanceof Error ? error.message : String(error));
-    return null;
-  }
+  return null;
 }

package/src/reranker.ts CHANGED Viewed

@@ -1,6 +1,3 @@
-import { getLlama } from 'node-llama-cpp';
-import { cpus } from 'os';
-import { resolveModelPath } from './embeddings.js';
 import type { RerankResult, RerankDocument } from './types.js';
 import { log } from './logger.js';
@@ -14,98 +11,9 @@ export interface RerankerOptions {
   cacheDir?: string;
 }
-const DEFAULT_MODEL_URI = 'hf:gpustack/bge-reranker-v2-m3-GGUF/bge-reranker-v2-m3-Q4_K_M.gguf';
-const MODEL_NAME = 'bge-reranker-v2-m3';
-const CONTEXT_SIZE = 8192;
-function sigmoid(x: number): number {
-  return 1 / (1 + Math.exp(-x));
-}
-class RerankerImpl implements Reranker {
-  private contexts: any[] = [];
-  constructor(
-    private model: any,
-    private parallelism: number
-  ) {}
-  async initialize(): Promise<void> {
-    log('reranker', 'initializing with parallelism=' + this.parallelism);
-    for (let i = 0; i < this.parallelism; i++) {
-      const context = await this.model.createContext({
-        contextSize: CONTEXT_SIZE,
-      });
-      this.contexts.push(context);
-    }
-  }
-  async rerank(query: string, documents: RerankDocument[]): Promise<RerankResult> {
-    log('reranker', 'rerank query="' + query.slice(0, 50) + '" docs=' + documents.length);
-    const scoredDocs: Array<{ file: string; score: number; index: number }> = [];
-    const batchSize = Math.min(4, this.parallelism);
-    log('reranker', 'batch size=' + batchSize);
-    for (let i = 0; i < documents.length; i += batchSize) {
-      const batch = documents.slice(i, i + batchSize);
-      const batchPromises = batch.map(async (doc, idx) => {
-        const contextIdx = idx % this.contexts.length;
-        const context = this.contexts[contextIdx];
-        const prompt = `Query: ${query}\nDocument: ${doc.text}`;
-        const result = await context.evaluate([prompt]);
-        const rawScore = result?.logits?.[0] || 0;
-        const normalizedScore = sigmoid(rawScore);
-        return {
-          file: doc.file,
-          score: normalizedScore,
-          index: doc.index,
-        };
-      });
-      const batchResults = await Promise.all(batchPromises);
-      scoredDocs.push(...batchResults);
-    }
-    scoredDocs.sort((a, b) => b.score - a.score);
-    log('reranker', 'rerank complete results=' + scoredDocs.length);
-    return {
-      results: scoredDocs,
-      model: MODEL_NAME,
-    };
-  }
-  dispose(): void {
-    this.contexts = [];
-  }
-}
 export async function createReranker(
-  options?: RerankerOptions
+  _options?: RerankerOptions
 ): Promise<Reranker | null> {
-  try {
-    log('reranker', 'loading reranker model');
-    const modelUri = options?.modelPath || DEFAULT_MODEL_URI;
-    const modelPath = await resolveModelPath(modelUri, options?.cacheDir);
-    const llama = await getLlama();
-    const model = await llama.loadModel({ modelPath });
-    const cpuCount = cpus().length;
-    const parallelism = Math.max(1, Math.min(4, Math.floor(cpuCount / 4)));
-    const reranker = new RerankerImpl(model, parallelism);
-    await reranker.initialize();
-    log('reranker', 'reranker model loaded successfully');
-    return reranker;
-  } catch (error) {
-    log('reranker', 'failed to load reranker model: ' + (error instanceof Error ? error.message : String(error)));
-    console.warn('Failed to load reranker model:', error instanceof Error ? error.message : String(error));
-    return null;
-  }
+  log('reranker', 'local reranker removed — use external reranker or rely on BM25+vector fusion');
+  return null;
 }

package/src/server.ts CHANGED Viewed

@@ -98,7 +98,6 @@ export function formatStatus(
       lines.push(`  - Model available: ${hasModel ? '✅ yes' : '❌ not found — run: ollama pull ' + embeddingHealth.model}`)
     } else {
       lines.push(`  - Status: ❌ unreachable (${embeddingHealth.error})`)
-      lines.push(`  - Fallback: local GGUF (node-llama-cpp)`)
     }
   }
   if (codebaseStats) {