npm - @tobilu/qmd - Versions diffs - 1.1.1 → 1.1.2 - Mend

@tobilu/qmd 1.1.1 → 1.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,57 @@
 ## [Unreleased]
+## [1.1.2] - 2026-03-07
+13 community PRs merged. GPU initialization replaced with node-llama-cpp's
+built-in `autoAttempt` — deleting ~220 lines of manual fallback code and
+fixing GPU issues reported across 10+ PRs in one shot. Reranking is faster
+through chunk deduplication and a parallelism cap that prevents VRAM
+exhaustion.
+### Changes
+- **GPU init**: use node-llama-cpp's `build: "autoAttempt"` instead of manual
+  GPU backend detection. Automatically tries Metal/CUDA/Vulkan and falls back
+  gracefully. #310 (thanks @giladgd — the node-llama-cpp author)
+- **Query `--explain`**: `qmd query --explain` exposes retrieval score traces
+  — backend scores, per-list RRF contributions, top-rank bonus, reranker
+  score, and final blended score. Works in JSON and CLI output. #242
+  (thanks @vyalamar)
+- **Collection ignore patterns**: `ignore: ["Sessions/**", "*.tmp"]` in
+  collection config to exclude files from indexing. #304 (thanks @sebkouba)
+- **Multilingual embeddings**: `QMD_EMBED_MODEL` env var lets you swap in
+  models like Qwen3-Embedding for non-English collections. #273 (thanks
+  @daocoding)
+- **Configurable expansion context**: `QMD_EXPAND_CONTEXT_SIZE` env var
+  (default 2048) — previously used the model's full 40960-token window,
+  wasting VRAM. #313 (thanks @0xble)
+- **`candidateLimit` exposed**: `-C` / `--candidate-limit` flag and MCP
+  parameter to tune how many candidates reach the reranker. #255 (thanks
+  @pandysp)
+- **MCP multi-session**: HTTP transport now supports multiple concurrent
+  client sessions, each with its own server instance. #286 (thanks @joelev)
+### Fixes
+- **Reranking performance**: cap parallel rerank contexts at 4 to prevent
+  VRAM exhaustion on high-core machines. Deduplicate identical chunk texts
+  before reranking — same content from different files now shares a single
+  reranker call. Cache scores by content hash instead of file path.
+- Deactivate stale docs when all files are removed from a collection and
+  `qmd update` is run. #312 (thanks @0xble)
+- Handle emoji-only filenames (`🐘.md` → `1f418.md`) instead of crashing.
+  #308 (thanks @debugerman)
+- Skip unreadable files during indexing (e.g. iCloud-evicted files returning
+  EAGAIN) instead of crashing. #253 (thanks @jimmynail)
+- Suppress progress bar escape sequences when stderr is not a TTY. #230
+  (thanks @dgilperez)
+- Emit format-appropriate empty output (`[]` for JSON, CSV header for CSV,
+  etc.) instead of plain text "No results." #228 (thanks @amsminn)
+- Correct Windows sqlite-vec package name (`sqlite-vec-windows-x64`) and add
+  `sqlite-vec-linux-arm64`. #225 (thanks @ilepn)
+- Fix claude plugin setup CLI commands in README. #311 (thanks @gi11es)
 ## [1.1.1] - 2026-03-06
 ### Fixes

package/README.md CHANGED Viewed

@@ -97,8 +97,8 @@ Although the tool works perfectly fine when you just tell your agent to use it o
 **Claude Code** — Install the plugin (recommended):
 ```bash
-claude marketplace add tobi/qmd
-claude plugin add qmd@qmd
+claude plugin marketplace add tobi/qmd
+claude plugin install qmd@qmd
 ```
 Or configure MCP manually in `~/.claude/settings.json`:
@@ -252,12 +252,34 @@ QMD uses three local GGUF models (auto-downloaded on first use):
 | Model | Purpose | Size |
 |-------|---------|------|
-| `embeddinggemma-300M-Q8_0` | Vector embeddings | ~300MB |
+| `embeddinggemma-300M-Q8_0` | Vector embeddings (default) | ~300MB |
 | `qwen3-reranker-0.6b-q8_0` | Re-ranking | ~640MB |
 | `qmd-query-expansion-1.7B-q4_k_m` | Query expansion (fine-tuned) | ~1.1GB |
 Models are downloaded from HuggingFace and cached in `~/.cache/qmd/models/`.
+### Custom Embedding Model
+Override the default embedding model via the `QMD_EMBED_MODEL` environment variable.
+This is useful for multilingual corpora (e.g. Chinese, Japanese, Korean) where
+`embeddinggemma-300M` has limited coverage.
+```sh
+# Use Qwen3-Embedding-0.6B for better multilingual (CJK) support
+export QMD_EMBED_MODEL="hf:Qwen/Qwen3-Embedding-0.6B-GGUF/qwen3-embedding-0.6b-q8_0.gguf"
+# After changing the model, re-embed all collections:
+qmd embed -f
+```
+Supported model families:
+- **embeddinggemma** (default) — English-optimized, small footprint
+- **Qwen3-Embedding** — Multilingual (119 languages including CJK), MTEB top-ranked
+> **Note:** When switching embedding models, you must re-index with `qmd embed -f`
+> since vectors are not cross-compatible between models. The prompt format is
+> automatically adjusted for each model family.
 ## Installation
 ```sh
@@ -366,6 +388,7 @@ qmd query "user authentication"
 --min-score <num>  # Minimum score threshold (default: 0)
 --full             # Show full document content
 --line-numbers     # Add line numbers to output
+--explain          # Include retrieval score traces (query, JSON/CLI output)
 --index <name>     # Use named index
 # Output formats (for search and multi-get)
@@ -428,6 +451,9 @@ qmd search --md --full "error handling"
 # JSON output for scripting
 qmd query --json "quarterly reports"
+# Inspect how each result was scored (RRF + rerank blend)
+qmd query --json --explain "quarterly reports"
 # Use separate index for different knowledge base
 qmd --index work search "quarterly reports"
 ```

package/dist/collections.d.ts CHANGED Viewed

@@ -16,6 +16,7 @@ export type ContextMap = Record<string, string>;
 export interface Collection {
     path: string;
     pattern: string;
+    ignore?: string[];
     context?: ContextMap;
     update?: string;
     includeByDefault?: boolean;

package/dist/llm.d.ts CHANGED Viewed

@@ -4,16 +4,23 @@
  * Provides embeddings, text generation, and reranking using local GGUF models.
  */
 import { type Token as LlamaToken } from "node-llama-cpp";
+/**
+ * Detect if a model URI uses the Qwen3-Embedding format.
+ * Qwen3-Embedding uses a different prompting style than nomic/embeddinggemma.
+ */
+export declare function isQwen3EmbeddingModel(modelUri: string): boolean;
 /**
  * Format a query for embedding.
- * Uses nomic-style task prefix format for embeddinggemma.
+ * Uses nomic-style task prefix format for embeddinggemma (default).
+ * Uses Qwen3-Embedding instruct format when a Qwen embedding model is active.
  */
-export declare function formatQueryForEmbedding(query: string): string;
+export declare function formatQueryForEmbedding(query: string, modelUri?: string): string;
 /**
  * Format a document for embedding.
- * Uses nomic-style format with title and text fields.
+ * Uses nomic-style format with title and text fields (default).
+ * Qwen3-Embedding encodes documents as raw text without special prefixes.
  */
-export declare function formatDocForEmbedding(text: string, title?: string): string;
+export declare function formatDocForEmbedding(text: string, title?: string, modelUri?: string): string;
 /**
  * Token with log probability
  */
@@ -130,7 +137,7 @@ export type RerankDocument = {
 };
 export declare const LFM2_GENERATE_MODEL = "hf:LiquidAI/LFM2-1.2B-GGUF/LFM2-1.2B-Q4_K_M.gguf";
 export declare const LFM2_INSTRUCT_MODEL = "hf:LiquidAI/LFM2.5-1.2B-Instruct-GGUF/LFM2.5-1.2B-Instruct-Q4_K_M.gguf";
-export declare const DEFAULT_EMBED_MODEL_URI = "hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf";
+export declare const DEFAULT_EMBED_MODEL_URI: string;
 export declare const DEFAULT_RERANK_MODEL_URI = "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf";
 export declare const DEFAULT_GENERATE_MODEL_URI = "hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf";
 export declare const DEFAULT_MODEL_CACHE_DIR: string;
@@ -183,6 +190,11 @@ export type LlamaCppConfig = {
     generateModel?: string;
     rerankModel?: string;
     modelCacheDir?: string;
+    /**
+     * Context size used for query expansion generation contexts.
+     * Default: 2048. Can also be set via QMD_EXPAND_CONTEXT_SIZE.
+     */
+    expandContextSize?: number;
     /**
      * Inactivity timeout in ms before unloading contexts (default: 2 minutes, 0 to disable).
      *
@@ -210,6 +222,7 @@ export declare class LlamaCpp implements LLM {
     private generateModelUri;
     private rerankModelUri;
     private modelCacheDir;
+    private expandContextSize;
     private embedModelLoadPromise;
     private generateModelLoadPromise;
     private rerankModelLoadPromise;
@@ -319,6 +332,7 @@ export declare class LlamaCpp implements LLM {
         includeLexical?: boolean;
     }): Promise<Queryable[]>;
     private static readonly RERANK_TEMPLATE_OVERHEAD;
+    private static readonly RERANK_TARGET_DOCS_PER_CONTEXT;
     rerank(query: string, documents: RerankDocument[], options?: RerankOptions): Promise<RerankResult>;
     /**
      * Get device/GPU info for status display.

package/dist/llm.js CHANGED Viewed

@@ -3,25 +3,43 @@
  *
  * Provides embeddings, text generation, and reranking using local GGUF models.
  */
-import { getLlama, getLlamaGpuTypes, resolveModelFile, LlamaChatSession, LlamaLogLevel, } from "node-llama-cpp";
+import { getLlama, resolveModelFile, LlamaChatSession, LlamaLogLevel, } from "node-llama-cpp";
 import { homedir } from "os";
 import { join } from "path";
 import { existsSync, mkdirSync, statSync, unlinkSync, readdirSync, readFileSync, writeFileSync } from "fs";
 // =============================================================================
 // Embedding Formatting Functions
 // =============================================================================
+/**
+ * Detect if a model URI uses the Qwen3-Embedding format.
+ * Qwen3-Embedding uses a different prompting style than nomic/embeddinggemma.
+ */
+export function isQwen3EmbeddingModel(modelUri) {
+    return /qwen.*embed/i.test(modelUri) || /embed.*qwen/i.test(modelUri);
+}
 /**
  * Format a query for embedding.
- * Uses nomic-style task prefix format for embeddinggemma.
+ * Uses nomic-style task prefix format for embeddinggemma (default).
+ * Uses Qwen3-Embedding instruct format when a Qwen embedding model is active.
  */
-export function formatQueryForEmbedding(query) {
+export function formatQueryForEmbedding(query, modelUri) {
+    const uri = modelUri ?? process.env.QMD_EMBED_MODEL ?? DEFAULT_EMBED_MODEL;
+    if (isQwen3EmbeddingModel(uri)) {
+        return `Instruct: Retrieve relevant documents for the given query\nQuery: ${query}`;
+    }
     return `task: search result | query: ${query}`;
 }
 /**
  * Format a document for embedding.
- * Uses nomic-style format with title and text fields.
+ * Uses nomic-style format with title and text fields (default).
+ * Qwen3-Embedding encodes documents as raw text without special prefixes.
  */
-export function formatDocForEmbedding(text, title) {
+export function formatDocForEmbedding(text, title, modelUri) {
+    const uri = modelUri ?? process.env.QMD_EMBED_MODEL ?? DEFAULT_EMBED_MODEL;
+    if (isQwen3EmbeddingModel(uri)) {
+        // Qwen3-Embedding: documents are raw text, no task prefix
+        return title ? `${title}\n${text}` : text;
+    }
     return `title: ${title || "none"} | text: ${text}`;
 }
 // =============================================================================
@@ -29,7 +47,8 @@ export function formatDocForEmbedding(text, title) {
 // =============================================================================
 // HuggingFace model URIs for node-llama-cpp
 // Format: hf:<user>/<repo>/<file>
-const DEFAULT_EMBED_MODEL = "hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf";
+// Override via QMD_EMBED_MODEL env var (e.g. hf:Qwen/Qwen3-Embedding-0.6B-GGUF/qwen3-embedding-0.6b-q8_0.gguf)
+const DEFAULT_EMBED_MODEL = process.env.QMD_EMBED_MODEL ?? "hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf";
 const DEFAULT_RERANK_MODEL = "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf";
 // const DEFAULT_GENERATE_MODEL = "hf:ggml-org/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf";
 const DEFAULT_GENERATE_MODEL = "hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf";
@@ -126,6 +145,24 @@ export async function pullModels(models, options = {}) {
  */
 // Default inactivity timeout: 5 minutes (keep models warm during typical search sessions)
 const DEFAULT_INACTIVITY_TIMEOUT_MS = 5 * 60 * 1000;
+const DEFAULT_EXPAND_CONTEXT_SIZE = 2048;
+function resolveExpandContextSize(configValue) {
+    if (configValue !== undefined) {
+        if (!Number.isInteger(configValue) || configValue <= 0) {
+            throw new Error(`Invalid expandContextSize: ${configValue}. Must be a positive integer.`);
+        }
+        return configValue;
+    }
+    const envValue = process.env.QMD_EXPAND_CONTEXT_SIZE?.trim();
+    if (!envValue)
+        return DEFAULT_EXPAND_CONTEXT_SIZE;
+    const parsed = Number.parseInt(envValue, 10);
+    if (!Number.isInteger(parsed) || parsed <= 0) {
+        process.stderr.write(`QMD Warning: invalid QMD_EXPAND_CONTEXT_SIZE="${envValue}", using default ${DEFAULT_EXPAND_CONTEXT_SIZE}.\n`);
+        return DEFAULT_EXPAND_CONTEXT_SIZE;
+    }
+    return parsed;
+}
 export class LlamaCpp {
     llama = null;
     embedModel = null;
@@ -137,6 +174,7 @@ export class LlamaCpp {
     generateModelUri;
     rerankModelUri;
     modelCacheDir;
+    expandContextSize;
     // Ensure we don't load the same model/context concurrently (which can allocate duplicate VRAM).
     embedModelLoadPromise = null;
     generateModelLoadPromise = null;
@@ -152,6 +190,7 @@ export class LlamaCpp {
         this.generateModelUri = config.generateModel || DEFAULT_GENERATE_MODEL;
         this.rerankModelUri = config.rerankModel || DEFAULT_RERANK_MODEL;
         this.modelCacheDir = config.modelCacheDir || MODEL_CACHE_DIR;
+        this.expandContextSize = resolveExpandContextSize(config.expandContextSize);
         this.inactivityTimeoutMs = config.inactivityTimeoutMs ?? DEFAULT_INACTIVITY_TIMEOUT_MS;
         this.disposeModelsOnInactivity = config.disposeModelsOnInactivity ?? false;
     }
@@ -249,27 +288,12 @@ export class LlamaCpp {
      */
     async ensureLlama() {
         if (!this.llama) {
-            // Detect available GPU types and use the best one.
-            // We can't rely on gpu:"auto" — it returns false even when CUDA is available
-            // (likely a binary/build config issue in node-llama-cpp).
-            // @ts-expect-error node-llama-cpp API compat
-            const gpuTypes = await getLlamaGpuTypes();
-            // Prefer CUDA > Metal > Vulkan > CPU
-            const preferred = ["cuda", "metal", "vulkan"].find(g => gpuTypes.includes(g));
-            let llama;
-            if (preferred) {
-                try {
-                    llama = await getLlama({ gpu: preferred, logLevel: LlamaLogLevel.error });
-                }
-                catch {
-                    llama = await getLlama({ gpu: false, logLevel: LlamaLogLevel.error });
-                    process.stderr.write(`QMD Warning: ${preferred} reported available but failed to initialize. Falling back to CPU.\n`);
-                }
-            }
-            else {
-                llama = await getLlama({ gpu: false, logLevel: LlamaLogLevel.error });
-            }
-            if (!llama.gpu) {
+            const llama = await getLlama({
+                // attempt to build
+                build: "autoAttempt",
+                logLevel: LlamaLogLevel.error
+            });
+            if (llama.gpu === false) {
                 process.stderr.write("QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details.\n");
             }
             this.llama = llama;
@@ -466,7 +490,7 @@ export class LlamaCpp {
         if (this.rerankContexts.length === 0) {
             const model = await this.ensureRerankModel();
             // ~960 MB per context with flash attention at contextSize 2048
-            const n = await this.computeParallelism(1000);
+            const n = Math.min(await this.computeParallelism(1000), 4);
             const threads = await this.threadsPerContext(n);
             for (let i = 0; i < n; i++) {
                 try {
@@ -668,8 +692,10 @@ export class LlamaCpp {
       `
         });
         const prompt = `/no_think Expand this search query: ${query}`;
-        // Create fresh context for each call
-        const genContext = await this.generateModel.createContext();
+        // Create a bounded context for expansion to prevent large default VRAM allocations.
+        const genContext = await this.generateModel.createContext({
+            contextSize: this.expandContextSize,
+        });
         const sequence = genContext.getSequence();
         const session = new LlamaChatSession({ contextSequence: sequence });
         try {
@@ -733,6 +759,7 @@ export class LlamaCpp {
     }
     // Qwen3 reranker chat template overhead (system prompt, tags, separators)
     static RERANK_TEMPLATE_OVERHEAD = 200;
+    static RERANK_TARGET_DOCS_PER_CONTEXT = 10;
     async rerank(query, documents, options = {}) {
         // Ping activity at start to keep models alive during this operation
         this.touchActivity();
@@ -742,41 +769,61 @@ export class LlamaCpp {
         // Budget = contextSize - template overhead - query tokens
         const queryTokens = model.tokenize(query).length;
         const maxDocTokens = LlamaCpp.RERANK_CONTEXT_SIZE - LlamaCpp.RERANK_TEMPLATE_OVERHEAD - queryTokens;
+        const truncationCache = new Map();
         const truncatedDocs = documents.map((doc) => {
+            const cached = truncationCache.get(doc.text);
+            if (cached !== undefined) {
+                return cached === doc.text ? doc : { ...doc, text: cached };
+            }
             const tokens = model.tokenize(doc.text);
-            if (tokens.length <= maxDocTokens)
+            const truncatedText = tokens.length <= maxDocTokens
+                ? doc.text
+                : model.detokenize(tokens.slice(0, maxDocTokens));
+            truncationCache.set(doc.text, truncatedText);
+            if (truncatedText === doc.text)
                 return doc;
-            const truncatedText = model.detokenize(tokens.slice(0, maxDocTokens));
             return { ...doc, text: truncatedText };
         });
-        // Build a map from document text to original indices (for lookup after sorting)
-        const textToDoc = new Map();
+        // Deduplicate identical effective texts before scoring.
+        // This avoids redundant work for repeated chunks and fixes collisions where
+        // multiple docs map to the same chunk text.
+        const textToDocs = new Map();
         truncatedDocs.forEach((doc, index) => {
-            textToDoc.set(doc.text, { file: doc.file, index });
+            const existing = textToDocs.get(doc.text);
+            if (existing) {
+                existing.push({ file: doc.file, index });
+            }
+            else {
+                textToDocs.set(doc.text, [{ file: doc.file, index }]);
+            }
         });
         // Extract just the text for ranking
-        const texts = truncatedDocs.map((doc) => doc.text);
+        const texts = Array.from(textToDocs.keys());
         // Split documents across contexts for parallel evaluation.
         // Each context has its own sequence with a lock, so parallelism comes
         // from multiple contexts evaluating different chunks simultaneously.
-        const n = contexts.length;
-        const chunkSize = Math.ceil(texts.length / n);
-        const chunks = Array.from({ length: n }, (_, i) => texts.slice(i * chunkSize, (i + 1) * chunkSize)).filter(chunk => chunk.length > 0);
-        const allScores = await Promise.all(chunks.map((chunk, i) => contexts[i].rankAll(query, chunk)));
+        const activeContextCount = Math.max(1, Math.min(contexts.length, Math.ceil(texts.length / LlamaCpp.RERANK_TARGET_DOCS_PER_CONTEXT)));
+        const activeContexts = contexts.slice(0, activeContextCount);
+        const chunkSize = Math.ceil(texts.length / activeContexts.length);
+        const chunks = Array.from({ length: activeContexts.length }, (_, i) => texts.slice(i * chunkSize, (i + 1) * chunkSize)).filter(chunk => chunk.length > 0);
+        const allScores = await Promise.all(chunks.map((chunk, i) => activeContexts[i].rankAll(query, chunk)));
         // Reassemble scores in original order and sort
         const flatScores = allScores.flat();
         const ranked = texts
             .map((text, i) => ({ document: text, score: flatScores[i] }))
             .sort((a, b) => b.score - a.score);
-        // Map back to our result format using the text-to-doc map
-        const results = ranked.map((item) => {
-            const docInfo = textToDoc.get(item.document);
-            return {
-                file: docInfo.file,
-                score: item.score,
-                index: docInfo.index,
-            };
-        });
+        // Map back to our result format.
+        const results = [];
+        for (const item of ranked) {
+            const docInfos = textToDocs.get(item.document) ?? [];
+            for (const docInfo of docInfos) {
+                results.push({
+                    file: docInfo.file,
+                    score: item.score,
+                    index: docInfo.index,
+                });
+            }
+        }
         return {
             results,
             model: this.rerankModelUri,
@@ -1033,7 +1080,8 @@ let defaultLlamaCpp = null;
  */
 export function getDefaultLlamaCpp() {
     if (!defaultLlamaCpp) {
-        defaultLlamaCpp = new LlamaCpp();
+        const embedModel = process.env.QMD_EMBED_MODEL;
+        defaultLlamaCpp = new LlamaCpp(embedModel ? { embedModel } : {});
     }
     return defaultLlamaCpp;
 }

package/dist/mcp.js CHANGED Viewed

@@ -12,6 +12,7 @@ import { fileURLToPath } from "url";
 import { McpServer, ResourceTemplate } from "@modelcontextprotocol/sdk/server/mcp.js";
 import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
 import { WebStandardStreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/webStandardStreamableHttp.js";
+import { isInitializeRequest } from "@modelcontextprotocol/sdk/types.js";
 import { z } from "zod";
 import { createStore, extractSnippet, addLineNumbers, structuredSearch, DEFAULT_MULTI_GET_MAX_BYTES, } from "./store.js";
 import { getCollection, getGlobalContext, getDefaultCollectionNames } from "./collections.js";
@@ -233,9 +234,10 @@ Intent-aware lex (C++ performance, not sports):
             searches: z.array(subSearchSchema).min(1).max(10).describe("Typed sub-queries to execute (lex/vec/hyde). First gets 2x weight."),
             limit: z.number().optional().default(10).describe("Max results (default: 10)"),
             minScore: z.number().optional().default(0).describe("Min relevance 0-1 (default: 0)"),
+            candidateLimit: z.number().optional().describe("Maximum candidates to rerank (default: 40, lower = faster but may miss results)"),
             collections: z.array(z.string()).optional().describe("Filter to collections (OR match)"),
         },
-    }, async ({ searches, limit, minScore, collections }) => {
+    }, async ({ searches, limit, minScore, candidateLimit, collections }) => {
         // Map to internal format
         const subSearches = searches.map(s => ({
             type: s.type,
@@ -247,6 +249,7 @@ Intent-aware lex (C++ performance, not sports):
             collections: effectiveCollections.length > 0 ? effectiveCollections : undefined,
             limit,
             minScore,
+            candidateLimit,
         });
         // Use first lex or vec query for snippet extraction
         const primaryQuery = searches.find(s => s.type === 'lex')?.query
@@ -425,12 +428,27 @@ export async function startMcpServer() {
  */
 export async function startMcpHttpServer(port, options) {
     const store = createStore();
-    const mcpServer = createMcpServer(store);
-    const transport = new WebStandardStreamableHTTPServerTransport({
-        sessionIdGenerator: () => randomUUID(),
-        enableJsonResponse: true,
-    });
-    await mcpServer.connect(transport);
+    // Session map: each client gets its own McpServer + Transport pair (MCP spec requirement).
+    // The store is shared — it's stateless SQLite, safe for concurrent access.
+    const sessions = new Map();
+    async function createSession() {
+        const transport = new WebStandardStreamableHTTPServerTransport({
+            sessionIdGenerator: () => randomUUID(),
+            enableJsonResponse: true,
+            onsessioninitialized: (sessionId) => {
+                sessions.set(sessionId, transport);
+                log(`${ts()} New session ${sessionId} (${sessions.size} active)`);
+            },
+        });
+        const server = createMcpServer(store);
+        await server.connect(transport);
+        transport.onclose = () => {
+            if (transport.sessionId) {
+                sessions.delete(transport.sessionId);
+            }
+        };
+        return transport;
+    }
     const startTime = Date.now();
     const quiet = options?.quiet ?? false;
     /** Format timestamp for request logging */
@@ -500,6 +518,7 @@ export async function startMcpHttpServer(port, options) {
                     collections: effectiveCollections.length > 0 ? effectiveCollections : undefined,
                     limit: params.limit ?? 10,
                     minScore: params.minScore ?? 0,
+                    candidateLimit: params.candidateLimit,
                 });
                 // Use first lex or vec query for snippet extraction
                 const primaryQuery = params.searches.find((s) => s.type === 'lex')?.query
@@ -531,6 +550,34 @@ export async function startMcpHttpServer(port, options) {
                     if (typeof v === "string")
                         headers[k] = v;
                 }
+                // Route to existing session or create new one on initialize
+                const sessionId = headers["mcp-session-id"];
+                let transport;
+                if (sessionId) {
+                    const existing = sessions.get(sessionId);
+                    if (!existing) {
+                        nodeRes.writeHead(404, { "Content-Type": "application/json" });
+                        nodeRes.end(JSON.stringify({
+                            jsonrpc: "2.0",
+                            error: { code: -32001, message: "Session not found" },
+                            id: body?.id ?? null,
+                        }));
+                        return;
+                    }
+                    transport = existing;
+                }
+                else if (isInitializeRequest(body)) {
+                    transport = await createSession();
+                }
+                else {
+                    nodeRes.writeHead(400, { "Content-Type": "application/json" });
+                    nodeRes.end(JSON.stringify({
+                        jsonrpc: "2.0",
+                        error: { code: -32000, message: "Bad Request: Missing session ID" },
+                        id: body?.id ?? null,
+                    }));
+                    return;
+                }
                 const request = new Request(url, { method: "POST", headers, body: rawBody });
                 const response = await transport.handleRequest(request, { parsedBody: body });
                 nodeRes.writeHead(response.status, Object.fromEntries(response.headers));
@@ -539,12 +586,33 @@ export async function startMcpHttpServer(port, options) {
                 return;
             }
             if (pathname === "/mcp") {
-                const url = `http://localhost:${port}${pathname}`;
                 const headers = {};
                 for (const [k, v] of Object.entries(nodeReq.headers)) {
                     if (typeof v === "string")
                         headers[k] = v;
                 }
+                // GET/DELETE must have a valid session
+                const sessionId = headers["mcp-session-id"];
+                if (!sessionId) {
+                    nodeRes.writeHead(400, { "Content-Type": "application/json" });
+                    nodeRes.end(JSON.stringify({
+                        jsonrpc: "2.0",
+                        error: { code: -32000, message: "Bad Request: Missing session ID" },
+                        id: null,
+                    }));
+                    return;
+                }
+                const transport = sessions.get(sessionId);
+                if (!transport) {
+                    nodeRes.writeHead(404, { "Content-Type": "application/json" });
+                    nodeRes.end(JSON.stringify({
+                        jsonrpc: "2.0",
+                        error: { code: -32001, message: "Session not found" },
+                        id: null,
+                    }));
+                    return;
+                }
+                const url = `http://localhost:${port}${pathname}`;
                 const rawBody = nodeReq.method !== "GET" && nodeReq.method !== "HEAD" ? await collectBody(nodeReq) : undefined;
                 const request = new Request(url, { method: nodeReq.method || "GET", headers, ...(rawBody ? { body: rawBody } : {}) });
                 const response = await transport.handleRequest(request);
@@ -571,7 +639,10 @@ export async function startMcpHttpServer(port, options) {
         if (stopping)
             return;
         stopping = true;
-        await transport.close();
+        for (const transport of sessions.values()) {
+            await transport.close();
+        }
+        sessions.clear();
         httpServer.close();
         store.close();
         await disposeDefaultLlamaCpp();