@tobilu/qmd 1.1.1 → 1.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,88 @@
2
2
 
3
3
  ## [Unreleased]
4
4
 
5
+ ## [1.1.5] - 2026-03-07
6
+
7
+ Ambiguous queries like "performance" now produce dramatically better results
8
+ when the caller knows what they mean. The new `intent` parameter steers all
9
+ five pipeline stages — expansion, strong-signal bypass, chunk selection,
10
+ reranking, and snippet extraction — without searching on its own. Design and
11
+ original implementation by Ilya Grigorik (@vyalamar) in #180.
12
+
13
+ ### Changes
14
+
15
+ - **Intent parameter**: optional `intent` string disambiguates queries across
16
+ the entire search pipeline. Available via CLI (`--intent` flag or `intent:`
17
+ line in query documents), MCP (`intent` field on the query tool), and
18
+ programmatic API. Adapted from PR #180 (thanks @vyalamar).
19
+ - **Query expansion**: when intent is provided, the expansion LLM prompt
20
+ includes `Query intent: {intent}`, matching the finetune training data
21
+ format for better-aligned expansions.
22
+ - **Reranking**: intent is prepended to the rerank query so Qwen3-Reranker
23
+ scores with domain context.
24
+ - **Chunk selection**: intent terms scored at 0.5× weight alongside query
25
+ terms (1.0×) when selecting the best chunk per document for reranking.
26
+ - **Snippet extraction**: intent terms scored at 0.3× weight to nudge
27
+ snippets toward intent-relevant lines without overriding query anchoring.
28
+ - **Strong-signal bypass disabled with intent**: when intent is provided, the
29
+ BM25 strong-signal shortcut is skipped — the obvious keyword match may not
30
+ be what the caller wants.
31
+ - **MCP instructions**: callers are now guided to provide `intent` on every
32
+ search call for disambiguation.
33
+ - **Query document syntax**: `intent:` recognized as a line type. At most one
34
+ per document, cannot appear alone. Grammar updated in `docs/SYNTAX.md`.
35
+
36
+ ## [1.1.2] - 2026-03-07
37
+
38
+ 13 community PRs merged. GPU initialization replaced with node-llama-cpp's
39
+ built-in `autoAttempt` — deleting ~220 lines of manual fallback code and
40
+ fixing GPU issues reported across 10+ PRs in one shot. Reranking is faster
41
+ through chunk deduplication and a parallelism cap that prevents VRAM
42
+ exhaustion.
43
+
44
+ ### Changes
45
+
46
+ - **GPU init**: use node-llama-cpp's `build: "autoAttempt"` instead of manual
47
+ GPU backend detection. Automatically tries Metal/CUDA/Vulkan and falls back
48
+ gracefully. #310 (thanks @giladgd — the node-llama-cpp author)
49
+ - **Query `--explain`**: `qmd query --explain` exposes retrieval score traces
50
+ — backend scores, per-list RRF contributions, top-rank bonus, reranker
51
+ score, and final blended score. Works in JSON and CLI output. #242
52
+ (thanks @vyalamar)
53
+ - **Collection ignore patterns**: `ignore: ["Sessions/**", "*.tmp"]` in
54
+ collection config to exclude files from indexing. #304 (thanks @sebkouba)
55
+ - **Multilingual embeddings**: `QMD_EMBED_MODEL` env var lets you swap in
56
+ models like Qwen3-Embedding for non-English collections. #273 (thanks
57
+ @daocoding)
58
+ - **Configurable expansion context**: `QMD_EXPAND_CONTEXT_SIZE` env var
59
+ (default 2048) — previously used the model's full 40960-token window,
60
+ wasting VRAM. #313 (thanks @0xble)
61
+ - **`candidateLimit` exposed**: `-C` / `--candidate-limit` flag and MCP
62
+ parameter to tune how many candidates reach the reranker. #255 (thanks
63
+ @pandysp)
64
+ - **MCP multi-session**: HTTP transport now supports multiple concurrent
65
+ client sessions, each with its own server instance. #286 (thanks @joelev)
66
+
67
+ ### Fixes
68
+
69
+ - **Reranking performance**: cap parallel rerank contexts at 4 to prevent
70
+ VRAM exhaustion on high-core machines. Deduplicate identical chunk texts
71
+ before reranking — same content from different files now shares a single
72
+ reranker call. Cache scores by content hash instead of file path.
73
+ - Deactivate stale docs when all files are removed from a collection and
74
+ `qmd update` is run. #312 (thanks @0xble)
75
+ - Handle emoji-only filenames (`🐘.md` → `1f418.md`) instead of crashing.
76
+ #308 (thanks @debugerman)
77
+ - Skip unreadable files during indexing (e.g. iCloud-evicted files returning
78
+ EAGAIN) instead of crashing. #253 (thanks @jimmynail)
79
+ - Suppress progress bar escape sequences when stderr is not a TTY. #230
80
+ (thanks @dgilperez)
81
+ - Emit format-appropriate empty output (`[]` for JSON, CSV header for CSV,
82
+ etc.) instead of plain text "No results." #228 (thanks @amsminn)
83
+ - Correct Windows sqlite-vec package name (`sqlite-vec-windows-x64`) and add
84
+ `sqlite-vec-linux-arm64`. #225 (thanks @ilepn)
85
+ - Fix claude plugin setup CLI commands in README. #311 (thanks @gi11es)
86
+
5
87
  ## [1.1.1] - 2026-03-06
6
88
 
7
89
  ### Fixes
package/README.md CHANGED
@@ -97,8 +97,8 @@ Although the tool works perfectly fine when you just tell your agent to use it o
97
97
  **Claude Code** — Install the plugin (recommended):
98
98
 
99
99
  ```bash
100
- claude marketplace add tobi/qmd
101
- claude plugin add qmd@qmd
100
+ claude plugin marketplace add tobi/qmd
101
+ claude plugin install qmd@qmd
102
102
  ```
103
103
 
104
104
  Or configure MCP manually in `~/.claude/settings.json`:
@@ -252,12 +252,34 @@ QMD uses three local GGUF models (auto-downloaded on first use):
252
252
 
253
253
  | Model | Purpose | Size |
254
254
  |-------|---------|------|
255
- | `embeddinggemma-300M-Q8_0` | Vector embeddings | ~300MB |
255
+ | `embeddinggemma-300M-Q8_0` | Vector embeddings (default) | ~300MB |
256
256
  | `qwen3-reranker-0.6b-q8_0` | Re-ranking | ~640MB |
257
257
  | `qmd-query-expansion-1.7B-q4_k_m` | Query expansion (fine-tuned) | ~1.1GB |
258
258
 
259
259
  Models are downloaded from HuggingFace and cached in `~/.cache/qmd/models/`.
260
260
 
261
+ ### Custom Embedding Model
262
+
263
+ Override the default embedding model via the `QMD_EMBED_MODEL` environment variable.
264
+ This is useful for multilingual corpora (e.g. Chinese, Japanese, Korean) where
265
+ `embeddinggemma-300M` has limited coverage.
266
+
267
+ ```sh
268
+ # Use Qwen3-Embedding-0.6B for better multilingual (CJK) support
269
+ export QMD_EMBED_MODEL="hf:Qwen/Qwen3-Embedding-0.6B-GGUF/qwen3-embedding-0.6b-q8_0.gguf"
270
+
271
+ # After changing the model, re-embed all collections:
272
+ qmd embed -f
273
+ ```
274
+
275
+ Supported model families:
276
+ - **embeddinggemma** (default) — English-optimized, small footprint
277
+ - **Qwen3-Embedding** — Multilingual (119 languages including CJK), MTEB top-ranked
278
+
279
+ > **Note:** When switching embedding models, you must re-index with `qmd embed -f`
280
+ > since vectors are not cross-compatible between models. The prompt format is
281
+ > automatically adjusted for each model family.
282
+
261
283
  ## Installation
262
284
 
263
285
  ```sh
@@ -366,6 +388,7 @@ qmd query "user authentication"
366
388
  --min-score <num> # Minimum score threshold (default: 0)
367
389
  --full # Show full document content
368
390
  --line-numbers # Add line numbers to output
391
+ --explain # Include retrieval score traces (query, JSON/CLI output)
369
392
  --index <name> # Use named index
370
393
 
371
394
  # Output formats (for search and multi-get)
@@ -428,6 +451,9 @@ qmd search --md --full "error handling"
428
451
  # JSON output for scripting
429
452
  qmd query --json "quarterly reports"
430
453
 
454
+ # Inspect how each result was scored (RRF + rerank blend)
455
+ qmd query --json --explain "quarterly reports"
456
+
431
457
  # Use separate index for different knowledge base
432
458
  qmd --index work search "quarterly reports"
433
459
  ```
@@ -16,6 +16,7 @@ export type ContextMap = Record<string, string>;
16
16
  export interface Collection {
17
17
  path: string;
18
18
  pattern: string;
19
+ ignore?: string[];
19
20
  context?: ContextMap;
20
21
  update?: string;
21
22
  includeByDefault?: boolean;
@@ -28,6 +28,7 @@ export type FormatOptions = {
28
28
  query?: string;
29
29
  useColor?: boolean;
30
30
  lineNumbers?: boolean;
31
+ intent?: string;
31
32
  };
32
33
  /**
33
34
  * Add line numbers to text content.
package/dist/formatter.js CHANGED
@@ -55,7 +55,7 @@ export function searchResultsToJson(results, opts = {}) {
55
55
  const output = results.map(row => {
56
56
  const bodyStr = row.body || "";
57
57
  let body = opts.full ? bodyStr : undefined;
58
- let snippet = !opts.full ? extractSnippet(bodyStr, query, 300, row.chunkPos).snippet : undefined;
58
+ let snippet = !opts.full ? extractSnippet(bodyStr, query, 300, row.chunkPos, undefined, opts.intent).snippet : undefined;
59
59
  if (opts.lineNumbers) {
60
60
  if (body)
61
61
  body = addLineNumbers(body);
@@ -82,7 +82,7 @@ export function searchResultsToCsv(results, opts = {}) {
82
82
  const header = "docid,score,file,title,context,line,snippet";
83
83
  const rows = results.map(row => {
84
84
  const bodyStr = row.body || "";
85
- const { line, snippet } = extractSnippet(bodyStr, query, 500, row.chunkPos);
85
+ const { line, snippet } = extractSnippet(bodyStr, query, 500, row.chunkPos, undefined, opts.intent);
86
86
  let content = opts.full ? bodyStr : snippet;
87
87
  if (opts.lineNumbers && content) {
88
88
  content = addLineNumbers(content);
@@ -121,7 +121,7 @@ export function searchResultsToMarkdown(results, opts = {}) {
121
121
  content = bodyStr;
122
122
  }
123
123
  else {
124
- content = extractSnippet(bodyStr, query, 500, row.chunkPos).snippet;
124
+ content = extractSnippet(bodyStr, query, 500, row.chunkPos, undefined, opts.intent).snippet;
125
125
  }
126
126
  if (opts.lineNumbers) {
127
127
  content = addLineNumbers(content);
@@ -138,7 +138,7 @@ export function searchResultsToXml(results, opts = {}) {
138
138
  const items = results.map(row => {
139
139
  const titleAttr = row.title ? ` title="${escapeXml(row.title)}"` : "";
140
140
  const bodyStr = row.body || "";
141
- let content = opts.full ? bodyStr : extractSnippet(bodyStr, query, 500, row.chunkPos).snippet;
141
+ let content = opts.full ? bodyStr : extractSnippet(bodyStr, query, 500, row.chunkPos, undefined, opts.intent).snippet;
142
142
  if (opts.lineNumbers) {
143
143
  content = addLineNumbers(content);
144
144
  }
package/dist/llm.d.ts CHANGED
@@ -4,16 +4,23 @@
4
4
  * Provides embeddings, text generation, and reranking using local GGUF models.
5
5
  */
6
6
  import { type Token as LlamaToken } from "node-llama-cpp";
7
+ /**
8
+ * Detect if a model URI uses the Qwen3-Embedding format.
9
+ * Qwen3-Embedding uses a different prompting style than nomic/embeddinggemma.
10
+ */
11
+ export declare function isQwen3EmbeddingModel(modelUri: string): boolean;
7
12
  /**
8
13
  * Format a query for embedding.
9
- * Uses nomic-style task prefix format for embeddinggemma.
14
+ * Uses nomic-style task prefix format for embeddinggemma (default).
15
+ * Uses Qwen3-Embedding instruct format when a Qwen embedding model is active.
10
16
  */
11
- export declare function formatQueryForEmbedding(query: string): string;
17
+ export declare function formatQueryForEmbedding(query: string, modelUri?: string): string;
12
18
  /**
13
19
  * Format a document for embedding.
14
- * Uses nomic-style format with title and text fields.
20
+ * Uses nomic-style format with title and text fields (default).
21
+ * Qwen3-Embedding encodes documents as raw text without special prefixes.
15
22
  */
16
- export declare function formatDocForEmbedding(text: string, title?: string): string;
23
+ export declare function formatDocForEmbedding(text: string, title?: string, modelUri?: string): string;
17
24
  /**
18
25
  * Token with log probability
19
26
  */
@@ -130,7 +137,7 @@ export type RerankDocument = {
130
137
  };
131
138
  export declare const LFM2_GENERATE_MODEL = "hf:LiquidAI/LFM2-1.2B-GGUF/LFM2-1.2B-Q4_K_M.gguf";
132
139
  export declare const LFM2_INSTRUCT_MODEL = "hf:LiquidAI/LFM2.5-1.2B-Instruct-GGUF/LFM2.5-1.2B-Instruct-Q4_K_M.gguf";
133
- export declare const DEFAULT_EMBED_MODEL_URI = "hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf";
140
+ export declare const DEFAULT_EMBED_MODEL_URI: string;
134
141
  export declare const DEFAULT_RERANK_MODEL_URI = "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf";
135
142
  export declare const DEFAULT_GENERATE_MODEL_URI = "hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf";
136
143
  export declare const DEFAULT_MODEL_CACHE_DIR: string;
@@ -183,6 +190,11 @@ export type LlamaCppConfig = {
183
190
  generateModel?: string;
184
191
  rerankModel?: string;
185
192
  modelCacheDir?: string;
193
+ /**
194
+ * Context size used for query expansion generation contexts.
195
+ * Default: 2048. Can also be set via QMD_EXPAND_CONTEXT_SIZE.
196
+ */
197
+ expandContextSize?: number;
186
198
  /**
187
199
  * Inactivity timeout in ms before unloading contexts (default: 2 minutes, 0 to disable).
188
200
  *
@@ -210,6 +222,7 @@ export declare class LlamaCpp implements LLM {
210
222
  private generateModelUri;
211
223
  private rerankModelUri;
212
224
  private modelCacheDir;
225
+ private expandContextSize;
213
226
  private embedModelLoadPromise;
214
227
  private generateModelLoadPromise;
215
228
  private rerankModelLoadPromise;
@@ -317,8 +330,10 @@ export declare class LlamaCpp implements LLM {
317
330
  expandQuery(query: string, options?: {
318
331
  context?: string;
319
332
  includeLexical?: boolean;
333
+ intent?: string;
320
334
  }): Promise<Queryable[]>;
321
335
  private static readonly RERANK_TEMPLATE_OVERHEAD;
336
+ private static readonly RERANK_TARGET_DOCS_PER_CONTEXT;
322
337
  rerank(query: string, documents: RerankDocument[], options?: RerankOptions): Promise<RerankResult>;
323
338
  /**
324
339
  * Get device/GPU info for status display.
package/dist/llm.js CHANGED
@@ -3,25 +3,43 @@
3
3
  *
4
4
  * Provides embeddings, text generation, and reranking using local GGUF models.
5
5
  */
6
- import { getLlama, getLlamaGpuTypes, resolveModelFile, LlamaChatSession, LlamaLogLevel, } from "node-llama-cpp";
6
+ import { getLlama, resolveModelFile, LlamaChatSession, LlamaLogLevel, } from "node-llama-cpp";
7
7
  import { homedir } from "os";
8
8
  import { join } from "path";
9
9
  import { existsSync, mkdirSync, statSync, unlinkSync, readdirSync, readFileSync, writeFileSync } from "fs";
10
10
  // =============================================================================
11
11
  // Embedding Formatting Functions
12
12
  // =============================================================================
13
+ /**
14
+ * Detect if a model URI uses the Qwen3-Embedding format.
15
+ * Qwen3-Embedding uses a different prompting style than nomic/embeddinggemma.
16
+ */
17
+ export function isQwen3EmbeddingModel(modelUri) {
18
+ return /qwen.*embed/i.test(modelUri) || /embed.*qwen/i.test(modelUri);
19
+ }
13
20
  /**
14
21
  * Format a query for embedding.
15
- * Uses nomic-style task prefix format for embeddinggemma.
22
+ * Uses nomic-style task prefix format for embeddinggemma (default).
23
+ * Uses Qwen3-Embedding instruct format when a Qwen embedding model is active.
16
24
  */
17
- export function formatQueryForEmbedding(query) {
25
+ export function formatQueryForEmbedding(query, modelUri) {
26
+ const uri = modelUri ?? process.env.QMD_EMBED_MODEL ?? DEFAULT_EMBED_MODEL;
27
+ if (isQwen3EmbeddingModel(uri)) {
28
+ return `Instruct: Retrieve relevant documents for the given query\nQuery: ${query}`;
29
+ }
18
30
  return `task: search result | query: ${query}`;
19
31
  }
20
32
  /**
21
33
  * Format a document for embedding.
22
- * Uses nomic-style format with title and text fields.
34
+ * Uses nomic-style format with title and text fields (default).
35
+ * Qwen3-Embedding encodes documents as raw text without special prefixes.
23
36
  */
24
- export function formatDocForEmbedding(text, title) {
37
+ export function formatDocForEmbedding(text, title, modelUri) {
38
+ const uri = modelUri ?? process.env.QMD_EMBED_MODEL ?? DEFAULT_EMBED_MODEL;
39
+ if (isQwen3EmbeddingModel(uri)) {
40
+ // Qwen3-Embedding: documents are raw text, no task prefix
41
+ return title ? `${title}\n${text}` : text;
42
+ }
25
43
  return `title: ${title || "none"} | text: ${text}`;
26
44
  }
27
45
  // =============================================================================
@@ -29,7 +47,8 @@ export function formatDocForEmbedding(text, title) {
29
47
  // =============================================================================
30
48
  // HuggingFace model URIs for node-llama-cpp
31
49
  // Format: hf:<user>/<repo>/<file>
32
- const DEFAULT_EMBED_MODEL = "hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf";
50
+ // Override via QMD_EMBED_MODEL env var (e.g. hf:Qwen/Qwen3-Embedding-0.6B-GGUF/qwen3-embedding-0.6b-q8_0.gguf)
51
+ const DEFAULT_EMBED_MODEL = process.env.QMD_EMBED_MODEL ?? "hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf";
33
52
  const DEFAULT_RERANK_MODEL = "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf";
34
53
  // const DEFAULT_GENERATE_MODEL = "hf:ggml-org/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf";
35
54
  const DEFAULT_GENERATE_MODEL = "hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf";
@@ -126,6 +145,24 @@ export async function pullModels(models, options = {}) {
126
145
  */
127
146
  // Default inactivity timeout: 5 minutes (keep models warm during typical search sessions)
128
147
  const DEFAULT_INACTIVITY_TIMEOUT_MS = 5 * 60 * 1000;
148
+ const DEFAULT_EXPAND_CONTEXT_SIZE = 2048;
149
+ function resolveExpandContextSize(configValue) {
150
+ if (configValue !== undefined) {
151
+ if (!Number.isInteger(configValue) || configValue <= 0) {
152
+ throw new Error(`Invalid expandContextSize: ${configValue}. Must be a positive integer.`);
153
+ }
154
+ return configValue;
155
+ }
156
+ const envValue = process.env.QMD_EXPAND_CONTEXT_SIZE?.trim();
157
+ if (!envValue)
158
+ return DEFAULT_EXPAND_CONTEXT_SIZE;
159
+ const parsed = Number.parseInt(envValue, 10);
160
+ if (!Number.isInteger(parsed) || parsed <= 0) {
161
+ process.stderr.write(`QMD Warning: invalid QMD_EXPAND_CONTEXT_SIZE="${envValue}", using default ${DEFAULT_EXPAND_CONTEXT_SIZE}.\n`);
162
+ return DEFAULT_EXPAND_CONTEXT_SIZE;
163
+ }
164
+ return parsed;
165
+ }
129
166
  export class LlamaCpp {
130
167
  llama = null;
131
168
  embedModel = null;
@@ -137,6 +174,7 @@ export class LlamaCpp {
137
174
  generateModelUri;
138
175
  rerankModelUri;
139
176
  modelCacheDir;
177
+ expandContextSize;
140
178
  // Ensure we don't load the same model/context concurrently (which can allocate duplicate VRAM).
141
179
  embedModelLoadPromise = null;
142
180
  generateModelLoadPromise = null;
@@ -152,6 +190,7 @@ export class LlamaCpp {
152
190
  this.generateModelUri = config.generateModel || DEFAULT_GENERATE_MODEL;
153
191
  this.rerankModelUri = config.rerankModel || DEFAULT_RERANK_MODEL;
154
192
  this.modelCacheDir = config.modelCacheDir || MODEL_CACHE_DIR;
193
+ this.expandContextSize = resolveExpandContextSize(config.expandContextSize);
155
194
  this.inactivityTimeoutMs = config.inactivityTimeoutMs ?? DEFAULT_INACTIVITY_TIMEOUT_MS;
156
195
  this.disposeModelsOnInactivity = config.disposeModelsOnInactivity ?? false;
157
196
  }
@@ -249,27 +288,12 @@ export class LlamaCpp {
249
288
  */
250
289
  async ensureLlama() {
251
290
  if (!this.llama) {
252
- // Detect available GPU types and use the best one.
253
- // We can't rely on gpu:"auto" — it returns false even when CUDA is available
254
- // (likely a binary/build config issue in node-llama-cpp).
255
- // @ts-expect-error node-llama-cpp API compat
256
- const gpuTypes = await getLlamaGpuTypes();
257
- // Prefer CUDA > Metal > Vulkan > CPU
258
- const preferred = ["cuda", "metal", "vulkan"].find(g => gpuTypes.includes(g));
259
- let llama;
260
- if (preferred) {
261
- try {
262
- llama = await getLlama({ gpu: preferred, logLevel: LlamaLogLevel.error });
263
- }
264
- catch {
265
- llama = await getLlama({ gpu: false, logLevel: LlamaLogLevel.error });
266
- process.stderr.write(`QMD Warning: ${preferred} reported available but failed to initialize. Falling back to CPU.\n`);
267
- }
268
- }
269
- else {
270
- llama = await getLlama({ gpu: false, logLevel: LlamaLogLevel.error });
271
- }
272
- if (!llama.gpu) {
291
+ const llama = await getLlama({
292
+ // attempt to build
293
+ build: "autoAttempt",
294
+ logLevel: LlamaLogLevel.error
295
+ });
296
+ if (llama.gpu === false) {
273
297
  process.stderr.write("QMD Warning: no GPU acceleration, running on CPU (slow). Run 'qmd status' for details.\n");
274
298
  }
275
299
  this.llama = llama;
@@ -466,7 +490,7 @@ export class LlamaCpp {
466
490
  if (this.rerankContexts.length === 0) {
467
491
  const model = await this.ensureRerankModel();
468
492
  // ~960 MB per context with flash attention at contextSize 2048
469
- const n = await this.computeParallelism(1000);
493
+ const n = Math.min(await this.computeParallelism(1000), 4);
470
494
  const threads = await this.threadsPerContext(n);
471
495
  for (let i = 0; i < n; i++) {
472
496
  try {
@@ -667,9 +691,14 @@ export class LlamaCpp {
667
691
  content ::= [^\\n]+
668
692
  `
669
693
  });
670
- const prompt = `/no_think Expand this search query: ${query}`;
671
- // Create fresh context for each call
672
- const genContext = await this.generateModel.createContext();
694
+ const intent = options.intent;
695
+ const prompt = intent
696
+ ? `/no_think Expand this search query: ${query}\nQuery intent: ${intent}`
697
+ : `/no_think Expand this search query: ${query}`;
698
+ // Create a bounded context for expansion to prevent large default VRAM allocations.
699
+ const genContext = await this.generateModel.createContext({
700
+ contextSize: this.expandContextSize,
701
+ });
673
702
  const sequence = genContext.getSequence();
674
703
  const session = new LlamaChatSession({ contextSequence: sequence });
675
704
  try {
@@ -733,6 +762,7 @@ export class LlamaCpp {
733
762
  }
734
763
  // Qwen3 reranker chat template overhead (system prompt, tags, separators)
735
764
  static RERANK_TEMPLATE_OVERHEAD = 200;
765
+ static RERANK_TARGET_DOCS_PER_CONTEXT = 10;
736
766
  async rerank(query, documents, options = {}) {
737
767
  // Ping activity at start to keep models alive during this operation
738
768
  this.touchActivity();
@@ -742,41 +772,61 @@ export class LlamaCpp {
742
772
  // Budget = contextSize - template overhead - query tokens
743
773
  const queryTokens = model.tokenize(query).length;
744
774
  const maxDocTokens = LlamaCpp.RERANK_CONTEXT_SIZE - LlamaCpp.RERANK_TEMPLATE_OVERHEAD - queryTokens;
775
+ const truncationCache = new Map();
745
776
  const truncatedDocs = documents.map((doc) => {
777
+ const cached = truncationCache.get(doc.text);
778
+ if (cached !== undefined) {
779
+ return cached === doc.text ? doc : { ...doc, text: cached };
780
+ }
746
781
  const tokens = model.tokenize(doc.text);
747
- if (tokens.length <= maxDocTokens)
782
+ const truncatedText = tokens.length <= maxDocTokens
783
+ ? doc.text
784
+ : model.detokenize(tokens.slice(0, maxDocTokens));
785
+ truncationCache.set(doc.text, truncatedText);
786
+ if (truncatedText === doc.text)
748
787
  return doc;
749
- const truncatedText = model.detokenize(tokens.slice(0, maxDocTokens));
750
788
  return { ...doc, text: truncatedText };
751
789
  });
752
- // Build a map from document text to original indices (for lookup after sorting)
753
- const textToDoc = new Map();
790
+ // Deduplicate identical effective texts before scoring.
791
+ // This avoids redundant work for repeated chunks and fixes collisions where
792
+ // multiple docs map to the same chunk text.
793
+ const textToDocs = new Map();
754
794
  truncatedDocs.forEach((doc, index) => {
755
- textToDoc.set(doc.text, { file: doc.file, index });
795
+ const existing = textToDocs.get(doc.text);
796
+ if (existing) {
797
+ existing.push({ file: doc.file, index });
798
+ }
799
+ else {
800
+ textToDocs.set(doc.text, [{ file: doc.file, index }]);
801
+ }
756
802
  });
757
803
  // Extract just the text for ranking
758
- const texts = truncatedDocs.map((doc) => doc.text);
804
+ const texts = Array.from(textToDocs.keys());
759
805
  // Split documents across contexts for parallel evaluation.
760
806
  // Each context has its own sequence with a lock, so parallelism comes
761
807
  // from multiple contexts evaluating different chunks simultaneously.
762
- const n = contexts.length;
763
- const chunkSize = Math.ceil(texts.length / n);
764
- const chunks = Array.from({ length: n }, (_, i) => texts.slice(i * chunkSize, (i + 1) * chunkSize)).filter(chunk => chunk.length > 0);
765
- const allScores = await Promise.all(chunks.map((chunk, i) => contexts[i].rankAll(query, chunk)));
808
+ const activeContextCount = Math.max(1, Math.min(contexts.length, Math.ceil(texts.length / LlamaCpp.RERANK_TARGET_DOCS_PER_CONTEXT)));
809
+ const activeContexts = contexts.slice(0, activeContextCount);
810
+ const chunkSize = Math.ceil(texts.length / activeContexts.length);
811
+ const chunks = Array.from({ length: activeContexts.length }, (_, i) => texts.slice(i * chunkSize, (i + 1) * chunkSize)).filter(chunk => chunk.length > 0);
812
+ const allScores = await Promise.all(chunks.map((chunk, i) => activeContexts[i].rankAll(query, chunk)));
766
813
  // Reassemble scores in original order and sort
767
814
  const flatScores = allScores.flat();
768
815
  const ranked = texts
769
816
  .map((text, i) => ({ document: text, score: flatScores[i] }))
770
817
  .sort((a, b) => b.score - a.score);
771
- // Map back to our result format using the text-to-doc map
772
- const results = ranked.map((item) => {
773
- const docInfo = textToDoc.get(item.document);
774
- return {
775
- file: docInfo.file,
776
- score: item.score,
777
- index: docInfo.index,
778
- };
779
- });
818
+ // Map back to our result format.
819
+ const results = [];
820
+ for (const item of ranked) {
821
+ const docInfos = textToDocs.get(item.document) ?? [];
822
+ for (const docInfo of docInfos) {
823
+ results.push({
824
+ file: docInfo.file,
825
+ score: item.score,
826
+ index: docInfo.index,
827
+ });
828
+ }
829
+ }
780
830
  return {
781
831
  results,
782
832
  model: this.rerankModelUri,
@@ -1033,7 +1083,8 @@ let defaultLlamaCpp = null;
1033
1083
  */
1034
1084
  export function getDefaultLlamaCpp() {
1035
1085
  if (!defaultLlamaCpp) {
1036
- defaultLlamaCpp = new LlamaCpp();
1086
+ const embedModel = process.env.QMD_EMBED_MODEL;
1087
+ defaultLlamaCpp = new LlamaCpp(embedModel ? { embedModel } : {});
1037
1088
  }
1038
1089
  return defaultLlamaCpp;
1039
1090
  }