sigmap 7.30.0 → 7.31.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -10,6 +10,17 @@ Format: [Semantic Versioning](https://semver.org/)
10
10
 
11
11
  ---
12
12
 
13
+ ## [7.31.0] — 2026-07-02
14
+
15
+ Minor release — **identifier-aware BM25 re-ranker.** Plain exact-token TF-IDF missed queries whose terms live *inside* code identifiers — `component emit` never surfaced `componentEmits` because that is one token sharing no exact term with the query. This was the dominant retrieval-miss cause. The new ranker splits identifiers, stems lightly, boosts path tokens, and scores with length-normalized BM25. Deterministic, zero new dependencies, no LLM/embeddings.
16
+
17
+ ### Added
18
+ - **Identifier-aware BM25 re-ranker (#395, #396):** new zero-dependency `src/retrieval/bm25.js` with (1) identifier-aware tokenization (split camelCase / snake_case), (2) light stemming (`emits` → `emit`, `options` → `option`), (3) path-token boost (filename weighed 3×), and (4) BM25 length-normalized scoring instead of raw TF-IDF. Wired into the core ranker (`src/retrieval/ranker.js`) as the base relevance score — so `sigmap ask`, `sigmap --query`, and MCP `query_context` all benefit — with the existing negative-signal penalty and recency/graph/learned boosts layered on top. Also drives the benchmark runner (`src/eval/runner.js`) and the dev retrieval benchmark.
19
+ - **BM25 unit tests (#396):** `test/integration/bm25.test.js` covers tokenization, stemming, path boost, the `component emit` → `componentEmits` motivating case, and deterministic tie-breaking.
20
+
21
+ ### Changed
22
+ - **Retrieval benchmark refreshed:** on the 18-repo / 90-task suite, hit@5 rose **75.6% → 86.7%** (retrieval lift 5.6× → 6.4×), with rank-1 gains on flask, spring-petclinic, rails, and svelte (60% → 100%). The task-completion proxy also improved (task success 52.2% → 67.8%, prompts/task 1.72 → 1.46) since it retrieves through the same ranker. Residual misses (vapor, serilog) are files whose signatures genuinely lack the query vocabulary — out of scope, they need semantic retrieval.
23
+
13
24
  ## [7.30.0] — 2026-06-23
14
25
 
15
26
  Minor release — **v8.0 E2 + E4 (the "Pivot"):** completes v8.0 by repositioning every public surface to the chosen framing — *"the deterministic, verifiable grounding layer for AI code work"* — and framing coding agents as **consumers, not competitors**. The Evidence Pack code (E1/E3/D3 + `mcp install`) already shipped in 7.27–7.29; this is the positioning half. Docs/strings only — no runtime behaviour change, zero new dependencies.
package/README.md CHANGED
@@ -57,10 +57,10 @@ That map is exactly what agentic grep is worst at: reproducible, auditable conte
57
57
 
58
58
  **Proof it pays off** (full benchmark below):
59
59
  <!--SM:whyMetrics-->
60
- - **75.6% hit@5** — right file found in top 5 results (vs 13.6% baseline)
60
+ - **86.7% hit@5** — right file found in top 5 results (vs 13.6% baseline)
61
61
  - **97.0% token reduction** — average across 21 real repos
62
- - **52.2% task success rate** — up from 10% without context
63
- - **1.72 prompts per task** — down from 2.84 (39.4% fewer retries)
62
+ - **67.8% task success rate** — up from 10% without context
63
+ - **1.46 prompts per task** — down from 2.84 (48.8% fewer retries)
64
64
  <!--/SM:whyMetrics-->
65
65
  - **<!--SM:languages-->33<!--/SM:languages--> languages supported** — TypeScript, Python, Go, Rust, Java, R, and more
66
66
  - **No vendor lock-in** — works with any AI assistant or local LLM
@@ -74,7 +74,7 @@ That map is exactly what agentic grep is worst at: reproducible, auditable conte
74
74
  | Without SigMap | With SigMap |
75
75
  |---|---|
76
76
  | ❌ Non-reproducible agent guesses | ✅ Deterministic map — same input, same output, every time |
77
- | ❌ "Trust me" AI answers | ✅ Grounded — right file in context <!--SM:hitWhole-->76%<!--/SM:hitWhole--> of the time, every symbol on a real line anchor |
77
+ | ❌ "Trust me" AI answers | ✅ Grounded — right file in context <!--SM:hitWhole-->87%<!--/SM:hitWhole--> of the time, every symbol on a real line anchor |
78
78
  | ❌ Embeddings / vector DB required | ✅ Zero deps, no infra, fully offline |
79
79
 
80
80
  ---
@@ -98,13 +98,13 @@ Ask → Rank → Context → Validate → Judge → Learn
98
98
 
99
99
  <!--SM:benchmarkBlock-->
100
100
  ```
101
- Benchmark : sigmap-v7.30-main (21 repositories, including R language)
102
- Date : 2026-06-23
101
+ Benchmark : sigmap-v7.31-main (21 repositories, including R language)
102
+ Date : 2026-07-02
103
103
 
104
- Hit@5 : 75.6% (baseline 13.6% — 5.6× lift)
104
+ Hit@5 : 86.7% (baseline 13.6% — 6.4× lift)
105
105
  Token reduction: 97.0% (across 21 repos)
106
- Prompt reduction : 39.4% (2.84 → 1.72 prompts per task)
107
- Task success : 52.2% (baseline 10%)
106
+ Prompt reduction : 48.8% (2.84 → 1.46 prompts per task)
107
+ Task success : 67.8% (baseline 10%)
108
108
  Repos tested : 21 (JavaScript, Python, Go, Rust, Java, R, C++, C#, Dart, Swift, Ruby, PHP, Scala, Kotlin, and more)
109
109
  ```
110
110
  <!--/SM:benchmarkBlock-->
package/gen-context.js CHANGED
@@ -4136,6 +4136,7 @@ __factories["./src/eval/runner"] = function(module, exports) {
4136
4136
  const fs = require('fs');
4137
4137
  const path = require('path');
4138
4138
  const { aggregate } = __require('./src/eval/scorer');
4139
+ const { bm25rank } = __require('./src/retrieval/bm25');
4139
4140
 
4140
4141
  // ---------------------------------------------------------------------------
4141
4142
  // Context file reader
@@ -4197,79 +4198,26 @@ __factories["./src/eval/runner"] = function(module, exports) {
4197
4198
  }
4198
4199
 
4199
4200
  // ---------------------------------------------------------------------------
4200
- // Simple keyword-based ranking (pre-retrieval layer; v2.3 adds proper ranker)
4201
+ // Identifier-aware BM25 ranking (v7.31; see src/retrieval/bm25.js and #395)
4201
4202
  // ---------------------------------------------------------------------------
4202
4203
 
4203
- /**
4204
- * Tokenize a query or signature into lower-case word tokens.
4205
- * Splits on whitespace, punctuation, camelCase, and snake_case.
4206
- * @param {string} text
4207
- * @returns {string[]}
4208
- */
4209
- function tokenize(text) {
4210
- if (!text) return [];
4211
- return text
4212
- // split camelCase
4213
- .replace(/([a-z])([A-Z])/g, '$1 $2')
4214
- // split snake/kebab
4215
- .replace(/[_\-]/g, ' ')
4216
- // drop non-word chars
4217
- .replace(/[^\w\s]/g, ' ')
4218
- .toLowerCase()
4219
- .split(/\s+/)
4220
- .filter((t) => t.length > 1);
4221
- }
4222
-
4223
- const STOP_WORDS = new Set([
4224
- 'the', 'a', 'an', 'in', 'of', 'to', 'for', 'and', 'or', 'is', 'are',
4225
- 'that', 'this', 'it', 'with', 'from', 'by', 'be', 'as', 'on', 'at',
4226
- ]);
4204
+ const { tokenize } = __require('./src/retrieval/bm25');
4227
4205
 
4228
4206
  /**
4229
- * Score a single file's signatures against a query.
4230
- * Returns a non-negative number; higher = more relevant.
4231
- * @param {string[]} sigs - array of signature strings for this file
4232
- * @param {string[]} queryTokens
4233
- * @returns {number}
4234
- */
4235
- function scoreFile(sigs, queryTokens) {
4236
- if (!sigs || sigs.length === 0) return 0;
4237
-
4238
- const sigText = sigs.join(' ');
4239
- const sigTokens = new Set(tokenize(sigText));
4240
-
4241
- let score = 0;
4242
- for (const qt of queryTokens) {
4243
- if (STOP_WORDS.has(qt)) continue;
4244
- if (sigTokens.has(qt)) score += 1;
4245
- // Partial match (prefix)
4246
- for (const st of sigTokens) {
4247
- if (st !== qt && st.startsWith(qt) && qt.length >= 4) score += 0.3;
4248
- }
4249
- }
4250
-
4251
- return score;
4252
- }
4253
-
4254
- /**
4255
- * Rank all files in the index against a query. Returns file paths sorted
4256
- * by relevance score descending. Ties are broken by file path alphabetically.
4207
+ * Rank all files in the index against a query with the identifier-aware BM25
4208
+ * re-ranker. Returns file entries sorted by relevance score descending; ties
4209
+ * are broken by file path alphabetically (deterministic).
4257
4210
  * @param {string} query
4258
4211
  * @param {Map<string, string[]>} index
4259
4212
  * @param {number} topK
4260
4213
  * @returns {{ file: string, score: number, sigs: string[] }[]}
4261
4214
  */
4262
4215
  function rank(query, index, topK = 10) {
4263
- const queryTokens = tokenize(query);
4264
- const scored = [];
4265
-
4216
+ const candidates = [];
4266
4217
  for (const [file, sigs] of index.entries()) {
4267
- const score = scoreFile(sigs, queryTokens);
4268
- scored.push({ file, score, sigs });
4218
+ candidates.push({ file, sigs });
4269
4219
  }
4270
-
4271
- scored.sort((a, b) => b.score - a.score || a.file.localeCompare(b.file));
4272
- return scored.slice(0, topK);
4220
+ return bm25rank(query, candidates).slice(0, topK);
4273
4221
  }
4274
4222
 
4275
4223
  // ---------------------------------------------------------------------------
@@ -12695,7 +12643,7 @@ __factories["./src/mcp/server"] = function(module, exports) {
12695
12643
 
12696
12644
  const SERVER_INFO = {
12697
12645
  name: 'sigmap',
12698
- version: '7.30.0',
12646
+ version: '7.31.0',
12699
12647
  description: 'SigMap MCP server — code signatures on demand',
12700
12648
  };
12701
12649
 
@@ -13418,6 +13366,132 @@ __factories["./src/plan/verify-plan"] = function(module, exports) {
13418
13366
 
13419
13367
  };
13420
13368
 
13369
+ // ── ./src/retrieval/bm25 ──
13370
+ __factories["./src/retrieval/bm25"] = function(module, exports) {
13371
+
13372
+ /**
13373
+ * SigMap identifier-aware BM25 re-ranker (zero dependencies, deterministic).
13374
+ *
13375
+ * Plain exact-token TF-IDF misses queries whose terms live *inside* code
13376
+ * identifiers — e.g. `component emit` never surfaces `componentEmits.ts`,
13377
+ * because "componentEmits" is one token that shares no exact term with the
13378
+ * query. This module fixes that with four small additions:
13379
+ *
13380
+ * 1. Identifier-aware tokenization — split camelCase and snake_case.
13381
+ * 2. Light stemming — plurals / common suffixes (`emits` → `emit`).
13382
+ * 3. Path-token boost — file path / basename tokens weigh PATH_BOOST× more.
13383
+ * 4. BM25 scoring instead of raw TF-IDF (length-normalized).
13384
+ *
13385
+ * On 85 curated tasks across 17 repos this lifted hit@5 from 75.3% → 82.4%
13386
+ * (MRR +16% relative). See issue #395.
13387
+ */
13388
+
13389
+ // Stop words: common English + low-signal code verbs/nouns that appear in
13390
+ // nearly every signature and so carry little retrieval signal.
13391
+ const STOP = new Set(
13392
+ ('a an the of to in on for and or is are be by with as at from that this it its ' +
13393
+ 'into get set add new return value test')
13394
+ .split(' ')
13395
+ );
13396
+
13397
+ /**
13398
+ * Light suffix stemmer — conservative, tuned for code identifiers rather than
13399
+ * prose. Words of 3 chars or fewer pass through unchanged; a result shorter
13400
+ * than 3 chars reverts to the original token.
13401
+ *
13402
+ * @param {string} w
13403
+ * @returns {string}
13404
+ */
13405
+ function stem(w) {
13406
+ if (w.length <= 3) return w;
13407
+ let s = w;
13408
+ s = s.replace(/ies$/, 'y');
13409
+ s = s.replace(/(sses|shes|ches|xes|zes)$/, (m) => m.slice(0, -2));
13410
+ s = s.replace(/([^s])s$/, '$1');
13411
+ s = s.replace(/(ization|izations)$/, 'ize');
13412
+ s = s.replace(/(ing|edly|ed|er|ers|ation|ations|ment|ness|ity|ive|able|ible|ize|ise|al)$/, '');
13413
+ return s.length >= 3 ? s : w;
13414
+ }
13415
+
13416
+ /**
13417
+ * Split on non-alphanumeric characters AND camelCase / snake_case boundaries,
13418
+ * lowercase, drop stop words and single characters, then stem.
13419
+ *
13420
+ * @param {string} text
13421
+ * @returns {string[]}
13422
+ */
13423
+ function tokenize(text) {
13424
+ if (!text || typeof text !== 'string') return [];
13425
+ return text
13426
+ .replace(/[^A-Za-z0-9]+/g, ' ')
13427
+ .replace(/([a-z0-9])([A-Z])/g, '$1 $2')
13428
+ .replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2')
13429
+ .toLowerCase()
13430
+ .split(/\s+/)
13431
+ .filter((t) => t.length > 1 && !STOP.has(t))
13432
+ .map(stem)
13433
+ .filter(Boolean);
13434
+ }
13435
+
13436
+ // The file path / basename is highly indicative of relevance, so its tokens
13437
+ // are counted PATH_BOOST times when building the document term-frequency map.
13438
+ const PATH_BOOST = 3;
13439
+
13440
+ /**
13441
+ * BM25 re-rank of candidates against a query. Each candidate is
13442
+ * `{ file, sigs }`; the returned objects preserve all original candidate
13443
+ * fields and add a numeric `score` (higher = more relevant), sorted best-first
13444
+ * with a deterministic path tie-break. A `score` of 0 means no query token
13445
+ * matched — callers typically drop those.
13446
+ *
13447
+ * @param {string} query
13448
+ * @param {{ file: string, sigs: string[] }[]} candidates
13449
+ * @returns {Array<object & { score: number }>}
13450
+ */
13451
+ function bm25rank(query, candidates) {
13452
+ if (!Array.isArray(candidates) || candidates.length === 0) return [];
13453
+
13454
+ const k1 = 1.5;
13455
+ const b = 0.75;
13456
+
13457
+ const docs = candidates.map((c) => {
13458
+ const pathToks = tokenize(c.file || '');
13459
+ const toks = tokenize((c.sigs || []).join(' '));
13460
+ for (let i = 0; i < PATH_BOOST; i++) toks.push(...pathToks);
13461
+ const tf = new Map();
13462
+ for (const t of toks) tf.set(t, (tf.get(t) || 0) + 1);
13463
+ return { cand: c, tf, len: toks.length };
13464
+ });
13465
+
13466
+ const N = docs.length || 1;
13467
+ const avgdl = docs.reduce((s, d) => s + d.len, 0) / N || 1;
13468
+
13469
+ const df = new Map();
13470
+ for (const d of docs) {
13471
+ for (const t of d.tf.keys()) df.set(t, (df.get(t) || 0) + 1);
13472
+ }
13473
+
13474
+ const qToks = [...new Set(tokenize(query))];
13475
+
13476
+ return docs
13477
+ .map((d) => {
13478
+ let score = 0;
13479
+ for (const t of qToks) {
13480
+ const f = d.tf.get(t);
13481
+ if (!f) continue;
13482
+ const dfT = df.get(t);
13483
+ const idf = Math.log(1 + (N - dfT + 0.5) / (dfT + 0.5));
13484
+ score += (idf * (f * (k1 + 1))) / (f + k1 * (1 - b + (b * d.len) / avgdl));
13485
+ }
13486
+ return Object.assign({}, d.cand, { score });
13487
+ })
13488
+ .sort((a, c) => c.score - a.score || String(a.file).localeCompare(String(c.file)));
13489
+ }
13490
+
13491
+ module.exports = { tokenize, stem, bm25rank, PATH_BOOST, STOP };
13492
+
13493
+ };
13494
+
13421
13495
  // ── ./src/retrieval/ranker ──
13422
13496
  __factories["./src/retrieval/ranker"] = function(module, exports) {
13423
13497
 
@@ -13440,6 +13514,7 @@ __factories["./src/retrieval/ranker"] = function(module, exports) {
13440
13514
 
13441
13515
  const { loadWeights } = __require('./src/learning/weights');
13442
13516
  const { tokenize, STOP_WORDS } = __require('./src/retrieval/tokenizer');
13517
+ const { bm25rank } = __require('./src/retrieval/bm25');
13443
13518
 
13444
13519
  // ---------------------------------------------------------------------------
13445
13520
  // Default weights
@@ -13618,11 +13693,24 @@ __factories["./src/retrieval/ranker"] = function(module, exports) {
13618
13693
  return all.slice(0, topK);
13619
13694
  }
13620
13695
 
13696
+ // Identifier-aware BM25 base relevance over the whole index (#395). BM25
13697
+ // splits camelCase/snake_case, stems, and boosts path tokens, so queries
13698
+ // whose terms live inside identifiers (e.g. "component emit" → componentEmits)
13699
+ // are matched. The existing negative-signal penalty and recency/graph/learned
13700
+ // boosts are layered on top; the per-token signals stay for the explain table.
13701
+ const bm25Scores = new Map();
13702
+ for (const c of bm25rank(query, [...sigIndex.entries()].map(([file, sigs]) => ({ file, sigs })))) {
13703
+ bm25Scores.set(c.file, c.score);
13704
+ }
13705
+
13621
13706
  const scored = [];
13622
13707
  for (const [file, sigs] of sigIndex.entries()) {
13623
13708
  const result = scoreFile(file, sigs, queryTokens, weights);
13624
- let score = result.score;
13709
+ const penalty = result.signals.penalty;
13710
+ const base = bm25Scores.get(file) || 0;
13711
+ let score = base * penalty;
13625
13712
  const signals = result.signals;
13713
+ signals.bm25 = base;
13626
13714
 
13627
13715
  // Recency boost
13628
13716
  if (recencySet && recencySet.has(file) && score > 0) {
@@ -16524,7 +16612,7 @@ function __tryGit(args, opts = {}) {
16524
16612
  catch (_) { return ''; }
16525
16613
  }
16526
16614
 
16527
- const VERSION = '7.30.0';
16615
+ const VERSION = '7.31.0';
16528
16616
  const MARKER = '\n\n## Auto-generated signatures\n<!-- Updated by gen-context.js -->\n';
16529
16617
 
16530
16618
  function requireSourceOrBundled(key) {
package/llms-full.txt CHANGED
@@ -11,20 +11,20 @@ ranking keeps the relevant context in scope (cutting tokens ~97% as a side
11
11
  effect), with no LLM calls, embeddings, or vector database. Works with Claude,
12
12
  Cursor, GitHub Copilot, Aider, Windsurf, local LLMs, and MCP.
13
13
 
14
- # Version: 7.30.0 | Benchmark: sigmap-v7.30-main (2026-06-23)
14
+ # Version: 7.31.0 | Benchmark: sigmap-v7.31-main (2026-07-02)
15
15
  # Source: auto-generated from package.json, version.json, benchmarks/latest.json, src/mcp/tools.js, src/config/defaults.js
16
16
  # Regenerate: npm run generate:llms | Validate: npm run validate:llms
17
17
 
18
18
  ---
19
19
 
20
- ## Core metrics (benchmark: sigmap-v7.30-main, 2026-06-23)
20
+ ## Core metrics (benchmark: sigmap-v7.31-main, 2026-07-02)
21
21
 
22
22
  | Metric | Without SigMap | With SigMap |
23
23
  |--------|----------------|-------------|
24
- | Retrieval hit@5 | 13.6% (random) | 75.6% (5.6× lift) |
24
+ | Retrieval hit@5 | 13.6% (random) | 86.7% (6.4× lift) |
25
25
  | Token reduction | — | 97.0% average |
26
- | Task success proxy | 10% | 52.2% |
27
- | Prompts per task | 2.84 | 1.72 (39.4% fewer) |
26
+ | Task success proxy | 10% | 67.8% |
27
+ | Prompts per task | 2.84 | 1.46 (48.8% fewer) |
28
28
  | Supported languages | — | 33 |
29
29
  | MCP tools | — | 17 |
30
30
  | npm runtime dependencies | — | 0 |
package/llms.txt CHANGED
@@ -11,7 +11,7 @@ ranking keeps the relevant context in scope (cutting tokens ~97% as a side
11
11
  effect), with no LLM calls, embeddings, or vector database. Works with Claude,
12
12
  Cursor, GitHub Copilot, Aider, Windsurf, local LLMs, and MCP.
13
13
 
14
- # Version: 7.30.0 | Benchmark: sigmap-v7.30-main (2026-06-23)
14
+ # Version: 7.31.0 | Benchmark: sigmap-v7.31-main (2026-07-02)
15
15
  # Source: auto-generated from package.json, version.json, benchmarks/latest.json, src/mcp/tools.js, src/config/defaults.js
16
16
  # Regenerate: npm run generate:llms | Validate: npm run validate:llms
17
17
 
@@ -23,12 +23,12 @@ Cursor, GitHub Copilot, Aider, Windsurf, local LLMs, and MCP.
23
23
  - No blast-radius awareness before editing a hub file — `--impact` shows every file a change touches.
24
24
  - Pasted stack traces, CI logs, and JSON bloat the prompt — `squeeze` minimizes them and enriches the top frame from the symbol index.
25
25
 
26
- ## Core metrics (benchmark: sigmap-v7.30-main, 2026-06-23)
26
+ ## Core metrics (benchmark: sigmap-v7.31-main, 2026-07-02)
27
27
 
28
- - hit@5 retrieval: 75.6% vs 13.6% random baseline (5.6× lift)
28
+ - hit@5 retrieval: 86.7% vs 13.6% random baseline (6.4× lift)
29
29
  - Token reduction: 97.0% average across benchmark repos
30
- - Task success: 52.2% vs 10% without SigMap
31
- - Prompts per task: 1.72 vs 2.84 baseline (39.4% fewer)
30
+ - Task success: 67.8% vs 10% without SigMap
31
+ - Prompts per task: 1.46 vs 2.84 baseline (48.8% fewer)
32
32
  - Languages: 33 supported · MCP tools: 17
33
33
  - Dependencies: zero npm runtime dependencies · fully offline
34
34
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sigmap",
3
- "version": "7.30.0",
3
+ "version": "7.31.0",
4
4
  "description": "97% token reduction for AI coding. Extracts function & class signatures with TF-IDF ranking to feed only the right files to Claude, Cursor, Copilot, Aider, Windsurf, local LLMs & MCP. Zero dependencies, runs offline via npx.",
5
5
  "main": "packages/core/index.js",
6
6
  "exports": {
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sigmap-cli",
3
- "version": "7.30.0",
3
+ "version": "7.31.0",
4
4
  "description": "SigMap CLI wrapper — thin adapter for programmatic CLI invocation",
5
5
  "main": "index.js",
6
6
  "keywords": [
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sigmap-core",
3
- "version": "7.30.0",
3
+ "version": "7.31.0",
4
4
  "description": "SigMap core library — zero-dependency code signature extraction, retrieval, and security scanning",
5
5
  "main": "index.js",
6
6
  "keywords": [
@@ -20,6 +20,7 @@
20
20
  const fs = require('fs');
21
21
  const path = require('path');
22
22
  const { aggregate } = require('./scorer');
23
+ const { bm25rank } = require('../retrieval/bm25');
23
24
 
24
25
  // ---------------------------------------------------------------------------
25
26
  // Context file reader
@@ -81,79 +82,26 @@ function buildSigIndex(cwd) {
81
82
  }
82
83
 
83
84
  // ---------------------------------------------------------------------------
84
- // Simple keyword-based ranking (pre-retrieval layer; v2.3 adds proper ranker)
85
+ // Identifier-aware BM25 ranking (v7.31; see src/retrieval/bm25.js and #395)
85
86
  // ---------------------------------------------------------------------------
86
87
 
87
- /**
88
- * Tokenize a query or signature into lower-case word tokens.
89
- * Splits on whitespace, punctuation, camelCase, and snake_case.
90
- * @param {string} text
91
- * @returns {string[]}
92
- */
93
- function tokenize(text) {
94
- if (!text) return [];
95
- return text
96
- // split camelCase
97
- .replace(/([a-z])([A-Z])/g, '$1 $2')
98
- // split snake/kebab
99
- .replace(/[_\-]/g, ' ')
100
- // drop non-word chars
101
- .replace(/[^\w\s]/g, ' ')
102
- .toLowerCase()
103
- .split(/\s+/)
104
- .filter((t) => t.length > 1);
105
- }
106
-
107
- const STOP_WORDS = new Set([
108
- 'the', 'a', 'an', 'in', 'of', 'to', 'for', 'and', 'or', 'is', 'are',
109
- 'that', 'this', 'it', 'with', 'from', 'by', 'be', 'as', 'on', 'at',
110
- ]);
111
-
112
- /**
113
- * Score a single file's signatures against a query.
114
- * Returns a non-negative number; higher = more relevant.
115
- * @param {string[]} sigs - array of signature strings for this file
116
- * @param {string[]} queryTokens
117
- * @returns {number}
118
- */
119
- function scoreFile(sigs, queryTokens) {
120
- if (!sigs || sigs.length === 0) return 0;
121
-
122
- const sigText = sigs.join(' ');
123
- const sigTokens = new Set(tokenize(sigText));
124
-
125
- let score = 0;
126
- for (const qt of queryTokens) {
127
- if (STOP_WORDS.has(qt)) continue;
128
- if (sigTokens.has(qt)) score += 1;
129
- // Partial match (prefix)
130
- for (const st of sigTokens) {
131
- if (st !== qt && st.startsWith(qt) && qt.length >= 4) score += 0.3;
132
- }
133
- }
134
-
135
- return score;
136
- }
88
+ const { tokenize } = require('../retrieval/bm25');
137
89
 
138
90
  /**
139
- * Rank all files in the index against a query. Returns file paths sorted
140
- * by relevance score descending. Ties are broken by file path alphabetically.
91
+ * Rank all files in the index against a query with the identifier-aware BM25
92
+ * re-ranker. Returns file entries sorted by relevance score descending; ties
93
+ * are broken by file path alphabetically (deterministic).
141
94
  * @param {string} query
142
95
  * @param {Map<string, string[]>} index
143
96
  * @param {number} topK
144
97
  * @returns {{ file: string, score: number, sigs: string[] }[]}
145
98
  */
146
99
  function rank(query, index, topK = 10) {
147
- const queryTokens = tokenize(query);
148
- const scored = [];
149
-
100
+ const candidates = [];
150
101
  for (const [file, sigs] of index.entries()) {
151
- const score = scoreFile(sigs, queryTokens);
152
- scored.push({ file, score, sigs });
102
+ candidates.push({ file, sigs });
153
103
  }
154
-
155
- scored.sort((a, b) => b.score - a.score || a.file.localeCompare(b.file));
156
- return scored.slice(0, topK);
104
+ return bm25rank(query, candidates).slice(0, topK);
157
105
  }
158
106
 
159
107
  // ---------------------------------------------------------------------------
package/src/mcp/server.js CHANGED
@@ -18,7 +18,7 @@ const { readContext, searchSignatures, getMap, createCheckpoint, getRouting, exp
18
18
 
19
19
  const SERVER_INFO = {
20
20
  name: 'sigmap',
21
- version: '7.30.0',
21
+ version: '7.31.0',
22
22
  description: 'SigMap MCP server — code signatures on demand',
23
23
  };
24
24
 
@@ -0,0 +1,122 @@
1
+ 'use strict';
2
+
3
+ /**
4
+ * SigMap identifier-aware BM25 re-ranker (zero dependencies, deterministic).
5
+ *
6
+ * Plain exact-token TF-IDF misses queries whose terms live *inside* code
7
+ * identifiers — e.g. `component emit` never surfaces `componentEmits.ts`,
8
+ * because "componentEmits" is one token that shares no exact term with the
9
+ * query. This module fixes that with four small additions:
10
+ *
11
+ * 1. Identifier-aware tokenization — split camelCase and snake_case.
12
+ * 2. Light stemming — plurals / common suffixes (`emits` → `emit`).
13
+ * 3. Path-token boost — file path / basename tokens weigh PATH_BOOST× more.
14
+ * 4. BM25 scoring instead of raw TF-IDF (length-normalized).
15
+ *
16
+ * On 85 curated tasks across 17 repos this lifted hit@5 from 75.3% → 82.4%
17
+ * (MRR +16% relative). See issue #395.
18
+ */
19
+
20
+ // Stop words: common English + low-signal code verbs/nouns that appear in
21
+ // nearly every signature and so carry little retrieval signal.
22
+ const STOP = new Set(
23
+ ('a an the of to in on for and or is are be by with as at from that this it its ' +
24
+ 'into get set add new return value test')
25
+ .split(' ')
26
+ );
27
+
28
+ /**
29
+ * Light suffix stemmer — conservative, tuned for code identifiers rather than
30
+ * prose. Words of 3 chars or fewer pass through unchanged; a result shorter
31
+ * than 3 chars reverts to the original token.
32
+ *
33
+ * @param {string} w
34
+ * @returns {string}
35
+ */
36
+ function stem(w) {
37
+ if (w.length <= 3) return w;
38
+ let s = w;
39
+ s = s.replace(/ies$/, 'y');
40
+ s = s.replace(/(sses|shes|ches|xes|zes)$/, (m) => m.slice(0, -2));
41
+ s = s.replace(/([^s])s$/, '$1');
42
+ s = s.replace(/(ization|izations)$/, 'ize');
43
+ s = s.replace(/(ing|edly|ed|er|ers|ation|ations|ment|ness|ity|ive|able|ible|ize|ise|al)$/, '');
44
+ return s.length >= 3 ? s : w;
45
+ }
46
+
47
+ /**
48
+ * Split on non-alphanumeric characters AND camelCase / snake_case boundaries,
49
+ * lowercase, drop stop words and single characters, then stem.
50
+ *
51
+ * @param {string} text
52
+ * @returns {string[]}
53
+ */
54
+ function tokenize(text) {
55
+ if (!text || typeof text !== 'string') return [];
56
+ return text
57
+ .replace(/[^A-Za-z0-9]+/g, ' ')
58
+ .replace(/([a-z0-9])([A-Z])/g, '$1 $2')
59
+ .replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2')
60
+ .toLowerCase()
61
+ .split(/\s+/)
62
+ .filter((t) => t.length > 1 && !STOP.has(t))
63
+ .map(stem)
64
+ .filter(Boolean);
65
+ }
66
+
67
+ // The file path / basename is highly indicative of relevance, so its tokens
68
+ // are counted PATH_BOOST times when building the document term-frequency map.
69
+ const PATH_BOOST = 3;
70
+
71
+ /**
72
+ * BM25 re-rank of candidates against a query. Each candidate is
73
+ * `{ file, sigs }`; the returned objects preserve all original candidate
74
+ * fields and add a numeric `score` (higher = more relevant), sorted best-first
75
+ * with a deterministic path tie-break. A `score` of 0 means no query token
76
+ * matched — callers typically drop those.
77
+ *
78
+ * @param {string} query
79
+ * @param {{ file: string, sigs: string[] }[]} candidates
80
+ * @returns {Array<object & { score: number }>}
81
+ */
82
+ function bm25rank(query, candidates) {
83
+ if (!Array.isArray(candidates) || candidates.length === 0) return [];
84
+
85
+ const k1 = 1.5;
86
+ const b = 0.75;
87
+
88
+ const docs = candidates.map((c) => {
89
+ const pathToks = tokenize(c.file || '');
90
+ const toks = tokenize((c.sigs || []).join(' '));
91
+ for (let i = 0; i < PATH_BOOST; i++) toks.push(...pathToks);
92
+ const tf = new Map();
93
+ for (const t of toks) tf.set(t, (tf.get(t) || 0) + 1);
94
+ return { cand: c, tf, len: toks.length };
95
+ });
96
+
97
+ const N = docs.length || 1;
98
+ const avgdl = docs.reduce((s, d) => s + d.len, 0) / N || 1;
99
+
100
+ const df = new Map();
101
+ for (const d of docs) {
102
+ for (const t of d.tf.keys()) df.set(t, (df.get(t) || 0) + 1);
103
+ }
104
+
105
+ const qToks = [...new Set(tokenize(query))];
106
+
107
+ return docs
108
+ .map((d) => {
109
+ let score = 0;
110
+ for (const t of qToks) {
111
+ const f = d.tf.get(t);
112
+ if (!f) continue;
113
+ const dfT = df.get(t);
114
+ const idf = Math.log(1 + (N - dfT + 0.5) / (dfT + 0.5));
115
+ score += (idf * (f * (k1 + 1))) / (f + k1 * (1 - b + (b * d.len) / avgdl));
116
+ }
117
+ return Object.assign({}, d.cand, { score });
118
+ })
119
+ .sort((a, c) => c.score - a.score || String(a.file).localeCompare(String(c.file)));
120
+ }
121
+
122
+ module.exports = { tokenize, stem, bm25rank, PATH_BOOST, STOP };
@@ -19,6 +19,7 @@
19
19
 
20
20
  const { loadWeights } = require('../learning/weights');
21
21
  const { tokenize, STOP_WORDS } = require('./tokenizer');
22
+ const { bm25rank } = require('./bm25');
22
23
 
23
24
  // ---------------------------------------------------------------------------
24
25
  // Default weights
@@ -197,11 +198,24 @@ function rank(query, sigIndex, opts) {
197
198
  return all.slice(0, topK);
198
199
  }
199
200
 
201
+ // Identifier-aware BM25 base relevance over the whole index (#395). BM25
202
+ // splits camelCase/snake_case, stems, and boosts path tokens, so queries
203
+ // whose terms live inside identifiers (e.g. "component emit" → componentEmits)
204
+ // are matched. The existing negative-signal penalty and recency/graph/learned
205
+ // boosts are layered on top; the per-token signals stay for the explain table.
206
+ const bm25Scores = new Map();
207
+ for (const c of bm25rank(query, [...sigIndex.entries()].map(([file, sigs]) => ({ file, sigs })))) {
208
+ bm25Scores.set(c.file, c.score);
209
+ }
210
+
200
211
  const scored = [];
201
212
  for (const [file, sigs] of sigIndex.entries()) {
202
213
  const result = scoreFile(file, sigs, queryTokens, weights);
203
- let score = result.score;
214
+ const penalty = result.signals.penalty;
215
+ const base = bm25Scores.get(file) || 0;
216
+ let score = base * penalty;
204
217
  const signals = result.signals;
218
+ signals.bm25 = base;
205
219
 
206
220
  // Recency boost
207
221
  if (recencySet && recencySet.has(file) && score > 0) {