sigmap 7.29.0 → 7.31.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +24 -0
- package/README.md +62 -14
- package/gen-context.js +154 -66
- package/llms-full.txt +16 -14
- package/llms.txt +19 -16
- package/package.json +1 -1
- package/packages/cli/package.json +1 -1
- package/packages/core/package.json +1 -1
- package/src/eval/runner.js +9 -61
- package/src/format/llms-txt.js +1 -1
- package/src/mcp/server.js +1 -1
- package/src/retrieval/bm25.js +122 -0
- package/src/retrieval/ranker.js +15 -1
package/CHANGELOG.md
CHANGED
|
@@ -10,6 +10,30 @@ Format: [Semantic Versioning](https://semver.org/)
|
|
|
10
10
|
|
|
11
11
|
---
|
|
12
12
|
|
|
13
|
+
## [7.31.0] — 2026-07-02
|
|
14
|
+
|
|
15
|
+
Minor release — **identifier-aware BM25 re-ranker.** Plain exact-token TF-IDF missed queries whose terms live *inside* code identifiers — `component emit` never surfaced `componentEmits` because that is one token sharing no exact term with the query. This was the dominant retrieval-miss cause. The new ranker splits identifiers, stems lightly, boosts path tokens, and scores with length-normalized BM25. Deterministic, zero new dependencies, no LLM/embeddings.
|
|
16
|
+
|
|
17
|
+
### Added
|
|
18
|
+
- **Identifier-aware BM25 re-ranker (#395, #396):** new zero-dependency `src/retrieval/bm25.js` with (1) identifier-aware tokenization (split camelCase / snake_case), (2) light stemming (`emits` → `emit`, `options` → `option`), (3) path-token boost (filename weighed 3×), and (4) BM25 length-normalized scoring instead of raw TF-IDF. Wired into the core ranker (`src/retrieval/ranker.js`) as the base relevance score — so `sigmap ask`, `sigmap --query`, and MCP `query_context` all benefit — with the existing negative-signal penalty and recency/graph/learned boosts layered on top. Also drives the benchmark runner (`src/eval/runner.js`) and the dev retrieval benchmark.
|
|
19
|
+
- **BM25 unit tests (#396):** `test/integration/bm25.test.js` covers tokenization, stemming, path boost, the `component emit` → `componentEmits` motivating case, and deterministic tie-breaking.
|
|
20
|
+
|
|
21
|
+
### Changed
|
|
22
|
+
- **Retrieval benchmark refreshed:** on the 18-repo / 90-task suite, hit@5 rose **75.6% → 86.7%** (retrieval lift 5.6× → 6.4×), with rank-1 gains on flask, spring-petclinic, rails, and svelte (60% → 100%). The task-completion proxy also improved (task success 52.2% → 67.8%, prompts/task 1.72 → 1.46) since it retrieves through the same ranker. Residual misses (vapor, serilog) are files whose signatures genuinely lack the query vocabulary — out of scope, they need semantic retrieval.
|
|
23
|
+
|
|
24
|
+
## [7.30.0] — 2026-06-23
|
|
25
|
+
|
|
26
|
+
Minor release — **v8.0 E2 + E4 (the "Pivot"):** completes v8.0 by repositioning every public surface to the chosen framing — *"the deterministic, verifiable grounding layer for AI code work"* — and framing coding agents as **consumers, not competitors**. The Evidence Pack code (E1/E3/D3 + `mcp install`) already shipped in 7.27–7.29; this is the positioning half. Docs/strings only — no runtime behaviour change, zero new dependencies.
|
|
27
|
+
|
|
28
|
+
### Added
|
|
29
|
+
- **Agent recipes (#389):** new README "Agent recipes" section with copy-paste setup for Claude Code, Cursor, Cline, Continue, Aider, OpenHands, and Codex CLI — each via `sigmap mcp install <client>` or a deterministic Evidence Pack, positioning agents as consumers of SigMap's map.
|
|
30
|
+
- **Surface docs for shipped commands (#389):** README now documents `sigmap evidence` (deterministic Evidence Pack JSON/Markdown) and `sigmap doctor` (setup diagnostics), which shipped in code but were undocumented.
|
|
31
|
+
- **Repositioning gate (#389):** `test/integration/repositioning.test.js` makes the pivot non-regressable — asserts the grounding-layer framing on README/`llms.txt`/docs `<title>`, recipes for every named agent, and the documented commands.
|
|
32
|
+
|
|
33
|
+
### Changed
|
|
34
|
+
- **E2 repositioning (#389):** README tagline, "What is SigMap?", "Why SigMap?" (token reduction demoted to proof) and the compare table; `docs/index.html` title/meta/keywords/JSON-LD + hero (and the stale `softwareVersion` 5.8.0 → current); `llms.txt`/`llms-full.txt` regenerated from `scripts/llms-manual.mjs`; the per-project adapter tagline in `src/format/llms-txt.js` (bundle rebuilt, reproducible); `docs/_config.yml`. The literal `context-engine` remains only inside the published JetBrains plugin URL slug.
|
|
35
|
+
- **Structure guards updated (#389):** `readme-structure.test.js` tagline/compare-table assertions moved to the new copy; `version-json.test.js` now derives the docs `softwareVersion` from `version.json` instead of a hardcoded stale value.
|
|
36
|
+
|
|
13
37
|
## [7.29.0] — 2026-06-23
|
|
14
38
|
|
|
15
39
|
Minor release — **v8.0 E4:** one-command, per-client MCP install so a cold user reaches a working MCP setup fast (the v8.0 <5-minute-quickstart exit gate).
|
package/README.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
|
|
5
5
|
# ⚡ SigMap
|
|
6
6
|
|
|
7
|
-
**SigMap
|
|
7
|
+
**SigMap is the deterministic, verifiable grounding layer for AI code work.**
|
|
8
8
|
|
|
9
9
|
[](https://www.npmjs.com/package/sigmap)
|
|
10
10
|
[](https://www.npmjs.com/package/sigmap)
|
|
@@ -35,7 +35,9 @@ Zero config. Zero dependencies. Under 10 seconds.
|
|
|
35
35
|
|
|
36
36
|
## What is SigMap?
|
|
37
37
|
|
|
38
|
-
SigMap
|
|
38
|
+
SigMap builds a **deterministic, auditable signature-and-evidence map** of your codebase — no LLM calls, no embeddings, byte-stable output — so AI agents, CI, and reviewers can *trust and verify* which files and symbols are real before acting. Same repo in, same map out, every time.
|
|
39
|
+
|
|
40
|
+
That map is exactly what agentic grep is worst at: reproducible, auditable context an agent can consume without a copy-paste, and a grounding check that proves an AI answer is anchored to real signatures and line numbers. Token reduction comes for free — but trust is the point.
|
|
39
41
|
|
|
40
42
|
**Model-agnostic.** Works with:
|
|
41
43
|
- **Cloud LLMs:** Claude, GPT-4, Copilot, Gemini
|
|
@@ -48,17 +50,22 @@ SigMap extracts function and class signatures from your codebase and feeds the r
|
|
|
48
50
|
|
|
49
51
|
## Why SigMap?
|
|
50
52
|
|
|
53
|
+
**Deterministic and verifiable — the two things an agentic-grep loop can't give you:**
|
|
54
|
+
- **Deterministic** — no LLM calls, no agent loop; the same repo always produces a byte-identical map you can diff, cache, and gate in CI.
|
|
55
|
+
- **Auditable & grounded** — every file and symbol traces to a real line anchor; `sigmap verify-ai-output` flags any AI claim that isn't.
|
|
56
|
+
- **Zero dependencies** — `npx sigmap` on any machine; no embeddings, no vector DB, no hosted service, fully offline.
|
|
57
|
+
|
|
58
|
+
**Proof it pays off** (full benchmark below):
|
|
51
59
|
<!--SM:whyMetrics-->
|
|
52
|
-
- **
|
|
60
|
+
- **86.7% hit@5** — right file found in top 5 results (vs 13.6% baseline)
|
|
53
61
|
- **97.0% token reduction** — average across 21 real repos
|
|
54
|
-
- **
|
|
55
|
-
- **1.
|
|
62
|
+
- **67.8% task success rate** — up from 10% without context
|
|
63
|
+
- **1.46 prompts per task** — down from 2.84 (48.8% fewer retries)
|
|
56
64
|
<!--/SM:whyMetrics-->
|
|
57
65
|
- **<!--SM:languages-->33<!--/SM:languages--> languages supported** — TypeScript, Python, Go, Rust, Java, R, and more
|
|
58
66
|
- **No vendor lock-in** — works with any AI assistant or local LLM
|
|
59
67
|
- **No API costs** — use local models (Ollama, llama.cpp, vLLM) with zero token fees
|
|
60
68
|
- **Full privacy** — keep your code and context on your machine
|
|
61
|
-
- **Zero npm dependencies** — `npx sigmap` on any machine
|
|
62
69
|
|
|
63
70
|
---
|
|
64
71
|
|
|
@@ -66,9 +73,9 @@ SigMap extracts function and class signatures from your codebase and feeds the r
|
|
|
66
73
|
|
|
67
74
|
| Without SigMap | With SigMap |
|
|
68
75
|
|---|---|
|
|
69
|
-
| ❌
|
|
70
|
-
| ❌
|
|
71
|
-
| ❌ Embeddings / vector DB required | ✅
|
|
76
|
+
| ❌ Non-reproducible agent guesses | ✅ Deterministic map — same input, same output, every time |
|
|
77
|
+
| ❌ "Trust me" AI answers | ✅ Grounded — right file in context <!--SM:hitWhole-->87%<!--/SM:hitWhole--> of the time, every symbol on a real line anchor |
|
|
78
|
+
| ❌ Embeddings / vector DB required | ✅ Zero deps, no infra, fully offline |
|
|
72
79
|
|
|
73
80
|
---
|
|
74
81
|
|
|
@@ -91,13 +98,13 @@ Ask → Rank → Context → Validate → Judge → Learn
|
|
|
91
98
|
|
|
92
99
|
<!--SM:benchmarkBlock-->
|
|
93
100
|
```
|
|
94
|
-
Benchmark : sigmap-v7.
|
|
95
|
-
Date : 2026-
|
|
101
|
+
Benchmark : sigmap-v7.31-main (21 repositories, including R language)
|
|
102
|
+
Date : 2026-07-02
|
|
96
103
|
|
|
97
|
-
Hit@5 :
|
|
104
|
+
Hit@5 : 86.7% (baseline 13.6% — 6.4× lift)
|
|
98
105
|
Token reduction: 97.0% (across 21 repos)
|
|
99
|
-
Prompt reduction :
|
|
100
|
-
Task success :
|
|
106
|
+
Prompt reduction : 48.8% (2.84 → 1.46 prompts per task)
|
|
107
|
+
Task success : 67.8% (baseline 10%)
|
|
101
108
|
Repos tested : 21 (JavaScript, Python, Go, Rust, Java, R, C++, C#, Dart, Swift, Ruby, PHP, Scala, Kotlin, and more)
|
|
102
109
|
```
|
|
103
110
|
<!--/SM:benchmarkBlock-->
|
|
@@ -216,6 +223,47 @@ sigmap create "<task>" # run the whole pipeline: scaffold → verify
|
|
|
216
223
|
|
|
217
224
|
---
|
|
218
225
|
|
|
226
|
+
## Evidence Pack & diagnostics
|
|
227
|
+
|
|
228
|
+
The **Evidence Pack** is the consumable, machine-readable replacement for "paste this into your prompt" — a deterministic JSON artifact (with a Markdown handoff mode) that an agent or CI step reads directly, with zero copy-paste:
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
sigmap evidence "how does auth work" # → .context/evidence-pack.json (deterministic, byte-stable)
|
|
232
|
+
sigmap evidence "how does auth work" --markdown # Markdown handoff to stdout
|
|
233
|
+
sigmap doctor # diagnose config, index, freshness, coverage, MCP wiring — with fixes
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
Each pack carries the ranked files, the symbols and line anchors that justify them, the token budget, the dropped files (and why), and the grounding summary — so a consumer can trust and audit the context instead of guessing.
|
|
237
|
+
|
|
238
|
+
---
|
|
239
|
+
|
|
240
|
+
## Agent recipes
|
|
241
|
+
|
|
242
|
+
SigMap treats coding agents as **consumers, not competitors**: it hands them a deterministic, auditable map the agent can read on demand. Wire any of them up once, then let the agent pull context or consume an Evidence Pack.
|
|
243
|
+
|
|
244
|
+
| Agent | One-time setup | How it consumes SigMap |
|
|
245
|
+
|---|---|---|
|
|
246
|
+
| **Claude Code** | `sigmap mcp install claude` | 17 MCP tools (`search_signatures`, `get_lines`, `get_diff_context`…) |
|
|
247
|
+
| **Cursor** | `sigmap mcp install cursor` | MCP tools, plus the `cursor` adapter writes `.cursorrules` |
|
|
248
|
+
| **Cline** | `sigmap mcp install cursor` | Reads `.cursorrules`; same MCP server |
|
|
249
|
+
| **Continue** | `sigmap mcp install vscode` | MCP tools inside the Continue extension |
|
|
250
|
+
| **Aider** | `sigmap --adapter openai` | Reads `.github/openai-context.md` before a session |
|
|
251
|
+
| **OpenHands** | `sigmap evidence "<task>"` | Consumes `.context/evidence-pack.json` directly |
|
|
252
|
+
| **Codex CLI** | `sigmap mcp install codex` | MCP tools, plus the `codex` adapter writes `AGENTS.md` |
|
|
253
|
+
|
|
254
|
+
```bash
|
|
255
|
+
# Pattern 1 — give the agent live, on-demand access (MCP)
|
|
256
|
+
sigmap mcp install claude # one of: claude|cursor|windsurf|vscode|zed|codex|gemini|opencode|mcp
|
|
257
|
+
# add --global for a user-level install
|
|
258
|
+
|
|
259
|
+
# Pattern 2 — hand the agent a deterministic Evidence Pack (no MCP, no copy-paste)
|
|
260
|
+
sigmap evidence "implement rate limiting" --markdown # or read .context/evidence-pack.json
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
See [`sigmap mcp list`](https://sigmap.io/guide/cli.html) for every supported client.
|
|
264
|
+
|
|
265
|
+
---
|
|
266
|
+
|
|
219
267
|
## Try it
|
|
220
268
|
|
|
221
269
|
```bash
|
package/gen-context.js
CHANGED
|
@@ -4136,6 +4136,7 @@ __factories["./src/eval/runner"] = function(module, exports) {
|
|
|
4136
4136
|
const fs = require('fs');
|
|
4137
4137
|
const path = require('path');
|
|
4138
4138
|
const { aggregate } = __require('./src/eval/scorer');
|
|
4139
|
+
const { bm25rank } = __require('./src/retrieval/bm25');
|
|
4139
4140
|
|
|
4140
4141
|
// ---------------------------------------------------------------------------
|
|
4141
4142
|
// Context file reader
|
|
@@ -4197,79 +4198,26 @@ __factories["./src/eval/runner"] = function(module, exports) {
|
|
|
4197
4198
|
}
|
|
4198
4199
|
|
|
4199
4200
|
// ---------------------------------------------------------------------------
|
|
4200
|
-
//
|
|
4201
|
+
// Identifier-aware BM25 ranking (v7.31; see src/retrieval/bm25.js and #395)
|
|
4201
4202
|
// ---------------------------------------------------------------------------
|
|
4202
4203
|
|
|
4203
|
-
|
|
4204
|
-
* Tokenize a query or signature into lower-case word tokens.
|
|
4205
|
-
* Splits on whitespace, punctuation, camelCase, and snake_case.
|
|
4206
|
-
* @param {string} text
|
|
4207
|
-
* @returns {string[]}
|
|
4208
|
-
*/
|
|
4209
|
-
function tokenize(text) {
|
|
4210
|
-
if (!text) return [];
|
|
4211
|
-
return text
|
|
4212
|
-
// split camelCase
|
|
4213
|
-
.replace(/([a-z])([A-Z])/g, '$1 $2')
|
|
4214
|
-
// split snake/kebab
|
|
4215
|
-
.replace(/[_\-]/g, ' ')
|
|
4216
|
-
// drop non-word chars
|
|
4217
|
-
.replace(/[^\w\s]/g, ' ')
|
|
4218
|
-
.toLowerCase()
|
|
4219
|
-
.split(/\s+/)
|
|
4220
|
-
.filter((t) => t.length > 1);
|
|
4221
|
-
}
|
|
4222
|
-
|
|
4223
|
-
const STOP_WORDS = new Set([
|
|
4224
|
-
'the', 'a', 'an', 'in', 'of', 'to', 'for', 'and', 'or', 'is', 'are',
|
|
4225
|
-
'that', 'this', 'it', 'with', 'from', 'by', 'be', 'as', 'on', 'at',
|
|
4226
|
-
]);
|
|
4204
|
+
const { tokenize } = __require('./src/retrieval/bm25');
|
|
4227
4205
|
|
|
4228
4206
|
/**
|
|
4229
|
-
*
|
|
4230
|
-
* Returns
|
|
4231
|
-
*
|
|
4232
|
-
* @param {string[]} queryTokens
|
|
4233
|
-
* @returns {number}
|
|
4234
|
-
*/
|
|
4235
|
-
function scoreFile(sigs, queryTokens) {
|
|
4236
|
-
if (!sigs || sigs.length === 0) return 0;
|
|
4237
|
-
|
|
4238
|
-
const sigText = sigs.join(' ');
|
|
4239
|
-
const sigTokens = new Set(tokenize(sigText));
|
|
4240
|
-
|
|
4241
|
-
let score = 0;
|
|
4242
|
-
for (const qt of queryTokens) {
|
|
4243
|
-
if (STOP_WORDS.has(qt)) continue;
|
|
4244
|
-
if (sigTokens.has(qt)) score += 1;
|
|
4245
|
-
// Partial match (prefix)
|
|
4246
|
-
for (const st of sigTokens) {
|
|
4247
|
-
if (st !== qt && st.startsWith(qt) && qt.length >= 4) score += 0.3;
|
|
4248
|
-
}
|
|
4249
|
-
}
|
|
4250
|
-
|
|
4251
|
-
return score;
|
|
4252
|
-
}
|
|
4253
|
-
|
|
4254
|
-
/**
|
|
4255
|
-
* Rank all files in the index against a query. Returns file paths sorted
|
|
4256
|
-
* by relevance score descending. Ties are broken by file path alphabetically.
|
|
4207
|
+
* Rank all files in the index against a query with the identifier-aware BM25
|
|
4208
|
+
* re-ranker. Returns file entries sorted by relevance score descending; ties
|
|
4209
|
+
* are broken by file path alphabetically (deterministic).
|
|
4257
4210
|
* @param {string} query
|
|
4258
4211
|
* @param {Map<string, string[]>} index
|
|
4259
4212
|
* @param {number} topK
|
|
4260
4213
|
* @returns {{ file: string, score: number, sigs: string[] }[]}
|
|
4261
4214
|
*/
|
|
4262
4215
|
function rank(query, index, topK = 10) {
|
|
4263
|
-
const
|
|
4264
|
-
const scored = [];
|
|
4265
|
-
|
|
4216
|
+
const candidates = [];
|
|
4266
4217
|
for (const [file, sigs] of index.entries()) {
|
|
4267
|
-
|
|
4268
|
-
scored.push({ file, score, sigs });
|
|
4218
|
+
candidates.push({ file, sigs });
|
|
4269
4219
|
}
|
|
4270
|
-
|
|
4271
|
-
scored.sort((a, b) => b.score - a.score || a.file.localeCompare(b.file));
|
|
4272
|
-
return scored.slice(0, topK);
|
|
4220
|
+
return bm25rank(query, candidates).slice(0, topK);
|
|
4273
4221
|
}
|
|
4274
4222
|
|
|
4275
4223
|
// ---------------------------------------------------------------------------
|
|
@@ -9837,7 +9785,7 @@ __factories["./src/format/llms-txt"] = function(module, exports) {
|
|
|
9837
9785
|
|
|
9838
9786
|
const lines = [
|
|
9839
9787
|
'# SigMap Context Index',
|
|
9840
|
-
`> Generated by SigMap v${sigmapVersion} —
|
|
9788
|
+
`> Generated by SigMap v${sigmapVersion} — the deterministic, verifiable grounding layer for AI code work`,
|
|
9841
9789
|
'',
|
|
9842
9790
|
'## Project',
|
|
9843
9791
|
`- Name: ${name}`,
|
|
@@ -12695,7 +12643,7 @@ __factories["./src/mcp/server"] = function(module, exports) {
|
|
|
12695
12643
|
|
|
12696
12644
|
const SERVER_INFO = {
|
|
12697
12645
|
name: 'sigmap',
|
|
12698
|
-
version: '7.
|
|
12646
|
+
version: '7.31.0',
|
|
12699
12647
|
description: 'SigMap MCP server — code signatures on demand',
|
|
12700
12648
|
};
|
|
12701
12649
|
|
|
@@ -13418,6 +13366,132 @@ __factories["./src/plan/verify-plan"] = function(module, exports) {
|
|
|
13418
13366
|
|
|
13419
13367
|
};
|
|
13420
13368
|
|
|
13369
|
+
// ── ./src/retrieval/bm25 ──
|
|
13370
|
+
__factories["./src/retrieval/bm25"] = function(module, exports) {
|
|
13371
|
+
|
|
13372
|
+
/**
|
|
13373
|
+
* SigMap identifier-aware BM25 re-ranker (zero dependencies, deterministic).
|
|
13374
|
+
*
|
|
13375
|
+
* Plain exact-token TF-IDF misses queries whose terms live *inside* code
|
|
13376
|
+
* identifiers — e.g. `component emit` never surfaces `componentEmits.ts`,
|
|
13377
|
+
* because "componentEmits" is one token that shares no exact term with the
|
|
13378
|
+
* query. This module fixes that with four small additions:
|
|
13379
|
+
*
|
|
13380
|
+
* 1. Identifier-aware tokenization — split camelCase and snake_case.
|
|
13381
|
+
* 2. Light stemming — plurals / common suffixes (`emits` → `emit`).
|
|
13382
|
+
* 3. Path-token boost — file path / basename tokens weigh PATH_BOOST× more.
|
|
13383
|
+
* 4. BM25 scoring instead of raw TF-IDF (length-normalized).
|
|
13384
|
+
*
|
|
13385
|
+
* On 85 curated tasks across 17 repos this lifted hit@5 from 75.3% → 82.4%
|
|
13386
|
+
* (MRR +16% relative). See issue #395.
|
|
13387
|
+
*/
|
|
13388
|
+
|
|
13389
|
+
// Stop words: common English + low-signal code verbs/nouns that appear in
|
|
13390
|
+
// nearly every signature and so carry little retrieval signal.
|
|
13391
|
+
const STOP = new Set(
|
|
13392
|
+
('a an the of to in on for and or is are be by with as at from that this it its ' +
|
|
13393
|
+
'into get set add new return value test')
|
|
13394
|
+
.split(' ')
|
|
13395
|
+
);
|
|
13396
|
+
|
|
13397
|
+
/**
|
|
13398
|
+
* Light suffix stemmer — conservative, tuned for code identifiers rather than
|
|
13399
|
+
* prose. Words of 3 chars or fewer pass through unchanged; a result shorter
|
|
13400
|
+
* than 3 chars reverts to the original token.
|
|
13401
|
+
*
|
|
13402
|
+
* @param {string} w
|
|
13403
|
+
* @returns {string}
|
|
13404
|
+
*/
|
|
13405
|
+
function stem(w) {
|
|
13406
|
+
if (w.length <= 3) return w;
|
|
13407
|
+
let s = w;
|
|
13408
|
+
s = s.replace(/ies$/, 'y');
|
|
13409
|
+
s = s.replace(/(sses|shes|ches|xes|zes)$/, (m) => m.slice(0, -2));
|
|
13410
|
+
s = s.replace(/([^s])s$/, '$1');
|
|
13411
|
+
s = s.replace(/(ization|izations)$/, 'ize');
|
|
13412
|
+
s = s.replace(/(ing|edly|ed|er|ers|ation|ations|ment|ness|ity|ive|able|ible|ize|ise|al)$/, '');
|
|
13413
|
+
return s.length >= 3 ? s : w;
|
|
13414
|
+
}
|
|
13415
|
+
|
|
13416
|
+
/**
|
|
13417
|
+
* Split on non-alphanumeric characters AND camelCase / snake_case boundaries,
|
|
13418
|
+
* lowercase, drop stop words and single characters, then stem.
|
|
13419
|
+
*
|
|
13420
|
+
* @param {string} text
|
|
13421
|
+
* @returns {string[]}
|
|
13422
|
+
*/
|
|
13423
|
+
function tokenize(text) {
|
|
13424
|
+
if (!text || typeof text !== 'string') return [];
|
|
13425
|
+
return text
|
|
13426
|
+
.replace(/[^A-Za-z0-9]+/g, ' ')
|
|
13427
|
+
.replace(/([a-z0-9])([A-Z])/g, '$1 $2')
|
|
13428
|
+
.replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2')
|
|
13429
|
+
.toLowerCase()
|
|
13430
|
+
.split(/\s+/)
|
|
13431
|
+
.filter((t) => t.length > 1 && !STOP.has(t))
|
|
13432
|
+
.map(stem)
|
|
13433
|
+
.filter(Boolean);
|
|
13434
|
+
}
|
|
13435
|
+
|
|
13436
|
+
// The file path / basename is highly indicative of relevance, so its tokens
|
|
13437
|
+
// are counted PATH_BOOST times when building the document term-frequency map.
|
|
13438
|
+
const PATH_BOOST = 3;
|
|
13439
|
+
|
|
13440
|
+
/**
|
|
13441
|
+
* BM25 re-rank of candidates against a query. Each candidate is
|
|
13442
|
+
* `{ file, sigs }`; the returned objects preserve all original candidate
|
|
13443
|
+
* fields and add a numeric `score` (higher = more relevant), sorted best-first
|
|
13444
|
+
* with a deterministic path tie-break. A `score` of 0 means no query token
|
|
13445
|
+
* matched — callers typically drop those.
|
|
13446
|
+
*
|
|
13447
|
+
* @param {string} query
|
|
13448
|
+
* @param {{ file: string, sigs: string[] }[]} candidates
|
|
13449
|
+
* @returns {Array<object & { score: number }>}
|
|
13450
|
+
*/
|
|
13451
|
+
function bm25rank(query, candidates) {
|
|
13452
|
+
if (!Array.isArray(candidates) || candidates.length === 0) return [];
|
|
13453
|
+
|
|
13454
|
+
const k1 = 1.5;
|
|
13455
|
+
const b = 0.75;
|
|
13456
|
+
|
|
13457
|
+
const docs = candidates.map((c) => {
|
|
13458
|
+
const pathToks = tokenize(c.file || '');
|
|
13459
|
+
const toks = tokenize((c.sigs || []).join(' '));
|
|
13460
|
+
for (let i = 0; i < PATH_BOOST; i++) toks.push(...pathToks);
|
|
13461
|
+
const tf = new Map();
|
|
13462
|
+
for (const t of toks) tf.set(t, (tf.get(t) || 0) + 1);
|
|
13463
|
+
return { cand: c, tf, len: toks.length };
|
|
13464
|
+
});
|
|
13465
|
+
|
|
13466
|
+
const N = docs.length || 1;
|
|
13467
|
+
const avgdl = docs.reduce((s, d) => s + d.len, 0) / N || 1;
|
|
13468
|
+
|
|
13469
|
+
const df = new Map();
|
|
13470
|
+
for (const d of docs) {
|
|
13471
|
+
for (const t of d.tf.keys()) df.set(t, (df.get(t) || 0) + 1);
|
|
13472
|
+
}
|
|
13473
|
+
|
|
13474
|
+
const qToks = [...new Set(tokenize(query))];
|
|
13475
|
+
|
|
13476
|
+
return docs
|
|
13477
|
+
.map((d) => {
|
|
13478
|
+
let score = 0;
|
|
13479
|
+
for (const t of qToks) {
|
|
13480
|
+
const f = d.tf.get(t);
|
|
13481
|
+
if (!f) continue;
|
|
13482
|
+
const dfT = df.get(t);
|
|
13483
|
+
const idf = Math.log(1 + (N - dfT + 0.5) / (dfT + 0.5));
|
|
13484
|
+
score += (idf * (f * (k1 + 1))) / (f + k1 * (1 - b + (b * d.len) / avgdl));
|
|
13485
|
+
}
|
|
13486
|
+
return Object.assign({}, d.cand, { score });
|
|
13487
|
+
})
|
|
13488
|
+
.sort((a, c) => c.score - a.score || String(a.file).localeCompare(String(c.file)));
|
|
13489
|
+
}
|
|
13490
|
+
|
|
13491
|
+
module.exports = { tokenize, stem, bm25rank, PATH_BOOST, STOP };
|
|
13492
|
+
|
|
13493
|
+
};
|
|
13494
|
+
|
|
13421
13495
|
// ── ./src/retrieval/ranker ──
|
|
13422
13496
|
__factories["./src/retrieval/ranker"] = function(module, exports) {
|
|
13423
13497
|
|
|
@@ -13440,6 +13514,7 @@ __factories["./src/retrieval/ranker"] = function(module, exports) {
|
|
|
13440
13514
|
|
|
13441
13515
|
const { loadWeights } = __require('./src/learning/weights');
|
|
13442
13516
|
const { tokenize, STOP_WORDS } = __require('./src/retrieval/tokenizer');
|
|
13517
|
+
const { bm25rank } = __require('./src/retrieval/bm25');
|
|
13443
13518
|
|
|
13444
13519
|
// ---------------------------------------------------------------------------
|
|
13445
13520
|
// Default weights
|
|
@@ -13618,11 +13693,24 @@ __factories["./src/retrieval/ranker"] = function(module, exports) {
|
|
|
13618
13693
|
return all.slice(0, topK);
|
|
13619
13694
|
}
|
|
13620
13695
|
|
|
13696
|
+
// Identifier-aware BM25 base relevance over the whole index (#395). BM25
|
|
13697
|
+
// splits camelCase/snake_case, stems, and boosts path tokens, so queries
|
|
13698
|
+
// whose terms live inside identifiers (e.g. "component emit" → componentEmits)
|
|
13699
|
+
// are matched. The existing negative-signal penalty and recency/graph/learned
|
|
13700
|
+
// boosts are layered on top; the per-token signals stay for the explain table.
|
|
13701
|
+
const bm25Scores = new Map();
|
|
13702
|
+
for (const c of bm25rank(query, [...sigIndex.entries()].map(([file, sigs]) => ({ file, sigs })))) {
|
|
13703
|
+
bm25Scores.set(c.file, c.score);
|
|
13704
|
+
}
|
|
13705
|
+
|
|
13621
13706
|
const scored = [];
|
|
13622
13707
|
for (const [file, sigs] of sigIndex.entries()) {
|
|
13623
13708
|
const result = scoreFile(file, sigs, queryTokens, weights);
|
|
13624
|
-
|
|
13709
|
+
const penalty = result.signals.penalty;
|
|
13710
|
+
const base = bm25Scores.get(file) || 0;
|
|
13711
|
+
let score = base * penalty;
|
|
13625
13712
|
const signals = result.signals;
|
|
13713
|
+
signals.bm25 = base;
|
|
13626
13714
|
|
|
13627
13715
|
// Recency boost
|
|
13628
13716
|
if (recencySet && recencySet.has(file) && score > 0) {
|
|
@@ -16524,7 +16612,7 @@ function __tryGit(args, opts = {}) {
|
|
|
16524
16612
|
catch (_) { return ''; }
|
|
16525
16613
|
}
|
|
16526
16614
|
|
|
16527
|
-
const VERSION = '7.
|
|
16615
|
+
const VERSION = '7.31.0';
|
|
16528
16616
|
const MARKER = '\n\n## Auto-generated signatures\n<!-- Updated by gen-context.js -->\n';
|
|
16529
16617
|
|
|
16530
16618
|
function requireSourceOrBundled(key) {
|
|
@@ -19091,7 +19179,7 @@ function main() {
|
|
|
19091
19179
|
}
|
|
19092
19180
|
|
|
19093
19181
|
const shareText = [
|
|
19094
|
-
'Generated with SigMap —
|
|
19182
|
+
'Generated with SigMap — the deterministic, verifiable grounding layer for AI code work',
|
|
19095
19183
|
`${reduction}% fewer tokens · ${hitAt5}% retrieval accuracy · 6× better results`,
|
|
19096
19184
|
'https://sigmap.io',
|
|
19097
19185
|
].join('\n');
|
package/llms-full.txt
CHANGED
|
@@ -1,28 +1,30 @@
|
|
|
1
1
|
# SigMap — Complete LLM Reference
|
|
2
2
|
|
|
3
|
-
>
|
|
4
|
-
>
|
|
5
|
-
|
|
6
|
-
SigMap is
|
|
7
|
-
signatures from a codebase and
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
3
|
+
> The deterministic, verifiable grounding layer for AI code work.
|
|
4
|
+
> A reproducible signature-and-evidence map that agents, CI, and reviewers can trust and audit. No embeddings, no vector DB, fully offline.
|
|
5
|
+
|
|
6
|
+
SigMap is the deterministic, verifiable grounding layer for AI code work. It
|
|
7
|
+
extracts function and class signatures from a codebase and builds a byte-stable
|
|
8
|
+
signature-and-evidence map that agents, CI, and reviewers can trust and audit —
|
|
9
|
+
proving which files and symbols are real before acting. Deterministic TF-IDF
|
|
10
|
+
ranking keeps the relevant context in scope (cutting tokens ~97% as a side
|
|
11
|
+
effect), with no LLM calls, embeddings, or vector database. Works with Claude,
|
|
12
|
+
Cursor, GitHub Copilot, Aider, Windsurf, local LLMs, and MCP.
|
|
13
|
+
|
|
14
|
+
# Version: 7.31.0 | Benchmark: sigmap-v7.31-main (2026-07-02)
|
|
13
15
|
# Source: auto-generated from package.json, version.json, benchmarks/latest.json, src/mcp/tools.js, src/config/defaults.js
|
|
14
16
|
# Regenerate: npm run generate:llms | Validate: npm run validate:llms
|
|
15
17
|
|
|
16
18
|
---
|
|
17
19
|
|
|
18
|
-
## Core metrics (benchmark: sigmap-v7.
|
|
20
|
+
## Core metrics (benchmark: sigmap-v7.31-main, 2026-07-02)
|
|
19
21
|
|
|
20
22
|
| Metric | Without SigMap | With SigMap |
|
|
21
23
|
|--------|----------------|-------------|
|
|
22
|
-
| Retrieval hit@5 | 13.6% (random) |
|
|
24
|
+
| Retrieval hit@5 | 13.6% (random) | 86.7% (6.4× lift) |
|
|
23
25
|
| Token reduction | — | 97.0% average |
|
|
24
|
-
| Task success proxy | 10% |
|
|
25
|
-
| Prompts per task | 2.84 | 1.
|
|
26
|
+
| Task success proxy | 10% | 67.8% |
|
|
27
|
+
| Prompts per task | 2.84 | 1.46 (48.8% fewer) |
|
|
26
28
|
| Supported languages | — | 33 |
|
|
27
29
|
| MCP tools | — | 17 |
|
|
28
30
|
| npm runtime dependencies | — | 0 |
|
package/llms.txt
CHANGED
|
@@ -1,15 +1,17 @@
|
|
|
1
1
|
# SigMap
|
|
2
2
|
|
|
3
|
-
>
|
|
4
|
-
>
|
|
5
|
-
|
|
6
|
-
SigMap is
|
|
7
|
-
signatures from a codebase and
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
3
|
+
> The deterministic, verifiable grounding layer for AI code work.
|
|
4
|
+
> A reproducible signature-and-evidence map that agents, CI, and reviewers can trust and audit. No embeddings, no vector DB, fully offline.
|
|
5
|
+
|
|
6
|
+
SigMap is the deterministic, verifiable grounding layer for AI code work. It
|
|
7
|
+
extracts function and class signatures from a codebase and builds a byte-stable
|
|
8
|
+
signature-and-evidence map that agents, CI, and reviewers can trust and audit —
|
|
9
|
+
proving which files and symbols are real before acting. Deterministic TF-IDF
|
|
10
|
+
ranking keeps the relevant context in scope (cutting tokens ~97% as a side
|
|
11
|
+
effect), with no LLM calls, embeddings, or vector database. Works with Claude,
|
|
12
|
+
Cursor, GitHub Copilot, Aider, Windsurf, local LLMs, and MCP.
|
|
13
|
+
|
|
14
|
+
# Version: 7.31.0 | Benchmark: sigmap-v7.31-main (2026-07-02)
|
|
13
15
|
# Source: auto-generated from package.json, version.json, benchmarks/latest.json, src/mcp/tools.js, src/config/defaults.js
|
|
14
16
|
# Regenerate: npm run generate:llms | Validate: npm run validate:llms
|
|
15
17
|
|
|
@@ -21,12 +23,12 @@ Claude, Cursor, GitHub Copilot, Aider, Windsurf, local LLMs, and MCP.
|
|
|
21
23
|
- No blast-radius awareness before editing a hub file — `--impact` shows every file a change touches.
|
|
22
24
|
- Pasted stack traces, CI logs, and JSON bloat the prompt — `squeeze` minimizes them and enriches the top frame from the symbol index.
|
|
23
25
|
|
|
24
|
-
## Core metrics (benchmark: sigmap-v7.
|
|
26
|
+
## Core metrics (benchmark: sigmap-v7.31-main, 2026-07-02)
|
|
25
27
|
|
|
26
|
-
- hit@5 retrieval:
|
|
28
|
+
- hit@5 retrieval: 86.7% vs 13.6% random baseline (6.4× lift)
|
|
27
29
|
- Token reduction: 97.0% average across benchmark repos
|
|
28
|
-
- Task success:
|
|
29
|
-
- Prompts per task: 1.
|
|
30
|
+
- Task success: 67.8% vs 10% without SigMap
|
|
31
|
+
- Prompts per task: 1.46 vs 2.84 baseline (48.8% fewer)
|
|
30
32
|
- Languages: 33 supported · MCP tools: 17
|
|
31
33
|
- Dependencies: zero npm runtime dependencies · fully offline
|
|
32
34
|
|
|
@@ -52,5 +54,6 @@ npx sigmap --mcp # start the MCP server over stdio
|
|
|
52
54
|
- [Benchmark dataset (Zenodo)](https://doi.org/10.5281/zenodo.19898842)
|
|
53
55
|
- [Full LLM reference](https://sigmap.io/llms-full.txt)
|
|
54
56
|
|
|
55
|
-
SigMap —
|
|
56
|
-
|
|
57
|
+
SigMap — the deterministic, verifiable grounding layer for AI code work. The
|
|
58
|
+
reproducible signature-and-evidence map agents, CI, and reviewers can audit,
|
|
59
|
+
which agentic grep cannot produce.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "sigmap",
|
|
3
|
-
"version": "7.
|
|
3
|
+
"version": "7.31.0",
|
|
4
4
|
"description": "97% token reduction for AI coding. Extracts function & class signatures with TF-IDF ranking to feed only the right files to Claude, Cursor, Copilot, Aider, Windsurf, local LLMs & MCP. Zero dependencies, runs offline via npx.",
|
|
5
5
|
"main": "packages/core/index.js",
|
|
6
6
|
"exports": {
|
package/src/eval/runner.js
CHANGED
|
@@ -20,6 +20,7 @@
|
|
|
20
20
|
const fs = require('fs');
|
|
21
21
|
const path = require('path');
|
|
22
22
|
const { aggregate } = require('./scorer');
|
|
23
|
+
const { bm25rank } = require('../retrieval/bm25');
|
|
23
24
|
|
|
24
25
|
// ---------------------------------------------------------------------------
|
|
25
26
|
// Context file reader
|
|
@@ -81,79 +82,26 @@ function buildSigIndex(cwd) {
|
|
|
81
82
|
}
|
|
82
83
|
|
|
83
84
|
// ---------------------------------------------------------------------------
|
|
84
|
-
//
|
|
85
|
+
// Identifier-aware BM25 ranking (v7.31; see src/retrieval/bm25.js and #395)
|
|
85
86
|
// ---------------------------------------------------------------------------
|
|
86
87
|
|
|
87
|
-
|
|
88
|
-
* Tokenize a query or signature into lower-case word tokens.
|
|
89
|
-
* Splits on whitespace, punctuation, camelCase, and snake_case.
|
|
90
|
-
* @param {string} text
|
|
91
|
-
* @returns {string[]}
|
|
92
|
-
*/
|
|
93
|
-
function tokenize(text) {
|
|
94
|
-
if (!text) return [];
|
|
95
|
-
return text
|
|
96
|
-
// split camelCase
|
|
97
|
-
.replace(/([a-z])([A-Z])/g, '$1 $2')
|
|
98
|
-
// split snake/kebab
|
|
99
|
-
.replace(/[_\-]/g, ' ')
|
|
100
|
-
// drop non-word chars
|
|
101
|
-
.replace(/[^\w\s]/g, ' ')
|
|
102
|
-
.toLowerCase()
|
|
103
|
-
.split(/\s+/)
|
|
104
|
-
.filter((t) => t.length > 1);
|
|
105
|
-
}
|
|
106
|
-
|
|
107
|
-
const STOP_WORDS = new Set([
|
|
108
|
-
'the', 'a', 'an', 'in', 'of', 'to', 'for', 'and', 'or', 'is', 'are',
|
|
109
|
-
'that', 'this', 'it', 'with', 'from', 'by', 'be', 'as', 'on', 'at',
|
|
110
|
-
]);
|
|
111
|
-
|
|
112
|
-
/**
|
|
113
|
-
* Score a single file's signatures against a query.
|
|
114
|
-
* Returns a non-negative number; higher = more relevant.
|
|
115
|
-
* @param {string[]} sigs - array of signature strings for this file
|
|
116
|
-
* @param {string[]} queryTokens
|
|
117
|
-
* @returns {number}
|
|
118
|
-
*/
|
|
119
|
-
function scoreFile(sigs, queryTokens) {
|
|
120
|
-
if (!sigs || sigs.length === 0) return 0;
|
|
121
|
-
|
|
122
|
-
const sigText = sigs.join(' ');
|
|
123
|
-
const sigTokens = new Set(tokenize(sigText));
|
|
124
|
-
|
|
125
|
-
let score = 0;
|
|
126
|
-
for (const qt of queryTokens) {
|
|
127
|
-
if (STOP_WORDS.has(qt)) continue;
|
|
128
|
-
if (sigTokens.has(qt)) score += 1;
|
|
129
|
-
// Partial match (prefix)
|
|
130
|
-
for (const st of sigTokens) {
|
|
131
|
-
if (st !== qt && st.startsWith(qt) && qt.length >= 4) score += 0.3;
|
|
132
|
-
}
|
|
133
|
-
}
|
|
134
|
-
|
|
135
|
-
return score;
|
|
136
|
-
}
|
|
88
|
+
const { tokenize } = require('../retrieval/bm25');
|
|
137
89
|
|
|
138
90
|
/**
|
|
139
|
-
* Rank all files in the index against a query
|
|
140
|
-
*
|
|
91
|
+
* Rank all files in the index against a query with the identifier-aware BM25
|
|
92
|
+
* re-ranker. Returns file entries sorted by relevance score descending; ties
|
|
93
|
+
* are broken by file path alphabetically (deterministic).
|
|
141
94
|
* @param {string} query
|
|
142
95
|
* @param {Map<string, string[]>} index
|
|
143
96
|
* @param {number} topK
|
|
144
97
|
* @returns {{ file: string, score: number, sigs: string[] }[]}
|
|
145
98
|
*/
|
|
146
99
|
function rank(query, index, topK = 10) {
|
|
147
|
-
const
|
|
148
|
-
const scored = [];
|
|
149
|
-
|
|
100
|
+
const candidates = [];
|
|
150
101
|
for (const [file, sigs] of index.entries()) {
|
|
151
|
-
|
|
152
|
-
scored.push({ file, score, sigs });
|
|
102
|
+
candidates.push({ file, sigs });
|
|
153
103
|
}
|
|
154
|
-
|
|
155
|
-
scored.sort((a, b) => b.score - a.score || a.file.localeCompare(b.file));
|
|
156
|
-
return scored.slice(0, topK);
|
|
104
|
+
return bm25rank(query, candidates).slice(0, topK);
|
|
157
105
|
}
|
|
158
106
|
|
|
159
107
|
// ---------------------------------------------------------------------------
|
package/src/format/llms-txt.js
CHANGED
|
@@ -29,7 +29,7 @@ function format(context, cwd, writtenFiles, sigmapVersion) {
|
|
|
29
29
|
|
|
30
30
|
const lines = [
|
|
31
31
|
'# SigMap Context Index',
|
|
32
|
-
`> Generated by SigMap v${sigmapVersion} —
|
|
32
|
+
`> Generated by SigMap v${sigmapVersion} — the deterministic, verifiable grounding layer for AI code work`,
|
|
33
33
|
'',
|
|
34
34
|
'## Project',
|
|
35
35
|
`- Name: ${name}`,
|
package/src/mcp/server.js
CHANGED
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
'use strict';
|
|
2
|
+
|
|
3
|
+
/**
|
|
4
|
+
* SigMap identifier-aware BM25 re-ranker (zero dependencies, deterministic).
|
|
5
|
+
*
|
|
6
|
+
* Plain exact-token TF-IDF misses queries whose terms live *inside* code
|
|
7
|
+
* identifiers — e.g. `component emit` never surfaces `componentEmits.ts`,
|
|
8
|
+
* because "componentEmits" is one token that shares no exact term with the
|
|
9
|
+
* query. This module fixes that with four small additions:
|
|
10
|
+
*
|
|
11
|
+
* 1. Identifier-aware tokenization — split camelCase and snake_case.
|
|
12
|
+
* 2. Light stemming — plurals / common suffixes (`emits` → `emit`).
|
|
13
|
+
* 3. Path-token boost — file path / basename tokens weigh PATH_BOOST× more.
|
|
14
|
+
* 4. BM25 scoring instead of raw TF-IDF (length-normalized).
|
|
15
|
+
*
|
|
16
|
+
* On 85 curated tasks across 17 repos this lifted hit@5 from 75.3% → 82.4%
|
|
17
|
+
* (MRR +16% relative). See issue #395.
|
|
18
|
+
*/
|
|
19
|
+
|
|
20
|
+
// Stop words: common English + low-signal code verbs/nouns that appear in
|
|
21
|
+
// nearly every signature and so carry little retrieval signal.
|
|
22
|
+
const STOP = new Set(
|
|
23
|
+
('a an the of to in on for and or is are be by with as at from that this it its ' +
|
|
24
|
+
'into get set add new return value test')
|
|
25
|
+
.split(' ')
|
|
26
|
+
);
|
|
27
|
+
|
|
28
|
+
/**
|
|
29
|
+
* Light suffix stemmer — conservative, tuned for code identifiers rather than
|
|
30
|
+
* prose. Words of 3 chars or fewer pass through unchanged; a result shorter
|
|
31
|
+
* than 3 chars reverts to the original token.
|
|
32
|
+
*
|
|
33
|
+
* @param {string} w
|
|
34
|
+
* @returns {string}
|
|
35
|
+
*/
|
|
36
|
+
function stem(w) {
|
|
37
|
+
if (w.length <= 3) return w;
|
|
38
|
+
let s = w;
|
|
39
|
+
s = s.replace(/ies$/, 'y');
|
|
40
|
+
s = s.replace(/(sses|shes|ches|xes|zes)$/, (m) => m.slice(0, -2));
|
|
41
|
+
s = s.replace(/([^s])s$/, '$1');
|
|
42
|
+
s = s.replace(/(ization|izations)$/, 'ize');
|
|
43
|
+
s = s.replace(/(ing|edly|ed|er|ers|ation|ations|ment|ness|ity|ive|able|ible|ize|ise|al)$/, '');
|
|
44
|
+
return s.length >= 3 ? s : w;
|
|
45
|
+
}
|
|
46
|
+
|
|
47
|
+
/**
|
|
48
|
+
* Split on non-alphanumeric characters AND camelCase / snake_case boundaries,
|
|
49
|
+
* lowercase, drop stop words and single characters, then stem.
|
|
50
|
+
*
|
|
51
|
+
* @param {string} text
|
|
52
|
+
* @returns {string[]}
|
|
53
|
+
*/
|
|
54
|
+
function tokenize(text) {
|
|
55
|
+
if (!text || typeof text !== 'string') return [];
|
|
56
|
+
return text
|
|
57
|
+
.replace(/[^A-Za-z0-9]+/g, ' ')
|
|
58
|
+
.replace(/([a-z0-9])([A-Z])/g, '$1 $2')
|
|
59
|
+
.replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2')
|
|
60
|
+
.toLowerCase()
|
|
61
|
+
.split(/\s+/)
|
|
62
|
+
.filter((t) => t.length > 1 && !STOP.has(t))
|
|
63
|
+
.map(stem)
|
|
64
|
+
.filter(Boolean);
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
// The file path / basename is highly indicative of relevance, so its tokens
|
|
68
|
+
// are counted PATH_BOOST times when building the document term-frequency map.
|
|
69
|
+
const PATH_BOOST = 3;
|
|
70
|
+
|
|
71
|
+
/**
|
|
72
|
+
* BM25 re-rank of candidates against a query. Each candidate is
|
|
73
|
+
* `{ file, sigs }`; the returned objects preserve all original candidate
|
|
74
|
+
* fields and add a numeric `score` (higher = more relevant), sorted best-first
|
|
75
|
+
* with a deterministic path tie-break. A `score` of 0 means no query token
|
|
76
|
+
* matched — callers typically drop those.
|
|
77
|
+
*
|
|
78
|
+
* @param {string} query
|
|
79
|
+
* @param {{ file: string, sigs: string[] }[]} candidates
|
|
80
|
+
* @returns {Array<object & { score: number }>}
|
|
81
|
+
*/
|
|
82
|
+
function bm25rank(query, candidates) {
|
|
83
|
+
if (!Array.isArray(candidates) || candidates.length === 0) return [];
|
|
84
|
+
|
|
85
|
+
const k1 = 1.5;
|
|
86
|
+
const b = 0.75;
|
|
87
|
+
|
|
88
|
+
const docs = candidates.map((c) => {
|
|
89
|
+
const pathToks = tokenize(c.file || '');
|
|
90
|
+
const toks = tokenize((c.sigs || []).join(' '));
|
|
91
|
+
for (let i = 0; i < PATH_BOOST; i++) toks.push(...pathToks);
|
|
92
|
+
const tf = new Map();
|
|
93
|
+
for (const t of toks) tf.set(t, (tf.get(t) || 0) + 1);
|
|
94
|
+
return { cand: c, tf, len: toks.length };
|
|
95
|
+
});
|
|
96
|
+
|
|
97
|
+
const N = docs.length || 1;
|
|
98
|
+
const avgdl = docs.reduce((s, d) => s + d.len, 0) / N || 1;
|
|
99
|
+
|
|
100
|
+
const df = new Map();
|
|
101
|
+
for (const d of docs) {
|
|
102
|
+
for (const t of d.tf.keys()) df.set(t, (df.get(t) || 0) + 1);
|
|
103
|
+
}
|
|
104
|
+
|
|
105
|
+
const qToks = [...new Set(tokenize(query))];
|
|
106
|
+
|
|
107
|
+
return docs
|
|
108
|
+
.map((d) => {
|
|
109
|
+
let score = 0;
|
|
110
|
+
for (const t of qToks) {
|
|
111
|
+
const f = d.tf.get(t);
|
|
112
|
+
if (!f) continue;
|
|
113
|
+
const dfT = df.get(t);
|
|
114
|
+
const idf = Math.log(1 + (N - dfT + 0.5) / (dfT + 0.5));
|
|
115
|
+
score += (idf * (f * (k1 + 1))) / (f + k1 * (1 - b + (b * d.len) / avgdl));
|
|
116
|
+
}
|
|
117
|
+
return Object.assign({}, d.cand, { score });
|
|
118
|
+
})
|
|
119
|
+
.sort((a, c) => c.score - a.score || String(a.file).localeCompare(String(c.file)));
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
module.exports = { tokenize, stem, bm25rank, PATH_BOOST, STOP };
|
package/src/retrieval/ranker.js
CHANGED
|
@@ -19,6 +19,7 @@
|
|
|
19
19
|
|
|
20
20
|
const { loadWeights } = require('../learning/weights');
|
|
21
21
|
const { tokenize, STOP_WORDS } = require('./tokenizer');
|
|
22
|
+
const { bm25rank } = require('./bm25');
|
|
22
23
|
|
|
23
24
|
// ---------------------------------------------------------------------------
|
|
24
25
|
// Default weights
|
|
@@ -197,11 +198,24 @@ function rank(query, sigIndex, opts) {
|
|
|
197
198
|
return all.slice(0, topK);
|
|
198
199
|
}
|
|
199
200
|
|
|
201
|
+
// Identifier-aware BM25 base relevance over the whole index (#395). BM25
|
|
202
|
+
// splits camelCase/snake_case, stems, and boosts path tokens, so queries
|
|
203
|
+
// whose terms live inside identifiers (e.g. "component emit" → componentEmits)
|
|
204
|
+
// are matched. The existing negative-signal penalty and recency/graph/learned
|
|
205
|
+
// boosts are layered on top; the per-token signals stay for the explain table.
|
|
206
|
+
const bm25Scores = new Map();
|
|
207
|
+
for (const c of bm25rank(query, [...sigIndex.entries()].map(([file, sigs]) => ({ file, sigs })))) {
|
|
208
|
+
bm25Scores.set(c.file, c.score);
|
|
209
|
+
}
|
|
210
|
+
|
|
200
211
|
const scored = [];
|
|
201
212
|
for (const [file, sigs] of sigIndex.entries()) {
|
|
202
213
|
const result = scoreFile(file, sigs, queryTokens, weights);
|
|
203
|
-
|
|
214
|
+
const penalty = result.signals.penalty;
|
|
215
|
+
const base = bm25Scores.get(file) || 0;
|
|
216
|
+
let score = base * penalty;
|
|
204
217
|
const signals = result.signals;
|
|
218
|
+
signals.bm25 = base;
|
|
205
219
|
|
|
206
220
|
// Recency boost
|
|
207
221
|
if (recencySet && recencySet.has(file) && score > 0) {
|