sweet-search 2.5.9 β†’ 2.5.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -233,25 +233,24 @@ The win is **harness-adaptive**: where the native loop is disciplined (Claude Co
233
233
  <a id="bench-paper-type"></a>
234
234
  ### πŸ“„ 3. Paper-type retrieval benchmarks β€” *academic NLβ†’code IR*
235
235
 
236
- > [!WARNING]
237
- > ⚠️ **THESE NUMBERS ARE STALE β€” TREAT THEM AS A FLOOR, NOT THE CURRENT SCORE.** ⚠️
238
- > Several results below were measured on builds that predate major accuracy work
239
- > (late-interaction correctness fixes, HNSW tuning, the May 2026 ranking overhaul).
240
- > Every benchmark is being re-run on the current engine and this table will be
241
- > replaced with fresh numbers. Until then, expect the real results to be **higher**.
236
+ > [!NOTE]
237
+ > πŸ”„ **Refreshed on the current engine (June 2026).** AdvTest, CoIR, CoSQA, and M2CRB were just
238
+ > re-run on the latest build β€” the one with the late-interaction correctness fixes, HNSW tuning,
239
+ > and the May 2026 ranking overhaul β€” and every one of them moved **up**. GCSN, CoSQA+, and CLARC
240
+ > were already current. Reproduction artifacts are in [`eval/results/`](eval/results/).
242
241
 
243
242
  Every number below is the **`ss-search` pipeline end-to-end** β€” the same binary you install, querying
244
- against the **full corpus** (no 99-distractor shortcuts), measured at 26–41 ms p50 on an M3 Max.
243
+ against the **full corpus** (no 99-distractor shortcuts), on an M3 Max.
245
244
 
246
245
  | πŸ“š Benchmark | πŸ” What it tests | # Queries | 🎯 MRR@10 |
247
246
  |-----------|---------------|--------:|-------:|
248
247
  | 🌐 **GenCodeSearchNet** | NLβ†’code, 6 languages | 6,000 | **86.6** |
249
- | πŸ—ΊοΈ **M2CRB** | multilingual NLβ†’code (ES/PT/DE/FR β†’ Py/Java/JS) | 2,814 | **60.2** |
250
- | 🐍 CoSQA (test split) | web queries β†’ Python | 500 | 97.0 |
248
+ | πŸ—ΊοΈ **M2CRB** | multilingual NLβ†’code (ES/PT/DE/FR β†’ Py/Java/JS) | 2,814 | **65.9** |
249
+ | 🐍 CoSQA (test split) | web queries β†’ Python | 500 | 98.8 |
251
250
  | 🐍 CoSQA+ | web queries β†’ Python, multi-match | 20,604 | 72.1 |
252
251
  | βš™οΈ CLARC | NLβ†’C/C++ (systems code) | 1,245 | 67.4 |
253
- | πŸ›‘οΈ AdvTest † | adversarially renamed Python | 1,000 | 91.5 |
254
- | 🌍 CoIR † | 10 datasets, 14 languages | 4,500 | 57.3 |
252
+ | πŸ›‘οΈ AdvTest | adversarially renamed Python | 1,000 | **99.1** |
253
+ | 🌍 CoIR | 10 datasets, 14 languages | 4,500 | **72.4** |
255
254
 
256
255
  **GenCodeSearchNet: the strongest result published anywhere, as far as we can tell.** The benchmark's
257
256
  own paper tops out at MRR ≀ 0.42 for its fine-tuned baselines (and ≀ 0.10 on the cross-lingual subsets),
@@ -260,15 +259,15 @@ query**. sweet-search scores **0.866**, retrieving from the entire 6,000-documen
260
259
 
261
260
  **M2CRB: best published number, no fine-tuning.** The benchmark paper's best model β€” a CodeBERT
262
261
  *fine-tuned on the task's training mix* β€” reaches 52.7 (auMRRc, a metric averaged over smaller retrieval
263
- pools). sweet-search reaches **60.2 full-corpus MRR@10 out of the box**, on Spanish, Portuguese, German,
262
+ pools). sweet-search reaches **65.9 full-corpus MRR@10 out of the box**, on Spanish, Portuguese, German,
264
263
  and French queries.
265
264
 
266
265
  <details>
267
- <summary><b>Methodology & staleness flags</b></summary>
266
+ <summary><b>Methodology & build dates</b></summary>
268
267
 
269
268
  - **Reproduction:** result artifacts live in `eval/results/`; rerun via `eval/run_all.js`.
270
269
  - **Protocol note:** published baselines for GCSN and CoSQA-style benchmarks typically rank the gold snippet against 99 sampled distractors. All sweet-search numbers rank against the full benchmark corpus β€” strictly harder.
271
- - **† Staleness:** AdvTest and CoIR were last run on the February 2026 build β€” before the late-interaction correctness fixes, HNSW tuning, and the May ranking work. They likely understate the current engine; re-runs are queued. CoSQA/M2CRB are from the April build; GCSN, CoSQA+, and CLARC are current (May 2026).
270
+ - **Build dates:** AdvTest, CoIR, CoSQA, and M2CRB were re-run on the **June 2026** engine (0 errors on each); GCSN, CoSQA+, and CLARC are from the May 2026 build. All numbers reflect the current late-interaction pipeline β€” the correctness fixes, HNSW tuning, and May ranking overhaul. The June re-runs all improved over their earlier builds (AdvTest 91.5β†’99.1, CoIR 57.3β†’72.4, CoSQA 97.0β†’98.8, M2CRB 60.2β†’65.9).
272
271
  - **Honesty corner:** CrossCodeEval β€” cross-file *completion context* retrieval, a different task than NL search β€” sits at 0.12. We don't optimize for it and report it anyway.
273
272
  - Dates and per-language breakdowns: [`docs/BENCHMARKS_EXPLAINED.md`](docs/BENCHMARKS_EXPLAINED.md).
274
273
 
@@ -540,31 +539,43 @@ without another search.
540
539
 
541
540
  ## 🧠 An Agent Prompt That Was Evolved, Not Written
542
541
 
543
- Giving an agent six tools is easy. Getting it to *stop grepping in circles* is not.
542
+ Shipping six tools is easy. Getting an agent to *stop grepping in circles* is the hard part.
544
543
 
545
- `sweet-search init` installs a ~1k-token system prompt that encodes a complete search discipline β€”
546
- and it wasn't hand-written. It was **evolved with a GEPA-style optimization loop**: reflective mutation
547
- by one model family, scored on a dual Pareto front (accuracy Γ— cost) across two *different* production
548
- targets, then validated on held-out probes and on **model families that were never part of the
549
- optimization**, and finally hand-hardened with a correctness editing pass.
544
+ So `sweet-search init` installs a ~1k-token system prompt that we **didn't write** β€” we *grew* it.
545
+ A GEPA-style loop mutated candidate prompts, scored each on a dual Pareto front (**accuracy Γ— cost**)
546
+ against **two different production agents at once** β€” Claude Code (Sonnet) and Codex (GPT-5.5) β€” kept the
547
+ survivors, and repeated. A final correctness pass hardened the winner. ~1k tokens, one job: teach the
548
+ agent to search *well*.
550
549
 
551
- What it teaches:
550
+ **πŸŽ“ The five rules it encodes:**
552
551
 
553
- - **Cheapest tool first** β€” hold an exact identifier? One `ss-grep`, trust the top hit, stop. No semantic search "just to confirm."
554
- - **Trust the ranking** β€” confirm with at most one narrow read, never a re-run of a hit that already matched.
555
- - **Absence is an answer** β€” two complementary empty probes (one semantic, one lexical) settle a negative; no third synonym, no `find`/`ls` spiral.
556
- - **No raw-shell escape** β€” the #1 token-waster we found in trajectory analysis is agents abandoning the index for dozens of raw `grep`/`find` calls after one empty result. The prompt closes that door explicitly.
557
- - **A reasoning checkpoint** β€” before a third probe, the agent must state what it has established and what its blind spot is.
552
+ | | Rule | What it kills |
553
+ |--|--|--|
554
+ | πŸ₯‡ | **Cheapest tool first** | Got an exact symbol? One `ss-grep`, trust the top hit, stop β€” no semantic search "just to confirm." |
555
+ | 🎯 | **Trust the ranking** | At most one narrow read to confirm; never re-run a hit that already matched. |
556
+ | 🚫 | **Absence is an answer** | Two empty probes (one semantic, one lexical) settle a negative β€” no third synonym, no `find`/`ls` spiral. |
557
+ | β›” | **No raw-shell escape** | The #1 token-waster in our trace analysis: agents bailing to dozens of raw `grep`/`find` calls after one miss. Door closed. |
558
+ | πŸ“ | **Think before you dig** | Before a third probe, the agent states what it knows and what its blind spot is. |
559
+
560
+ **🧾 The receipts** β€” *held-out discipline throughout: a dev set to iterate on, a held-out set touched only at milestones, a sealed vault opened exactly once.*
561
+
562
+ | Validation gate | Result |
563
+ |--|--|
564
+ | 🎯 **Held-out** (30 probes Γ— both agents) | joint score *(worst of the two)* **0.988** |
565
+ | 🌍 **Out-of-distribution** (8 languages never seen in the loop) | **0.952** β€” *every* language β‰₯ 0.79, zero weak spots |
566
+ | πŸ›‘οΈ **Adversarial counter-probes** | **1.00 / 1.00** |
567
+ | πŸ”€ **Held-out model families** (never optimized on) | MiMo **0.988** Β· Qwen **0.980** β€” it generalizes, it doesn't memorize |
568
+ | 🧩 **Paraphrase robustness** (reword the prompt, same behavior) | correctness-weighted **0.95 / 0.93** |
558
569
 
559
570
  <details>
560
- <summary><b>How it was validated</b></summary>
561
-
562
- - **Optimization targets:** two frontier model families in production harnesses (Claude Code and Codex-style CLIs), scored jointly so the prompt can't overfit to one model's quirks.
563
- - **Selection:** dual Pareto fronts over per-probe accuracy and measured cost; candidates gated by paraphrase-invariance (the prompt's behavior must survive rewording).
564
- - **Held-out discipline:** a dev probe set for iteration, a held-out set checked only at milestones, and a sealed vault set opened exactly once. Joint maximin on held-out: **0.988**; out-of-distribution probes: **0.95+**; vault: **0.963** β€” 2.5 pp below held-out, well inside the pre-registered 15% acceptance gate.
565
- - **Held-out model families (HOMP):** the final prompt passed on two model families from different vendors that were never used during evolution β€” evidence the routing rules generalize, not memorize.
566
- - All figures are from the in-repo evaluation program (internal probe suites; see [`docs/PHASE7.md`](docs/PHASE7.md)); the benchmark suite that will make these externally reproducible is in progress.
567
- - Installation is idempotent and marker-delimited: re-running `init` updates the managed block in `CLAUDE.md` / `AGENTS.md` / `GEMINI.md` / `.cursor/rules` without touching anything else you wrote.
571
+ <summary><b>πŸ”¬ How it was actually built (the honest version)</b></summary>
572
+
573
+ - **Seeds β†’ survivors:** 15 hand-authored seed prompts entered a reflective-evolution loop (an agent reads the *real* tool-call traces, proposes one targeted edit, we keep what helps). Operators included trajectory crossover, structural pivots, tool-name masking, and a pruner that fights prompt bloat.
574
+ - **Two targets, jointly:** every candidate was scored on **both** Claude Code/Sonnet **and** Codex/GPT-5.5 with Maximin discipline (a prompt is only as good as its *worse* target), so it can't overfit one model's quirks.
575
+ - **What actually won:** not clever phrasing β€” **terseness** (a shorter prompt re-sent every turn is cheaper), a **leaner tool mix** (grep/read over heavy semantic blocks that fatten the transcript), and **decisiveness on no-match** (stop spiraling). We report this plainly because it's what the traces showed.
576
+ - **The correctness pass:** the shipped prompt ("M++") is the cost-winner plus 7 edits that fix factual descriptions of the tools β€” routing byte-identical, accuracy held, cost unchanged. A lateral move that buys honesty.
577
+ - **Held-out everything:** dev to iterate, held-out checked only at milestones, a sealed vault opened once, plus held-out *model families* (MiMo, Qwen) and a reasoning-mode replay (MiniMax **0.963**) it never trained against. Figures: [`docs/PHASE7.md`](docs/PHASE7.md) (internal probe suites; an externally-reproducible suite is in progress).
578
+ - **Idempotent install:** `init` writes a marker-delimited block into `CLAUDE.md` / `AGENTS.md` / `GEMINI.md` / `.cursor/rules` β€” re-run it freely, it never touches anything else you wrote.
568
579
 
569
580
  </details>
570
581
 
@@ -130,14 +130,22 @@ let deferredLogs = []; // lines held back while parallel bars run (flus
130
130
 
131
131
  function renderBar(current, total, label) {
132
132
  const ratio = total > 0 ? Math.max(0, Math.min(1, current / total)) : 1;
133
- const eighths = Math.round(ratio * BAR_WIDTH * 8);
133
+ const head = `${label}:`.padEnd(LABEL_COL); // right border aligns across phases
134
+ const pct = (ratio * 100).toFixed(1).padStart(5);
135
+ const prefix = `${head}[`;
136
+ // Size the bar so the whole line fits the terminal width β€” a wrapped line would
137
+ // span two physical rows and break the cursor-up redraw math (β†’ duplicate bars).
138
+ // Drop the (current/total) counts first when the terminal is too cramped.
139
+ const cols = process.stdout.columns || 80;
140
+ let suffix = `] ${pct}% (${current}/${total})`;
141
+ if (cols - prefix.length - suffix.length - 1 < 6) suffix = `] ${pct}%`;
142
+ const width = Math.max(1, Math.min(BAR_WIDTH, cols - prefix.length - suffix.length - 1));
143
+ const eighths = Math.round(ratio * width * 8);
134
144
  const full = Math.floor(eighths / 8);
135
145
  const partial = SUB_BLOCKS[eighths % 8];
136
146
  const bar = 'β–ˆ'.repeat(full) + partial;
137
- const empty = 'β–‘'.repeat(Math.max(0, BAR_WIDTH - full - (partial ? 1 : 0)));
138
- const head = `${label}:`.padEnd(LABEL_COL); // right border aligns across phases
139
- const pct = (ratio * 100).toFixed(1).padStart(5);
140
- return `${colors.cyan}${head}[${bar}${empty}] ${pct}% (${current}/${total})${colors.reset}`;
147
+ const empty = 'β–‘'.repeat(Math.max(0, width - full - (partial ? 1 : 0)));
148
+ return `${colors.cyan}${prefix}${bar}${empty}${suffix}${colors.reset}`;
141
149
  }
142
150
 
143
151
  // (Re)draw the live region in place (the `log-update` pattern). Invariant: the
@@ -1654,7 +1654,7 @@ export class LateInteractionIndex {
1654
1654
  this._segmentDir = segDir;
1655
1655
  this._currentSegment = new Map();
1656
1656
  await this._saveAliasSidecar();
1657
- console.log(`LateInteraction: Saved ${this.documents.size} documents across ${this._segments.length} segments`);
1657
+ if (process.env.DEBUG) console.log(`LateInteraction: Saved ${this.documents.size} documents across ${this._segments.length} segments`);
1658
1658
  return;
1659
1659
  }
1660
1660
  }
@@ -1743,7 +1743,7 @@ export class LateInteractionIndex {
1743
1743
  this._currentSegment = new Map();
1744
1744
 
1745
1745
  await this._saveAliasSidecar();
1746
- console.log(`LateInteraction: Saved ${this.documents.size} documents across ${newSegments.length} segments`);
1746
+ if (process.env.DEBUG) console.log(`LateInteraction: Saved ${this.documents.size} documents across ${newSegments.length} segments`);
1747
1747
  return;
1748
1748
  }
1749
1749
 
@@ -1820,7 +1820,10 @@ export class LateInteractionIndex {
1820
1820
  await this._saveAliasSidecar();
1821
1821
 
1822
1822
  const sizeMB = (bytesWritten / 1024 / 1024).toFixed(2);
1823
- console.log(`LateInteraction: Saved ${this.documents.size} documents (${sizeMB} MB)`);
1823
+ // DEBUG-only: this prints during the indexer's parallel embed+LI progress region;
1824
+ // a direct write here moves the cursor and duplicates a bar. The indexer's
1825
+ // "βœ“ Late interaction index built: N docs (X MB)" line already reports this.
1826
+ if (process.env.DEBUG) console.log(`LateInteraction: Saved ${this.documents.size} documents (${sizeMB} MB)`);
1824
1827
  }
1825
1828
 
1826
1829
  /**
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sweet-search",
3
- "version": "2.5.9",
3
+ "version": "2.5.10",
4
4
  "description": "Sweet Search - SOTA Hybrid Code Search Engine with WASM CatBoost Query Router, Semantic/Lexical/Structural Search, and Multilingual Support",
5
5
  "type": "module",
6
6
  "main": "core/search/sweet-search.js",
@@ -163,12 +163,12 @@
163
163
  },
164
164
  "optionalDependencies": {
165
165
  "usearch": "^2.21.4",
166
- "@sweet-search/native-darwin-arm64": "2.5.9",
167
- "@sweet-search/native-darwin-x64": "2.5.9",
168
- "@sweet-search/native-linux-arm64-gnu": "2.5.9",
169
- "@sweet-search/native-linux-arm64-gnu-cuda": "2.5.9",
170
- "@sweet-search/native-linux-x64-gnu": "2.5.9",
171
- "@sweet-search/native-linux-x64-gnu-cuda": "2.5.9"
166
+ "@sweet-search/native-darwin-arm64": "2.5.10",
167
+ "@sweet-search/native-darwin-x64": "2.5.10",
168
+ "@sweet-search/native-linux-arm64-gnu": "2.5.10",
169
+ "@sweet-search/native-linux-arm64-gnu-cuda": "2.5.10",
170
+ "@sweet-search/native-linux-x64-gnu": "2.5.10",
171
+ "@sweet-search/native-linux-x64-gnu-cuda": "2.5.10"
172
172
  },
173
173
  "engines": {
174
174
  "node": ">=18.0.0"
@@ -34,8 +34,8 @@ function run() {
34
34
  const L2 = 'β–„β–„β–ˆ β–€β–„β–ˆβ–„β–€ β–ˆβ–ˆβ–„ β–ˆβ–ˆβ–„ β–ˆ β–„β–„β–ˆ β–ˆβ–ˆβ–„ β–ˆβ–€β–ˆ β–ˆβ–ˆβ–„ β–ˆβ–„β–„ β–ˆβ–€β–ˆ';
35
35
  const msg = [
36
36
  '',
37
- ` ${c('1;38;5;213', L1)}`,
38
- ` ${c('1;38;5;213', L2)}`,
37
+ ` ${c('1;38;5;135', L1)}`,
38
+ ` ${c('1;38;5;135', L2)}`,
39
39
  '',
40
40
  ` ${c('1', 'Get started:')}`,
41
41
  ` ${c('36', 'sweet-search init')} set up the current project`,