npm - sweet-search - Versions diffs - 2.5.9 → 2.5.10 - Mend

sweet-search 2.5.9 → 2.5.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +45 -34
package/core/indexing/indexer-utils.js +13 -5
package/core/ranking/late-interaction-index.js +6 -3
package/package.json +7 -7
package/scripts/postinstall-banner.js +2 -2

package/README.md CHANGED Viewed

@@ -233,25 +233,24 @@ The win is **harness-adaptive**: where the native loop is disciplined (Claude Co
 <a id="bench-paper-type"></a>
 ### 📄 3. Paper-type retrieval benchmarks — *academic NL→code IR*
-> [!WARNING]
-> ⚠️ **THESE NUMBERS ARE STALE — TREAT THEM AS A FLOOR, NOT THE CURRENT SCORE.** ⚠️
-> Several results below were measured on builds that predate major accuracy work
-> (late-interaction correctness fixes, HNSW tuning, the May 2026 ranking overhaul).
-> Every benchmark is being re-run on the current engine and this table will be
-> replaced with fresh numbers. Until then, expect the real results to be **higher**.
+> [!NOTE]
+> 🔄 **Refreshed on the current engine (June 2026).** AdvTest, CoIR, CoSQA, and M2CRB were just
+> re-run on the latest build — the one with the late-interaction correctness fixes, HNSW tuning,
+> and the May 2026 ranking overhaul — and every one of them moved **up**. GCSN, CoSQA+, and CLARC
+> were already current. Reproduction artifacts are in [`eval/results/`](eval/results/).
 Every number below is the **`ss-search` pipeline end-to-end** — the same binary you install, querying
-against the **full corpus** (no 99-distractor shortcuts), measured at 26–41 ms p50 on an M3 Max.
+against the **full corpus** (no 99-distractor shortcuts), on an M3 Max.
 | 📚 Benchmark | 🔍 What it tests | # Queries | 🎯 MRR@10 |
 |-----------|---------------|--------:|-------:|
 | 🌐 **GenCodeSearchNet** | NL→code, 6 languages | 6,000 | **86.6** |
-| 🗺️ **M2CRB** | multilingual NL→code (ES/PT/DE/FR → Py/Java/JS) | 2,814 | **60.2** |
-| 🐍 CoSQA (test split) | web queries → Python | 500 | 97.0 |
+| 🗺️ **M2CRB** | multilingual NL→code (ES/PT/DE/FR → Py/Java/JS) | 2,814 | **65.9** |
+| 🐍 CoSQA (test split) | web queries → Python | 500 | 98.8 |
 | 🐍 CoSQA+ | web queries → Python, multi-match | 20,604 | 72.1 |
 | ⚙️ CLARC | NL→C/C++ (systems code) | 1,245 | 67.4 |
-| 🛡️ AdvTest † | adversarially renamed Python | 1,000 | 91.5 |
-| 🌍 CoIR † | 10 datasets, 14 languages | 4,500 | 57.3 |
+| 🛡️ AdvTest | adversarially renamed Python | 1,000 | **99.1** |
+| 🌍 CoIR | 10 datasets, 14 languages | 4,500 | **72.4** |
 **GenCodeSearchNet: the strongest result published anywhere, as far as we can tell.** The benchmark's
 own paper tops out at MRR ≤ 0.42 for its fine-tuned baselines (and ≤ 0.10 on the cross-lingual subsets),
@@ -260,15 +259,15 @@ query**. sweet-search scores **0.866**, retrieving from the entire 6,000-documen
 **M2CRB: best published number, no fine-tuning.** The benchmark paper's best model — a CodeBERT
 *fine-tuned on the task's training mix* — reaches 52.7 (auMRRc, a metric averaged over smaller retrieval
-pools). sweet-search reaches **60.2 full-corpus MRR@10 out of the box**, on Spanish, Portuguese, German,
+pools). sweet-search reaches **65.9 full-corpus MRR@10 out of the box**, on Spanish, Portuguese, German,
 and French queries.
 <details>
-<summary><b>Methodology & staleness flags</b></summary>
+<summary><b>Methodology & build dates</b></summary>
 - **Reproduction:** result artifacts live in `eval/results/`; rerun via `eval/run_all.js`.
 - **Protocol note:** published baselines for GCSN and CoSQA-style benchmarks typically rank the gold snippet against 99 sampled distractors. All sweet-search numbers rank against the full benchmark corpus — strictly harder.
-- **† Staleness:** AdvTest and CoIR were last run on the February 2026 build — before the late-interaction correctness fixes, HNSW tuning, and the May ranking work. They likely understate the current engine; re-runs are queued. CoSQA/M2CRB are from the April build; GCSN, CoSQA+, and CLARC are current (May 2026).
+- **Build dates:** AdvTest, CoIR, CoSQA, and M2CRB were re-run on the **June 2026** engine (0 errors on each); GCSN, CoSQA+, and CLARC are from the May 2026 build. All numbers reflect the current late-interaction pipeline — the correctness fixes, HNSW tuning, and May ranking overhaul. The June re-runs all improved over their earlier builds (AdvTest 91.5→99.1, CoIR 57.3→72.4, CoSQA 97.0→98.8, M2CRB 60.2→65.9).
 - **Honesty corner:** CrossCodeEval — cross-file *completion context* retrieval, a different task than NL search — sits at 0.12. We don't optimize for it and report it anyway.
 - Dates and per-language breakdowns: [`docs/BENCHMARKS_EXPLAINED.md`](docs/BENCHMARKS_EXPLAINED.md).
@@ -540,31 +539,43 @@ without another search.
 ## 🧠 An Agent Prompt That Was Evolved, Not Written
-Giving an agent six tools is easy. Getting it to *stop grepping in circles* is not.
+Shipping six tools is easy. Getting an agent to *stop grepping in circles* is the hard part.
-`sweet-search init` installs a ~1k-token system prompt that encodes a complete search discipline —
-and it wasn't hand-written. It was **evolved with a GEPA-style optimization loop**: reflective mutation
-by one model family, scored on a dual Pareto front (accuracy × cost) across two *different* production
-targets, then validated on held-out probes and on **model families that were never part of the
-optimization**, and finally hand-hardened with a correctness editing pass.
+So `sweet-search init` installs a ~1k-token system prompt that we **didn't write** — we *grew* it.
+A GEPA-style loop mutated candidate prompts, scored each on a dual Pareto front (**accuracy × cost**)
+against **two different production agents at once** — Claude Code (Sonnet) and Codex (GPT-5.5) — kept the
+survivors, and repeated. A final correctness pass hardened the winner. ~1k tokens, one job: teach the
+agent to search *well*.
-What it teaches:
+**🎓 The five rules it encodes:**
-- **Cheapest tool first** — hold an exact identifier? One `ss-grep`, trust the top hit, stop. No semantic search "just to confirm."
-- **Trust the ranking** — confirm with at most one narrow read, never a re-run of a hit that already matched.
-- **Absence is an answer** — two complementary empty probes (one semantic, one lexical) settle a negative; no third synonym, no `find`/`ls` spiral.
-- **No raw-shell escape** — the #1 token-waster we found in trajectory analysis is agents abandoning the index for dozens of raw `grep`/`find` calls after one empty result. The prompt closes that door explicitly.
-- **A reasoning checkpoint** — before a third probe, the agent must state what it has established and what its blind spot is.
+| | Rule | What it kills |
+|--|--|--|
+| 🥇 | **Cheapest tool first** | Got an exact symbol? One `ss-grep`, trust the top hit, stop — no semantic search "just to confirm." |
+| 🎯 | **Trust the ranking** | At most one narrow read to confirm; never re-run a hit that already matched. |
+| 🚫 | **Absence is an answer** | Two empty probes (one semantic, one lexical) settle a negative — no third synonym, no `find`/`ls` spiral. |
+| ⛔ | **No raw-shell escape** | The #1 token-waster in our trace analysis: agents bailing to dozens of raw `grep`/`find` calls after one miss. Door closed. |
+| 📝 | **Think before you dig** | Before a third probe, the agent states what it knows and what its blind spot is. |
+**🧾 The receipts** — *held-out discipline throughout: a dev set to iterate on, a held-out set touched only at milestones, a sealed vault opened exactly once.*
+| Validation gate | Result |
+|--|--|
+| 🎯 **Held-out** (30 probes × both agents) | joint score *(worst of the two)* **0.988** |
+| 🌍 **Out-of-distribution** (8 languages never seen in the loop) | **0.952** — *every* language ≥ 0.79, zero weak spots |
+| 🛡️ **Adversarial counter-probes** | **1.00 / 1.00** |
+| 🔀 **Held-out model families** (never optimized on) | MiMo **0.988** · Qwen **0.980** — it generalizes, it doesn't memorize |
+| 🧩 **Paraphrase robustness** (reword the prompt, same behavior) | correctness-weighted **0.95 / 0.93** |
 <details>
-<summary><b>How it was validated</b></summary>
-- **Optimization targets:** two frontier model families in production harnesses (Claude Code and Codex-style CLIs), scored jointly so the prompt can't overfit to one model's quirks.
-- **Selection:** dual Pareto fronts over per-probe accuracy and measured cost; candidates gated by paraphrase-invariance (the prompt's behavior must survive rewording).
-- **Held-out discipline:** a dev probe set for iteration, a held-out set checked only at milestones, and a sealed vault set opened exactly once. Joint maximin on held-out: **0.988**; out-of-distribution probes: **0.95+**; vault: **0.963** — 2.5 pp below held-out, well inside the pre-registered 15% acceptance gate.
-- **Held-out model families (HOMP):** the final prompt passed on two model families from different vendors that were never used during evolution — evidence the routing rules generalize, not memorize.
-- All figures are from the in-repo evaluation program (internal probe suites; see [`docs/PHASE7.md`](docs/PHASE7.md)); the benchmark suite that will make these externally reproducible is in progress.
-- Installation is idempotent and marker-delimited: re-running `init` updates the managed block in `CLAUDE.md` / `AGENTS.md` / `GEMINI.md` / `.cursor/rules` without touching anything else you wrote.
+<summary><b>🔬 How it was actually built (the honest version)</b></summary>
+- **Seeds → survivors:** 15 hand-authored seed prompts entered a reflective-evolution loop (an agent reads the *real* tool-call traces, proposes one targeted edit, we keep what helps). Operators included trajectory crossover, structural pivots, tool-name masking, and a pruner that fights prompt bloat.
+- **Two targets, jointly:** every candidate was scored on **both** Claude Code/Sonnet **and** Codex/GPT-5.5 with Maximin discipline (a prompt is only as good as its *worse* target), so it can't overfit one model's quirks.
+- **What actually won:** not clever phrasing — **terseness** (a shorter prompt re-sent every turn is cheaper), a **leaner tool mix** (grep/read over heavy semantic blocks that fatten the transcript), and **decisiveness on no-match** (stop spiraling). We report this plainly because it's what the traces showed.
+- **The correctness pass:** the shipped prompt ("M++") is the cost-winner plus 7 edits that fix factual descriptions of the tools — routing byte-identical, accuracy held, cost unchanged. A lateral move that buys honesty.
+- **Held-out everything:** dev to iterate, held-out checked only at milestones, a sealed vault opened once, plus held-out *model families* (MiMo, Qwen) and a reasoning-mode replay (MiniMax **0.963**) it never trained against. Figures: [`docs/PHASE7.md`](docs/PHASE7.md) (internal probe suites; an externally-reproducible suite is in progress).
+- **Idempotent install:** `init` writes a marker-delimited block into `CLAUDE.md` / `AGENTS.md` / `GEMINI.md` / `.cursor/rules` — re-run it freely, it never touches anything else you wrote.
 </details>

package/core/indexing/indexer-utils.js CHANGED Viewed

@@ -130,14 +130,22 @@ let deferredLogs = [];          // lines held back while parallel bars run (flus
 function renderBar(current, total, label) {
   const ratio = total > 0 ? Math.max(0, Math.min(1, current / total)) : 1;
-  const eighths = Math.round(ratio * BAR_WIDTH * 8);
+  const head = `${label}:`.padEnd(LABEL_COL);            // right border aligns across phases
+  const pct = (ratio * 100).toFixed(1).padStart(5);
+  const prefix = `${head}[`;
+  // Size the bar so the whole line fits the terminal width — a wrapped line would
+  // span two physical rows and break the cursor-up redraw math (→ duplicate bars).
+  // Drop the (current/total) counts first when the terminal is too cramped.
+  const cols = process.stdout.columns || 80;
+  let suffix = `] ${pct}% (${current}/${total})`;
+  if (cols - prefix.length - suffix.length - 1 < 6) suffix = `] ${pct}%`;
+  const width = Math.max(1, Math.min(BAR_WIDTH, cols - prefix.length - suffix.length - 1));
+  const eighths = Math.round(ratio * width * 8);
   const full = Math.floor(eighths / 8);
   const partial = SUB_BLOCKS[eighths % 8];
   const bar = '█'.repeat(full) + partial;
-  const empty = '░'.repeat(Math.max(0, BAR_WIDTH - full - (partial ? 1 : 0)));
-  const head = `${label}:`.padEnd(LABEL_COL);            // right border aligns across phases
-  const pct = (ratio * 100).toFixed(1).padStart(5);
-  return `${colors.cyan}${head}[${bar}${empty}] ${pct}% (${current}/${total})${colors.reset}`;
+  const empty = '░'.repeat(Math.max(0, width - full - (partial ? 1 : 0)));
+  return `${colors.cyan}${prefix}${bar}${empty}${suffix}${colors.reset}`;
 }
 // (Re)draw the live region in place (the `log-update` pattern). Invariant: the

package/core/ranking/late-interaction-index.js CHANGED Viewed

@@ -1654,7 +1654,7 @@ export class LateInteractionIndex {
           this._segmentDir = segDir;
           this._currentSegment = new Map();
           await this._saveAliasSidecar();
-          console.log(`LateInteraction: Saved ${this.documents.size} documents across ${this._segments.length} segments`);
+          if (process.env.DEBUG) console.log(`LateInteraction: Saved ${this.documents.size} documents across ${this._segments.length} segments`);
           return;
         }
       }
@@ -1743,7 +1743,7 @@ export class LateInteractionIndex {
       this._currentSegment = new Map();
       await this._saveAliasSidecar();
-      console.log(`LateInteraction: Saved ${this.documents.size} documents across ${newSegments.length} segments`);
+      if (process.env.DEBUG) console.log(`LateInteraction: Saved ${this.documents.size} documents across ${newSegments.length} segments`);
       return;
     }
@@ -1820,7 +1820,10 @@ export class LateInteractionIndex {
     await this._saveAliasSidecar();
     const sizeMB = (bytesWritten / 1024 / 1024).toFixed(2);
-    console.log(`LateInteraction: Saved ${this.documents.size} documents (${sizeMB} MB)`);
+    // DEBUG-only: this prints during the indexer's parallel embed+LI progress region;
+    // a direct write here moves the cursor and duplicates a bar. The indexer's
+    // "✓ Late interaction index built: N docs (X MB)" line already reports this.
+    if (process.env.DEBUG) console.log(`LateInteraction: Saved ${this.documents.size} documents (${sizeMB} MB)`);
   }
   /**

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "sweet-search",
-  "version": "2.5.9",
+  "version": "2.5.10",
   "description": "Sweet Search - SOTA Hybrid Code Search Engine with WASM CatBoost Query Router, Semantic/Lexical/Structural Search, and Multilingual Support",
   "type": "module",
   "main": "core/search/sweet-search.js",
@@ -163,12 +163,12 @@
   },
   "optionalDependencies": {
     "usearch": "^2.21.4",
-    "@sweet-search/native-darwin-arm64": "2.5.9",
-    "@sweet-search/native-darwin-x64": "2.5.9",
-    "@sweet-search/native-linux-arm64-gnu": "2.5.9",
-    "@sweet-search/native-linux-arm64-gnu-cuda": "2.5.9",
-    "@sweet-search/native-linux-x64-gnu": "2.5.9",
-    "@sweet-search/native-linux-x64-gnu-cuda": "2.5.9"
+    "@sweet-search/native-darwin-arm64": "2.5.10",
+    "@sweet-search/native-darwin-x64": "2.5.10",
+    "@sweet-search/native-linux-arm64-gnu": "2.5.10",
+    "@sweet-search/native-linux-arm64-gnu-cuda": "2.5.10",
+    "@sweet-search/native-linux-x64-gnu": "2.5.10",
+    "@sweet-search/native-linux-x64-gnu-cuda": "2.5.10"
   },
   "engines": {
     "node": ">=18.0.0"

package/scripts/postinstall-banner.js CHANGED Viewed

@@ -34,8 +34,8 @@ function run() {
   const L2 = '▄▄█ ▀▄█▄▀ ██▄ ██▄  █   ▄▄█ ██▄ █▀█ ██▄ █▄▄ █▀█';
   const msg = [
     '',
-    `  ${c('1;38;5;213', L1)}`,
-    `  ${c('1;38;5;213', L2)}`,
+    `  ${c('1;38;5;135', L1)}`,
+    `  ${c('1;38;5;135', L2)}`,
     '',
     `  ${c('1', 'Get started:')}`,
     `    ${c('36', 'sweet-search init')}        set up the current project`,