npm - @metaharness/darwin - Versions diffs - 0.2.5 → 0.2.7 - Mend

@metaharness/darwin 0.2.5 → 0.2.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,14 @@
 All notable changes to this package. Dates UTC.
+## 0.2.7 — 2026-06-20
+- LEARNINGS.md brought current with the full batch-verified arc: §5 N-tier ladder (29.3→40.3→**58.3%**), §6 capability floor now the rigorous local-14b full-300 number (4.7→6.7%), verdict updated (paradigm reaches 58.3%, both frontiers exhausted). Agentic loop (ADR-153) now implemented + unit-tested (`bench/swebench/agentic-loop.mjs` + `solve-agentic.mjs`) — the next-arc architecture, shipped as code.
+## 0.2.6 — 2026-06-19
+- **3-tier hybrid = 175/300 = 58.3%** [52.7,63.8] on full SWE-bench Lite (ADR-154), VERIFIED (55/55 sage-added reproduced). v4-pro(88)->sonnet Scholar(+33)->opus Sage(+54). 7.6x the 7.7% baseline; conservative lower bound (Sage partial). Blended ~$0.74/instance.
 ## 0.2.5 — 2026-06-19
 - New ceiling (ADR-152): **v4-pro + Scholar hybrid = 121/300 = 40.3%** [34.9,46.0] on full SWE-bench Lite — 5.2x the 7.7% baseline. Two levers stack: stronger cheap base (v4-pro, 88/300) + frontier-tail escalation (sonnet-4 recovers 33/212). Blended ~$0.39/instance.

package/LEARNINGS.md CHANGED Viewed

@@ -31,17 +31,23 @@ recommended harness patterns. The headline: **the harness, not the model, is the
 - **Recommendation:** default to the cheapest model that clears the task; reserve frontier models for
   measured capability gaps. Track **$/resolve**, not just resolve-rate.
-## 5. Barbarian & Scholar — tier the models, escalate only the residual (33.3% at ~6× less cost)
-- Cheap base (deepseek + repair) banks the easy 46/300; a frontier "Scholar" (sonnet-4 + repair)
-  escalated **only to the 254 it failed** cracks 55 more → **100/300 = 33.3%**, blended
-  ~$0.34/instance vs ~$2 to run frontier on all 300.
-- **Recommendation:** two-tier orchestration — cheap sweep, then frontier on the residual — is far
-  more cost-efficient than one strong model everywhere (you'd waste 5/6 of frontier spend re-solving
-  what cheap already gets).
+## 5. Barbarian & Scholar — tier the models, escalate only the residual (up to 58.3%)
+- Cheap base banks the easy wins; a frontier "Scholar" escalated **only to the residual it failed**
+  cracks more; a 3rd "Sage" tier escalates again. Each tier pays only for the shrinking tail. The
+  batch-verified ladder on full SWE-bench Lite (300):
+  - v4-pro base + repair: 88/300 = 29.3%
+  - + sonnet-4 Scholar on the tail (2-tier): 121/300 = **40.3%** [34.9, 46.0], ~$0.39/inst
+  - + opus-4.8 Sage on the residual (3-tier): 175/300 = **58.3%** [52.7, 63.8], ~$0.74/inst
+- **Recommendation:** N-tier cheap→frontier escalation is far more cost-efficient than one strong
+  model everywhere (you'd waste most of frontier spend re-solving what cheap already gets). Returns
+  diminish per tier at rising $/resolve — stop where the residual's marginal cost exceeds its value.
 ## 6. The repair lift is model-bound below a capability floor (~14B)
-- Repair did nothing for a 7B (4%→4%) but lifted a 14B (8%→12%) and doubled a hosted model. The loop
-  needs the model to *occasionally* produce a correct-ish patch to converge toward.
+- Batch-verified on full-300: repair lifts a local 14B only **+2pp (4.7% → 6.7%** [4.4, 10.1]) — and
+  108/300 of its attempts were empty/invalid diffs the model couldn't emit, so the loop had nothing to
+  iterate on. The *same harness* on a hosted model reaches 29.3%. The loop needs the model to
+  *occasionally* produce a correct-ish patch to converge toward; below that floor, repair recovers
+  little.
 - **Recommendation:** don't expect harness scaffolding to rescue a model below the task's reasoning
   floor; pick the smallest model *above* it, then let the harness multiply it.
@@ -58,6 +64,10 @@ recommended harness patterns. The headline: **the harness, not the model, is the
 ---
-Verdict: this paradigm (localize + search/replace + repair + tiered escalation) tops out ~33% on
-SWE-bench Lite with a cheap base. The 65–88% agentic-SOTA tier needs a **multi-step autonomous agent**
-(bash, dir-navigation, long-horizon discovery) — an architecture change, not more knob-tuning.
+Verdict: this paradigm (localize + search/replace + repair + tiered escalation) reaches a
+batch-verified **58.3%** on SWE-bench Lite via cheap-base + 3-tier frontier escalation — 7.6× the
+7.7% open-loop baseline. Both within-paradigm frontiers are now exhausted: hosted (3rd-tier escalation
+at steeply rising $/resolve) and local (the §6 capability floor). The 65–88% agentic-SOTA tier needs a
+**multi-step autonomous agent** (read/grep/run-tests/edit/discovery loop) — an architecture change,
+not more knob-tuning. That loop is now implemented + unit-tested (ADR-153: `bench/swebench/
+agentic-loop.mjs` + `solve-agentic.mjs`); its at-scale number is the next arc.

package/README.md CHANGED Viewed

@@ -286,6 +286,7 @@ context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0
 | **+ closed-loop repair (test-feedback, ≤3)** | 46/300 = **15.3%** | **[11.7, 19.8]** | 149 |
 | **+ swap base → deepseek-v4-pro (cheap)** | 88/300 = **29.3%** | **[24.5, 34.7]** | 151 |
 | **+ v4-pro + Scholar hybrid** | 121/300 = **40.3%** | **[34.9, 46.0]** | 152 |
+| **+ Sage (opus-4.8) — 3-tier** | 175/300 = **58.3%** | **[52.7, 63.8]** | 154 |
 | **+ Barbarian&Scholar hybrid (cheap+frontier tail)** | 100/300 = **33.3%** | **[28.2, 38.8]** | 148 |
 The closed-loop repair loop **~doubles** the resolve-rate (7.7% → 15.3%) on the *same cheap model*

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "@metaharness/darwin",
-  "version": "0.2.5",
-  "description": "An LLM supercharger and cost optimizer: freeze the model class, evolve the harness. Measured on full SWE-bench Lite (300, official swebench Docker): 7.7% open-loop -> 15.3% +repair -> 29.3% (deepseek-v4-pro base) -> 40.3% v4-pro+frontier-tail hybrid, ~$0.01-$0.39/instance (vs $1-20 for frontier agents). The harness, not the model, is the lever. Dependency-free (Node built-ins).",
+  "version": "0.2.7",
+  "description": "An LLM supercharger and cost optimizer: freeze the model class, evolve the harness. Measured on full SWE-bench Lite (300, official swebench Docker, verified): 7.7% open-loop -> 15.3% +repair -> 29.3% (v4-pro base) -> 40.3% 2-tier -> 58.3% 3-tier cheap->frontier escalation, ~$0.01-$0.74/instance (vs $1-20 for frontier agents). The harness, not the model, is the lever. Dependency-free (Node built-ins).",
   "type": "module",
   "main": "./dist/index.js",
   "types": "./dist/index.d.ts",