@metaharness/darwin 0.2.5 → 0.2.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,14 @@
2
2
 
3
3
  All notable changes to this package. Dates UTC.
4
4
 
5
+ ## 0.2.7 — 2026-06-20
6
+
7
+ - LEARNINGS.md brought current with the full batch-verified arc: §5 N-tier ladder (29.3→40.3→**58.3%**), §6 capability floor now the rigorous local-14b full-300 number (4.7→6.7%), verdict updated (paradigm reaches 58.3%, both frontiers exhausted). Agentic loop (ADR-153) now implemented + unit-tested (`bench/swebench/agentic-loop.mjs` + `solve-agentic.mjs`) — the next-arc architecture, shipped as code.
8
+
9
+ ## 0.2.6 — 2026-06-19
10
+
11
+ - **3-tier hybrid = 175/300 = 58.3%** [52.7,63.8] on full SWE-bench Lite (ADR-154), VERIFIED (55/55 sage-added reproduced). v4-pro(88)->sonnet Scholar(+33)->opus Sage(+54). 7.6x the 7.7% baseline; conservative lower bound (Sage partial). Blended ~$0.74/instance.
12
+
5
13
  ## 0.2.5 — 2026-06-19
6
14
 
7
15
  - New ceiling (ADR-152): **v4-pro + Scholar hybrid = 121/300 = 40.3%** [34.9,46.0] on full SWE-bench Lite — 5.2x the 7.7% baseline. Two levers stack: stronger cheap base (v4-pro, 88/300) + frontier-tail escalation (sonnet-4 recovers 33/212). Blended ~$0.39/instance.
package/LEARNINGS.md CHANGED
@@ -31,17 +31,23 @@ recommended harness patterns. The headline: **the harness, not the model, is the
31
31
  - **Recommendation:** default to the cheapest model that clears the task; reserve frontier models for
32
32
  measured capability gaps. Track **$/resolve**, not just resolve-rate.
33
33
 
34
- ## 5. Barbarian & Scholar — tier the models, escalate only the residual (33.3% at ~6× less cost)
35
- - Cheap base (deepseek + repair) banks the easy 46/300; a frontier "Scholar" (sonnet-4 + repair)
36
- escalated **only to the 254 it failed** cracks 55 more **100/300 = 33.3%**, blended
37
- ~$0.34/instance vs ~$2 to run frontier on all 300.
38
- - **Recommendation:** two-tier orchestration cheap sweep, then frontier on the residual — is far
39
- more cost-efficient than one strong model everywhere (you'd waste 5/6 of frontier spend re-solving
40
- what cheap already gets).
34
+ ## 5. Barbarian & Scholar — tier the models, escalate only the residual (up to 58.3%)
35
+ - Cheap base banks the easy wins; a frontier "Scholar" escalated **only to the residual it failed**
36
+ cracks more; a 3rd "Sage" tier escalates again. Each tier pays only for the shrinking tail. The
37
+ batch-verified ladder on full SWE-bench Lite (300):
38
+ - v4-pro base + repair: 88/300 = 29.3%
39
+ - + sonnet-4 Scholar on the tail (2-tier): 121/300 = **40.3%** [34.9, 46.0], ~$0.39/inst
40
+ - + opus-4.8 Sage on the residual (3-tier): 175/300 = **58.3%** [52.7, 63.8], ~$0.74/inst
41
+ - **Recommendation:** N-tier cheap→frontier escalation is far more cost-efficient than one strong
42
+ model everywhere (you'd waste most of frontier spend re-solving what cheap already gets). Returns
43
+ diminish per tier at rising $/resolve — stop where the residual's marginal cost exceeds its value.
41
44
 
42
45
  ## 6. The repair lift is model-bound below a capability floor (~14B)
43
- - Repair did nothing for a 7B (4%→4%) but lifted a 14B (8%→12%) and doubled a hosted model. The loop
44
- needs the model to *occasionally* produce a correct-ish patch to converge toward.
46
+ - Batch-verified on full-300: repair lifts a local 14B only **+2pp (4.7% 6.7%** [4.4, 10.1]) and
47
+ 108/300 of its attempts were empty/invalid diffs the model couldn't emit, so the loop had nothing to
48
+ iterate on. The *same harness* on a hosted model reaches 29.3%. The loop needs the model to
49
+ *occasionally* produce a correct-ish patch to converge toward; below that floor, repair recovers
50
+ little.
45
51
  - **Recommendation:** don't expect harness scaffolding to rescue a model below the task's reasoning
46
52
  floor; pick the smallest model *above* it, then let the harness multiply it.
47
53
 
@@ -58,6 +64,10 @@ recommended harness patterns. The headline: **the harness, not the model, is the
58
64
 
59
65
  ---
60
66
 
61
- Verdict: this paradigm (localize + search/replace + repair + tiered escalation) tops out ~33% on
62
- SWE-bench Lite with a cheap base. The 65–88% agentic-SOTA tier needs a **multi-step autonomous agent**
63
- (bash, dir-navigation, long-horizon discovery) an architecture change, not more knob-tuning.
67
+ Verdict: this paradigm (localize + search/replace + repair + tiered escalation) reaches a
68
+ batch-verified **58.3%** on SWE-bench Lite via cheap-base + 3-tier frontier escalation 7.6× the
69
+ 7.7% open-loop baseline. Both within-paradigm frontiers are now exhausted: hosted (3rd-tier escalation
70
+ at steeply rising $/resolve) and local (the §6 capability floor). The 65–88% agentic-SOTA tier needs a
71
+ **multi-step autonomous agent** (read/grep/run-tests/edit/discovery loop) — an architecture change,
72
+ not more knob-tuning. That loop is now implemented + unit-tested (ADR-153: `bench/swebench/
73
+ agentic-loop.mjs` + `solve-agentic.mjs`); its at-scale number is the next arc.
package/README.md CHANGED
@@ -286,6 +286,7 @@ context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0
286
286
  | **+ closed-loop repair (test-feedback, ≤3)** | 46/300 = **15.3%** | **[11.7, 19.8]** | 149 |
287
287
  | **+ swap base → deepseek-v4-pro (cheap)** | 88/300 = **29.3%** | **[24.5, 34.7]** | 151 |
288
288
  | **+ v4-pro + Scholar hybrid** | 121/300 = **40.3%** | **[34.9, 46.0]** | 152 |
289
+ | **+ Sage (opus-4.8) — 3-tier** | 175/300 = **58.3%** | **[52.7, 63.8]** | 154 |
289
290
  | **+ Barbarian&Scholar hybrid (cheap+frontier tail)** | 100/300 = **33.3%** | **[28.2, 38.8]** | 148 |
290
291
 
291
292
  The closed-loop repair loop **~doubles** the resolve-rate (7.7% → 15.3%) on the *same cheap model*
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "@metaharness/darwin",
3
- "version": "0.2.5",
4
- "description": "An LLM supercharger and cost optimizer: freeze the model class, evolve the harness. Measured on full SWE-bench Lite (300, official swebench Docker): 7.7% open-loop -> 15.3% +repair -> 29.3% (deepseek-v4-pro base) -> 40.3% v4-pro+frontier-tail hybrid, ~$0.01-$0.39/instance (vs $1-20 for frontier agents). The harness, not the model, is the lever. Dependency-free (Node built-ins).",
3
+ "version": "0.2.7",
4
+ "description": "An LLM supercharger and cost optimizer: freeze the model class, evolve the harness. Measured on full SWE-bench Lite (300, official swebench Docker, verified): 7.7% open-loop -> 15.3% +repair -> 29.3% (v4-pro base) -> 40.3% 2-tier -> 58.3% 3-tier cheap->frontier escalation, ~$0.01-$0.74/instance (vs $1-20 for frontier agents). The harness, not the model, is the lever. Dependency-free (Node built-ins).",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",
7
7
  "types": "./dist/index.d.ts",