@metaharness/darwin 0.2.2 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,10 @@
2
2
 
3
3
  All notable changes to this package. Dates UTC.
4
4
 
5
+ ## 0.2.3 — 2026-06-19
6
+
7
+ - Add `LEARNINGS.md` — the measured findings distilled into harness defaults (repair=2x lever, cost-routing, Barbarian&Scholar tiering, format-contract, batch-eval discipline, capability floor).
8
+
5
9
  ## 0.2.2 — 2026-06-19
6
10
 
7
11
  - **Docs: full SWE-bench Lite (300) evidence ladder** now in the description + README, all official `swebench` Docker harness, batch-verified:
package/LEARNINGS.md ADDED
@@ -0,0 +1,63 @@
1
+ # What the benchmarks taught us → harness defaults
2
+
3
+ Empirical findings from the full SWE-bench Lite (300) arc (official `swebench` Docker harness,
4
+ batch-verified — see `bench/results/RESULTS.md`). These are the *measured* reasons behind the
5
+ recommended harness patterns. The headline: **the harness, not the model, is the dominant lever.**
6
+
7
+ ## 1. Closed-loop repair (test feedback) is the #1 lever — ~2× for free
8
+ - open-loop single-shot: **7.7%** → + closed-loop repair (run the failing tests, feed the
9
+ traceback back, retry ≤3): **15.3%** — on the *same cheap model*, ~$0.01/instance.
10
+ - **Recommendation:** make iteration against ground-truth (compiler/tests) first-class. A model
11
+ that can *see why it failed* beats a smarter model that can't. Prefer `retryPolicy` configs that
12
+ consume real failure signal over blind retries.
13
+
14
+ ## 2. Localization fixes retrieval, not emission — beware the "emission wall"
15
+ - LLM file-localization lifted gold-file recall **+15pp** but resolve-rate stayed flat (8.0%).
16
+ The bottleneck was *writing a valid patch*, not *finding the file*.
17
+ - **Recommendation:** measure where you actually lose (selection vs emission) before optimizing
18
+ retrieval. Don't assume better context = better output.
19
+
20
+ ## 3. Format contract + fit-in-context unblocks weak/local models (0 → 13/25 applied)
21
+ - A small local model emitted prose summaries instead of edits until the harness (a) served enough
22
+ context window, (b) carried the search/replace **format contract in a system message + worked
23
+ example**, and (c) **shrank per-file context to fit the window** (truncation silently dropped the
24
+ instruction). Apply-rate went 0 → ~50%.
25
+ - **Recommendation:** put the output-format contract in a *system* role with an example; size the
26
+ prompt to the model's real context; never let truncation eat the instruction.
27
+
28
+ ## 4. Cheap-first + cost-aware routing — 31× cheaper per resolve
29
+ - Router probe: `pareto-code`→deepseek-v4-pro resolved at **$0.21/resolve** vs `fusion`→opus-4.8 at
30
+ **$6.57/resolve** — same task, 31× cost gap for +1 resolve.
31
+ - **Recommendation:** default to the cheapest model that clears the task; reserve frontier models for
32
+ measured capability gaps. Track **$/resolve**, not just resolve-rate.
33
+
34
+ ## 5. Barbarian & Scholar — tier the models, escalate only the residual (33.3% at ~6× less cost)
35
+ - Cheap base (deepseek + repair) banks the easy 46/300; a frontier "Scholar" (sonnet-4 + repair)
36
+ escalated **only to the 254 it failed** cracks 55 more → **100/300 = 33.3%**, blended
37
+ ~$0.34/instance vs ~$2 to run frontier on all 300.
38
+ - **Recommendation:** two-tier orchestration — cheap sweep, then frontier on the residual — is far
39
+ more cost-efficient than one strong model everywhere (you'd waste 5/6 of frontier spend re-solving
40
+ what cheap already gets).
41
+
42
+ ## 6. The repair lift is model-bound below a capability floor (~14B)
43
+ - Repair did nothing for a 7B (4%→4%) but lifted a 14B (8%→12%) and doubled a hosted model. The loop
44
+ needs the model to *occasionally* produce a correct-ish patch to converge toward.
45
+ - **Recommendation:** don't expect harness scaffolding to rescue a model below the task's reasoning
46
+ floor; pick the smallest model *above* it, then let the harness multiply it.
47
+
48
+ ## 7. Methodology: only batch-eval on final predictions is authoritative
49
+ - In-loop "resolved" counters drifted from clean batch eval by 1.5–5× (both directions — flaky
50
+ passes over-count; Docker-hang false-negatives under-count). Every reported number here is a
51
+ fresh batch eval on the final saved predictions.
52
+ - **Recommendation:** never report the in-loop signal; re-evaluate the artifact you'd actually ship.
53
+
54
+ ## 8. Engineering robustness (or your run lies to you)
55
+ - Concurrency clones rate-limit (6-wide anon GitHub clones → 63 fetch failures): **cap at 2–3**,
56
+ retry-with-backoff, free each clone. One instance (`psf__requests-2317`) reliably hangs Docker
57
+ past timeout → known-flaky exclusion (`bench/swebench/KNOWN_FLAKY.md`). Watch for wedged containers.
58
+
59
+ ---
60
+
61
+ Verdict: this paradigm (localize + search/replace + repair + tiered escalation) tops out ~33% on
62
+ SWE-bench Lite with a cheap base. The 65–88% agentic-SOTA tier needs a **multi-step autonomous agent**
63
+ (bash, dir-navigation, long-horizon discovery) — an architecture change, not more knob-tuning.
package/README.md CHANGED
@@ -273,7 +273,7 @@ substrate is **no longer deferred** — it shipped (ADR-106→141) and now runs
273
273
 
274
274
  ### Canonical SWE-bench Lite (real, official harness — ADR-142–149)
275
275
 
276
- > Full reproducible evidence + per-run provenance: [`bench/results/RESULTS.md`](bench/results/RESULTS.md) · known-flaky exclusions: [`bench/swebench/KNOWN_FLAKY.md`](bench/swebench/KNOWN_FLAKY.md)
276
+ > Full reproducible evidence: [`bench/results/RESULTS.md`](bench/results/RESULTS.md) · measured best-practices: [`LEARNINGS.md`](LEARNINGS.md) · known-flaky exclusions: [`bench/swebench/KNOWN_FLAKY.md`](bench/swebench/KNOWN_FLAKY.md)
277
277
 
278
278
  Run on the **full 300** SWE-bench Lite (test) instances, scored by the **official
279
279
  `swebench` Docker harness** — no cherry-picking, tight CIs. Solver = relevance-ranked
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@metaharness/darwin",
3
- "version": "0.2.2",
3
+ "version": "0.2.3",
4
4
  "description": "An LLM supercharger and cost optimizer: freeze your model, evolve the harness around it. Measured on full SWE-bench Lite (300, official swebench Docker): open-loop 7.7% -> +repair 15.3% -> +Barbarian&Scholar hybrid 33.3%, at ~$0.01-$0.34/instance (vs $1-20 for frontier agents). Mutate->sandbox->score->archive. Dependency-free (Node built-ins).",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",
@@ -19,7 +19,8 @@
19
19
  "README.md",
20
20
  "SECURITY.md",
21
21
  "LICENSE",
22
- "CHANGELOG.md"
22
+ "CHANGELOG.md",
23
+ "LEARNINGS.md"
23
24
  ],
24
25
  "scripts": {
25
26
  "build": "tsc",