@metaharness/darwin 0.2.2 → 0.2.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +8 -0
- package/LEARNINGS.md +63 -0
- package/README.md +2 -1
- package/package.json +4 -3
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,14 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to this package. Dates UTC.
|
|
4
4
|
|
|
5
|
+
## 0.2.4 — 2026-06-19
|
|
6
|
+
|
|
7
|
+
- New result (ADR-151): swapping the cheap base deepseek-V3 -> deepseek-v4-pro nearly doubles the repair floor **15.3% -> 29.3%** [24.5,34.7] on full SWE-bench Lite (same harness). Falsifies "paradigm exhausted regardless of model IQ."
|
|
8
|
+
|
|
9
|
+
## 0.2.3 — 2026-06-19
|
|
10
|
+
|
|
11
|
+
- Add `LEARNINGS.md` — the measured findings distilled into harness defaults (repair=2x lever, cost-routing, Barbarian&Scholar tiering, format-contract, batch-eval discipline, capability floor).
|
|
12
|
+
|
|
5
13
|
## 0.2.2 — 2026-06-19
|
|
6
14
|
|
|
7
15
|
- **Docs: full SWE-bench Lite (300) evidence ladder** now in the description + README, all official `swebench` Docker harness, batch-verified:
|
package/LEARNINGS.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
# What the benchmarks taught us → harness defaults
|
|
2
|
+
|
|
3
|
+
Empirical findings from the full SWE-bench Lite (300) arc (official `swebench` Docker harness,
|
|
4
|
+
batch-verified — see `bench/results/RESULTS.md`). These are the *measured* reasons behind the
|
|
5
|
+
recommended harness patterns. The headline: **the harness, not the model, is the dominant lever.**
|
|
6
|
+
|
|
7
|
+
## 1. Closed-loop repair (test feedback) is the #1 lever — ~2× for free
|
|
8
|
+
- open-loop single-shot: **7.7%** → + closed-loop repair (run the failing tests, feed the
|
|
9
|
+
traceback back, retry ≤3): **15.3%** — on the *same cheap model*, ~$0.01/instance.
|
|
10
|
+
- **Recommendation:** make iteration against ground-truth (compiler/tests) first-class. A model
|
|
11
|
+
that can *see why it failed* beats a smarter model that can't. Prefer `retryPolicy` configs that
|
|
12
|
+
consume real failure signal over blind retries.
|
|
13
|
+
|
|
14
|
+
## 2. Localization fixes retrieval, not emission — beware the "emission wall"
|
|
15
|
+
- LLM file-localization lifted gold-file recall **+15pp** but resolve-rate stayed flat (8.0%).
|
|
16
|
+
The bottleneck was *writing a valid patch*, not *finding the file*.
|
|
17
|
+
- **Recommendation:** measure where you actually lose (selection vs emission) before optimizing
|
|
18
|
+
retrieval. Don't assume better context = better output.
|
|
19
|
+
|
|
20
|
+
## 3. Format contract + fit-in-context unblocks weak/local models (0 → 13/25 applied)
|
|
21
|
+
- A small local model emitted prose summaries instead of edits until the harness (a) served enough
|
|
22
|
+
context window, (b) carried the search/replace **format contract in a system message + worked
|
|
23
|
+
example**, and (c) **shrank per-file context to fit the window** (truncation silently dropped the
|
|
24
|
+
instruction). Apply-rate went 0 → ~50%.
|
|
25
|
+
- **Recommendation:** put the output-format contract in a *system* role with an example; size the
|
|
26
|
+
prompt to the model's real context; never let truncation eat the instruction.
|
|
27
|
+
|
|
28
|
+
## 4. Cheap-first + cost-aware routing — 31× cheaper per resolve
|
|
29
|
+
- Router probe: `pareto-code`→deepseek-v4-pro resolved at **$0.21/resolve** vs `fusion`→opus-4.8 at
|
|
30
|
+
**$6.57/resolve** — same task, 31× cost gap for +1 resolve.
|
|
31
|
+
- **Recommendation:** default to the cheapest model that clears the task; reserve frontier models for
|
|
32
|
+
measured capability gaps. Track **$/resolve**, not just resolve-rate.
|
|
33
|
+
|
|
34
|
+
## 5. Barbarian & Scholar — tier the models, escalate only the residual (33.3% at ~6× less cost)
|
|
35
|
+
- Cheap base (deepseek + repair) banks the easy 46/300; a frontier "Scholar" (sonnet-4 + repair)
|
|
36
|
+
escalated **only to the 254 it failed** cracks 55 more → **100/300 = 33.3%**, blended
|
|
37
|
+
~$0.34/instance vs ~$2 to run frontier on all 300.
|
|
38
|
+
- **Recommendation:** two-tier orchestration — cheap sweep, then frontier on the residual — is far
|
|
39
|
+
more cost-efficient than one strong model everywhere (you'd waste 5/6 of frontier spend re-solving
|
|
40
|
+
what cheap already gets).
|
|
41
|
+
|
|
42
|
+
## 6. The repair lift is model-bound below a capability floor (~14B)
|
|
43
|
+
- Repair did nothing for a 7B (4%→4%) but lifted a 14B (8%→12%) and doubled a hosted model. The loop
|
|
44
|
+
needs the model to *occasionally* produce a correct-ish patch to converge toward.
|
|
45
|
+
- **Recommendation:** don't expect harness scaffolding to rescue a model below the task's reasoning
|
|
46
|
+
floor; pick the smallest model *above* it, then let the harness multiply it.
|
|
47
|
+
|
|
48
|
+
## 7. Methodology: only batch-eval on final predictions is authoritative
|
|
49
|
+
- In-loop "resolved" counters drifted from clean batch eval by 1.5–5× (both directions — flaky
|
|
50
|
+
passes over-count; Docker-hang false-negatives under-count). Every reported number here is a
|
|
51
|
+
fresh batch eval on the final saved predictions.
|
|
52
|
+
- **Recommendation:** never report the in-loop signal; re-evaluate the artifact you'd actually ship.
|
|
53
|
+
|
|
54
|
+
## 8. Engineering robustness (or your run lies to you)
|
|
55
|
+
- Concurrency clones rate-limit (6-wide anon GitHub clones → 63 fetch failures): **cap at 2–3**,
|
|
56
|
+
retry-with-backoff, free each clone. One instance (`psf__requests-2317`) reliably hangs Docker
|
|
57
|
+
past timeout → known-flaky exclusion (`bench/swebench/KNOWN_FLAKY.md`). Watch for wedged containers.
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
Verdict: this paradigm (localize + search/replace + repair + tiered escalation) tops out ~33% on
|
|
62
|
+
SWE-bench Lite with a cheap base. The 65–88% agentic-SOTA tier needs a **multi-step autonomous agent**
|
|
63
|
+
(bash, dir-navigation, long-horizon discovery) — an architecture change, not more knob-tuning.
|
package/README.md
CHANGED
|
@@ -273,7 +273,7 @@ substrate is **no longer deferred** — it shipped (ADR-106→141) and now runs
|
|
|
273
273
|
|
|
274
274
|
### Canonical SWE-bench Lite (real, official harness — ADR-142–149)
|
|
275
275
|
|
|
276
|
-
> Full reproducible evidence
|
|
276
|
+
> Full reproducible evidence: [`bench/results/RESULTS.md`](bench/results/RESULTS.md) · measured best-practices: [`LEARNINGS.md`](LEARNINGS.md) · known-flaky exclusions: [`bench/swebench/KNOWN_FLAKY.md`](bench/swebench/KNOWN_FLAKY.md)
|
|
277
277
|
|
|
278
278
|
Run on the **full 300** SWE-bench Lite (test) instances, scored by the **official
|
|
279
279
|
`swebench` Docker harness** — no cherry-picking, tight CIs. Solver = relevance-ranked
|
|
@@ -284,6 +284,7 @@ context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0
|
|
|
284
284
|
| baseline (open-loop, single-shot) | 23/300 = **7.7%** | [5.2, 11.2] | 144 |
|
|
285
285
|
| + LLM localization | 24/300 = **8.0%** | [5.4, 11.6] | 146 |
|
|
286
286
|
| **+ closed-loop repair (test-feedback, ≤3)** | 46/300 = **15.3%** | **[11.7, 19.8]** | 149 |
|
|
287
|
+
| **+ swap base → deepseek-v4-pro (cheap)** | 88/300 = **29.3%** | **[24.5, 34.7]** | 151 |
|
|
287
288
|
| **+ Barbarian&Scholar hybrid (cheap+frontier tail)** | 100/300 = **33.3%** | **[28.2, 38.8]** | 148 |
|
|
288
289
|
|
|
289
290
|
The closed-loop repair loop **~doubles** the resolve-rate (7.7% → 15.3%) on the *same cheap model*
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@metaharness/darwin",
|
|
3
|
-
"version": "0.2.
|
|
4
|
-
"description": "An LLM supercharger and cost optimizer: freeze
|
|
3
|
+
"version": "0.2.4",
|
|
4
|
+
"description": "An LLM supercharger and cost optimizer: freeze the model class, evolve the harness. Measured on full SWE-bench Lite (300, official swebench Docker): cheap-model repair 15.3% (deepseek-V3) -> 29.3% (deepseek-v4-pro) -> 33.3% Barbarian&Scholar hybrid, ~$0.01-$0.34/instance (vs $1-20 for frontier agents). The harness, not the model, is the lever. Dependency-free (Node built-ins).",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./dist/index.js",
|
|
7
7
|
"types": "./dist/index.d.ts",
|
|
@@ -19,7 +19,8 @@
|
|
|
19
19
|
"README.md",
|
|
20
20
|
"SECURITY.md",
|
|
21
21
|
"LICENSE",
|
|
22
|
-
"CHANGELOG.md"
|
|
22
|
+
"CHANGELOG.md",
|
|
23
|
+
"LEARNINGS.md"
|
|
23
24
|
],
|
|
24
25
|
"scripts": {
|
|
25
26
|
"build": "tsc",
|