@metaharness/darwin 0.2.3 → 0.2.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +8 -0
- package/README.md +2 -0
- package/package.json +2 -2
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,14 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to this package. Dates UTC.
|
|
4
4
|
|
|
5
|
+
## 0.2.5 — 2026-06-19
|
|
6
|
+
|
|
7
|
+
- New ceiling (ADR-152): **v4-pro + Scholar hybrid = 121/300 = 40.3%** [34.9,46.0] on full SWE-bench Lite — 5.2x the 7.7% baseline. Two levers stack: stronger cheap base (v4-pro, 88/300) + frontier-tail escalation (sonnet-4 recovers 33/212). Blended ~$0.39/instance.
|
|
8
|
+
|
|
9
|
+
## 0.2.4 — 2026-06-19
|
|
10
|
+
|
|
11
|
+
- New result (ADR-151): swapping the cheap base deepseek-V3 -> deepseek-v4-pro nearly doubles the repair floor **15.3% -> 29.3%** [24.5,34.7] on full SWE-bench Lite (same harness). Falsifies "paradigm exhausted regardless of model IQ."
|
|
12
|
+
|
|
5
13
|
## 0.2.3 — 2026-06-19
|
|
6
14
|
|
|
7
15
|
- Add `LEARNINGS.md` — the measured findings distilled into harness defaults (repair=2x lever, cost-routing, Barbarian&Scholar tiering, format-contract, batch-eval discipline, capability floor).
|
package/README.md
CHANGED
|
@@ -284,6 +284,8 @@ context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0
|
|
|
284
284
|
| baseline (open-loop, single-shot) | 23/300 = **7.7%** | [5.2, 11.2] | 144 |
|
|
285
285
|
| + LLM localization | 24/300 = **8.0%** | [5.4, 11.6] | 146 |
|
|
286
286
|
| **+ closed-loop repair (test-feedback, ≤3)** | 46/300 = **15.3%** | **[11.7, 19.8]** | 149 |
|
|
287
|
+
| **+ swap base → deepseek-v4-pro (cheap)** | 88/300 = **29.3%** | **[24.5, 34.7]** | 151 |
|
|
288
|
+
| **+ v4-pro + Scholar hybrid** | 121/300 = **40.3%** | **[34.9, 46.0]** | 152 |
|
|
287
289
|
| **+ Barbarian&Scholar hybrid (cheap+frontier tail)** | 100/300 = **33.3%** | **[28.2, 38.8]** | 148 |
|
|
288
290
|
|
|
289
291
|
The closed-loop repair loop **~doubles** the resolve-rate (7.7% → 15.3%) on the *same cheap model*
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@metaharness/darwin",
|
|
3
|
-
"version": "0.2.
|
|
4
|
-
"description": "An LLM supercharger and cost optimizer: freeze
|
|
3
|
+
"version": "0.2.5",
|
|
4
|
+
"description": "An LLM supercharger and cost optimizer: freeze the model class, evolve the harness. Measured on full SWE-bench Lite (300, official swebench Docker): 7.7% open-loop -> 15.3% +repair -> 29.3% (deepseek-v4-pro base) -> 40.3% v4-pro+frontier-tail hybrid, ~$0.01-$0.39/instance (vs $1-20 for frontier agents). The harness, not the model, is the lever. Dependency-free (Node built-ins).",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./dist/index.js",
|
|
7
7
|
"types": "./dist/index.d.ts",
|