@metaharness/darwin 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +54 -22
- package/package.json +10 -7
package/README.md
CHANGED
|
@@ -1,17 +1,26 @@
|
|
|
1
1
|
# @metaharness/darwin
|
|
2
2
|
|
|
3
|
-
>
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
3
|
+
> **An LLM supercharger and cost optimizer.** Keep your model frozen — evolve the
|
|
4
|
+
> harness around it so a *cheap* model performs like an expensive one, for a fraction
|
|
5
|
+
> of the cost.
|
|
6
|
+
|
|
7
|
+
Darwin Mode makes the LLM you already use **measurably better and cheaper** by
|
|
8
|
+
evolving the *operating system around it* — planner, context builder, reviewer,
|
|
9
|
+
retry/tool/memory/score policy — instead of paying for a bigger model. It mutates one
|
|
10
|
+
surface at a time, tests each change in a sandbox, and keeps only what *measurably*
|
|
11
|
+
improves, building an archive of successful descendants. No weight updates, no
|
|
12
|
+
fine-tuning — just a population, a benchmark, and an archive.
|
|
13
|
+
|
|
14
|
+
**Why it pays off (measured, not marketing):**
|
|
15
|
+
- **Cheap beats frontier.** On a 15-model × 6-language execution benchmark, DeepSeek-V3
|
|
16
|
+
($0.4/Mtok) tops quality-per-dollar — and the harness, not the model, is the lever (ADR-085).
|
|
17
|
+
- **Real bug-fixing for pennies.** Resolves real **SWE-bench Lite** issues at **~$0.01/instance**
|
|
18
|
+
with a sub-$1/Mtok model (ADR-142–146) — vs. $1–20/instance for frontier-model agents.
|
|
19
|
+
- **The harness is the multiplier.** Evolving context-window/selection/retry policy lifts a
|
|
20
|
+
fixed model's measured outcomes (e.g. `finalScore 0.765 → 0.985`, ADR-103) — same model, better results.
|
|
21
|
+
|
|
22
|
+
This follows the **Darwin Gödel Machine** lineage: iteratively mutate the source of a
|
|
23
|
+
coding agent, then *empirically validate* each variant.
|
|
15
24
|
|
|
16
25
|
```
|
|
17
26
|
repo
|
|
@@ -244,9 +253,10 @@ By default the sandbox runs the **repo's test command**, which is independent of
|
|
|
244
253
|
the harness surfaces — so the behavioural manifold is degenerate (measured:
|
|
245
254
|
`nicheEntropy = 0`, ADR-099). `sandboxMode: 'mock'` (ADR-102) instead runs a
|
|
246
255
|
**deterministic surface-driven agent loop**, so a variant's traces depend on its
|
|
247
|
-
surface content and the manifold comes alive.
|
|
248
|
-
|
|
249
|
-
deferred
|
|
256
|
+
surface content and the manifold comes alive. `sandboxMode: 'agent'` (ADR-106)
|
|
257
|
+
runs a variant's **real surface code** in a child process. The real-LLM-on-real-code
|
|
258
|
+
substrate is **no longer deferred** — it shipped (ADR-106→141) and now runs on
|
|
259
|
+
**canonical SWE-bench Lite** (ADR-142+, below).
|
|
250
260
|
|
|
251
261
|
### Validated results (real, reproducible — see `bench/results/`)
|
|
252
262
|
|
|
@@ -261,15 +271,37 @@ deferred**.
|
|
|
261
271
|
- **Polyglot model frontier** (ADR-085): 15 models × 6 languages, execution-scored;
|
|
262
272
|
DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — cheap beats frontier for code.
|
|
263
273
|
|
|
274
|
+
### Canonical SWE-bench Lite (real, official harness — ADR-142–146)
|
|
275
|
+
|
|
276
|
+
Run on the **full 300** SWE-bench Lite (test) instances, scored by the **official
|
|
277
|
+
`swebench` Docker harness** — no cherry-picking, tight CIs. Solver = relevance-ranked
|
|
278
|
+
context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0.01/instance.
|
|
279
|
+
|
|
280
|
+
| config | resolved | Wilson 95% CI | ADR |
|
|
281
|
+
|---|---|---|---|
|
|
282
|
+
| baseline (open-loop, single-shot) | 23/300 = **7.7%** | [5.2, 11.2] | 144 |
|
|
283
|
+
| + LLM localization | 24/300 = **8.0%** | [5.4, 11.6] | 146 |
|
|
284
|
+
| + closed-loop repair (test-feedback) | *measuring (full 300)* | — | 149 |
|
|
285
|
+
|
|
286
|
+
Honest framing: this is a **cheap-model, single-shot baseline** ($0.4/Mtok, <1¢/instance) —
|
|
287
|
+
leaderboard leaders hit 65–88% on Verified using iterative agentic loops + frontier models at
|
|
288
|
+
$1–20/instance. Localization lifted file-selection recall **44.7% → 59.7%** but resolve-rate held
|
|
289
|
+
flat — the bottleneck relocated to *patch emission* (ADR-146). The repair loop and a hybrid
|
|
290
|
+
cheap→frontier escalation (ADR-148) are the measured next levers. Every number is reproducible
|
|
291
|
+
under `bench/swebench/`.
|
|
292
|
+
|
|
264
293
|
## Status
|
|
265
294
|
|
|
266
|
-
**Working
|
|
267
|
-
`DeterministicMutator` is seeded and signature-preserving;
|
|
268
|
-
(ADR-085) is the production LLM `CodeGenerator`, behind the
|
|
269
|
-
`validateGeneratedCode` gate. The safety boundary, scorer, archive, and bench
|
|
270
|
-
layer are kernel code. The frontier
|
|
271
|
-
|
|
272
|
-
|
|
295
|
+
**Working, empirically validated on both the mock substrate *and* canonical
|
|
296
|
+
SWE-bench Lite.** The `DeterministicMutator` is seeded and signature-preserving;
|
|
297
|
+
the `OpenRouterMutator` (ADR-085) is the production LLM `CodeGenerator`, behind the
|
|
298
|
+
*same* `validateGeneratedCode` gate. The safety boundary, scorer, archive, and bench
|
|
299
|
+
layer are kernel code. The real-LLM-on-real-code frontier (once deferred) is now
|
|
300
|
+
**measured**: a reproducible **7.7% [5.2–11.2%]** open-loop baseline on the full
|
|
301
|
+
SWE-bench Lite (ADR-144), with localization (146), the repair loop (149), and a
|
|
302
|
+
hybrid cheap→frontier escalation (148) as the active levers. Darwin Mode also ships
|
|
303
|
+
**integrated into the `metaharness` scaffolder** — `npx metaharness <name>` produces
|
|
304
|
+
a harness with `npm run evolve` out of the box (ADR-147).
|
|
273
305
|
|
|
274
306
|
## License
|
|
275
307
|
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@metaharness/darwin",
|
|
3
|
-
"version": "0.2.
|
|
4
|
-
"description": "
|
|
3
|
+
"version": "0.2.1",
|
|
4
|
+
"description": "An LLM supercharger and cost optimizer: keep your model frozen and evolve the harness around it so a cheap model performs like an expensive one — measurably better, far cheaper. Mutate→sandbox→score→archive; promote only safe, measured wins. Validated on real SWE-bench Lite (~$0.01/instance). Dependency-free (Node built-ins).",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./dist/index.js",
|
|
7
7
|
"types": "./dist/index.d.ts",
|
|
@@ -26,15 +26,18 @@
|
|
|
26
26
|
"lint": "tsc --noEmit"
|
|
27
27
|
},
|
|
28
28
|
"keywords": [
|
|
29
|
+
"llm",
|
|
30
|
+
"cost-optimization",
|
|
31
|
+
"llm-optimizer",
|
|
32
|
+
"cheap-llm",
|
|
33
|
+
"compute-arbitrage",
|
|
29
34
|
"agent-harness",
|
|
30
|
-
"darwin-mode",
|
|
31
35
|
"self-improvement",
|
|
32
36
|
"evolutionary-search",
|
|
37
|
+
"swe-bench",
|
|
33
38
|
"metaharness",
|
|
34
|
-
"
|
|
35
|
-
"
|
|
36
|
-
"sandbox",
|
|
37
|
-
"benchmark"
|
|
39
|
+
"darwin-mode",
|
|
40
|
+
"sandbox"
|
|
38
41
|
],
|
|
39
42
|
"author": "rUv <ruv@ruv.net>",
|
|
40
43
|
"license": "MIT",
|