@metaharness/darwin 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +54 -22
  2. package/package.json +10 -7
package/README.md CHANGED
@@ -1,17 +1,26 @@
1
1
  # @metaharness/darwin
2
2
 
3
- > Darwin Mode **the model is frozen; the harness evolves.**
4
-
5
- Bounded, empirical, population-based self-improvement of an agent harness
6
- (ADR-070…075). "Self-improving agents" is widely misread as "the model trains
7
- itself." Darwin Mode ships the practical version: an agent **modifies its own
8
- harness**, runs benchmarks in a sandbox, keeps the variants that *measurably*
9
- improve, and builds an **archive of successful descendants**. The foundation
10
- model never changes what evolves is the operating system around it (planner,
11
- context builder, reviewer, retry/tool/memory/score policy). This follows the
12
- **Darwin Gödel Machine** lineage: iteratively mutate the source of a coding
13
- agent, then *empirically validate* each variant — no weight updates, just a
14
- population, a benchmark, and an archive.
3
+ > **An LLM supercharger and cost optimizer.** Keep your model frozen evolve the
4
+ > harness around it so a *cheap* model performs like an expensive one, for a fraction
5
+ > of the cost.
6
+
7
+ Darwin Mode makes the LLM you already use **measurably better and cheaper** by
8
+ evolving the *operating system around it* planner, context builder, reviewer,
9
+ retry/tool/memory/score policy instead of paying for a bigger model. It mutates one
10
+ surface at a time, tests each change in a sandbox, and keeps only what *measurably*
11
+ improves, building an archive of successful descendants. No weight updates, no
12
+ fine-tuning just a population, a benchmark, and an archive.
13
+
14
+ **Why it pays off (measured, not marketing):**
15
+ - **Cheap beats frontier.** On a 15-model × 6-language execution benchmark, DeepSeek-V3
16
+ ($0.4/Mtok) tops quality-per-dollar — and the harness, not the model, is the lever (ADR-085).
17
+ - **Real bug-fixing for pennies.** Resolves real **SWE-bench Lite** issues at **~$0.01/instance**
18
+ with a sub-$1/Mtok model (ADR-142–146) — vs. $1–20/instance for frontier-model agents.
19
+ - **The harness is the multiplier.** Evolving context-window/selection/retry policy lifts a
20
+ fixed model's measured outcomes (e.g. `finalScore 0.765 → 0.985`, ADR-103) — same model, better results.
21
+
22
+ This follows the **Darwin Gödel Machine** lineage: iteratively mutate the source of a
23
+ coding agent, then *empirically validate* each variant.
15
24
 
16
25
  ```
17
26
  repo
@@ -244,9 +253,10 @@ By default the sandbox runs the **repo's test command**, which is independent of
244
253
  the harness surfaces — so the behavioural manifold is degenerate (measured:
245
254
  `nicheEntropy = 0`, ADR-099). `sandboxMode: 'mock'` (ADR-102) instead runs a
246
255
  **deterministic surface-driven agent loop**, so a variant's traces depend on its
247
- surface content and the manifold comes alive. The full real-agent substrate
248
- (surfaces driving an LLM on real coding tasks SWE-bench) is **ADR-101/098,
249
- deferred**.
256
+ surface content and the manifold comes alive. `sandboxMode: 'agent'` (ADR-106)
257
+ runs a variant's **real surface code** in a child process. The real-LLM-on-real-code
258
+ substrate is **no longer deferred** — it shipped (ADR-106→141) and now runs on
259
+ **canonical SWE-bench Lite** (ADR-142+, below).
250
260
 
251
261
  ### Validated results (real, reproducible — see `bench/results/`)
252
262
 
@@ -261,15 +271,37 @@ deferred**.
261
271
  - **Polyglot model frontier** (ADR-085): 15 models × 6 languages, execution-scored;
262
272
  DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — cheap beats frontier for code.
263
273
 
274
+ ### Canonical SWE-bench Lite (real, official harness — ADR-142–146)
275
+
276
+ Run on the **full 300** SWE-bench Lite (test) instances, scored by the **official
277
+ `swebench` Docker harness** — no cherry-picking, tight CIs. Solver = relevance-ranked
278
+ context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0.01/instance.
279
+
280
+ | config | resolved | Wilson 95% CI | ADR |
281
+ |---|---|---|---|
282
+ | baseline (open-loop, single-shot) | 23/300 = **7.7%** | [5.2, 11.2] | 144 |
283
+ | + LLM localization | 24/300 = **8.0%** | [5.4, 11.6] | 146 |
284
+ | + closed-loop repair (test-feedback) | *measuring (full 300)* | — | 149 |
285
+
286
+ Honest framing: this is a **cheap-model, single-shot baseline** ($0.4/Mtok, <1¢/instance) —
287
+ leaderboard leaders hit 65–88% on Verified using iterative agentic loops + frontier models at
288
+ $1–20/instance. Localization lifted file-selection recall **44.7% → 59.7%** but resolve-rate held
289
+ flat — the bottleneck relocated to *patch emission* (ADR-146). The repair loop and a hybrid
290
+ cheap→frontier escalation (ADR-148) are the measured next levers. Every number is reproducible
291
+ under `bench/swebench/`.
292
+
264
293
  ## Status
265
294
 
266
- **Working prototype, empirically validated (mock substrate).** The
267
- `DeterministicMutator` is seeded and signature-preserving; the `OpenRouterMutator`
268
- (ADR-085) is the production LLM `CodeGenerator`, behind the *same*
269
- `validateGeneratedCode` gate. The safety boundary, scorer, archive, and bench
270
- layer are kernel code. The frontier is **Tier-2 real-agent execution**
271
- (ADR-101/098): surfaces driving a real LLM on real tasks, which turns the
272
- validated mock instrument into real-world SWE results.
295
+ **Working, empirically validated on both the mock substrate *and* canonical
296
+ SWE-bench Lite.** The `DeterministicMutator` is seeded and signature-preserving;
297
+ the `OpenRouterMutator` (ADR-085) is the production LLM `CodeGenerator`, behind the
298
+ *same* `validateGeneratedCode` gate. The safety boundary, scorer, archive, and bench
299
+ layer are kernel code. The real-LLM-on-real-code frontier (once deferred) is now
300
+ **measured**: a reproducible **7.7% [5.2–11.2%]** open-loop baseline on the full
301
+ SWE-bench Lite (ADR-144), with localization (146), the repair loop (149), and a
302
+ hybrid cheap→frontier escalation (148) as the active levers. Darwin Mode also ships
303
+ **integrated into the `metaharness` scaffolder** — `npx metaharness <name>` produces
304
+ a harness with `npm run evolve` out of the box (ADR-147).
273
305
 
274
306
  ## License
275
307
 
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "@metaharness/darwin",
3
- "version": "0.2.0",
4
- "description": "Darwin Mode (ADR-070…075) bounded, empirical, population-based self-improvement of an agent harness. Generate child harness variants, sandbox-score them, archive the lineage, and promote only measured, safe wins. The model is frozen; the harness evolves. Dependency-free (Node built-ins only).",
3
+ "version": "0.2.1",
4
+ "description": "An LLM supercharger and cost optimizer: keep your model frozen and evolve the harness around it so a cheap model performs like an expensive one — measurably better, far cheaper. Mutate→sandbox→score→archive; promote only safe, measured wins. Validated on real SWE-bench Lite (~$0.01/instance). Dependency-free (Node built-ins).",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",
7
7
  "types": "./dist/index.d.ts",
@@ -26,15 +26,18 @@
26
26
  "lint": "tsc --noEmit"
27
27
  },
28
28
  "keywords": [
29
+ "llm",
30
+ "cost-optimization",
31
+ "llm-optimizer",
32
+ "cheap-llm",
33
+ "compute-arbitrage",
29
34
  "agent-harness",
30
- "darwin-mode",
31
35
  "self-improvement",
32
36
  "evolutionary-search",
37
+ "swe-bench",
33
38
  "metaharness",
34
- "dgm",
35
- "archive",
36
- "sandbox",
37
- "benchmark"
39
+ "darwin-mode",
40
+ "sandbox"
38
41
  ],
39
42
  "author": "rUv <ruv@ruv.net>",
40
43
  "license": "MIT",