npm - @metaharness/darwin - Versions diffs - 0.2.0 → 0.2.1 - Mend

@metaharness/darwin 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +54 -22
package/package.json +10 -7

package/README.md CHANGED Viewed

@@ -1,17 +1,26 @@
 # @metaharness/darwin
-> Darwin Mode — **the model is frozen; the harness evolves.**
-Bounded, empirical, population-based self-improvement of an agent harness
-(ADR-070…075). "Self-improving agents" is widely misread as "the model trains
-itself." Darwin Mode ships the practical version: an agent **modifies its own
-harness**, runs benchmarks in a sandbox, keeps the variants that *measurably*
-improve, and builds an **archive of successful descendants**. The foundation
-model never changes — what evolves is the operating system around it (planner,
-context builder, reviewer, retry/tool/memory/score policy). This follows the
-**Darwin Gödel Machine** lineage: iteratively mutate the source of a coding
-agent, then *empirically validate* each variant — no weight updates, just a
-population, a benchmark, and an archive.
+> **An LLM supercharger and cost optimizer.** Keep your model frozen — evolve the
+> harness around it so a *cheap* model performs like an expensive one, for a fraction
+> of the cost.
+Darwin Mode makes the LLM you already use **measurably better and cheaper** by
+evolving the *operating system around it* — planner, context builder, reviewer,
+retry/tool/memory/score policy — instead of paying for a bigger model. It mutates one
+surface at a time, tests each change in a sandbox, and keeps only what *measurably*
+improves, building an archive of successful descendants. No weight updates, no
+fine-tuning — just a population, a benchmark, and an archive.
+**Why it pays off (measured, not marketing):**
+- **Cheap beats frontier.** On a 15-model × 6-language execution benchmark, DeepSeek-V3
+  ($0.4/Mtok) tops quality-per-dollar — and the harness, not the model, is the lever (ADR-085).
+- **Real bug-fixing for pennies.** Resolves real **SWE-bench Lite** issues at **~$0.01/instance**
+  with a sub-$1/Mtok model (ADR-142–146) — vs. $1–20/instance for frontier-model agents.
+- **The harness is the multiplier.** Evolving context-window/selection/retry policy lifts a
+  fixed model's measured outcomes (e.g. `finalScore 0.765 → 0.985`, ADR-103) — same model, better results.
+This follows the **Darwin Gödel Machine** lineage: iteratively mutate the source of a
+coding agent, then *empirically validate* each variant.
 ```
 repo
@@ -244,9 +253,10 @@ By default the sandbox runs the **repo's test command**, which is independent of
 the harness surfaces — so the behavioural manifold is degenerate (measured:
 `nicheEntropy = 0`, ADR-099). `sandboxMode: 'mock'` (ADR-102) instead runs a
 **deterministic surface-driven agent loop**, so a variant's traces depend on its
-surface content and the manifold comes alive. The full real-agent substrate
-(surfaces driving an LLM on real coding tasks → SWE-bench) is **ADR-101/098,
-deferred**.
+surface content and the manifold comes alive. `sandboxMode: 'agent'` (ADR-106)
+runs a variant's **real surface code** in a child process. The real-LLM-on-real-code
+substrate is **no longer deferred** — it shipped (ADR-106→141) and now runs on
+**canonical SWE-bench Lite** (ADR-142+, below).
 ### Validated results (real, reproducible — see `bench/results/`)
@@ -261,15 +271,37 @@ deferred**.
 - **Polyglot model frontier** (ADR-085): 15 models × 6 languages, execution-scored;
   DeepSeek-V3 ($0.4/Mtok) tops quality-per-dollar — cheap beats frontier for code.
+### Canonical SWE-bench Lite (real, official harness — ADR-142–146)
+Run on the **full 300** SWE-bench Lite (test) instances, scored by the **official
+`swebench` Docker harness** — no cherry-picking, tight CIs. Solver = relevance-ranked
+context + symbol-aware localization + search/replace patch, `deepseek-chat`, ~$0.01/instance.
+| config | resolved | Wilson 95% CI | ADR |
+|---|---|---|---|
+| baseline (open-loop, single-shot) | 23/300 = **7.7%** | [5.2, 11.2] | 144 |
+| + LLM localization | 24/300 = **8.0%** | [5.4, 11.6] | 146 |
+| + closed-loop repair (test-feedback) | *measuring (full 300)* | — | 149 |
+Honest framing: this is a **cheap-model, single-shot baseline** ($0.4/Mtok, <1¢/instance) —
+leaderboard leaders hit 65–88% on Verified using iterative agentic loops + frontier models at
+$1–20/instance. Localization lifted file-selection recall **44.7% → 59.7%** but resolve-rate held
+flat — the bottleneck relocated to *patch emission* (ADR-146). The repair loop and a hybrid
+cheap→frontier escalation (ADR-148) are the measured next levers. Every number is reproducible
+under `bench/swebench/`.
 ## Status
-**Working prototype, empirically validated (mock substrate).** The
-`DeterministicMutator` is seeded and signature-preserving; the `OpenRouterMutator`
-(ADR-085) is the production LLM `CodeGenerator`, behind the *same*
-`validateGeneratedCode` gate. The safety boundary, scorer, archive, and bench
-layer are kernel code. The frontier is **Tier-2 real-agent execution**
-(ADR-101/098): surfaces driving a real LLM on real tasks, which turns the
-validated mock instrument into real-world SWE results.
+**Working, empirically validated on both the mock substrate *and* canonical
+SWE-bench Lite.** The `DeterministicMutator` is seeded and signature-preserving;
+the `OpenRouterMutator` (ADR-085) is the production LLM `CodeGenerator`, behind the
+*same* `validateGeneratedCode` gate. The safety boundary, scorer, archive, and bench
+layer are kernel code. The real-LLM-on-real-code frontier (once deferred) is now
+**measured**: a reproducible **7.7% [5.2–11.2%]** open-loop baseline on the full
+SWE-bench Lite (ADR-144), with localization (146), the repair loop (149), and a
+hybrid cheap→frontier escalation (148) as the active levers. Darwin Mode also ships
+**integrated into the `metaharness` scaffolder** — `npx metaharness <name>` produces
+a harness with `npm run evolve` out of the box (ADR-147).
 ## License

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "@metaharness/darwin",
-  "version": "0.2.0",
-  "description": "Darwin Mode (ADR-070…075) — bounded, empirical, population-based self-improvement of an agent harness. Generate child harness variants, sandbox-score them, archive the lineage, and promote only measured, safe wins. The model is frozen; the harness evolves. Dependency-free (Node built-ins only).",
+  "version": "0.2.1",
+  "description": "An LLM supercharger and cost optimizer: keep your model frozen and evolve the harness around it so a cheap model performs like an expensive one — measurably better, far cheaper. Mutate→sandbox→score→archive; promote only safe, measured wins. Validated on real SWE-bench Lite (~$0.01/instance). Dependency-free (Node built-ins).",
   "type": "module",
   "main": "./dist/index.js",
   "types": "./dist/index.d.ts",
@@ -26,15 +26,18 @@
     "lint": "tsc --noEmit"
   },
   "keywords": [
+    "llm",
+    "cost-optimization",
+    "llm-optimizer",
+    "cheap-llm",
+    "compute-arbitrage",
     "agent-harness",
-    "darwin-mode",
     "self-improvement",
     "evolutionary-search",
+    "swe-bench",
     "metaharness",
-    "dgm",
-    "archive",
-    "sandbox",
-    "benchmark"
+    "darwin-mode",
+    "sandbox"
   ],
   "author": "rUv <ruv@ruv.net>",
   "license": "MIT",