npm - harness-evolver - Versions diffs - 2.2.0 → 2.4.0 - Mend

harness-evolver 2.2.0 → 2.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md +128 -177
package/agents/harness-evolver-proposer.md +8 -5
package/package.json +1 -1
package/skills/evolve/SKILL.md +305 -35

package/README.md CHANGED Viewed

@@ -1,228 +1,173 @@
+<p align="center">
+  <img src="assets/banner.jpg" alt="Harness Evolver" width="100%">
+</p>
 # Harness Evolver
-End-to-end optimization of LLM agent harnesses, inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
+<p align="center">
+  <a href="https://www.npmjs.com/package/harness-evolver"><img src="https://img.shields.io/npm/v/harness-evolver?style=for-the-badge&color=blueviolet" alt="npm"></a>
+  <a href="https://github.com/raphaelchristi/harness-evolver/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License: MIT"></a>
+  <a href="https://arxiv.org/abs/2603.28052"><img src="https://img.shields.io/badge/Paper-Meta--Harness-FFD700?style=for-the-badge" alt="Paper"></a>
+  <a href="https://github.com/raphaelchristi/harness-evolver"><img src="https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge" alt="Built by Raphael Valdetaro"></a>
+</p>
-**The harness is the 80% factor.** Changing just the scaffolding around a fixed LLM can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.
+**Autonomous harness optimization for LLM agents.** Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM — prompts, retrieval, routing, output parsing — using a multi-agent loop inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
-## Install
+The harness is the 80% factor. Changing just the scaffolding can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates that search.
-```bash
-npx harness-evolver@latest
-```
+---
-Select your runtime (Claude Code, Cursor, Codex, Windsurf) and scope (global/local). Then **restart your AI coding agent** for the skills to appear.
-## Prerequisites
-### API Keys (set in your shell before launching Claude Code)
-The harness you're evolving may call LLM APIs. Set the keys your harness needs:
+## Install
 ```bash
-# Required: at least one LLM provider
-export ANTHROPIC_API_KEY="sk-ant-..."       # For Claude-based harnesses
-export OPENAI_API_KEY="sk-..."              # For OpenAI-based harnesses
-export GEMINI_API_KEY="AIza..."             # For Gemini-based harnesses
-export OPENROUTER_API_KEY="sk-or-..."       # For OpenRouter (multi-model)
-# Optional: enhanced tracing
-export LANGSMITH_API_KEY="lsv2_pt_..."      # Auto-enables LangSmith tracing
+npx harness-evolver@latest
 ```
-The plugin auto-detects which keys are available during `/harness-evolver:init` and shows them. The proposer agent knows which APIs are available and uses them accordingly.
-**No API key needed for the example** — the classifier example uses keyword matching (mock mode), no LLM calls.
-### Optional: Enhanced Integrations
+> Works with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.
-```bash
-# LangSmith — rich trace analysis for the proposer
-uv tool install langsmith-cli && langsmith-cli auth login
-# Context7 — up-to-date library documentation for the proposer
-claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
-# LangChain Docs — LangChain/LangGraph-specific documentation
-claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
-```
+---
 ## Quick Start
-### Try the Example (no API key needed)
 ```bash
-# 1. Copy the example
-cp -r ~/.harness-evolver/examples/classifier ./my-classifier
-cd my-classifier
-# 2. Open Claude Code
+cd my-llm-project
 claude
-# 3. Initialize — auto-detects harness.py, eval.py, tasks/
-/harness-evolver:init
-# 4. Run the evolution loop
-/harness-evolver:evolve --iterations 3
-# 5. Check progress
-/harness-evolver:status
+/harness-evolver:init        # scans code, creates eval + tasks if missing
+/harness-evolver:evolve      # runs the optimization loop
+/harness-evolver:status      # check progress anytime
 ```
-### Use with Your Own Project
+**Zero-config mode:** If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
-```bash
-cd my-llm-project
-claude
-# Init scans your project, identifies the entry point,
-# and helps create harness wrapper + eval + tasks if missing
-/harness-evolver:init
-# Run optimization
-/harness-evolver:evolve --iterations 10
-```
+---
-The init skill adapts to your project — if you have `graph.py` instead of `harness.py`, it creates a thin wrapper. If you don't have an eval script, it helps you write one.
+## How It Works
-## Available Commands
+<table>
+<tr>
+<td><b>5 Proposers</b></td>
+<td>Each iteration spawns 5 parallel agents with different strategies: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), prompt specialist, retrieval specialist. Best candidate wins.</td>
+</tr>
+<tr>
+<td><b>Full Traces</b></td>
+<td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Proposers read actual LLM prompts and responses.</td>
+</tr>
+<tr>
+<td><b>Critic</b></td>
+<td>Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence.</td>
+</tr>
+<tr>
+<td><b>Architect</b></td>
+<td>Auto-triggers on stagnation or regression. Recommends topology changes (single-call → RAG, chain → ReAct, etc.) with concrete migration steps.</td>
+</tr>
+<tr>
+<td><b>Judge</b></td>
+<td>LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.</td>
+</tr>
+</table>
+---
+## Commands
 | Command | What it does |
 |---|---|
 | `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
-| `/harness-evolver:evolve` | Run the autonomous optimization loop |
-| `/harness-evolver:status` | Show progress (scores, iterations, stagnation) |
+| `/harness-evolver:evolve` | Run the autonomous optimization loop (5 parallel proposers) |
+| `/harness-evolver:status` | Show progress, scores, stagnation detection |
 | `/harness-evolver:compare` | Diff two versions with per-task analysis |
 | `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
-| `/harness-evolver:deploy` | Copy the best harness back to your project |
-## How It Works
+| `/harness-evolver:deploy` | Promote the best harness back to your project |
+| `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
+| `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
-```
-                    ┌─────────────────────────────┐
-                    │   /harness-evolver:evolve    │
-                    │     (orchestrator skill)     │
-                    └──────────┬──────────────────┘
-                               │
-              ┌────────────────┼────────────────┐
-              ▼                ▼                ▼
-        ┌──────────┐   ┌────────────┐   ┌──────────┐
-        │ PROPOSE  │   │  EVALUATE  │   │  UPDATE   │
-        │ proposer │   │ evaluate.py│   │ state.py  │
-        │ agent    │   │ + eval.py  │   │           │
-        └──────────┘   └────────────┘   └──────────┘
-              │                │                │
-              ▼                ▼                ▼
-        harnesses/       traces/         summary.json
-        v{N}/            per-task        STATE.md
-        harness.py       stdout/stderr   PROPOSER_HISTORY.md
-        proposal.md      timing.json
-        scores.json
-```
+---
-1. **Propose** — A proposer agent reads all prior candidates' code, execution traces, and scores. Diagnoses failure modes via counterfactual analysis and writes a new harness.
-2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
-3. **Update** — State files are updated with the new score, parent lineage, and regression detection.
-4. **Repeat** — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.
+## Agents
-## The Harness Contract
+| Agent | Role | Color |
+|---|---|---|
+| **Proposer** | Evolves the harness code based on trace analysis | Green |
+| **Architect** | Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.) | Blue |
+| **Critic** | Evaluates eval quality, detects gaming, proposes stricter scoring | Red |
+| **Judge** | LLM-as-judge scoring — works without expected answers | Yellow |
+| **TestGen** | Generates synthetic test cases from code analysis | Cyan |
-A harness is **any executable** that accepts:
+---
-```bash
-python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
-```
-- `--input`: JSON with `{id, input, metadata}` (never sees expected answers)
-- `--output`: JSON with `{id, output}`
-- `--traces-dir`: optional directory for rich traces
-- `--config`: optional JSON with evolvable parameters
+## Integrations
-The eval script is also any executable:
+<table>
+<tr>
+<td><b>LangSmith</b></td>
+<td>Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via <code>langsmith-cli</code>. Processed into readable format per iteration.</td>
+</tr>
+<tr>
+<td><b>Context7</b></td>
+<td>Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.</td>
+</tr>
+<tr>
+<td><b>LangChain Docs</b></td>
+<td>LangChain/LangGraph-specific documentation search via MCP.</td>
+</tr>
+</table>
 ```bash
-python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.json
-```
-Works with **any language, any framework, any domain**.
-## Project Structure (after init)
-```
-.harness-evolver/                     # Created by /harness-evolver:init
-├── config.json                       # Project config (harness cmd, eval, API keys detected)
-├── summary.json                      # Source of truth (versions, scores, parents)
-├── STATE.md                          # Human-readable status
-├── PROPOSER_HISTORY.md               # Log of all proposals and outcomes
-├── baseline/                         # Original harness (read-only)
-│   └── harness.py
-├── eval/
-│   ├── eval.py                       # Your scoring script
-│   └── tasks/                        # Test cases
-└── harnesses/
-    └── v001/
-        ├── harness.py                # Evolved candidate
-        ├── proposal.md               # Why this version was created
-        ├── scores.json               # How it scored
-        └── traces/                   # Full execution traces
-            ├── stdout.log
-            ├── stderr.log
-            ├── timing.json
-            └── task_001/
-                ├── input.json
-                └── output.json
+# Optional — install during npx setup or manually:
+uv tool install langsmith-cli && langsmith-cli auth login
+claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
+claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
 ```
-## The Proposer
+---
-The core of the system. 4-phase workflow from the Meta-Harness paper:
-| Phase | What it does |
-|---|---|
-| **Orient** | Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate. |
-| **Diagnose** | Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis. |
-| **Propose** | Write new harness. Prefer additive changes after regressions. |
-| **Document** | Write `proposal.md` with evidence. Update history. |
-**7 rules:** evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.
-## Integrations
+## The Harness Contract
-### LangSmith (optional, recommended for LangChain/LangGraph harnesses)
+A harness is **any executable**:
 ```bash
-export LANGSMITH_API_KEY=lsv2_...
-uv tool install langsmith-cli && langsmith-cli auth login
+python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
 ```
-When detected, the plugin:
-- Sets `LANGCHAIN_TRACING_V2=true` automatically — all LLM calls are traced
-- The proposer queries traces directly via `langsmith-cli`:
+Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
-```bash
-langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
-langsmith-cli --json runs stats --project harness-evolver-v003
-```
+---
-### Context7 (optional, recommended for any library-heavy harness)
+## Evolution Loop
-```bash
-claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
+```
+/harness-evolver:evolve
+  │
+  ├─ 1. Gather LangSmith traces (processed into readable format)
+  ├─ 2. Spawn 5 proposers in parallel (exploit/explore/crossover/prompt/retrieval)
+  ├─ 3. Validate all candidates
+  ├─ 4. Evaluate all candidates
+  ├─ 4.5 Judge (if using LLM-as-judge eval)
+  ├─ 5. Select winner (highest combined_score)
+  ├─ 6. Report results
+  ├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
+  ├─ 7. Auto-trigger Architect (if regression or stagnation)
+  └─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
 ```
-The plugin detects your stack via AST analysis (17 libraries: LangChain, LangGraph, OpenAI, Anthropic, ChromaDB, FastAPI, etc.) and instructs the proposer to consult current docs before proposing API changes.
+---
-## Development
+## API Keys
+Set in your shell before launching Claude Code:
 ```bash
-# Run all tests (41 tests, stdlib-only)
-python3 -m unittest discover -s tests -v
+export GEMINI_API_KEY="AIza..."             # Gemini-based harnesses
+export ANTHROPIC_API_KEY="sk-ant-..."       # Claude-based harnesses
+export OPENAI_API_KEY="sk-..."              # OpenAI-based harnesses
+export OPENROUTER_API_KEY="sk-or-..."       # Multi-model via OpenRouter
+export LANGSMITH_API_KEY="lsv2_pt_..."      # Auto-enables LangSmith tracing
+```
-# Test example manually
-python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json
+The plugin auto-detects available keys. No key needed for the included example.
-# Install locally for development
-node bin/install.js
-```
+---
 ## Comparison
@@ -230,17 +175,23 @@ node bin/install.js
 |---|---|---|---|---|
 | **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
 | **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
-| **Domain** | TerminalBench-2 | Coding benchmarks | Dev workflow | **Any domain** |
-| **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx`** |
+| **Candidates/iter** | 1 | 1 | N/A | **5 parallel** |
+| **Auto-critique** | No | No | No | **Yes (critic + judge)** |
+| **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
 | **LangSmith** | No | No | No | **Yes** |
 | **Context7** | No | No | No | **Yes** |
+| **Zero-config** | No | No | No | **Yes** |
+---
 ## References
-- [Meta-Harness paper (arxiv 2603.28052)](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
-- [Design Spec](docs/specs/2026-03-31-harness-evolver-design.md)
-- [LangSmith Integration](docs/specs/2026-03-31-langsmith-integration.md)
-- [Context7 Integration](docs/specs/2026-03-31-context7-integration.md)
+- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
+- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI (parallel evolution architecture)
+- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind (population-based code evolution)
+- [Agent Skills Specification](https://agentskills.io) — Open standard for AI agent skills
+---
 ## License

package/agents/harness-evolver-proposer.md CHANGED Viewed

@@ -15,12 +15,15 @@ every file listed there before performing any other actions. These files are you
 ## Strategy Injection
-Your prompt may contain a `<strategy>` block defining your evolutionary role:
-- **exploitation**: Make targeted, conservative fixes to the current best
-- **exploration**: Try fundamentally different approaches, be bold
-- **crossover**: Combine strengths from two parent versions
+Your prompt contains a `<strategy>` block defining your approach. Follow it:
+- **exploitation**: Conservative fix on current best. Small, targeted changes.
+- **exploration**: Bold, fundamentally different approach. High risk, high reward.
+- **crossover**: Combine strengths from two parent versions.
+- **failure-targeted**: Fix SPECIFIC failing tasks listed in the strategy. Read their traces, understand the root cause, fix that capability. You are free to change ANYTHING needed.
+- **creative**: Try something unexpected — different algorithms, architecture, libraries.
+- **efficiency**: Same quality but fewer tokens, faster, simpler code.
-Follow the strategy. It determines your risk tolerance and parent selection.
 If no strategy block is present, default to exploitation (conservative improvement).
 ## Context7 — Enrich Your Knowledge

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "2.2.0",
+  "version": "2.4.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -52,21 +52,137 @@ LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*
 If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
-**Step 2: Gather traces from the discovered project**
+**Step 2: Gather raw traces from the discovered project**
 ```bash
 if [ -n "$LS_PROJECT" ]; then
-  langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
-  langsmith-cli --json runs list --project "$LS_PROJECT" --fields id,name,inputs,outputs,latency_ms,total_tokens --limit 20 > .harness-evolver/langsmith_runs.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
+  langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
   langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
   echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
 else
-  echo "[]" > .harness-evolver/langsmith_diagnosis.json
+  echo "[]" > /tmp/langsmith_raw.json
   echo "{}" > .harness-evolver/langsmith_stats.json
 fi
 ```
-These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
+**Step 3: Process raw LangSmith data into a readable format for proposers**
+The raw langsmith data has LangChain-serialized messages that are hard to read. Process it into a clean summary:
+```bash
+python3 -c "
+import json, sys
+raw = json.load(open('/tmp/langsmith_raw.json'))
+if not raw:
+    json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
+    sys.exit(0)
+clean = []
+for r in raw:
+    entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
+    # Extract readable prompt from LangChain serialized inputs
+    inputs = r.get('inputs', {})
+    if isinstance(inputs, dict) and 'messages' in inputs:
+        msgs = inputs['messages']
+        for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
+            for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
+                if isinstance(msg, dict):
+                    kwargs = msg.get('kwargs', msg)
+                    content = kwargs.get('content', '')
+                    msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
+                    if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
+                        entry['user_message'] = str(content)[:300]
+                    elif 'System' in str(msg_type):
+                        entry['system_prompt_preview'] = str(content)[:200]
+    # Extract readable output
+    outputs = r.get('outputs', {})
+    if isinstance(outputs, dict) and 'generations' in outputs:
+        gens = outputs['generations']
+        if gens and isinstance(gens, list) and gens[0]:
+            gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
+            if isinstance(gen, dict):
+                msg = gen.get('message', gen)
+                if isinstance(msg, dict):
+                    kwargs = msg.get('kwargs', msg)
+                    entry['llm_response'] = str(kwargs.get('content', ''))[:300]
+    clean.append(entry)
+json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
+print(f'Processed {len(clean)} LangSmith runs into readable format')
+" 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
+```
+The resulting `langsmith_runs.json` has clean, readable entries:
+```json
+[
+  {
+    "name": "ChatGoogleGenerativeAI",
+    "tokens": 1332,
+    "error": null,
+    "user_message": "Analise este texto: Bom dia pessoal...",
+    "system_prompt_preview": "Você é um moderador de conteúdo...",
+    "llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
+  }
+]
+```
+These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
+### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
+Before spawning proposers, analyze which tasks are failing and cluster them:
+```bash
+python3 -c "
+import json, os, sys
+# Find best version scores
+summary = json.load(open('.harness-evolver/summary.json'))
+best = summary['best']['version']
+scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
+if not os.path.exists(scores_path):
+    scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
+if not scores_path or not os.path.exists(scores_path):
+    print('NO_SCORES')
+    sys.exit(0)
+scores = json.load(open(scores_path))
+tasks_dir = '.harness-evolver/eval/tasks/'
+failures = {}
+for tid, tdata in scores.get('per_task', {}).items():
+    score = tdata.get('score', 0)
+    if score < 0.7:
+        tfile = os.path.join(tasks_dir, tid + '.json')
+        cat = 'unknown'
+        if os.path.exists(tfile):
+            task = json.load(open(tfile))
+            meta = task.get('metadata', {})
+            cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
+        failures.setdefault(cat, []).append({'id': tid, 'score': score})
+if not failures:
+    print('ALL_PASSING')
+else:
+    sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
+    for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
+        task_ids = [t['id'] for t in tasks]
+        avg_score = sum(t['score'] for t in tasks) / len(tasks)
+        print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
+" 2>/dev/null
+```
+Parse the output:
+- If `NO_SCORES` or `ALL_PASSING`: D gets "creative" brief, E gets "efficiency" brief
+- If clusters found: D targets cluster 1, E targets cluster 2
+- If only 1 cluster: D targets it, E gets "creative" brief
+Save clusters for use in step 2.
 ### 2. Propose (3 parallel candidates)
@@ -76,7 +192,10 @@ This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
 Determine parents for each strategy:
 - **Exploiter parent**: current best version (from summary.json `best.version`)
 - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
-- **Crossover parents**: best version + a different high-scorer from a different lineage
+- **Crossover parents**:
+  - Parent A = current best version
+  - Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
+    If no champion file exists, fall back to a non-best version from the archive.
 Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
@@ -198,35 +317,131 @@ Agent(
 **Also spawn these additional candidates:**
-**Candidate D (Prompt Specialist)** — `run_in_background: true`:
-Same as Exploiter but with a different focus:
+**Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
+If failure clusters were found in step 1.8:
+```
+Agent(
+  subagent_type: "harness-evolver-proposer",
+  description: "Proposer D: fix {cluster_1_category} failures",
+  run_in_background: true,
+  prompt: |
+    <strategy>
+    APPROACH: failure-targeted
+    Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
+    They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
+    Read the traces of these specific tasks to understand WHY they fail.
+    Your changes should improve these tasks WITHOUT regressing others.
+    You are free to change anything — prompts, code, retrieval, architecture —
+    whatever is needed to fix THIS specific failure mode.
+    </strategy>
+    <objective>
+    Propose harness version {version}d targeting {cluster_1_category} failures.
+    </objective>
+    <files_to_read>
+    - .harness-evolver/summary.json
+    - .harness-evolver/PROPOSER_HISTORY.md
+    - .harness-evolver/config.json
+    - .harness-evolver/harnesses/{best_version}/harness.py
+    - .harness-evolver/harnesses/{best_version}/scores.json
+    - .harness-evolver/langsmith_runs.json (if exists)
+    - .harness-evolver/architecture.json (if exists)
+    </files_to_read>
+    <output>
+    Create directory .harness-evolver/harnesses/{version}d/ containing:
+    - harness.py, config.json, proposal.md
+    </output>
+)
+```
+If ALL_PASSING (no failures):
+```
+Agent(
+  subagent_type: "harness-evolver-proposer",
+  description: "Proposer D: creative approach",
+  run_in_background: true,
+  prompt: |
+    <strategy>
+    APPROACH: creative
+    All tasks are scoring well. Try something UNEXPECTED:
+    - Different algorithm or library
+    - Completely different prompt architecture
+    - Novel error handling or output validation
+    - Something no one would think of
+    The goal is to discover improvements that incremental fixes would miss.
+    </strategy>
+    ...same files_to_read and output as above...
+)
+```
+**Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
+If a second failure cluster exists:
 ```
-<strategy>
-APPROACH: prompt-engineering
-You are the PROMPT SPECIALIST. Focus ONLY on improving the system prompt,
-few-shot examples, output format instructions, and prompt structure.
-Do NOT change the retrieval logic, pipeline structure, or code architecture.
-</strategy>
+Agent(
+  subagent_type: "harness-evolver-proposer",
+  description: "Proposer E: fix {cluster_2_category} failures",
+  run_in_background: true,
+  prompt: |
+    <strategy>
+    APPROACH: failure-targeted
+    Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
+    They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
+    Read the traces of these specific tasks to understand WHY they fail.
+    Your changes should improve these tasks WITHOUT regressing others.
+    You are free to change anything — prompts, code, retrieval, architecture —
+    whatever is needed to fix THIS specific failure mode.
+    </strategy>
+    <objective>
+    Propose harness version {version}e targeting {cluster_2_category} failures.
+    </objective>
+    <files_to_read>
+    - .harness-evolver/summary.json
+    - .harness-evolver/PROPOSER_HISTORY.md
+    - .harness-evolver/config.json
+    - .harness-evolver/harnesses/{best_version}/harness.py
+    - .harness-evolver/harnesses/{best_version}/scores.json
+    - .harness-evolver/langsmith_runs.json (if exists)
+    - .harness-evolver/architecture.json (if exists)
+    </files_to_read>
+    <output>
+    Create directory .harness-evolver/harnesses/{version}e/ containing:
+    - harness.py, config.json, proposal.md
+    </output>
+)
 ```
-Output to: `.harness-evolver/harnesses/{version}d/`
-**Candidate E (Data/Retrieval Specialist)** — `run_in_background: true`:
+If no second cluster (or ALL_PASSING):
 ```
-<strategy>
-APPROACH: retrieval-optimization
-You are the RETRIEVAL SPECIALIST. Focus ONLY on improving how data is
-retrieved, filtered, ranked, and presented to the LLM.
-Do NOT change the system prompt text or output formatting.
-Improve: search logic, relevance scoring, cross-domain retrieval, chunking.
-</strategy>
+Agent(
+  subagent_type: "harness-evolver-proposer",
+  description: "Proposer E: efficiency optimization",
+  run_in_background: true,
+  prompt: |
+    <strategy>
+    APPROACH: efficiency
+    Maintain the current quality but optimize for:
+    - Fewer LLM tokens (shorter prompts, less context)
+    - Faster execution (reduce unnecessary steps)
+    - Simpler code (remove redundant logic)
+    - Better error handling (graceful degradation)
+    Do NOT sacrifice accuracy for speed — same quality, less cost.
+    </strategy>
+    ...same files_to_read and output as above...
+)
 ```
-Output to: `.harness-evolver/harnesses/{version}e/`
 Wait for all 5 to complete. The background agents will notify when done.
 **Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
-**On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, replace Candidate D with a "Radical" strategy that rewrites the harness from scratch.
+**On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, step 1.8 will naturally shift D and E toward failure-targeted or creative strategies based on actual task performance.
 ### 3. Validate All Candidates
@@ -281,16 +496,67 @@ Wait for `## JUDGE COMPLETE`.
 If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
-### 5. Select Winner + Update State
+### 5. Select Winner + Track Per-Task Champions
+**5a. Find overall winner (highest combined_score):**
+Compare all evaluated candidates. The winner is the one with highest combined_score.
+**5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
+```bash
+python3 -c "
+import json, os
+version = '{version}'
+candidates = {}
+for suffix in ['a', 'b', 'c', 'd', 'e']:
+    path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
+    if os.path.exists(path):
+        candidates[suffix] = json.load(open(path))
+if not candidates:
+    print('NO_CANDIDATES')
+    exit()
+# Overall winner
+winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
+winner_score = candidates[winner_suffix]['combined_score']
+print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
+# Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
+task_wins = {}
+winner_tasks = candidates[winner_suffix].get('per_task', {})
+for suffix, data in candidates.items():
+    if suffix == winner_suffix:
+        continue
+    wins = 0
+    for tid, tdata in data.get('per_task', {}).items():
+        winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
+        if tdata.get('score', 0) > winner_task_score:
+            wins += 1
+    if wins > 0:
+        task_wins[suffix] = wins
+if task_wins:
+    champion_suffix = max(task_wins, key=task_wins.get)
+    print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
+    # Save champion info for next iteration's crossover parent
+    with open('.harness-evolver/per_task_champion.json', 'w') as f:
+        json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
+else:
+    print('NO_CHAMPION: winner dominates all tasks')
+" 2>/dev/null
+```
-Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
+**5c. Promote winner and report ALL candidates:**
-Rename the winner directory to the official version name:
+Rename winner directory to official version:
 ```bash
 mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
 ```
-Update state with the winner:
+Update state:
 ```bash
 python3 $TOOLS/state.py update \
     --base-dir .harness-evolver \
@@ -299,13 +565,17 @@ python3 $TOOLS/state.py update \
     --proposal .harness-evolver/harnesses/{version}/proposal.md
 ```
-Report ALL candidates:
+Report ALL candidates with their scores and strategies:
 ```
-Iteration {i}/{N} — 3 candidates evaluated:
-  {version}a (exploit): {score_a} — {1-line summary from proposal.md}
-  {version}b (explore): {score_b} — {1-line summary}
-  {version}c (cross):   {score_c} — {1-line summary}
-  Winner: {version}{suffix} ({score}) ← promoted to {version}
+Iteration {i}/{N} — {num_candidates} candidates evaluated:
+  {version}a (exploit):          {score_a} — {summary}
+  {version}b (explore):          {score_b} — {summary}
+  {version}c (crossover):        {score_c} — {summary}
+  {version}d ({strategy_d}):     {score_d} — {summary}
+  {version}e ({strategy_e}):     {score_e} — {summary}
+  Winner: {version}{suffix} ({score})
+  Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
 ```
 Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).