harness-evolver 2.2.0 → 2.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,228 +1,173 @@
1
+ <p align="center">
2
+ <img src="assets/banner.jpg" alt="Harness Evolver" width="100%">
3
+ </p>
4
+
1
5
  # Harness Evolver
2
6
 
3
- End-to-end optimization of LLM agent harnesses, inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
7
+ <p align="center">
8
+ <a href="https://www.npmjs.com/package/harness-evolver"><img src="https://img.shields.io/npm/v/harness-evolver?style=for-the-badge&color=blueviolet" alt="npm"></a>
9
+ <a href="https://github.com/raphaelchristi/harness-evolver/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License: MIT"></a>
10
+ <a href="https://arxiv.org/abs/2603.28052"><img src="https://img.shields.io/badge/Paper-Meta--Harness-FFD700?style=for-the-badge" alt="Paper"></a>
11
+ <a href="https://github.com/raphaelchristi/harness-evolver"><img src="https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge" alt="Built by Raphael Valdetaro"></a>
12
+ </p>
4
13
 
5
- **The harness is the 80% factor.** Changing just the scaffolding around a fixed LLM can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.
14
+ **Autonomous harness optimization for LLM agents.** Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM prompts, retrieval, routing, output parsing using a multi-agent loop inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
6
15
 
7
- ## Install
16
+ The harness is the 80% factor. Changing just the scaffolding can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates that search.
8
17
 
9
- ```bash
10
- npx harness-evolver@latest
11
- ```
18
+ ---
12
19
 
13
- Select your runtime (Claude Code, Cursor, Codex, Windsurf) and scope (global/local). Then **restart your AI coding agent** for the skills to appear.
14
-
15
- ## Prerequisites
16
-
17
- ### API Keys (set in your shell before launching Claude Code)
18
-
19
- The harness you're evolving may call LLM APIs. Set the keys your harness needs:
20
+ ## Install
20
21
 
21
22
  ```bash
22
- # Required: at least one LLM provider
23
- export ANTHROPIC_API_KEY="sk-ant-..." # For Claude-based harnesses
24
- export OPENAI_API_KEY="sk-..." # For OpenAI-based harnesses
25
- export GEMINI_API_KEY="AIza..." # For Gemini-based harnesses
26
- export OPENROUTER_API_KEY="sk-or-..." # For OpenRouter (multi-model)
27
-
28
- # Optional: enhanced tracing
29
- export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
23
+ npx harness-evolver@latest
30
24
  ```
31
25
 
32
- The plugin auto-detects which keys are available during `/harness-evolver:init` and shows them. The proposer agent knows which APIs are available and uses them accordingly.
33
-
34
- **No API key needed for the example** — the classifier example uses keyword matching (mock mode), no LLM calls.
35
-
36
- ### Optional: Enhanced Integrations
26
+ > Works with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.
37
27
 
38
- ```bash
39
- # LangSmith — rich trace analysis for the proposer
40
- uv tool install langsmith-cli && langsmith-cli auth login
41
-
42
- # Context7 — up-to-date library documentation for the proposer
43
- claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
44
-
45
- # LangChain Docs — LangChain/LangGraph-specific documentation
46
- claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
47
- ```
28
+ ---
48
29
 
49
30
  ## Quick Start
50
31
 
51
- ### Try the Example (no API key needed)
52
-
53
32
  ```bash
54
- # 1. Copy the example
55
- cp -r ~/.harness-evolver/examples/classifier ./my-classifier
56
- cd my-classifier
57
-
58
- # 2. Open Claude Code
33
+ cd my-llm-project
59
34
  claude
60
35
 
61
- # 3. Initialize auto-detects harness.py, eval.py, tasks/
62
- /harness-evolver:init
63
-
64
- # 4. Run the evolution loop
65
- /harness-evolver:evolve --iterations 3
66
-
67
- # 5. Check progress
68
- /harness-evolver:status
36
+ /harness-evolver:init # scans code, creates eval + tasks if missing
37
+ /harness-evolver:evolve # runs the optimization loop
38
+ /harness-evolver:status # check progress anytime
69
39
  ```
70
40
 
71
- ### Use with Your Own Project
41
+ **Zero-config mode:** If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
72
42
 
73
- ```bash
74
- cd my-llm-project
75
- claude
76
-
77
- # Init scans your project, identifies the entry point,
78
- # and helps create harness wrapper + eval + tasks if missing
79
- /harness-evolver:init
80
-
81
- # Run optimization
82
- /harness-evolver:evolve --iterations 10
83
- ```
43
+ ---
84
44
 
85
- The init skill adapts to your project — if you have `graph.py` instead of `harness.py`, it creates a thin wrapper. If you don't have an eval script, it helps you write one.
45
+ ## How It Works
86
46
 
87
- ## Available Commands
47
+ <table>
48
+ <tr>
49
+ <td><b>5 Proposers</b></td>
50
+ <td>Each iteration spawns 5 parallel agents with different strategies: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), prompt specialist, retrieval specialist. Best candidate wins.</td>
51
+ </tr>
52
+ <tr>
53
+ <td><b>Full Traces</b></td>
54
+ <td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Proposers read actual LLM prompts and responses.</td>
55
+ </tr>
56
+ <tr>
57
+ <td><b>Critic</b></td>
58
+ <td>Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence.</td>
59
+ </tr>
60
+ <tr>
61
+ <td><b>Architect</b></td>
62
+ <td>Auto-triggers on stagnation or regression. Recommends topology changes (single-call → RAG, chain → ReAct, etc.) with concrete migration steps.</td>
63
+ </tr>
64
+ <tr>
65
+ <td><b>Judge</b></td>
66
+ <td>LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.</td>
67
+ </tr>
68
+ </table>
69
+
70
+ ---
71
+
72
+ ## Commands
88
73
 
89
74
  | Command | What it does |
90
75
  |---|---|
91
76
  | `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
92
- | `/harness-evolver:evolve` | Run the autonomous optimization loop |
93
- | `/harness-evolver:status` | Show progress (scores, iterations, stagnation) |
77
+ | `/harness-evolver:evolve` | Run the autonomous optimization loop (5 parallel proposers) |
78
+ | `/harness-evolver:status` | Show progress, scores, stagnation detection |
94
79
  | `/harness-evolver:compare` | Diff two versions with per-task analysis |
95
80
  | `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
96
- | `/harness-evolver:deploy` | Copy the best harness back to your project |
97
-
98
- ## How It Works
81
+ | `/harness-evolver:deploy` | Promote the best harness back to your project |
82
+ | `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
83
+ | `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
99
84
 
100
- ```
101
- ┌─────────────────────────────┐
102
- │ /harness-evolver:evolve │
103
- │ (orchestrator skill) │
104
- └──────────┬──────────────────┘
105
-
106
- ┌────────────────┼────────────────┐
107
- ▼ ▼ ▼
108
- ┌──────────┐ ┌────────────┐ ┌──────────┐
109
- │ PROPOSE │ │ EVALUATE │ │ UPDATE │
110
- │ proposer │ │ evaluate.py│ │ state.py │
111
- │ agent │ │ + eval.py │ │ │
112
- └──────────┘ └────────────┘ └──────────┘
113
- │ │ │
114
- ▼ ▼ ▼
115
- harnesses/ traces/ summary.json
116
- v{N}/ per-task STATE.md
117
- harness.py stdout/stderr PROPOSER_HISTORY.md
118
- proposal.md timing.json
119
- scores.json
120
- ```
85
+ ---
121
86
 
122
- 1. **Propose** — A proposer agent reads all prior candidates' code, execution traces, and scores. Diagnoses failure modes via counterfactual analysis and writes a new harness.
123
- 2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
124
- 3. **Update** — State files are updated with the new score, parent lineage, and regression detection.
125
- 4. **Repeat** — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.
87
+ ## Agents
126
88
 
127
- ## The Harness Contract
89
+ | Agent | Role | Color |
90
+ |---|---|---|
91
+ | **Proposer** | Evolves the harness code based on trace analysis | Green |
92
+ | **Architect** | Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.) | Blue |
93
+ | **Critic** | Evaluates eval quality, detects gaming, proposes stricter scoring | Red |
94
+ | **Judge** | LLM-as-judge scoring — works without expected answers | Yellow |
95
+ | **TestGen** | Generates synthetic test cases from code analysis | Cyan |
128
96
 
129
- A harness is **any executable** that accepts:
97
+ ---
130
98
 
131
- ```bash
132
- python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
133
- ```
134
-
135
- - `--input`: JSON with `{id, input, metadata}` (never sees expected answers)
136
- - `--output`: JSON with `{id, output}`
137
- - `--traces-dir`: optional directory for rich traces
138
- - `--config`: optional JSON with evolvable parameters
99
+ ## Integrations
139
100
 
140
- The eval script is also any executable:
101
+ <table>
102
+ <tr>
103
+ <td><b>LangSmith</b></td>
104
+ <td>Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via <code>langsmith-cli</code>. Processed into readable format per iteration.</td>
105
+ </tr>
106
+ <tr>
107
+ <td><b>Context7</b></td>
108
+ <td>Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.</td>
109
+ </tr>
110
+ <tr>
111
+ <td><b>LangChain Docs</b></td>
112
+ <td>LangChain/LangGraph-specific documentation search via MCP.</td>
113
+ </tr>
114
+ </table>
141
115
 
142
116
  ```bash
143
- python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.json
144
- ```
145
-
146
- Works with **any language, any framework, any domain**.
147
-
148
- ## Project Structure (after init)
149
-
150
- ```
151
- .harness-evolver/ # Created by /harness-evolver:init
152
- ├── config.json # Project config (harness cmd, eval, API keys detected)
153
- ├── summary.json # Source of truth (versions, scores, parents)
154
- ├── STATE.md # Human-readable status
155
- ├── PROPOSER_HISTORY.md # Log of all proposals and outcomes
156
- ├── baseline/ # Original harness (read-only)
157
- │ └── harness.py
158
- ├── eval/
159
- │ ├── eval.py # Your scoring script
160
- │ └── tasks/ # Test cases
161
- └── harnesses/
162
- └── v001/
163
- ├── harness.py # Evolved candidate
164
- ├── proposal.md # Why this version was created
165
- ├── scores.json # How it scored
166
- └── traces/ # Full execution traces
167
- ├── stdout.log
168
- ├── stderr.log
169
- ├── timing.json
170
- └── task_001/
171
- ├── input.json
172
- └── output.json
117
+ # Optional install during npx setup or manually:
118
+ uv tool install langsmith-cli && langsmith-cli auth login
119
+ claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
120
+ claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
173
121
  ```
174
122
 
175
- ## The Proposer
123
+ ---
176
124
 
177
- The core of the system. 4-phase workflow from the Meta-Harness paper:
178
-
179
- | Phase | What it does |
180
- |---|---|
181
- | **Orient** | Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate. |
182
- | **Diagnose** | Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis. |
183
- | **Propose** | Write new harness. Prefer additive changes after regressions. |
184
- | **Document** | Write `proposal.md` with evidence. Update history. |
185
-
186
- **7 rules:** evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.
187
-
188
- ## Integrations
125
+ ## The Harness Contract
189
126
 
190
- ### LangSmith (optional, recommended for LangChain/LangGraph harnesses)
127
+ A harness is **any executable**:
191
128
 
192
129
  ```bash
193
- export LANGSMITH_API_KEY=lsv2_...
194
- uv tool install langsmith-cli && langsmith-cli auth login
130
+ python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
195
131
  ```
196
132
 
197
- When detected, the plugin:
198
- - Sets `LANGCHAIN_TRACING_V2=true` automatically — all LLM calls are traced
199
- - The proposer queries traces directly via `langsmith-cli`:
133
+ Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
200
134
 
201
- ```bash
202
- langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
203
- langsmith-cli --json runs stats --project harness-evolver-v003
204
- ```
135
+ ---
205
136
 
206
- ### Context7 (optional, recommended for any library-heavy harness)
137
+ ## Evolution Loop
207
138
 
208
- ```bash
209
- claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
139
+ ```
140
+ /harness-evolver:evolve
141
+
142
+ ├─ 1. Gather LangSmith traces (processed into readable format)
143
+ ├─ 2. Spawn 5 proposers in parallel (exploit/explore/crossover/prompt/retrieval)
144
+ ├─ 3. Validate all candidates
145
+ ├─ 4. Evaluate all candidates
146
+ ├─ 4.5 Judge (if using LLM-as-judge eval)
147
+ ├─ 5. Select winner (highest combined_score)
148
+ ├─ 6. Report results
149
+ ├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
150
+ ├─ 7. Auto-trigger Architect (if regression or stagnation)
151
+ └─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
210
152
  ```
211
153
 
212
- The plugin detects your stack via AST analysis (17 libraries: LangChain, LangGraph, OpenAI, Anthropic, ChromaDB, FastAPI, etc.) and instructs the proposer to consult current docs before proposing API changes.
154
+ ---
213
155
 
214
- ## Development
156
+ ## API Keys
157
+
158
+ Set in your shell before launching Claude Code:
215
159
 
216
160
  ```bash
217
- # Run all tests (41 tests, stdlib-only)
218
- python3 -m unittest discover -s tests -v
161
+ export GEMINI_API_KEY="AIza..." # Gemini-based harnesses
162
+ export ANTHROPIC_API_KEY="sk-ant-..." # Claude-based harnesses
163
+ export OPENAI_API_KEY="sk-..." # OpenAI-based harnesses
164
+ export OPENROUTER_API_KEY="sk-or-..." # Multi-model via OpenRouter
165
+ export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
166
+ ```
219
167
 
220
- # Test example manually
221
- python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json
168
+ The plugin auto-detects available keys. No key needed for the included example.
222
169
 
223
- # Install locally for development
224
- node bin/install.js
225
- ```
170
+ ---
226
171
 
227
172
  ## Comparison
228
173
 
@@ -230,17 +175,23 @@ node bin/install.js
230
175
  |---|---|---|---|---|
231
176
  | **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
232
177
  | **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
233
- | **Domain** | TerminalBench-2 | Coding benchmarks | Dev workflow | **Any domain** |
234
- | **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx`** |
178
+ | **Candidates/iter** | 1 | 1 | N/A | **5 parallel** |
179
+ | **Auto-critique** | No | No | No | **Yes (critic + judge)** |
180
+ | **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
235
181
  | **LangSmith** | No | No | No | **Yes** |
236
182
  | **Context7** | No | No | No | **Yes** |
183
+ | **Zero-config** | No | No | No | **Yes** |
184
+
185
+ ---
237
186
 
238
187
  ## References
239
188
 
240
- - [Meta-Harness paper (arxiv 2603.28052)](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
241
- - [Design Spec](docs/specs/2026-03-31-harness-evolver-design.md)
242
- - [LangSmith Integration](docs/specs/2026-03-31-langsmith-integration.md)
243
- - [Context7 Integration](docs/specs/2026-03-31-context7-integration.md)
189
+ - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
190
+ - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI (parallel evolution architecture)
191
+ - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind (population-based code evolution)
192
+ - [Agent Skills Specification](https://agentskills.io) — Open standard for AI agent skills
193
+
194
+ ---
244
195
 
245
196
  ## License
246
197
 
@@ -15,12 +15,15 @@ every file listed there before performing any other actions. These files are you
15
15
 
16
16
  ## Strategy Injection
17
17
 
18
- Your prompt may contain a `<strategy>` block defining your evolutionary role:
19
- - **exploitation**: Make targeted, conservative fixes to the current best
20
- - **exploration**: Try fundamentally different approaches, be bold
21
- - **crossover**: Combine strengths from two parent versions
18
+ Your prompt contains a `<strategy>` block defining your approach. Follow it:
19
+
20
+ - **exploitation**: Conservative fix on current best. Small, targeted changes.
21
+ - **exploration**: Bold, fundamentally different approach. High risk, high reward.
22
+ - **crossover**: Combine strengths from two parent versions.
23
+ - **failure-targeted**: Fix SPECIFIC failing tasks listed in the strategy. Read their traces, understand the root cause, fix that capability. You are free to change ANYTHING needed.
24
+ - **creative**: Try something unexpected — different algorithms, architecture, libraries.
25
+ - **efficiency**: Same quality but fewer tokens, faster, simpler code.
22
26
 
23
- Follow the strategy. It determines your risk tolerance and parent selection.
24
27
  If no strategy block is present, default to exploitation (conservative improvement).
25
28
 
26
29
  ## Context7 — Enrich Your Knowledge
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.2.0",
3
+ "version": "2.4.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -52,21 +52,137 @@ LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*
52
52
 
53
53
  If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
54
54
 
55
- **Step 2: Gather traces from the discovered project**
55
+ **Step 2: Gather raw traces from the discovered project**
56
56
 
57
57
  ```bash
58
58
  if [ -n "$LS_PROJECT" ]; then
59
- langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
60
- langsmith-cli --json runs list --project "$LS_PROJECT" --fields id,name,inputs,outputs,latency_ms,total_tokens --limit 20 > .harness-evolver/langsmith_runs.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
59
+ langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
61
60
  langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
62
61
  echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
63
62
  else
64
- echo "[]" > .harness-evolver/langsmith_diagnosis.json
63
+ echo "[]" > /tmp/langsmith_raw.json
65
64
  echo "{}" > .harness-evolver/langsmith_stats.json
66
65
  fi
67
66
  ```
68
67
 
69
- These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
68
+ **Step 3: Process raw LangSmith data into a readable format for proposers**
69
+
70
+ The raw langsmith data has LangChain-serialized messages that are hard to read. Process it into a clean summary:
71
+
72
+ ```bash
73
+ python3 -c "
74
+ import json, sys
75
+
76
+ raw = json.load(open('/tmp/langsmith_raw.json'))
77
+ if not raw:
78
+ json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
79
+ sys.exit(0)
80
+
81
+ clean = []
82
+ for r in raw:
83
+ entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
84
+
85
+ # Extract readable prompt from LangChain serialized inputs
86
+ inputs = r.get('inputs', {})
87
+ if isinstance(inputs, dict) and 'messages' in inputs:
88
+ msgs = inputs['messages']
89
+ for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
90
+ for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
91
+ if isinstance(msg, dict):
92
+ kwargs = msg.get('kwargs', msg)
93
+ content = kwargs.get('content', '')
94
+ msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
95
+ if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
96
+ entry['user_message'] = str(content)[:300]
97
+ elif 'System' in str(msg_type):
98
+ entry['system_prompt_preview'] = str(content)[:200]
99
+
100
+ # Extract readable output
101
+ outputs = r.get('outputs', {})
102
+ if isinstance(outputs, dict) and 'generations' in outputs:
103
+ gens = outputs['generations']
104
+ if gens and isinstance(gens, list) and gens[0]:
105
+ gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
106
+ if isinstance(gen, dict):
107
+ msg = gen.get('message', gen)
108
+ if isinstance(msg, dict):
109
+ kwargs = msg.get('kwargs', msg)
110
+ entry['llm_response'] = str(kwargs.get('content', ''))[:300]
111
+
112
+ clean.append(entry)
113
+
114
+ json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
115
+ print(f'Processed {len(clean)} LangSmith runs into readable format')
116
+ " 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
117
+ ```
118
+
119
+ The resulting `langsmith_runs.json` has clean, readable entries:
120
+ ```json
121
+ [
122
+ {
123
+ "name": "ChatGoogleGenerativeAI",
124
+ "tokens": 1332,
125
+ "error": null,
126
+ "user_message": "Analise este texto: Bom dia pessoal...",
127
+ "system_prompt_preview": "Você é um moderador de conteúdo...",
128
+ "llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
129
+ }
130
+ ]
131
+ ```
132
+
133
+ These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
134
+
135
+ ### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
136
+
137
+ Before spawning proposers, analyze which tasks are failing and cluster them:
138
+
139
+ ```bash
140
+ python3 -c "
141
+ import json, os, sys
142
+
143
+ # Find best version scores
144
+ summary = json.load(open('.harness-evolver/summary.json'))
145
+ best = summary['best']['version']
146
+ scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
147
+ if not os.path.exists(scores_path):
148
+ scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
149
+
150
+ if not scores_path or not os.path.exists(scores_path):
151
+ print('NO_SCORES')
152
+ sys.exit(0)
153
+
154
+ scores = json.load(open(scores_path))
155
+ tasks_dir = '.harness-evolver/eval/tasks/'
156
+ failures = {}
157
+
158
+ for tid, tdata in scores.get('per_task', {}).items():
159
+ score = tdata.get('score', 0)
160
+ if score < 0.7:
161
+ tfile = os.path.join(tasks_dir, tid + '.json')
162
+ cat = 'unknown'
163
+ if os.path.exists(tfile):
164
+ task = json.load(open(tfile))
165
+ meta = task.get('metadata', {})
166
+ cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
167
+ failures.setdefault(cat, []).append({'id': tid, 'score': score})
168
+
169
+ if not failures:
170
+ print('ALL_PASSING')
171
+ else:
172
+ sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
173
+ for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
174
+ task_ids = [t['id'] for t in tasks]
175
+ avg_score = sum(t['score'] for t in tasks) / len(tasks)
176
+ print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
177
+ " 2>/dev/null
178
+ ```
179
+
180
+ Parse the output:
181
+ - If `NO_SCORES` or `ALL_PASSING`: D gets "creative" brief, E gets "efficiency" brief
182
+ - If clusters found: D targets cluster 1, E targets cluster 2
183
+ - If only 1 cluster: D targets it, E gets "creative" brief
184
+
185
+ Save clusters for use in step 2.
70
186
 
71
187
  ### 2. Propose (3 parallel candidates)
72
188
 
@@ -76,7 +192,10 @@ This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
76
192
  Determine parents for each strategy:
77
193
  - **Exploiter parent**: current best version (from summary.json `best.version`)
78
194
  - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
79
- - **Crossover parents**: best version + a different high-scorer from a different lineage
195
+ - **Crossover parents**:
196
+ - Parent A = current best version
197
+ - Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
198
+ If no champion file exists, fall back to a non-best version from the archive.
80
199
 
81
200
  Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
82
201
 
@@ -198,35 +317,131 @@ Agent(
198
317
 
199
318
  **Also spawn these additional candidates:**
200
319
 
201
- **Candidate D (Prompt Specialist)** — `run_in_background: true`:
202
- Same as Exploiter but with a different focus:
320
+ **Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
321
+
322
+ If failure clusters were found in step 1.8:
323
+ ```
324
+ Agent(
325
+ subagent_type: "harness-evolver-proposer",
326
+ description: "Proposer D: fix {cluster_1_category} failures",
327
+ run_in_background: true,
328
+ prompt: |
329
+ <strategy>
330
+ APPROACH: failure-targeted
331
+ Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
332
+ They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
333
+ Read the traces of these specific tasks to understand WHY they fail.
334
+ Your changes should improve these tasks WITHOUT regressing others.
335
+ You are free to change anything — prompts, code, retrieval, architecture —
336
+ whatever is needed to fix THIS specific failure mode.
337
+ </strategy>
338
+
339
+ <objective>
340
+ Propose harness version {version}d targeting {cluster_1_category} failures.
341
+ </objective>
342
+
343
+ <files_to_read>
344
+ - .harness-evolver/summary.json
345
+ - .harness-evolver/PROPOSER_HISTORY.md
346
+ - .harness-evolver/config.json
347
+ - .harness-evolver/harnesses/{best_version}/harness.py
348
+ - .harness-evolver/harnesses/{best_version}/scores.json
349
+ - .harness-evolver/langsmith_runs.json (if exists)
350
+ - .harness-evolver/architecture.json (if exists)
351
+ </files_to_read>
352
+
353
+ <output>
354
+ Create directory .harness-evolver/harnesses/{version}d/ containing:
355
+ - harness.py, config.json, proposal.md
356
+ </output>
357
+ )
358
+ ```
359
+
360
+ If ALL_PASSING (no failures):
361
+ ```
362
+ Agent(
363
+ subagent_type: "harness-evolver-proposer",
364
+ description: "Proposer D: creative approach",
365
+ run_in_background: true,
366
+ prompt: |
367
+ <strategy>
368
+ APPROACH: creative
369
+ All tasks are scoring well. Try something UNEXPECTED:
370
+ - Different algorithm or library
371
+ - Completely different prompt architecture
372
+ - Novel error handling or output validation
373
+ - Something no one would think of
374
+ The goal is to discover improvements that incremental fixes would miss.
375
+ </strategy>
376
+ ...same files_to_read and output as above...
377
+ )
378
+ ```
379
+
380
+ **Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
381
+
382
+ If a second failure cluster exists:
203
383
  ```
204
- <strategy>
205
- APPROACH: prompt-engineering
206
- You are the PROMPT SPECIALIST. Focus ONLY on improving the system prompt,
207
- few-shot examples, output format instructions, and prompt structure.
208
- Do NOT change the retrieval logic, pipeline structure, or code architecture.
209
- </strategy>
384
+ Agent(
385
+ subagent_type: "harness-evolver-proposer",
386
+ description: "Proposer E: fix {cluster_2_category} failures",
387
+ run_in_background: true,
388
+ prompt: |
389
+ <strategy>
390
+ APPROACH: failure-targeted
391
+ Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
392
+ They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
393
+ Read the traces of these specific tasks to understand WHY they fail.
394
+ Your changes should improve these tasks WITHOUT regressing others.
395
+ You are free to change anything — prompts, code, retrieval, architecture —
396
+ whatever is needed to fix THIS specific failure mode.
397
+ </strategy>
398
+
399
+ <objective>
400
+ Propose harness version {version}e targeting {cluster_2_category} failures.
401
+ </objective>
402
+
403
+ <files_to_read>
404
+ - .harness-evolver/summary.json
405
+ - .harness-evolver/PROPOSER_HISTORY.md
406
+ - .harness-evolver/config.json
407
+ - .harness-evolver/harnesses/{best_version}/harness.py
408
+ - .harness-evolver/harnesses/{best_version}/scores.json
409
+ - .harness-evolver/langsmith_runs.json (if exists)
410
+ - .harness-evolver/architecture.json (if exists)
411
+ </files_to_read>
412
+
413
+ <output>
414
+ Create directory .harness-evolver/harnesses/{version}e/ containing:
415
+ - harness.py, config.json, proposal.md
416
+ </output>
417
+ )
210
418
  ```
211
- Output to: `.harness-evolver/harnesses/{version}d/`
212
419
 
213
- **Candidate E (Data/Retrieval Specialist)** — `run_in_background: true`:
420
+ If no second cluster (or ALL_PASSING):
214
421
  ```
215
- <strategy>
216
- APPROACH: retrieval-optimization
217
- You are the RETRIEVAL SPECIALIST. Focus ONLY on improving how data is
218
- retrieved, filtered, ranked, and presented to the LLM.
219
- Do NOT change the system prompt text or output formatting.
220
- Improve: search logic, relevance scoring, cross-domain retrieval, chunking.
221
- </strategy>
422
+ Agent(
423
+ subagent_type: "harness-evolver-proposer",
424
+ description: "Proposer E: efficiency optimization",
425
+ run_in_background: true,
426
+ prompt: |
427
+ <strategy>
428
+ APPROACH: efficiency
429
+ Maintain the current quality but optimize for:
430
+ - Fewer LLM tokens (shorter prompts, less context)
431
+ - Faster execution (reduce unnecessary steps)
432
+ - Simpler code (remove redundant logic)
433
+ - Better error handling (graceful degradation)
434
+ Do NOT sacrifice accuracy for speed — same quality, less cost.
435
+ </strategy>
436
+ ...same files_to_read and output as above...
437
+ )
222
438
  ```
223
- Output to: `.harness-evolver/harnesses/{version}e/`
224
439
 
225
440
  Wait for all 5 to complete. The background agents will notify when done.
226
441
 
227
442
  **Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
228
443
 
229
- **On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, replace Candidate D with a "Radical" strategy that rewrites the harness from scratch.
444
+ **On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, step 1.8 will naturally shift D and E toward failure-targeted or creative strategies based on actual task performance.
230
445
 
231
446
  ### 3. Validate All Candidates
232
447
 
@@ -281,16 +496,67 @@ Wait for `## JUDGE COMPLETE`.
281
496
 
282
497
  If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
283
498
 
284
- ### 5. Select Winner + Update State
499
+ ### 5. Select Winner + Track Per-Task Champions
500
+
501
+ **5a. Find overall winner (highest combined_score):**
502
+
503
+ Compare all evaluated candidates. The winner is the one with highest combined_score.
504
+
505
+ **5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
506
+
507
+ ```bash
508
+ python3 -c "
509
+ import json, os
510
+
511
+ version = '{version}'
512
+ candidates = {}
513
+ for suffix in ['a', 'b', 'c', 'd', 'e']:
514
+ path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
515
+ if os.path.exists(path):
516
+ candidates[suffix] = json.load(open(path))
517
+
518
+ if not candidates:
519
+ print('NO_CANDIDATES')
520
+ exit()
521
+
522
+ # Overall winner
523
+ winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
524
+ winner_score = candidates[winner_suffix]['combined_score']
525
+ print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
526
+
527
+ # Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
528
+ task_wins = {}
529
+ winner_tasks = candidates[winner_suffix].get('per_task', {})
530
+ for suffix, data in candidates.items():
531
+ if suffix == winner_suffix:
532
+ continue
533
+ wins = 0
534
+ for tid, tdata in data.get('per_task', {}).items():
535
+ winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
536
+ if tdata.get('score', 0) > winner_task_score:
537
+ wins += 1
538
+ if wins > 0:
539
+ task_wins[suffix] = wins
540
+
541
+ if task_wins:
542
+ champion_suffix = max(task_wins, key=task_wins.get)
543
+ print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
544
+ # Save champion info for next iteration's crossover parent
545
+ with open('.harness-evolver/per_task_champion.json', 'w') as f:
546
+ json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
547
+ else:
548
+ print('NO_CHAMPION: winner dominates all tasks')
549
+ " 2>/dev/null
550
+ ```
285
551
 
286
- Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
552
+ **5c. Promote winner and report ALL candidates:**
287
553
 
288
- Rename the winner directory to the official version name:
554
+ Rename winner directory to official version:
289
555
  ```bash
290
556
  mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
291
557
  ```
292
558
 
293
- Update state with the winner:
559
+ Update state:
294
560
  ```bash
295
561
  python3 $TOOLS/state.py update \
296
562
  --base-dir .harness-evolver \
@@ -299,13 +565,17 @@ python3 $TOOLS/state.py update \
299
565
  --proposal .harness-evolver/harnesses/{version}/proposal.md
300
566
  ```
301
567
 
302
- Report ALL candidates:
568
+ Report ALL candidates with their scores and strategies:
303
569
  ```
304
- Iteration {i}/{N} — 3 candidates evaluated:
305
- {version}a (exploit): {score_a} — {1-line summary from proposal.md}
306
- {version}b (explore): {score_b} — {1-line summary}
307
- {version}c (cross): {score_c} — {1-line summary}
308
- Winner: {version}{suffix} ({score}) promoted to {version}
570
+ Iteration {i}/{N} — {num_candidates} candidates evaluated:
571
+ {version}a (exploit): {score_a} — {summary}
572
+ {version}b (explore): {score_b} — {summary}
573
+ {version}c (crossover): {score_c} — {summary}
574
+ {version}d ({strategy_d}): {score_d} {summary}
575
+ {version}e ({strategy_e}): {score_e} — {summary}
576
+
577
+ Winner: {version}{suffix} ({score})
578
+ Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
309
579
  ```
310
580
 
311
581
  Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).