harness-evolver 2.3.0 → 2.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +128 -177
- package/agents/harness-evolver-proposer.md +8 -5
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +236 -30
package/README.md
CHANGED
|
@@ -1,228 +1,173 @@
|
|
|
1
|
+
<p align="center">
|
|
2
|
+
<img src="assets/banner.jpg" alt="Harness Evolver" width="100%">
|
|
3
|
+
</p>
|
|
4
|
+
|
|
1
5
|
# Harness Evolver
|
|
2
6
|
|
|
3
|
-
|
|
7
|
+
<p align="center">
|
|
8
|
+
<a href="https://www.npmjs.com/package/harness-evolver"><img src="https://img.shields.io/npm/v/harness-evolver?style=for-the-badge&color=blueviolet" alt="npm"></a>
|
|
9
|
+
<a href="https://github.com/raphaelchristi/harness-evolver/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License: MIT"></a>
|
|
10
|
+
<a href="https://arxiv.org/abs/2603.28052"><img src="https://img.shields.io/badge/Paper-Meta--Harness-FFD700?style=for-the-badge" alt="Paper"></a>
|
|
11
|
+
<a href="https://github.com/raphaelchristi/harness-evolver"><img src="https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge" alt="Built by Raphael Valdetaro"></a>
|
|
12
|
+
</p>
|
|
4
13
|
|
|
5
|
-
**
|
|
14
|
+
**Autonomous harness optimization for LLM agents.** Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM — prompts, retrieval, routing, output parsing — using a multi-agent loop inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
|
|
6
15
|
|
|
7
|
-
|
|
16
|
+
The harness is the 80% factor. Changing just the scaffolding can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates that search.
|
|
8
17
|
|
|
9
|
-
|
|
10
|
-
npx harness-evolver@latest
|
|
11
|
-
```
|
|
18
|
+
---
|
|
12
19
|
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
## Prerequisites
|
|
16
|
-
|
|
17
|
-
### API Keys (set in your shell before launching Claude Code)
|
|
18
|
-
|
|
19
|
-
The harness you're evolving may call LLM APIs. Set the keys your harness needs:
|
|
20
|
+
## Install
|
|
20
21
|
|
|
21
22
|
```bash
|
|
22
|
-
|
|
23
|
-
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude-based harnesses
|
|
24
|
-
export OPENAI_API_KEY="sk-..." # For OpenAI-based harnesses
|
|
25
|
-
export GEMINI_API_KEY="AIza..." # For Gemini-based harnesses
|
|
26
|
-
export OPENROUTER_API_KEY="sk-or-..." # For OpenRouter (multi-model)
|
|
27
|
-
|
|
28
|
-
# Optional: enhanced tracing
|
|
29
|
-
export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
|
|
23
|
+
npx harness-evolver@latest
|
|
30
24
|
```
|
|
31
25
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
**No API key needed for the example** — the classifier example uses keyword matching (mock mode), no LLM calls.
|
|
35
|
-
|
|
36
|
-
### Optional: Enhanced Integrations
|
|
26
|
+
> Works with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.
|
|
37
27
|
|
|
38
|
-
|
|
39
|
-
# LangSmith — rich trace analysis for the proposer
|
|
40
|
-
uv tool install langsmith-cli && langsmith-cli auth login
|
|
41
|
-
|
|
42
|
-
# Context7 — up-to-date library documentation for the proposer
|
|
43
|
-
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
|
|
44
|
-
|
|
45
|
-
# LangChain Docs — LangChain/LangGraph-specific documentation
|
|
46
|
-
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
|
|
47
|
-
```
|
|
28
|
+
---
|
|
48
29
|
|
|
49
30
|
## Quick Start
|
|
50
31
|
|
|
51
|
-
### Try the Example (no API key needed)
|
|
52
|
-
|
|
53
32
|
```bash
|
|
54
|
-
|
|
55
|
-
cp -r ~/.harness-evolver/examples/classifier ./my-classifier
|
|
56
|
-
cd my-classifier
|
|
57
|
-
|
|
58
|
-
# 2. Open Claude Code
|
|
33
|
+
cd my-llm-project
|
|
59
34
|
claude
|
|
60
35
|
|
|
61
|
-
#
|
|
62
|
-
/harness-evolver:
|
|
63
|
-
|
|
64
|
-
# 4. Run the evolution loop
|
|
65
|
-
/harness-evolver:evolve --iterations 3
|
|
66
|
-
|
|
67
|
-
# 5. Check progress
|
|
68
|
-
/harness-evolver:status
|
|
36
|
+
/harness-evolver:init # scans code, creates eval + tasks if missing
|
|
37
|
+
/harness-evolver:evolve # runs the optimization loop
|
|
38
|
+
/harness-evolver:status # check progress anytime
|
|
69
39
|
```
|
|
70
40
|
|
|
71
|
-
|
|
41
|
+
**Zero-config mode:** If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
|
|
72
42
|
|
|
73
|
-
|
|
74
|
-
cd my-llm-project
|
|
75
|
-
claude
|
|
76
|
-
|
|
77
|
-
# Init scans your project, identifies the entry point,
|
|
78
|
-
# and helps create harness wrapper + eval + tasks if missing
|
|
79
|
-
/harness-evolver:init
|
|
80
|
-
|
|
81
|
-
# Run optimization
|
|
82
|
-
/harness-evolver:evolve --iterations 10
|
|
83
|
-
```
|
|
43
|
+
---
|
|
84
44
|
|
|
85
|
-
|
|
45
|
+
## How It Works
|
|
86
46
|
|
|
87
|
-
|
|
47
|
+
<table>
|
|
48
|
+
<tr>
|
|
49
|
+
<td><b>5 Proposers</b></td>
|
|
50
|
+
<td>Each iteration spawns 5 parallel agents with different strategies: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), prompt specialist, retrieval specialist. Best candidate wins.</td>
|
|
51
|
+
</tr>
|
|
52
|
+
<tr>
|
|
53
|
+
<td><b>Full Traces</b></td>
|
|
54
|
+
<td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Proposers read actual LLM prompts and responses.</td>
|
|
55
|
+
</tr>
|
|
56
|
+
<tr>
|
|
57
|
+
<td><b>Critic</b></td>
|
|
58
|
+
<td>Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence.</td>
|
|
59
|
+
</tr>
|
|
60
|
+
<tr>
|
|
61
|
+
<td><b>Architect</b></td>
|
|
62
|
+
<td>Auto-triggers on stagnation or regression. Recommends topology changes (single-call → RAG, chain → ReAct, etc.) with concrete migration steps.</td>
|
|
63
|
+
</tr>
|
|
64
|
+
<tr>
|
|
65
|
+
<td><b>Judge</b></td>
|
|
66
|
+
<td>LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.</td>
|
|
67
|
+
</tr>
|
|
68
|
+
</table>
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## Commands
|
|
88
73
|
|
|
89
74
|
| Command | What it does |
|
|
90
75
|
|---|---|
|
|
91
76
|
| `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
|
|
92
|
-
| `/harness-evolver:evolve` | Run the autonomous optimization loop |
|
|
93
|
-
| `/harness-evolver:status` | Show progress
|
|
77
|
+
| `/harness-evolver:evolve` | Run the autonomous optimization loop (5 parallel proposers) |
|
|
78
|
+
| `/harness-evolver:status` | Show progress, scores, stagnation detection |
|
|
94
79
|
| `/harness-evolver:compare` | Diff two versions with per-task analysis |
|
|
95
80
|
| `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
|
|
96
|
-
| `/harness-evolver:deploy` |
|
|
97
|
-
|
|
98
|
-
|
|
81
|
+
| `/harness-evolver:deploy` | Promote the best harness back to your project |
|
|
82
|
+
| `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
|
|
83
|
+
| `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
|
|
99
84
|
|
|
100
|
-
|
|
101
|
-
┌─────────────────────────────┐
|
|
102
|
-
│ /harness-evolver:evolve │
|
|
103
|
-
│ (orchestrator skill) │
|
|
104
|
-
└──────────┬──────────────────┘
|
|
105
|
-
│
|
|
106
|
-
┌────────────────┼────────────────┐
|
|
107
|
-
▼ ▼ ▼
|
|
108
|
-
┌──────────┐ ┌────────────┐ ┌──────────┐
|
|
109
|
-
│ PROPOSE │ │ EVALUATE │ │ UPDATE │
|
|
110
|
-
│ proposer │ │ evaluate.py│ │ state.py │
|
|
111
|
-
│ agent │ │ + eval.py │ │ │
|
|
112
|
-
└──────────┘ └────────────┘ └──────────┘
|
|
113
|
-
│ │ │
|
|
114
|
-
▼ ▼ ▼
|
|
115
|
-
harnesses/ traces/ summary.json
|
|
116
|
-
v{N}/ per-task STATE.md
|
|
117
|
-
harness.py stdout/stderr PROPOSER_HISTORY.md
|
|
118
|
-
proposal.md timing.json
|
|
119
|
-
scores.json
|
|
120
|
-
```
|
|
85
|
+
---
|
|
121
86
|
|
|
122
|
-
|
|
123
|
-
2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
|
|
124
|
-
3. **Update** — State files are updated with the new score, parent lineage, and regression detection.
|
|
125
|
-
4. **Repeat** — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.
|
|
87
|
+
## Agents
|
|
126
88
|
|
|
127
|
-
|
|
89
|
+
| Agent | Role | Color |
|
|
90
|
+
|---|---|---|
|
|
91
|
+
| **Proposer** | Evolves the harness code based on trace analysis | Green |
|
|
92
|
+
| **Architect** | Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.) | Blue |
|
|
93
|
+
| **Critic** | Evaluates eval quality, detects gaming, proposes stricter scoring | Red |
|
|
94
|
+
| **Judge** | LLM-as-judge scoring — works without expected answers | Yellow |
|
|
95
|
+
| **TestGen** | Generates synthetic test cases from code analysis | Cyan |
|
|
128
96
|
|
|
129
|
-
|
|
97
|
+
---
|
|
130
98
|
|
|
131
|
-
|
|
132
|
-
python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
- `--input`: JSON with `{id, input, metadata}` (never sees expected answers)
|
|
136
|
-
- `--output`: JSON with `{id, output}`
|
|
137
|
-
- `--traces-dir`: optional directory for rich traces
|
|
138
|
-
- `--config`: optional JSON with evolvable parameters
|
|
99
|
+
## Integrations
|
|
139
100
|
|
|
140
|
-
|
|
101
|
+
<table>
|
|
102
|
+
<tr>
|
|
103
|
+
<td><b>LangSmith</b></td>
|
|
104
|
+
<td>Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via <code>langsmith-cli</code>. Processed into readable format per iteration.</td>
|
|
105
|
+
</tr>
|
|
106
|
+
<tr>
|
|
107
|
+
<td><b>Context7</b></td>
|
|
108
|
+
<td>Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.</td>
|
|
109
|
+
</tr>
|
|
110
|
+
<tr>
|
|
111
|
+
<td><b>LangChain Docs</b></td>
|
|
112
|
+
<td>LangChain/LangGraph-specific documentation search via MCP.</td>
|
|
113
|
+
</tr>
|
|
114
|
+
</table>
|
|
141
115
|
|
|
142
116
|
```bash
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
## Project Structure (after init)
|
|
149
|
-
|
|
150
|
-
```
|
|
151
|
-
.harness-evolver/ # Created by /harness-evolver:init
|
|
152
|
-
├── config.json # Project config (harness cmd, eval, API keys detected)
|
|
153
|
-
├── summary.json # Source of truth (versions, scores, parents)
|
|
154
|
-
├── STATE.md # Human-readable status
|
|
155
|
-
├── PROPOSER_HISTORY.md # Log of all proposals and outcomes
|
|
156
|
-
├── baseline/ # Original harness (read-only)
|
|
157
|
-
│ └── harness.py
|
|
158
|
-
├── eval/
|
|
159
|
-
│ ├── eval.py # Your scoring script
|
|
160
|
-
│ └── tasks/ # Test cases
|
|
161
|
-
└── harnesses/
|
|
162
|
-
└── v001/
|
|
163
|
-
├── harness.py # Evolved candidate
|
|
164
|
-
├── proposal.md # Why this version was created
|
|
165
|
-
├── scores.json # How it scored
|
|
166
|
-
└── traces/ # Full execution traces
|
|
167
|
-
├── stdout.log
|
|
168
|
-
├── stderr.log
|
|
169
|
-
├── timing.json
|
|
170
|
-
└── task_001/
|
|
171
|
-
├── input.json
|
|
172
|
-
└── output.json
|
|
117
|
+
# Optional — install during npx setup or manually:
|
|
118
|
+
uv tool install langsmith-cli && langsmith-cli auth login
|
|
119
|
+
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
|
|
120
|
+
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
|
|
173
121
|
```
|
|
174
122
|
|
|
175
|
-
|
|
123
|
+
---
|
|
176
124
|
|
|
177
|
-
The
|
|
178
|
-
|
|
179
|
-
| Phase | What it does |
|
|
180
|
-
|---|---|
|
|
181
|
-
| **Orient** | Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate. |
|
|
182
|
-
| **Diagnose** | Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis. |
|
|
183
|
-
| **Propose** | Write new harness. Prefer additive changes after regressions. |
|
|
184
|
-
| **Document** | Write `proposal.md` with evidence. Update history. |
|
|
185
|
-
|
|
186
|
-
**7 rules:** evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.
|
|
187
|
-
|
|
188
|
-
## Integrations
|
|
125
|
+
## The Harness Contract
|
|
189
126
|
|
|
190
|
-
|
|
127
|
+
A harness is **any executable**:
|
|
191
128
|
|
|
192
129
|
```bash
|
|
193
|
-
|
|
194
|
-
uv tool install langsmith-cli && langsmith-cli auth login
|
|
130
|
+
python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
|
|
195
131
|
```
|
|
196
132
|
|
|
197
|
-
|
|
198
|
-
- Sets `LANGCHAIN_TRACING_V2=true` automatically — all LLM calls are traced
|
|
199
|
-
- The proposer queries traces directly via `langsmith-cli`:
|
|
133
|
+
Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
|
|
200
134
|
|
|
201
|
-
|
|
202
|
-
langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
|
|
203
|
-
langsmith-cli --json runs stats --project harness-evolver-v003
|
|
204
|
-
```
|
|
135
|
+
---
|
|
205
136
|
|
|
206
|
-
|
|
137
|
+
## Evolution Loop
|
|
207
138
|
|
|
208
|
-
```
|
|
209
|
-
|
|
139
|
+
```
|
|
140
|
+
/harness-evolver:evolve
|
|
141
|
+
│
|
|
142
|
+
├─ 1. Gather LangSmith traces (processed into readable format)
|
|
143
|
+
├─ 2. Spawn 5 proposers in parallel (exploit/explore/crossover/prompt/retrieval)
|
|
144
|
+
├─ 3. Validate all candidates
|
|
145
|
+
├─ 4. Evaluate all candidates
|
|
146
|
+
├─ 4.5 Judge (if using LLM-as-judge eval)
|
|
147
|
+
├─ 5. Select winner (highest combined_score)
|
|
148
|
+
├─ 6. Report results
|
|
149
|
+
├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
|
|
150
|
+
├─ 7. Auto-trigger Architect (if regression or stagnation)
|
|
151
|
+
└─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
|
|
210
152
|
```
|
|
211
153
|
|
|
212
|
-
|
|
154
|
+
---
|
|
213
155
|
|
|
214
|
-
##
|
|
156
|
+
## API Keys
|
|
157
|
+
|
|
158
|
+
Set in your shell before launching Claude Code:
|
|
215
159
|
|
|
216
160
|
```bash
|
|
217
|
-
#
|
|
218
|
-
|
|
161
|
+
export GEMINI_API_KEY="AIza..." # Gemini-based harnesses
|
|
162
|
+
export ANTHROPIC_API_KEY="sk-ant-..." # Claude-based harnesses
|
|
163
|
+
export OPENAI_API_KEY="sk-..." # OpenAI-based harnesses
|
|
164
|
+
export OPENROUTER_API_KEY="sk-or-..." # Multi-model via OpenRouter
|
|
165
|
+
export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
|
|
166
|
+
```
|
|
219
167
|
|
|
220
|
-
|
|
221
|
-
python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json
|
|
168
|
+
The plugin auto-detects available keys. No key needed for the included example.
|
|
222
169
|
|
|
223
|
-
|
|
224
|
-
node bin/install.js
|
|
225
|
-
```
|
|
170
|
+
---
|
|
226
171
|
|
|
227
172
|
## Comparison
|
|
228
173
|
|
|
@@ -230,17 +175,23 @@ node bin/install.js
|
|
|
230
175
|
|---|---|---|---|---|
|
|
231
176
|
| **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
|
|
232
177
|
| **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
|
|
233
|
-
| **
|
|
234
|
-
| **
|
|
178
|
+
| **Candidates/iter** | 1 | 1 | N/A | **5 parallel** |
|
|
179
|
+
| **Auto-critique** | No | No | No | **Yes (critic + judge)** |
|
|
180
|
+
| **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
|
|
235
181
|
| **LangSmith** | No | No | No | **Yes** |
|
|
236
182
|
| **Context7** | No | No | No | **Yes** |
|
|
183
|
+
| **Zero-config** | No | No | No | **Yes** |
|
|
184
|
+
|
|
185
|
+
---
|
|
237
186
|
|
|
238
187
|
## References
|
|
239
188
|
|
|
240
|
-
- [Meta-Harness
|
|
241
|
-
- [
|
|
242
|
-
- [
|
|
243
|
-
- [
|
|
189
|
+
- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
|
|
190
|
+
- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI (parallel evolution architecture)
|
|
191
|
+
- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind (population-based code evolution)
|
|
192
|
+
- [Agent Skills Specification](https://agentskills.io) — Open standard for AI agent skills
|
|
193
|
+
|
|
194
|
+
---
|
|
244
195
|
|
|
245
196
|
## License
|
|
246
197
|
|
|
@@ -15,12 +15,15 @@ every file listed there before performing any other actions. These files are you
|
|
|
15
15
|
|
|
16
16
|
## Strategy Injection
|
|
17
17
|
|
|
18
|
-
Your prompt
|
|
19
|
-
|
|
20
|
-
- **
|
|
21
|
-
- **
|
|
18
|
+
Your prompt contains a `<strategy>` block defining your approach. Follow it:
|
|
19
|
+
|
|
20
|
+
- **exploitation**: Conservative fix on current best. Small, targeted changes.
|
|
21
|
+
- **exploration**: Bold, fundamentally different approach. High risk, high reward.
|
|
22
|
+
- **crossover**: Combine strengths from two parent versions.
|
|
23
|
+
- **failure-targeted**: Fix SPECIFIC failing tasks listed in the strategy. Read their traces, understand the root cause, fix that capability. You are free to change ANYTHING needed.
|
|
24
|
+
- **creative**: Try something unexpected — different algorithms, architecture, libraries.
|
|
25
|
+
- **efficiency**: Same quality but fewer tokens, faster, simpler code.
|
|
22
26
|
|
|
23
|
-
Follow the strategy. It determines your risk tolerance and parent selection.
|
|
24
27
|
If no strategy block is present, default to exploitation (conservative improvement).
|
|
25
28
|
|
|
26
29
|
## Context7 — Enrich Your Knowledge
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -132,6 +132,58 @@ The resulting `langsmith_runs.json` has clean, readable entries:
|
|
|
132
132
|
|
|
133
133
|
These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
|
|
134
134
|
|
|
135
|
+
### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
|
|
136
|
+
|
|
137
|
+
Before spawning proposers, analyze which tasks are failing and cluster them:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
python3 -c "
|
|
141
|
+
import json, os, sys
|
|
142
|
+
|
|
143
|
+
# Find best version scores
|
|
144
|
+
summary = json.load(open('.harness-evolver/summary.json'))
|
|
145
|
+
best = summary['best']['version']
|
|
146
|
+
scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
|
|
147
|
+
if not os.path.exists(scores_path):
|
|
148
|
+
scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
|
|
149
|
+
|
|
150
|
+
if not scores_path or not os.path.exists(scores_path):
|
|
151
|
+
print('NO_SCORES')
|
|
152
|
+
sys.exit(0)
|
|
153
|
+
|
|
154
|
+
scores = json.load(open(scores_path))
|
|
155
|
+
tasks_dir = '.harness-evolver/eval/tasks/'
|
|
156
|
+
failures = {}
|
|
157
|
+
|
|
158
|
+
for tid, tdata in scores.get('per_task', {}).items():
|
|
159
|
+
score = tdata.get('score', 0)
|
|
160
|
+
if score < 0.7:
|
|
161
|
+
tfile = os.path.join(tasks_dir, tid + '.json')
|
|
162
|
+
cat = 'unknown'
|
|
163
|
+
if os.path.exists(tfile):
|
|
164
|
+
task = json.load(open(tfile))
|
|
165
|
+
meta = task.get('metadata', {})
|
|
166
|
+
cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
|
|
167
|
+
failures.setdefault(cat, []).append({'id': tid, 'score': score})
|
|
168
|
+
|
|
169
|
+
if not failures:
|
|
170
|
+
print('ALL_PASSING')
|
|
171
|
+
else:
|
|
172
|
+
sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
|
|
173
|
+
for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
|
|
174
|
+
task_ids = [t['id'] for t in tasks]
|
|
175
|
+
avg_score = sum(t['score'] for t in tasks) / len(tasks)
|
|
176
|
+
print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
|
|
177
|
+
" 2>/dev/null
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
Parse the output:
|
|
181
|
+
- If `NO_SCORES` or `ALL_PASSING`: D gets "creative" brief, E gets "efficiency" brief
|
|
182
|
+
- If clusters found: D targets cluster 1, E targets cluster 2
|
|
183
|
+
- If only 1 cluster: D targets it, E gets "creative" brief
|
|
184
|
+
|
|
185
|
+
Save clusters for use in step 2.
|
|
186
|
+
|
|
135
187
|
### 2. Propose (3 parallel candidates)
|
|
136
188
|
|
|
137
189
|
Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
|
|
@@ -140,7 +192,10 @@ This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
|
|
|
140
192
|
Determine parents for each strategy:
|
|
141
193
|
- **Exploiter parent**: current best version (from summary.json `best.version`)
|
|
142
194
|
- **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
|
|
143
|
-
- **Crossover parents**:
|
|
195
|
+
- **Crossover parents**:
|
|
196
|
+
- Parent A = current best version
|
|
197
|
+
- Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
|
|
198
|
+
If no champion file exists, fall back to a non-best version from the archive.
|
|
144
199
|
|
|
145
200
|
Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
|
|
146
201
|
|
|
@@ -262,35 +317,131 @@ Agent(
|
|
|
262
317
|
|
|
263
318
|
**Also spawn these additional candidates:**
|
|
264
319
|
|
|
265
|
-
**Candidate D (
|
|
266
|
-
|
|
320
|
+
**Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
|
|
321
|
+
|
|
322
|
+
If failure clusters were found in step 1.8:
|
|
267
323
|
```
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
324
|
+
Agent(
|
|
325
|
+
subagent_type: "harness-evolver-proposer",
|
|
326
|
+
description: "Proposer D: fix {cluster_1_category} failures",
|
|
327
|
+
run_in_background: true,
|
|
328
|
+
prompt: |
|
|
329
|
+
<strategy>
|
|
330
|
+
APPROACH: failure-targeted
|
|
331
|
+
Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
|
|
332
|
+
They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
|
|
333
|
+
Read the traces of these specific tasks to understand WHY they fail.
|
|
334
|
+
Your changes should improve these tasks WITHOUT regressing others.
|
|
335
|
+
You are free to change anything — prompts, code, retrieval, architecture —
|
|
336
|
+
whatever is needed to fix THIS specific failure mode.
|
|
337
|
+
</strategy>
|
|
338
|
+
|
|
339
|
+
<objective>
|
|
340
|
+
Propose harness version {version}d targeting {cluster_1_category} failures.
|
|
341
|
+
</objective>
|
|
342
|
+
|
|
343
|
+
<files_to_read>
|
|
344
|
+
- .harness-evolver/summary.json
|
|
345
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
346
|
+
- .harness-evolver/config.json
|
|
347
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
348
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
349
|
+
- .harness-evolver/langsmith_runs.json (if exists)
|
|
350
|
+
- .harness-evolver/architecture.json (if exists)
|
|
351
|
+
</files_to_read>
|
|
352
|
+
|
|
353
|
+
<output>
|
|
354
|
+
Create directory .harness-evolver/harnesses/{version}d/ containing:
|
|
355
|
+
- harness.py, config.json, proposal.md
|
|
356
|
+
</output>
|
|
357
|
+
)
|
|
274
358
|
```
|
|
275
|
-
Output to: `.harness-evolver/harnesses/{version}d/`
|
|
276
359
|
|
|
277
|
-
|
|
360
|
+
If ALL_PASSING (no failures):
|
|
278
361
|
```
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
362
|
+
Agent(
|
|
363
|
+
subagent_type: "harness-evolver-proposer",
|
|
364
|
+
description: "Proposer D: creative approach",
|
|
365
|
+
run_in_background: true,
|
|
366
|
+
prompt: |
|
|
367
|
+
<strategy>
|
|
368
|
+
APPROACH: creative
|
|
369
|
+
All tasks are scoring well. Try something UNEXPECTED:
|
|
370
|
+
- Different algorithm or library
|
|
371
|
+
- Completely different prompt architecture
|
|
372
|
+
- Novel error handling or output validation
|
|
373
|
+
- Something no one would think of
|
|
374
|
+
The goal is to discover improvements that incremental fixes would miss.
|
|
375
|
+
</strategy>
|
|
376
|
+
...same files_to_read and output as above...
|
|
377
|
+
)
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
**Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
|
|
381
|
+
|
|
382
|
+
If a second failure cluster exists:
|
|
383
|
+
```
|
|
384
|
+
Agent(
|
|
385
|
+
subagent_type: "harness-evolver-proposer",
|
|
386
|
+
description: "Proposer E: fix {cluster_2_category} failures",
|
|
387
|
+
run_in_background: true,
|
|
388
|
+
prompt: |
|
|
389
|
+
<strategy>
|
|
390
|
+
APPROACH: failure-targeted
|
|
391
|
+
Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
|
|
392
|
+
They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
|
|
393
|
+
Read the traces of these specific tasks to understand WHY they fail.
|
|
394
|
+
Your changes should improve these tasks WITHOUT regressing others.
|
|
395
|
+
You are free to change anything — prompts, code, retrieval, architecture —
|
|
396
|
+
whatever is needed to fix THIS specific failure mode.
|
|
397
|
+
</strategy>
|
|
398
|
+
|
|
399
|
+
<objective>
|
|
400
|
+
Propose harness version {version}e targeting {cluster_2_category} failures.
|
|
401
|
+
</objective>
|
|
402
|
+
|
|
403
|
+
<files_to_read>
|
|
404
|
+
- .harness-evolver/summary.json
|
|
405
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
406
|
+
- .harness-evolver/config.json
|
|
407
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
408
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
409
|
+
- .harness-evolver/langsmith_runs.json (if exists)
|
|
410
|
+
- .harness-evolver/architecture.json (if exists)
|
|
411
|
+
</files_to_read>
|
|
412
|
+
|
|
413
|
+
<output>
|
|
414
|
+
Create directory .harness-evolver/harnesses/{version}e/ containing:
|
|
415
|
+
- harness.py, config.json, proposal.md
|
|
416
|
+
</output>
|
|
417
|
+
)
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
If no second cluster (or ALL_PASSING):
|
|
421
|
+
```
|
|
422
|
+
Agent(
|
|
423
|
+
subagent_type: "harness-evolver-proposer",
|
|
424
|
+
description: "Proposer E: efficiency optimization",
|
|
425
|
+
run_in_background: true,
|
|
426
|
+
prompt: |
|
|
427
|
+
<strategy>
|
|
428
|
+
APPROACH: efficiency
|
|
429
|
+
Maintain the current quality but optimize for:
|
|
430
|
+
- Fewer LLM tokens (shorter prompts, less context)
|
|
431
|
+
- Faster execution (reduce unnecessary steps)
|
|
432
|
+
- Simpler code (remove redundant logic)
|
|
433
|
+
- Better error handling (graceful degradation)
|
|
434
|
+
Do NOT sacrifice accuracy for speed — same quality, less cost.
|
|
435
|
+
</strategy>
|
|
436
|
+
...same files_to_read and output as above...
|
|
437
|
+
)
|
|
286
438
|
```
|
|
287
|
-
Output to: `.harness-evolver/harnesses/{version}e/`
|
|
288
439
|
|
|
289
440
|
Wait for all 5 to complete. The background agents will notify when done.
|
|
290
441
|
|
|
291
442
|
**Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
|
|
292
443
|
|
|
293
|
-
**On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating,
|
|
444
|
+
**On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, step 1.8 will naturally shift D and E toward failure-targeted or creative strategies based on actual task performance.
|
|
294
445
|
|
|
295
446
|
### 3. Validate All Candidates
|
|
296
447
|
|
|
@@ -345,16 +496,67 @@ Wait for `## JUDGE COMPLETE`.
|
|
|
345
496
|
|
|
346
497
|
If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
|
|
347
498
|
|
|
348
|
-
### 5. Select Winner +
|
|
499
|
+
### 5. Select Winner + Track Per-Task Champions
|
|
500
|
+
|
|
501
|
+
**5a. Find overall winner (highest combined_score):**
|
|
502
|
+
|
|
503
|
+
Compare all evaluated candidates. The winner is the one with highest combined_score.
|
|
504
|
+
|
|
505
|
+
**5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
|
|
506
|
+
|
|
507
|
+
```bash
|
|
508
|
+
python3 -c "
|
|
509
|
+
import json, os
|
|
510
|
+
|
|
511
|
+
version = '{version}'
|
|
512
|
+
candidates = {}
|
|
513
|
+
for suffix in ['a', 'b', 'c', 'd', 'e']:
|
|
514
|
+
path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
|
|
515
|
+
if os.path.exists(path):
|
|
516
|
+
candidates[suffix] = json.load(open(path))
|
|
517
|
+
|
|
518
|
+
if not candidates:
|
|
519
|
+
print('NO_CANDIDATES')
|
|
520
|
+
exit()
|
|
521
|
+
|
|
522
|
+
# Overall winner
|
|
523
|
+
winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
|
|
524
|
+
winner_score = candidates[winner_suffix]['combined_score']
|
|
525
|
+
print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
|
|
526
|
+
|
|
527
|
+
# Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
|
|
528
|
+
task_wins = {}
|
|
529
|
+
winner_tasks = candidates[winner_suffix].get('per_task', {})
|
|
530
|
+
for suffix, data in candidates.items():
|
|
531
|
+
if suffix == winner_suffix:
|
|
532
|
+
continue
|
|
533
|
+
wins = 0
|
|
534
|
+
for tid, tdata in data.get('per_task', {}).items():
|
|
535
|
+
winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
|
|
536
|
+
if tdata.get('score', 0) > winner_task_score:
|
|
537
|
+
wins += 1
|
|
538
|
+
if wins > 0:
|
|
539
|
+
task_wins[suffix] = wins
|
|
540
|
+
|
|
541
|
+
if task_wins:
|
|
542
|
+
champion_suffix = max(task_wins, key=task_wins.get)
|
|
543
|
+
print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
|
|
544
|
+
# Save champion info for next iteration's crossover parent
|
|
545
|
+
with open('.harness-evolver/per_task_champion.json', 'w') as f:
|
|
546
|
+
json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
|
|
547
|
+
else:
|
|
548
|
+
print('NO_CHAMPION: winner dominates all tasks')
|
|
549
|
+
" 2>/dev/null
|
|
550
|
+
```
|
|
349
551
|
|
|
350
|
-
|
|
552
|
+
**5c. Promote winner and report ALL candidates:**
|
|
351
553
|
|
|
352
|
-
Rename
|
|
554
|
+
Rename winner directory to official version:
|
|
353
555
|
```bash
|
|
354
556
|
mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
|
|
355
557
|
```
|
|
356
558
|
|
|
357
|
-
Update state
|
|
559
|
+
Update state:
|
|
358
560
|
```bash
|
|
359
561
|
python3 $TOOLS/state.py update \
|
|
360
562
|
--base-dir .harness-evolver \
|
|
@@ -363,13 +565,17 @@ python3 $TOOLS/state.py update \
|
|
|
363
565
|
--proposal .harness-evolver/harnesses/{version}/proposal.md
|
|
364
566
|
```
|
|
365
567
|
|
|
366
|
-
Report ALL candidates:
|
|
568
|
+
Report ALL candidates with their scores and strategies:
|
|
367
569
|
```
|
|
368
|
-
Iteration {i}/{N} —
|
|
369
|
-
{version}a (exploit):
|
|
370
|
-
{version}b (explore):
|
|
371
|
-
{version}c (
|
|
372
|
-
|
|
570
|
+
Iteration {i}/{N} — {num_candidates} candidates evaluated:
|
|
571
|
+
{version}a (exploit): {score_a} — {summary}
|
|
572
|
+
{version}b (explore): {score_b} — {summary}
|
|
573
|
+
{version}c (crossover): {score_c} — {summary}
|
|
574
|
+
{version}d ({strategy_d}): {score_d} — {summary}
|
|
575
|
+
{version}e ({strategy_e}): {score_e} — {summary}
|
|
576
|
+
|
|
577
|
+
Winner: {version}{suffix} ({score})
|
|
578
|
+
Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
|
|
373
579
|
```
|
|
374
580
|
|
|
375
581
|
Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
|