harness-evolver 2.9.0 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -117
- package/agents/evolver-architect.md +53 -0
- package/agents/evolver-critic.md +44 -0
- package/agents/evolver-proposer.md +128 -0
- package/agents/evolver-testgen.md +67 -0
- package/bin/install.js +181 -171
- package/package.json +7 -7
- package/skills/deploy/SKILL.md +49 -56
- package/skills/evolve/SKILL.md +180 -700
- package/skills/setup/SKILL.md +182 -0
- package/skills/status/SKILL.md +23 -21
- package/tools/read_results.py +240 -0
- package/tools/run_eval.py +202 -0
- package/tools/seed_from_traces.py +36 -8
- package/tools/setup.py +393 -0
- package/tools/trace_insights.py +86 -14
- package/agents/harness-evolver-architect.md +0 -173
- package/agents/harness-evolver-critic.md +0 -132
- package/agents/harness-evolver-judge.md +0 -110
- package/agents/harness-evolver-proposer.md +0 -317
- package/agents/harness-evolver-testgen.md +0 -112
- package/examples/classifier/README.md +0 -25
- package/examples/classifier/config.json +0 -3
- package/examples/classifier/eval.py +0 -58
- package/examples/classifier/harness.py +0 -111
- package/examples/classifier/tasks/task_001.json +0 -1
- package/examples/classifier/tasks/task_002.json +0 -1
- package/examples/classifier/tasks/task_003.json +0 -1
- package/examples/classifier/tasks/task_004.json +0 -1
- package/examples/classifier/tasks/task_005.json +0 -1
- package/examples/classifier/tasks/task_006.json +0 -1
- package/examples/classifier/tasks/task_007.json +0 -1
- package/examples/classifier/tasks/task_008.json +0 -1
- package/examples/classifier/tasks/task_009.json +0 -1
- package/examples/classifier/tasks/task_010.json +0 -1
- package/skills/architect/SKILL.md +0 -93
- package/skills/compare/SKILL.md +0 -73
- package/skills/critic/SKILL.md +0 -67
- package/skills/diagnose/SKILL.md +0 -96
- package/skills/import-traces/SKILL.md +0 -102
- package/skills/init/SKILL.md +0 -253
- package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
- package/tools/eval_llm_judge.py +0 -233
- package/tools/eval_passthrough.py +0 -55
- package/tools/evaluate.py +0 -255
- package/tools/import_traces.py +0 -229
- package/tools/init.py +0 -531
- package/tools/llm_api.py +0 -125
- package/tools/state.py +0 -219
- package/tools/test_growth.py +0 -230
- package/tools/trace_logger.py +0 -42
package/skills/evolve/SKILL.md
CHANGED
|
@@ -1,843 +1,323 @@
|
|
|
1
1
|
---
|
|
2
|
-
name:
|
|
3
|
-
description: "Use when the user wants to run the optimization loop, improve
|
|
2
|
+
name: evolver:evolve
|
|
3
|
+
description: "Use when the user wants to run the optimization loop, improve agent performance, evolve the agent, or iterate on quality. Requires .evolver.json to exist (run evolver:setup first)."
|
|
4
4
|
argument-hint: "[--iterations N]"
|
|
5
5
|
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# /
|
|
8
|
+
# /evolver:evolve
|
|
9
9
|
|
|
10
|
-
Run the autonomous propose-evaluate-iterate loop.
|
|
10
|
+
Run the autonomous propose-evaluate-iterate loop using LangSmith as the evaluation backend and git worktrees for isolation.
|
|
11
11
|
|
|
12
12
|
## Prerequisites
|
|
13
13
|
|
|
14
|
-
`.
|
|
14
|
+
`.evolver.json` must exist. If not, tell user to run `evolver:setup`.
|
|
15
15
|
|
|
16
16
|
## Resolve Tool Path
|
|
17
17
|
|
|
18
18
|
```bash
|
|
19
|
-
TOOLS=$([ -d ".
|
|
19
|
+
TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")
|
|
20
20
|
```
|
|
21
21
|
|
|
22
22
|
## Parse Arguments
|
|
23
23
|
|
|
24
|
-
- `--iterations N` (default:
|
|
25
|
-
- Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
|
|
24
|
+
- `--iterations N` (default: from interactive question or 5)
|
|
26
25
|
|
|
27
26
|
## Pre-Loop: Interactive Configuration
|
|
28
27
|
|
|
29
|
-
If no `--iterations` argument was provided, ask the user
|
|
28
|
+
If no `--iterations` argument was provided, ask the user:
|
|
30
29
|
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
30
|
+
```json
|
|
31
|
+
{
|
|
32
|
+
"questions": [
|
|
33
|
+
{
|
|
34
|
+
"question": "How many evolution iterations?",
|
|
35
|
+
"header": "Iterations",
|
|
36
|
+
"multiSelect": false,
|
|
37
|
+
"options": [
|
|
38
|
+
{"label": "3 (quick)", "description": "Fast exploration, good for testing. ~15 min."},
|
|
39
|
+
{"label": "5 (balanced)", "description": "Good trade-off between speed and quality. ~30 min."},
|
|
40
|
+
{"label": "10 (thorough)", "description": "Deep optimization with adaptive strategies. ~1 hour."}
|
|
41
|
+
]
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"question": "Stop early if score reaches?",
|
|
45
|
+
"header": "Target",
|
|
46
|
+
"multiSelect": false,
|
|
47
|
+
"options": [
|
|
48
|
+
{"label": "0.8 (good enough)", "description": "Stop when the agent is reasonably good"},
|
|
49
|
+
{"label": "0.9 (high quality)", "description": "Stop when quality is high"},
|
|
50
|
+
{"label": "0.95 (near perfect)", "description": "Push for near-perfect scores"},
|
|
51
|
+
{"label": "No limit", "description": "Run all iterations regardless of score"}
|
|
52
|
+
]
|
|
53
|
+
}
|
|
54
|
+
]
|
|
55
|
+
}
|
|
48
56
|
```
|
|
49
57
|
|
|
50
|
-
Apply the answers:
|
|
51
|
-
- Set iterations from question 1 (3, 5, or 10)
|
|
52
|
-
- Set target_score from question 2 (0.8, 0.9, 0.95, or None)
|
|
53
|
-
|
|
54
|
-
If `--iterations` WAS provided as argument, skip these questions and use the argument value.
|
|
55
|
-
|
|
56
58
|
## The Loop
|
|
57
59
|
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
### 1. Get Next Version
|
|
61
|
-
|
|
60
|
+
Read config:
|
|
62
61
|
```bash
|
|
63
|
-
python3 -c "import json;
|
|
62
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Best: {c[\"best_experiment\"]} ({c[\"best_score\"]:.3f}), Iterations: {c[\"iterations\"]}')"
|
|
64
63
|
```
|
|
65
64
|
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
|
|
69
|
-
|
|
70
|
-
```bash
|
|
71
|
-
PROD_PROJECT=$(python3 -c "
|
|
72
|
-
import json, os
|
|
73
|
-
c = json.load(open('.harness-evolver/config.json'))
|
|
74
|
-
print(c.get('eval', {}).get('production_project', ''))
|
|
75
|
-
" 2>/dev/null)
|
|
76
|
-
if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
|
|
77
|
-
python3 $TOOLS/seed_from_traces.py \
|
|
78
|
-
--project "$PROD_PROJECT" \
|
|
79
|
-
--output-md .harness-evolver/production_seed.md \
|
|
80
|
-
--output-json .harness-evolver/production_seed.json \
|
|
81
|
-
--limit 100 2>/dev/null
|
|
82
|
-
fi
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
|
|
86
|
-
|
|
87
|
-
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
88
|
-
|
|
89
|
-
**Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
|
|
90
|
-
|
|
91
|
-
**Step 1: Find the actual LangSmith project name**
|
|
92
|
-
|
|
93
|
-
```bash
|
|
94
|
-
langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
|
|
98
|
-
|
|
99
|
-
```bash
|
|
100
|
-
LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
|
|
101
|
-
```
|
|
102
|
-
|
|
103
|
-
If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
|
|
65
|
+
For each iteration:
|
|
104
66
|
|
|
105
|
-
|
|
67
|
+
### 1. Get Next Version
|
|
106
68
|
|
|
107
69
|
```bash
|
|
108
|
-
|
|
109
|
-
langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
|
|
110
|
-
langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
111
|
-
echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
|
|
112
|
-
else
|
|
113
|
-
echo "[]" > /tmp/langsmith_raw.json
|
|
114
|
-
echo "{}" > .harness-evolver/langsmith_stats.json
|
|
115
|
-
fi
|
|
70
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"iterations\"]+1:03d}')"
|
|
116
71
|
```
|
|
117
72
|
|
|
118
|
-
|
|
73
|
+
### 1.5. Gather Trace Insights
|
|
119
74
|
|
|
120
|
-
|
|
75
|
+
Run trace insights from the best experiment:
|
|
121
76
|
|
|
122
77
|
```bash
|
|
123
|
-
python3 -c "
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
if not raw:
|
|
128
|
-
json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
|
|
129
|
-
sys.exit(0)
|
|
130
|
-
|
|
131
|
-
clean = []
|
|
132
|
-
for r in raw:
|
|
133
|
-
entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
|
|
134
|
-
|
|
135
|
-
# Extract readable prompt from LangChain serialized inputs
|
|
136
|
-
inputs = r.get('inputs', {})
|
|
137
|
-
if isinstance(inputs, dict) and 'messages' in inputs:
|
|
138
|
-
msgs = inputs['messages']
|
|
139
|
-
for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
|
|
140
|
-
for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
|
|
141
|
-
if isinstance(msg, dict):
|
|
142
|
-
kwargs = msg.get('kwargs', msg)
|
|
143
|
-
content = kwargs.get('content', '')
|
|
144
|
-
msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
|
|
145
|
-
if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
|
|
146
|
-
entry['user_message'] = str(content)[:300]
|
|
147
|
-
elif 'System' in str(msg_type):
|
|
148
|
-
entry['system_prompt_preview'] = str(content)[:200]
|
|
149
|
-
|
|
150
|
-
# Extract readable output
|
|
151
|
-
outputs = r.get('outputs', {})
|
|
152
|
-
if isinstance(outputs, dict) and 'generations' in outputs:
|
|
153
|
-
gens = outputs['generations']
|
|
154
|
-
if gens and isinstance(gens, list) and gens[0]:
|
|
155
|
-
gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
|
|
156
|
-
if isinstance(gen, dict):
|
|
157
|
-
msg = gen.get('message', gen)
|
|
158
|
-
if isinstance(msg, dict):
|
|
159
|
-
kwargs = msg.get('kwargs', msg)
|
|
160
|
-
entry['llm_response'] = str(kwargs.get('content', ''))[:300]
|
|
161
|
-
|
|
162
|
-
clean.append(entry)
|
|
163
|
-
|
|
164
|
-
json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
|
|
165
|
-
print(f'Processed {len(clean)} LangSmith runs into readable format')
|
|
166
|
-
" 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
|
|
167
|
-
```
|
|
168
|
-
|
|
169
|
-
The resulting `langsmith_runs.json` has clean, readable entries:
|
|
170
|
-
```json
|
|
171
|
-
[
|
|
172
|
-
{
|
|
173
|
-
"name": "ChatGoogleGenerativeAI",
|
|
174
|
-
"tokens": 1332,
|
|
175
|
-
"error": null,
|
|
176
|
-
"user_message": "Analise este texto: Bom dia pessoal...",
|
|
177
|
-
"system_prompt_preview": "Você é um moderador de conteúdo...",
|
|
178
|
-
"llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
|
|
179
|
-
}
|
|
180
|
-
]
|
|
78
|
+
BEST=$(python3 -c "import json; print(json.load(open('.evolver.json'))['best_experiment'])")
|
|
79
|
+
python3 $TOOLS/trace_insights.py \
|
|
80
|
+
--from-experiment "$BEST" \
|
|
81
|
+
--output trace_insights.json 2>/dev/null
|
|
181
82
|
```
|
|
182
83
|
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
### 1.6. Generate Trace Insights (systematic analysis)
|
|
186
|
-
|
|
187
|
-
If LangSmith traces were gathered, run systematic analysis to cluster errors, analyze token usage, and cross-reference with scores:
|
|
84
|
+
If a production project is configured, also gather production insights:
|
|
188
85
|
|
|
189
86
|
```bash
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
--
|
|
196
|
-
--
|
|
197
|
-
--scores "$SCORES_PATH" \
|
|
198
|
-
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
199
|
-
--output .harness-evolver/trace_insights.json 2>/dev/null
|
|
87
|
+
PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
|
|
88
|
+
if [ -n "$PROD" ] && [ ! -f "production_seed.json" ]; then
|
|
89
|
+
python3 $TOOLS/seed_from_traces.py \
|
|
90
|
+
--project "$PROD" --use-sdk \
|
|
91
|
+
--output-md production_seed.md \
|
|
92
|
+
--output-json production_seed.json \
|
|
93
|
+
--limit 100 2>/dev/null
|
|
200
94
|
fi
|
|
201
95
|
```
|
|
202
96
|
|
|
203
|
-
|
|
204
|
-
- `error_clusters`: grouped error patterns with counts
|
|
205
|
-
- `token_analysis`: score distribution by token usage bucket (low/medium/high)
|
|
206
|
-
- `hypotheses`: data-driven theories about failure causes
|
|
207
|
-
- `top_issues`: highest-impact problems sorted by severity
|
|
208
|
-
|
|
209
|
-
This file is included in all proposers' `<files_to_read>` so they have structured diagnostic data.
|
|
97
|
+
### 1.8. Analyze Per-Task Failures
|
|
210
98
|
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
Before spawning proposers, analyze which tasks are failing and cluster them:
|
|
99
|
+
Read the best experiment results and cluster failures:
|
|
214
100
|
|
|
215
101
|
```bash
|
|
216
|
-
python3
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
summary = json.load(open('.harness-evolver/summary.json'))
|
|
221
|
-
best = summary['best']['version']
|
|
222
|
-
scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
|
|
223
|
-
if not os.path.exists(scores_path):
|
|
224
|
-
scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
|
|
225
|
-
|
|
226
|
-
if not scores_path or not os.path.exists(scores_path):
|
|
227
|
-
print('NO_SCORES')
|
|
228
|
-
sys.exit(0)
|
|
229
|
-
|
|
230
|
-
scores = json.load(open(scores_path))
|
|
231
|
-
tasks_dir = '.harness-evolver/eval/tasks/'
|
|
232
|
-
failures = {}
|
|
233
|
-
|
|
234
|
-
for tid, tdata in scores.get('per_task', {}).items():
|
|
235
|
-
score = tdata.get('score', 0)
|
|
236
|
-
if score < 0.7:
|
|
237
|
-
tfile = os.path.join(tasks_dir, tid + '.json')
|
|
238
|
-
cat = 'unknown'
|
|
239
|
-
if os.path.exists(tfile):
|
|
240
|
-
task = json.load(open(tfile))
|
|
241
|
-
meta = task.get('metadata', {})
|
|
242
|
-
cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
|
|
243
|
-
failures.setdefault(cat, []).append({'id': tid, 'score': score})
|
|
244
|
-
|
|
245
|
-
if not failures:
|
|
246
|
-
print('ALL_PASSING')
|
|
247
|
-
else:
|
|
248
|
-
sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
|
|
249
|
-
for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
|
|
250
|
-
task_ids = [t['id'] for t in tasks]
|
|
251
|
-
avg_score = sum(t['score'] for t in tasks) / len(tasks)
|
|
252
|
-
print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
|
|
253
|
-
" 2>/dev/null
|
|
102
|
+
python3 $TOOLS/read_results.py \
|
|
103
|
+
--experiment "$BEST" \
|
|
104
|
+
--config .evolver.json \
|
|
105
|
+
--output best_results.json 2>/dev/null
|
|
254
106
|
```
|
|
255
107
|
|
|
256
|
-
Parse
|
|
257
|
-
|
|
258
|
-
- If clusters found: D targets cluster 1, E targets cluster 2
|
|
259
|
-
- If only 1 cluster: D targets it, E gets "creative" brief
|
|
260
|
-
|
|
261
|
-
Save clusters for use in step 2.
|
|
262
|
-
|
|
263
|
-
### 2. Propose (3 parallel candidates)
|
|
264
|
-
|
|
265
|
-
Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
|
|
266
|
-
This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
|
|
267
|
-
|
|
268
|
-
Determine parents for each strategy:
|
|
269
|
-
- **Exploiter parent**: current best version (from summary.json `best.version`)
|
|
270
|
-
- **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
|
|
271
|
-
- **Crossover parents**:
|
|
272
|
-
- Parent A = current best version
|
|
273
|
-
- Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
|
|
274
|
-
If no champion file exists, fall back to a non-best version from the archive.
|
|
275
|
-
|
|
276
|
-
Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
|
|
277
|
-
|
|
278
|
-
**Candidate A (Exploiter)** — `run_in_background: true`:
|
|
279
|
-
```
|
|
280
|
-
Agent(
|
|
281
|
-
subagent_type: "harness-evolver-proposer",
|
|
282
|
-
description: "Proposer A (exploit): targeted fix for {version}",
|
|
283
|
-
run_in_background: true,
|
|
284
|
-
prompt: |
|
|
285
|
-
<strategy>
|
|
286
|
-
APPROACH: exploitation
|
|
287
|
-
You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
|
|
288
|
-
the highest-impact failing tasks. Base your work on the current best version.
|
|
289
|
-
Do NOT restructure the code. Do NOT change the architecture.
|
|
290
|
-
Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
|
|
291
|
-
</strategy>
|
|
108
|
+
Parse `best_results.json` to find failing examples (score < 0.7). Group by metadata or error pattern.
|
|
109
|
+
Generate adaptive briefings for Candidates D and E (same logic as v2).
|
|
292
110
|
|
|
293
|
-
|
|
294
|
-
Propose harness version {version}a that improves on {best_score}.
|
|
295
|
-
</objective>
|
|
111
|
+
### 2. Spawn 5 Proposers in Parallel
|
|
296
112
|
|
|
297
|
-
|
|
298
|
-
- .harness-evolver/summary.json
|
|
299
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
300
|
-
- .harness-evolver/config.json
|
|
301
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
302
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
303
|
-
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
304
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
305
|
-
- .harness-evolver/langsmith_stats.json (if exists)
|
|
306
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
307
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
308
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
309
|
-
- .harness-evolver/architecture.json (if exists)
|
|
310
|
-
</files_to_read>
|
|
113
|
+
Each proposer runs in a **git worktree** via Claude Code's native `isolation: "worktree"` parameter.
|
|
311
114
|
|
|
312
|
-
|
|
313
|
-
Create directory .harness-evolver/harnesses/{version}a/ containing:
|
|
314
|
-
- harness.py, config.json, proposal.md
|
|
315
|
-
</output>
|
|
316
|
-
)
|
|
317
|
-
```
|
|
115
|
+
**Candidate A (Exploit)** — `run_in_background: true`:
|
|
318
116
|
|
|
319
|
-
**Candidate B (Explorer)** — `run_in_background: true`:
|
|
320
117
|
```
|
|
321
118
|
Agent(
|
|
322
|
-
subagent_type: "
|
|
323
|
-
description: "Proposer
|
|
119
|
+
subagent_type: "evolver-proposer",
|
|
120
|
+
description: "Proposer A: exploit best version",
|
|
121
|
+
isolation: "worktree",
|
|
324
122
|
run_in_background: true,
|
|
325
123
|
prompt: |
|
|
326
|
-
<strategy>
|
|
327
|
-
APPROACH: exploration
|
|
328
|
-
You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
|
|
329
|
-
Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
|
|
330
|
-
Consider: different retrieval strategy, different prompt structure,
|
|
331
|
-
different output parsing, different error handling philosophy.
|
|
332
|
-
Be bold. A creative failure teaches more than a timid success.
|
|
333
|
-
</strategy>
|
|
334
|
-
|
|
335
124
|
<objective>
|
|
336
|
-
|
|
125
|
+
Improve the agent code to score higher on the evaluation dataset.
|
|
126
|
+
You are working in an isolated git worktree — modify any file freely.
|
|
337
127
|
</objective>
|
|
338
128
|
|
|
339
|
-
<files_to_read>
|
|
340
|
-
- .harness-evolver/summary.json
|
|
341
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
342
|
-
- .harness-evolver/config.json
|
|
343
|
-
- .harness-evolver/baseline/harness.py
|
|
344
|
-
- .harness-evolver/harnesses/{explorer_parent}/harness.py
|
|
345
|
-
- .harness-evolver/harnesses/{explorer_parent}/scores.json
|
|
346
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
347
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
348
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
349
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
350
|
-
- .harness-evolver/architecture.json (if exists)
|
|
351
|
-
</files_to_read>
|
|
352
|
-
|
|
353
|
-
<output>
|
|
354
|
-
Create directory .harness-evolver/harnesses/{version}b/ containing:
|
|
355
|
-
- harness.py, config.json, proposal.md
|
|
356
|
-
</output>
|
|
357
|
-
)
|
|
358
|
-
```
|
|
359
|
-
|
|
360
|
-
**Candidate C (Crossover)** — blocks (last one):
|
|
361
|
-
```
|
|
362
|
-
Agent(
|
|
363
|
-
subagent_type: "harness-evolver-proposer",
|
|
364
|
-
description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
|
|
365
|
-
prompt: |
|
|
366
129
|
<strategy>
|
|
367
|
-
APPROACH:
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
- {parent_b} (score: {score_b}): {summary of what it does well}
|
|
371
|
-
Take the best elements from each and merge them into a single harness.
|
|
130
|
+
APPROACH: exploitation
|
|
131
|
+
Make targeted improvements to the current best version.
|
|
132
|
+
Focus on the specific failures identified in the results.
|
|
372
133
|
</strategy>
|
|
373
134
|
|
|
374
|
-
<objective>
|
|
375
|
-
Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
|
|
376
|
-
</objective>
|
|
377
|
-
|
|
378
135
|
<files_to_read>
|
|
379
|
-
- .
|
|
380
|
-
- .
|
|
381
|
-
- .
|
|
382
|
-
- .
|
|
383
|
-
- .
|
|
384
|
-
- .harness-evolver/harnesses/{parent_b}/harness.py
|
|
385
|
-
- .harness-evolver/harnesses/{parent_b}/scores.json
|
|
386
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
387
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
388
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
389
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
390
|
-
- .harness-evolver/architecture.json (if exists)
|
|
136
|
+
- .evolver.json
|
|
137
|
+
- trace_insights.json (if exists)
|
|
138
|
+
- production_seed.json (if exists)
|
|
139
|
+
- best_results.json (if exists)
|
|
140
|
+
- {entry point file from .evolver.json}
|
|
391
141
|
</files_to_read>
|
|
392
142
|
|
|
393
|
-
<
|
|
394
|
-
|
|
395
|
-
|
|
396
|
-
|
|
397
|
-
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
**Also spawn these additional candidates:**
|
|
401
|
-
|
|
402
|
-
**Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
|
|
403
|
-
|
|
404
|
-
If failure clusters were found in step 1.8:
|
|
405
|
-
```
|
|
406
|
-
Agent(
|
|
407
|
-
subagent_type: "harness-evolver-proposer",
|
|
408
|
-
description: "Proposer D: fix {cluster_1_category} failures",
|
|
409
|
-
run_in_background: true,
|
|
410
|
-
prompt: |
|
|
411
|
-
<strategy>
|
|
412
|
-
APPROACH: failure-targeted
|
|
413
|
-
Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
|
|
414
|
-
They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
|
|
415
|
-
Read the traces of these specific tasks to understand WHY they fail.
|
|
416
|
-
Your changes should improve these tasks WITHOUT regressing others.
|
|
417
|
-
You are free to change anything — prompts, code, retrieval, architecture —
|
|
418
|
-
whatever is needed to fix THIS specific failure mode.
|
|
419
|
-
</strategy>
|
|
420
|
-
|
|
421
|
-
<objective>
|
|
422
|
-
Propose harness version {version}d targeting {cluster_1_category} failures.
|
|
423
|
-
</objective>
|
|
424
|
-
|
|
425
|
-
<files_to_read>
|
|
426
|
-
- .harness-evolver/summary.json
|
|
427
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
428
|
-
- .harness-evolver/config.json
|
|
429
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
430
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
431
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
432
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
433
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
434
|
-
- .harness-evolver/architecture.json (if exists)
|
|
435
|
-
</files_to_read>
|
|
143
|
+
<context>
|
|
144
|
+
Best experiment: {best_experiment} (score: {best_score})
|
|
145
|
+
Framework: {framework}
|
|
146
|
+
Entry point: {entry_point}
|
|
147
|
+
Evaluators: {evaluators}
|
|
148
|
+
Failing examples: {failing_example_summary}
|
|
149
|
+
</context>
|
|
436
150
|
|
|
437
151
|
<output>
|
|
438
|
-
|
|
439
|
-
|
|
152
|
+
1. Modify the code to improve performance
|
|
153
|
+
2. Commit your changes with a descriptive message
|
|
154
|
+
3. Write proposal.md explaining what you changed and why
|
|
440
155
|
</output>
|
|
441
156
|
)
|
|
442
157
|
```
|
|
443
158
|
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
Agent(
|
|
447
|
-
subagent_type: "harness-evolver-proposer",
|
|
448
|
-
description: "Proposer D: creative approach",
|
|
449
|
-
run_in_background: true,
|
|
450
|
-
prompt: |
|
|
451
|
-
<strategy>
|
|
452
|
-
APPROACH: creative
|
|
453
|
-
All tasks are scoring well. Try something UNEXPECTED:
|
|
454
|
-
- Different algorithm or library
|
|
455
|
-
- Completely different prompt architecture
|
|
456
|
-
- Novel error handling or output validation
|
|
457
|
-
- Something no one would think of
|
|
458
|
-
The goal is to discover improvements that incremental fixes would miss.
|
|
459
|
-
</strategy>
|
|
460
|
-
...same files_to_read and output as above...
|
|
461
|
-
)
|
|
462
|
-
```
|
|
463
|
-
|
|
464
|
-
**Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
|
|
465
|
-
|
|
466
|
-
If a second failure cluster exists:
|
|
467
|
-
```
|
|
468
|
-
Agent(
|
|
469
|
-
subagent_type: "harness-evolver-proposer",
|
|
470
|
-
description: "Proposer E: fix {cluster_2_category} failures",
|
|
471
|
-
run_in_background: true,
|
|
472
|
-
prompt: |
|
|
473
|
-
<strategy>
|
|
474
|
-
APPROACH: failure-targeted
|
|
475
|
-
Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
|
|
476
|
-
They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
|
|
477
|
-
Read the traces of these specific tasks to understand WHY they fail.
|
|
478
|
-
Your changes should improve these tasks WITHOUT regressing others.
|
|
479
|
-
You are free to change anything — prompts, code, retrieval, architecture —
|
|
480
|
-
whatever is needed to fix THIS specific failure mode.
|
|
481
|
-
</strategy>
|
|
482
|
-
|
|
483
|
-
<objective>
|
|
484
|
-
Propose harness version {version}e targeting {cluster_2_category} failures.
|
|
485
|
-
</objective>
|
|
486
|
-
|
|
487
|
-
<files_to_read>
|
|
488
|
-
- .harness-evolver/summary.json
|
|
489
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
490
|
-
- .harness-evolver/config.json
|
|
491
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
492
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
493
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
494
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
495
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
496
|
-
- .harness-evolver/architecture.json (if exists)
|
|
497
|
-
</files_to_read>
|
|
498
|
-
|
|
499
|
-
<output>
|
|
500
|
-
Create directory .harness-evolver/harnesses/{version}e/ containing:
|
|
501
|
-
- harness.py, config.json, proposal.md
|
|
502
|
-
</output>
|
|
503
|
-
)
|
|
504
|
-
```
|
|
159
|
+
**Candidate B (Explorer)** — `run_in_background: true`:
|
|
160
|
+
Same structure but `APPROACH: exploration` — bold, fundamentally different approach.
|
|
505
161
|
|
|
506
|
-
|
|
507
|
-
|
|
508
|
-
|
|
509
|
-
subagent_type: "harness-evolver-proposer",
|
|
510
|
-
description: "Proposer E: efficiency optimization",
|
|
511
|
-
run_in_background: true,
|
|
512
|
-
prompt: |
|
|
513
|
-
<strategy>
|
|
514
|
-
APPROACH: efficiency
|
|
515
|
-
Maintain the current quality but optimize for:
|
|
516
|
-
- Fewer LLM tokens (shorter prompts, less context)
|
|
517
|
-
- Faster execution (reduce unnecessary steps)
|
|
518
|
-
- Simpler code (remove redundant logic)
|
|
519
|
-
- Better error handling (graceful degradation)
|
|
520
|
-
Do NOT sacrifice accuracy for speed — same quality, less cost.
|
|
521
|
-
</strategy>
|
|
522
|
-
...same files_to_read and output as above...
|
|
523
|
-
)
|
|
524
|
-
```
|
|
162
|
+
**Candidate C (Crossover)** — `run_in_background: true`:
|
|
163
|
+
Same structure but `APPROACH: crossover` — combine strengths from previous iterations.
|
|
164
|
+
Include git log of recent changes so it can see what was tried.
|
|
525
165
|
|
|
526
|
-
|
|
166
|
+
**Candidates D and E (Failure-Targeted)** — `run_in_background: true`:
|
|
167
|
+
Same structure but `APPROACH: failure-targeted` with specific failing example clusters.
|
|
168
|
+
If ALL_PASSING: D gets `creative`, E gets `efficiency`.
|
|
527
169
|
|
|
528
|
-
|
|
170
|
+
Wait for all 5 to complete.
|
|
529
171
|
|
|
530
|
-
|
|
172
|
+
### 3. Evaluate Each Candidate
|
|
531
173
|
|
|
532
|
-
|
|
174
|
+
For each worktree that has changes (proposer committed something):
|
|
533
175
|
|
|
534
|
-
For each candidate (a, b, c, d, e):
|
|
535
176
|
```bash
|
|
536
|
-
python3 $TOOLS/
|
|
177
|
+
python3 $TOOLS/run_eval.py \
|
|
178
|
+
--config .evolver.json \
|
|
179
|
+
--worktree-path {worktree_path} \
|
|
180
|
+
--experiment-prefix v{NNN}{suffix} \
|
|
181
|
+
--timeout 120
|
|
537
182
|
```
|
|
538
183
|
|
|
539
|
-
|
|
184
|
+
Each candidate becomes a separate LangSmith experiment.
|
|
540
185
|
|
|
541
|
-
### 4.
|
|
186
|
+
### 4. Compare All Candidates
|
|
542
187
|
|
|
543
|
-
For each valid candidate:
|
|
544
188
|
```bash
|
|
545
|
-
python3 $TOOLS/
|
|
546
|
-
--
|
|
547
|
-
--config .
|
|
548
|
-
--
|
|
549
|
-
--eval .harness-evolver/eval/eval.py \
|
|
550
|
-
--traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
|
|
551
|
-
--scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
|
|
552
|
-
--timeout 60
|
|
553
|
-
```
|
|
554
|
-
|
|
555
|
-
### 4.5. Judge (if eval returned pending scores)
|
|
556
|
-
|
|
557
|
-
For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
|
|
558
|
-
|
|
559
|
-
Spawn judge subagent with `subagent_type: "harness-evolver-judge"` for EACH candidate that needs judging:
|
|
560
|
-
|
|
561
|
-
```
|
|
562
|
-
Agent(
|
|
563
|
-
subagent_type: "harness-evolver-judge",
|
|
564
|
-
description: "Judge: score {version}{suffix} outputs",
|
|
565
|
-
prompt: |
|
|
566
|
-
<objective>
|
|
567
|
-
Score the outputs of harness version {version}{suffix} across all {N} tasks.
|
|
568
|
-
</objective>
|
|
569
|
-
|
|
570
|
-
<files_to_read>
|
|
571
|
-
- .harness-evolver/harnesses/{version}{suffix}/scores.json
|
|
572
|
-
- .harness-evolver/eval/tasks/ (read all task files)
|
|
573
|
-
</files_to_read>
|
|
574
|
-
|
|
575
|
-
<output>
|
|
576
|
-
Overwrite .harness-evolver/harnesses/{version}{suffix}/scores.json with real scores.
|
|
577
|
-
</output>
|
|
578
|
-
)
|
|
189
|
+
python3 $TOOLS/read_results.py \
|
|
190
|
+
--experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
|
|
191
|
+
--config .evolver.json \
|
|
192
|
+
--output comparison.json
|
|
579
193
|
```
|
|
580
194
|
|
|
581
|
-
|
|
582
|
-
|
|
583
|
-
|
|
584
|
-
|
|
585
|
-
### 5. Select Winner + Track Per-Task Champions
|
|
195
|
+
Parse `comparison.json`:
|
|
196
|
+
- `comparison.winner` — highest combined score
|
|
197
|
+
- `comparison.champion` — per-task champion (for next crossover)
|
|
198
|
+
- `comparison.all_candidates` — all scores for reporting
|
|
586
199
|
|
|
587
|
-
|
|
200
|
+
### 5. Merge Winner
|
|
588
201
|
|
|
589
|
-
|
|
590
|
-
|
|
591
|
-
**5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
|
|
202
|
+
If the winner scored higher than the current best:
|
|
592
203
|
|
|
593
204
|
```bash
|
|
594
|
-
|
|
595
|
-
|
|
596
|
-
|
|
597
|
-
version = '{version}'
|
|
598
|
-
candidates = {}
|
|
599
|
-
for suffix in ['a', 'b', 'c', 'd', 'e']:
|
|
600
|
-
path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
|
|
601
|
-
if os.path.exists(path):
|
|
602
|
-
candidates[suffix] = json.load(open(path))
|
|
603
|
-
|
|
604
|
-
if not candidates:
|
|
605
|
-
print('NO_CANDIDATES')
|
|
606
|
-
exit()
|
|
607
|
-
|
|
608
|
-
# Overall winner
|
|
609
|
-
winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
|
|
610
|
-
winner_score = candidates[winner_suffix]['combined_score']
|
|
611
|
-
print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
|
|
612
|
-
|
|
613
|
-
# Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
|
|
614
|
-
task_wins = {}
|
|
615
|
-
winner_tasks = candidates[winner_suffix].get('per_task', {})
|
|
616
|
-
for suffix, data in candidates.items():
|
|
617
|
-
if suffix == winner_suffix:
|
|
618
|
-
continue
|
|
619
|
-
wins = 0
|
|
620
|
-
for tid, tdata in data.get('per_task', {}).items():
|
|
621
|
-
winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
|
|
622
|
-
if tdata.get('score', 0) > winner_task_score:
|
|
623
|
-
wins += 1
|
|
624
|
-
if wins > 0:
|
|
625
|
-
task_wins[suffix] = wins
|
|
626
|
-
|
|
627
|
-
if task_wins:
|
|
628
|
-
champion_suffix = max(task_wins, key=task_wins.get)
|
|
629
|
-
print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
|
|
630
|
-
# Save champion info for next iteration's crossover parent
|
|
631
|
-
with open('.harness-evolver/per_task_champion.json', 'w') as f:
|
|
632
|
-
json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
|
|
633
|
-
else:
|
|
634
|
-
print('NO_CHAMPION: winner dominates all tasks')
|
|
635
|
-
" 2>/dev/null
|
|
636
|
-
```
|
|
637
|
-
|
|
638
|
-
**5c. Promote winner and report ALL candidates:**
|
|
205
|
+
# Get the winning worktree's branch
|
|
206
|
+
WINNER_BRANCH={winning_worktree_branch}
|
|
639
207
|
|
|
640
|
-
|
|
641
|
-
|
|
642
|
-
mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
|
|
208
|
+
# Merge into main
|
|
209
|
+
git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
|
|
643
210
|
```
|
|
644
211
|
|
|
645
|
-
Update
|
|
646
|
-
```
|
|
647
|
-
|
|
648
|
-
|
|
649
|
-
|
|
650
|
-
|
|
651
|
-
|
|
212
|
+
Update `.evolver.json`:
|
|
213
|
+
```python
|
|
214
|
+
import json
|
|
215
|
+
c = json.load(open('.evolver.json'))
|
|
216
|
+
c['best_experiment'] = '{winner_experiment}'
|
|
217
|
+
c['best_score'] = {winner_score}
|
|
218
|
+
c['iterations'] = c['iterations'] + 1
|
|
219
|
+
c['history'].append({
|
|
220
|
+
'version': 'v{NNN}',
|
|
221
|
+
'experiment': '{winner_experiment}',
|
|
222
|
+
'score': {winner_score}
|
|
223
|
+
})
|
|
224
|
+
json.dump(c, open('.evolver.json', 'w'), indent=2)
|
|
652
225
|
```
|
|
653
226
|
|
|
654
|
-
Report ALL candidates
|
|
655
|
-
```
|
|
656
|
-
Iteration {i}/{N} — {num_candidates} candidates evaluated:
|
|
657
|
-
{version}a (exploit): {score_a} — {summary}
|
|
658
|
-
{version}b (explore): {score_b} — {summary}
|
|
659
|
-
{version}c (crossover): {score_c} — {summary}
|
|
660
|
-
{version}d ({strategy_d}): {score_d} — {summary}
|
|
661
|
-
{version}e ({strategy_e}): {score_e} — {summary}
|
|
662
|
-
|
|
663
|
-
Winner: {version}{suffix} ({score})
|
|
664
|
-
Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
|
|
227
|
+
Report ALL candidates:
|
|
665
228
|
```
|
|
229
|
+
Iteration {i}/{N} — 5 candidates evaluated:
|
|
230
|
+
v{NNN}a (exploit): {score_a} — {summary}
|
|
231
|
+
v{NNN}b (explore): {score_b} — {summary}
|
|
232
|
+
v{NNN}c (crossover): {score_c} — {summary}
|
|
233
|
+
v{NNN}d ({strategy}): {score_d} — {summary}
|
|
234
|
+
v{NNN}e ({strategy}): {score_e} — {summary}
|
|
666
235
|
|
|
667
|
-
|
|
236
|
+
Winner: v{NNN}{suffix} ({score}) — merged into main
|
|
237
|
+
Per-task champion: {champion} (beats winner on {N} tasks)
|
|
238
|
+
```
|
|
668
239
|
|
|
669
|
-
### 5.5. Test Suite Growth
|
|
240
|
+
### 5.5. Test Suite Growth
|
|
670
241
|
|
|
671
|
-
|
|
672
|
-
Generate regression tasks to lock in improvements and prevent future regressions:
|
|
242
|
+
If previously-failing examples now pass, add regression examples to the dataset:
|
|
673
243
|
|
|
674
244
|
```bash
|
|
675
|
-
|
|
245
|
+
python3 -c "
|
|
246
|
+
from langsmith import Client
|
|
676
247
|
import json
|
|
677
|
-
s = json.load(open('.harness-evolver/summary.json'))
|
|
678
|
-
versions = s.get('versions', [])
|
|
679
|
-
print(versions[-2]['version'] if len(versions) >= 2 else '')
|
|
680
|
-
" 2>/dev/null)
|
|
681
|
-
if [ -n "$PREV_BEST" ] && [ -f ".harness-evolver/harnesses/$PREV_BEST/scores.json" ]; then
|
|
682
|
-
python3 $TOOLS/test_growth.py \
|
|
683
|
-
--current-scores .harness-evolver/harnesses/{version}/scores.json \
|
|
684
|
-
--previous-scores ".harness-evolver/harnesses/$PREV_BEST/scores.json" \
|
|
685
|
-
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
686
|
-
--output-dir .harness-evolver/eval/tasks/ \
|
|
687
|
-
--max-total-tasks 60 2>/dev/null
|
|
688
|
-
fi
|
|
689
|
-
```
|
|
690
248
|
|
|
691
|
-
|
|
249
|
+
client = Client()
|
|
250
|
+
config = json.load(open('.evolver.json'))
|
|
692
251
|
|
|
693
|
-
|
|
694
|
-
|
|
695
|
-
|
|
252
|
+
# Find examples that improved significantly
|
|
253
|
+
# (score went from <0.5 to >0.8 between iterations)
|
|
254
|
+
# Generate variations and add to dataset
|
|
255
|
+
# client.create_examples(dataset_id=config['dataset_id'], examples=[...])
|
|
256
|
+
print('Test suite growth: added N regression examples')
|
|
257
|
+
" 2>/dev/null
|
|
258
|
+
```
|
|
696
259
|
|
|
697
260
|
### 6. Report
|
|
698
261
|
|
|
699
|
-
|
|
262
|
+
Print: `Iteration {i}/{N}: v{NNN} scored {score} (best: {best} at {best_score})`
|
|
700
263
|
|
|
701
|
-
### 6.5. Auto-trigger Critic
|
|
264
|
+
### 6.5. Auto-trigger Critic
|
|
702
265
|
|
|
703
|
-
|
|
704
|
-
- Did the score jump >0.3 from parent version?
|
|
705
|
-
- Did we reach 1.0 in fewer than 3 total iterations?
|
|
266
|
+
If score jumped >0.3 from previous iteration OR reached target in <3 iterations:
|
|
706
267
|
|
|
707
|
-
|
|
708
|
-
|
|
709
|
-
```bash
|
|
710
|
-
python3 $TOOLS/evaluate.py run \
|
|
711
|
-
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
712
|
-
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
713
|
-
--eval .harness-evolver/eval/eval.py \
|
|
714
|
-
--traces-dir /tmp/critic-check/ \
|
|
715
|
-
--scores /tmp/critic-check-scores.json \
|
|
716
|
-
--timeout 60
|
|
717
|
-
```
|
|
718
|
-
|
|
719
|
-
Dispatch the critic agent:
|
|
268
|
+
Spawn the critic agent to analyze evaluator quality:
|
|
720
269
|
|
|
721
270
|
```
|
|
722
271
|
Agent(
|
|
723
|
-
subagent_type: "
|
|
724
|
-
description: "Critic:
|
|
272
|
+
subagent_type: "evolver-critic",
|
|
273
|
+
description: "Critic: check evaluator gaming",
|
|
725
274
|
prompt: |
|
|
726
275
|
<objective>
|
|
727
|
-
EVAL GAMING DETECTED: Score jumped from {
|
|
728
|
-
|
|
276
|
+
EVAL GAMING DETECTED: Score jumped from {prev_score} to {score}.
|
|
277
|
+
Check if the LangSmith evaluators are being gamed.
|
|
729
278
|
</objective>
|
|
730
279
|
|
|
731
280
|
<files_to_read>
|
|
732
|
-
- .
|
|
733
|
-
- .
|
|
734
|
-
- .
|
|
735
|
-
- .harness-evolver/harnesses/{version}/harness.py
|
|
736
|
-
- .harness-evolver/harnesses/{version}/proposal.md
|
|
737
|
-
- .harness-evolver/config.json
|
|
738
|
-
- .harness-evolver/langsmith_stats.json (if exists)
|
|
281
|
+
- .evolver.json
|
|
282
|
+
- comparison.json
|
|
283
|
+
- trace_insights.json
|
|
739
284
|
</files_to_read>
|
|
740
|
-
|
|
741
|
-
<output>
|
|
742
|
-
Write:
|
|
743
|
-
- .harness-evolver/critic_report.md
|
|
744
|
-
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
745
|
-
</output>
|
|
746
|
-
|
|
747
|
-
<success_criteria>
|
|
748
|
-
- Identifies specific weaknesses in eval.py with task/output examples
|
|
749
|
-
- If gaming detected, shows exact tasks that expose the weakness
|
|
750
|
-
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
751
|
-
- Re-scores the best version with improved eval to show the difference
|
|
752
|
-
</success_criteria>
|
|
753
285
|
)
|
|
754
286
|
```
|
|
755
287
|
|
|
756
|
-
|
|
757
|
-
|
|
758
|
-
If critic wrote `eval_improved.py`:
|
|
759
|
-
- Re-score the best harness with the improved eval
|
|
760
|
-
- Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
|
|
761
|
-
- **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
|
|
762
|
-
- Re-run baseline with new eval and update `summary.json`
|
|
763
|
-
- Print: "Eval upgraded. Resuming evolution with stricter eval."
|
|
764
|
-
- **Continue the loop** with the new eval
|
|
765
|
-
|
|
766
|
-
If critic did NOT write `eval_improved.py` (eval is fine):
|
|
767
|
-
- Print the critic's assessment
|
|
768
|
-
- Continue the loop normally
|
|
769
|
-
|
|
770
|
-
### 7. Auto-trigger Architect (on stagnation or regression)
|
|
771
|
-
|
|
772
|
-
Check if the architect should be auto-spawned:
|
|
773
|
-
- **Stagnation**: 3 consecutive iterations within 1% of each other
|
|
774
|
-
- **Regression**: score dropped below parent score (even once)
|
|
288
|
+
### 7. Auto-trigger Architect
|
|
775
289
|
|
|
776
|
-
|
|
777
|
-
|
|
778
|
-
If triggered:
|
|
779
|
-
|
|
780
|
-
```bash
|
|
781
|
-
python3 $TOOLS/analyze_architecture.py \
|
|
782
|
-
--harness .harness-evolver/harnesses/{best_version}/harness.py \
|
|
783
|
-
--traces-dir .harness-evolver/harnesses/{best_version}/traces \
|
|
784
|
-
--summary .harness-evolver/summary.json \
|
|
785
|
-
-o .harness-evolver/architecture_signals.json
|
|
786
|
-
```
|
|
787
|
-
|
|
788
|
-
Dispatch the architect agent:
|
|
290
|
+
If 3 consecutive iterations within 1% OR score dropped:
|
|
789
291
|
|
|
790
292
|
```
|
|
791
293
|
Agent(
|
|
792
|
-
subagent_type: "
|
|
793
|
-
description: "Architect:
|
|
294
|
+
subagent_type: "evolver-architect",
|
|
295
|
+
description: "Architect: recommend topology change",
|
|
794
296
|
prompt: |
|
|
795
297
|
<objective>
|
|
796
|
-
The evolution loop has
|
|
797
|
-
Analyze the
|
|
298
|
+
The evolution loop has stagnated after {iterations} iterations.
|
|
299
|
+
Analyze the architecture and recommend changes.
|
|
798
300
|
</objective>
|
|
799
301
|
|
|
800
302
|
<files_to_read>
|
|
801
|
-
- .
|
|
802
|
-
- .
|
|
803
|
-
-
|
|
804
|
-
- .harness-evolver/config.json
|
|
805
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
806
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
807
|
-
- .harness-evolver/context7_docs.md (if exists)
|
|
303
|
+
- .evolver.json
|
|
304
|
+
- trace_insights.json
|
|
305
|
+
- {entry point and related source files}
|
|
808
306
|
</files_to_read>
|
|
809
|
-
|
|
810
|
-
<output>
|
|
811
|
-
Write:
|
|
812
|
-
- .harness-evolver/architecture.json (structured recommendation)
|
|
813
|
-
- .harness-evolver/architecture.md (human-readable analysis)
|
|
814
|
-
</output>
|
|
815
|
-
|
|
816
|
-
<success_criteria>
|
|
817
|
-
- Recommendation includes concrete migration steps
|
|
818
|
-
- Each step is implementable in one proposer iteration
|
|
819
|
-
- Considers detected stack and available API keys
|
|
820
|
-
</success_criteria>
|
|
821
307
|
)
|
|
822
308
|
```
|
|
823
309
|
|
|
824
|
-
Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
825
|
-
|
|
826
|
-
Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
|
|
827
|
-
|
|
828
|
-
Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
|
|
829
|
-
|
|
830
310
|
### 8. Check Stop Conditions
|
|
831
311
|
|
|
832
|
-
- **Target**: `
|
|
312
|
+
- **Target**: `score >= target_score` → stop
|
|
833
313
|
- **N reached**: done
|
|
834
|
-
- **Stagnation post-architect**: 3 more iterations without improvement
|
|
314
|
+
- **Stagnation post-architect**: 3 more iterations without improvement → stop
|
|
835
315
|
|
|
836
316
|
## When Loop Ends — Final Report
|
|
837
317
|
|
|
838
318
|
- Best version and score
|
|
839
319
|
- Improvement over baseline (absolute and %)
|
|
840
320
|
- Total iterations run
|
|
841
|
-
-
|
|
842
|
-
-
|
|
843
|
-
- Suggest:
|
|
321
|
+
- Key changes made (git log from baseline to current)
|
|
322
|
+
- LangSmith experiment URLs for comparison
|
|
323
|
+
- Suggest: `/evolver:deploy` to finalize
|