harness-evolver 2.9.1 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -117
- package/agents/evolver-architect.md +53 -0
- package/agents/evolver-critic.md +44 -0
- package/agents/evolver-proposer.md +128 -0
- package/agents/evolver-testgen.md +67 -0
- package/bin/install.js +181 -171
- package/package.json +7 -7
- package/skills/deploy/SKILL.md +49 -56
- package/skills/evolve/SKILL.md +156 -687
- package/skills/setup/SKILL.md +182 -0
- package/skills/status/SKILL.md +23 -21
- package/tools/read_results.py +240 -0
- package/tools/run_eval.py +202 -0
- package/tools/seed_from_traces.py +36 -8
- package/tools/setup.py +393 -0
- package/tools/trace_insights.py +86 -14
- package/agents/harness-evolver-architect.md +0 -173
- package/agents/harness-evolver-critic.md +0 -132
- package/agents/harness-evolver-judge.md +0 -110
- package/agents/harness-evolver-proposer.md +0 -317
- package/agents/harness-evolver-testgen.md +0 -112
- package/examples/classifier/README.md +0 -25
- package/examples/classifier/config.json +0 -3
- package/examples/classifier/eval.py +0 -58
- package/examples/classifier/harness.py +0 -111
- package/examples/classifier/tasks/task_001.json +0 -1
- package/examples/classifier/tasks/task_002.json +0 -1
- package/examples/classifier/tasks/task_003.json +0 -1
- package/examples/classifier/tasks/task_004.json +0 -1
- package/examples/classifier/tasks/task_005.json +0 -1
- package/examples/classifier/tasks/task_006.json +0 -1
- package/examples/classifier/tasks/task_007.json +0 -1
- package/examples/classifier/tasks/task_008.json +0 -1
- package/examples/classifier/tasks/task_009.json +0 -1
- package/examples/classifier/tasks/task_010.json +0 -1
- package/skills/architect/SKILL.md +0 -93
- package/skills/compare/SKILL.md +0 -73
- package/skills/critic/SKILL.md +0 -67
- package/skills/diagnose/SKILL.md +0 -96
- package/skills/import-traces/SKILL.md +0 -102
- package/skills/init/SKILL.md +0 -293
- package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
- package/tools/eval_llm_judge.py +0 -233
- package/tools/eval_passthrough.py +0 -55
- package/tools/evaluate.py +0 -255
- package/tools/import_traces.py +0 -229
- package/tools/init.py +0 -531
- package/tools/llm_api.py +0 -125
- package/tools/state.py +0 -219
- package/tools/test_growth.py +0 -230
- package/tools/trace_logger.py +0 -42
package/skills/evolve/SKILL.md
CHANGED
|
@@ -1,34 +1,31 @@
|
|
|
1
1
|
---
|
|
2
|
-
name:
|
|
3
|
-
description: "Use when the user wants to run the optimization loop, improve
|
|
2
|
+
name: evolver:evolve
|
|
3
|
+
description: "Use when the user wants to run the optimization loop, improve agent performance, evolve the agent, or iterate on quality. Requires .evolver.json to exist (run evolver:setup first)."
|
|
4
4
|
argument-hint: "[--iterations N]"
|
|
5
5
|
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# /
|
|
8
|
+
# /evolver:evolve
|
|
9
9
|
|
|
10
|
-
Run the autonomous propose-evaluate-iterate loop.
|
|
10
|
+
Run the autonomous propose-evaluate-iterate loop using LangSmith as the evaluation backend and git worktrees for isolation.
|
|
11
11
|
|
|
12
12
|
## Prerequisites
|
|
13
13
|
|
|
14
|
-
`.
|
|
14
|
+
`.evolver.json` must exist. If not, tell user to run `evolver:setup`.
|
|
15
15
|
|
|
16
16
|
## Resolve Tool Path
|
|
17
17
|
|
|
18
18
|
```bash
|
|
19
|
-
TOOLS=$([ -d ".
|
|
19
|
+
TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")
|
|
20
20
|
```
|
|
21
21
|
|
|
22
22
|
## Parse Arguments
|
|
23
23
|
|
|
24
|
-
- `--iterations N` (default:
|
|
25
|
-
- Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
|
|
24
|
+
- `--iterations N` (default: from interactive question or 5)
|
|
26
25
|
|
|
27
26
|
## Pre-Loop: Interactive Configuration
|
|
28
27
|
|
|
29
|
-
If no `--iterations` argument was provided, ask the user
|
|
30
|
-
|
|
31
|
-
Use AskUserQuestion with TWO questions in a single call (simple single-select, no preview needed):
|
|
28
|
+
If no `--iterations` argument was provided, ask the user:
|
|
32
29
|
|
|
33
30
|
```json
|
|
34
31
|
{
|
|
@@ -38,7 +35,7 @@ Use AskUserQuestion with TWO questions in a single call (simple single-select, n
|
|
|
38
35
|
"header": "Iterations",
|
|
39
36
|
"multiSelect": false,
|
|
40
37
|
"options": [
|
|
41
|
-
{"label": "3 (quick)", "description": "Fast exploration, good for testing
|
|
38
|
+
{"label": "3 (quick)", "description": "Fast exploration, good for testing. ~15 min."},
|
|
42
39
|
{"label": "5 (balanced)", "description": "Good trade-off between speed and quality. ~30 min."},
|
|
43
40
|
{"label": "10 (thorough)", "description": "Deep optimization with adaptive strategies. ~1 hour."}
|
|
44
41
|
]
|
|
@@ -48,7 +45,7 @@ Use AskUserQuestion with TWO questions in a single call (simple single-select, n
|
|
|
48
45
|
"header": "Target",
|
|
49
46
|
"multiSelect": false,
|
|
50
47
|
"options": [
|
|
51
|
-
{"label": "0.8 (good enough)", "description": "Stop when the
|
|
48
|
+
{"label": "0.8 (good enough)", "description": "Stop when the agent is reasonably good"},
|
|
52
49
|
{"label": "0.9 (high quality)", "description": "Stop when quality is high"},
|
|
53
50
|
{"label": "0.95 (near perfect)", "description": "Push for near-perfect scores"},
|
|
54
51
|
{"label": "No limit", "description": "Run all iterations regardless of score"}
|
|
@@ -58,797 +55,269 @@ Use AskUserQuestion with TWO questions in a single call (simple single-select, n
|
|
|
58
55
|
}
|
|
59
56
|
```
|
|
60
57
|
|
|
61
|
-
Apply the answers:
|
|
62
|
-
- Set iterations from question 1 (3, 5, or 10)
|
|
63
|
-
- Set target_score from question 2 (0.8, 0.9, 0.95, or None)
|
|
64
|
-
|
|
65
|
-
If `--iterations` WAS provided as argument, skip these questions and use the argument value.
|
|
66
|
-
|
|
67
58
|
## The Loop
|
|
68
59
|
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
### 1. Get Next Version
|
|
72
|
-
|
|
60
|
+
Read config:
|
|
73
61
|
```bash
|
|
74
|
-
python3 -c "import json;
|
|
62
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Best: {c[\"best_experiment\"]} ({c[\"best_score\"]:.3f}), Iterations: {c[\"iterations\"]}')"
|
|
75
63
|
```
|
|
76
64
|
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
|
|
80
|
-
|
|
81
|
-
```bash
|
|
82
|
-
PROD_PROJECT=$(python3 -c "
|
|
83
|
-
import json, os
|
|
84
|
-
c = json.load(open('.harness-evolver/config.json'))
|
|
85
|
-
print(c.get('eval', {}).get('production_project', ''))
|
|
86
|
-
" 2>/dev/null)
|
|
87
|
-
if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
|
|
88
|
-
python3 $TOOLS/seed_from_traces.py \
|
|
89
|
-
--project "$PROD_PROJECT" \
|
|
90
|
-
--output-md .harness-evolver/production_seed.md \
|
|
91
|
-
--output-json .harness-evolver/production_seed.json \
|
|
92
|
-
--limit 100 2>/dev/null
|
|
93
|
-
fi
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
|
|
97
|
-
|
|
98
|
-
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
99
|
-
|
|
100
|
-
**Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
|
|
101
|
-
|
|
102
|
-
**Step 1: Find the actual LangSmith project name**
|
|
103
|
-
|
|
104
|
-
```bash
|
|
105
|
-
langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
|
|
106
|
-
```
|
|
107
|
-
|
|
108
|
-
This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
|
|
109
|
-
|
|
110
|
-
```bash
|
|
111
|
-
LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
|
|
65
|
+
For each iteration:
|
|
115
66
|
|
|
116
|
-
|
|
67
|
+
### 1. Get Next Version
|
|
117
68
|
|
|
118
69
|
```bash
|
|
119
|
-
|
|
120
|
-
langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
|
|
121
|
-
langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
122
|
-
echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
|
|
123
|
-
else
|
|
124
|
-
echo "[]" > /tmp/langsmith_raw.json
|
|
125
|
-
echo "{}" > .harness-evolver/langsmith_stats.json
|
|
126
|
-
fi
|
|
70
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"iterations\"]+1:03d}')"
|
|
127
71
|
```
|
|
128
72
|
|
|
129
|
-
|
|
73
|
+
### 1.5. Gather Trace Insights
|
|
130
74
|
|
|
131
|
-
|
|
75
|
+
Run trace insights from the best experiment:
|
|
132
76
|
|
|
133
77
|
```bash
|
|
134
|
-
python3 -c "
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
if not raw:
|
|
139
|
-
json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
|
|
140
|
-
sys.exit(0)
|
|
141
|
-
|
|
142
|
-
clean = []
|
|
143
|
-
for r in raw:
|
|
144
|
-
entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
|
|
145
|
-
|
|
146
|
-
# Extract readable prompt from LangChain serialized inputs
|
|
147
|
-
inputs = r.get('inputs', {})
|
|
148
|
-
if isinstance(inputs, dict) and 'messages' in inputs:
|
|
149
|
-
msgs = inputs['messages']
|
|
150
|
-
for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
|
|
151
|
-
for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
|
|
152
|
-
if isinstance(msg, dict):
|
|
153
|
-
kwargs = msg.get('kwargs', msg)
|
|
154
|
-
content = kwargs.get('content', '')
|
|
155
|
-
msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
|
|
156
|
-
if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
|
|
157
|
-
entry['user_message'] = str(content)[:300]
|
|
158
|
-
elif 'System' in str(msg_type):
|
|
159
|
-
entry['system_prompt_preview'] = str(content)[:200]
|
|
160
|
-
|
|
161
|
-
# Extract readable output
|
|
162
|
-
outputs = r.get('outputs', {})
|
|
163
|
-
if isinstance(outputs, dict) and 'generations' in outputs:
|
|
164
|
-
gens = outputs['generations']
|
|
165
|
-
if gens and isinstance(gens, list) and gens[0]:
|
|
166
|
-
gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
|
|
167
|
-
if isinstance(gen, dict):
|
|
168
|
-
msg = gen.get('message', gen)
|
|
169
|
-
if isinstance(msg, dict):
|
|
170
|
-
kwargs = msg.get('kwargs', msg)
|
|
171
|
-
entry['llm_response'] = str(kwargs.get('content', ''))[:300]
|
|
172
|
-
|
|
173
|
-
clean.append(entry)
|
|
174
|
-
|
|
175
|
-
json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
|
|
176
|
-
print(f'Processed {len(clean)} LangSmith runs into readable format')
|
|
177
|
-
" 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
|
|
178
|
-
```
|
|
179
|
-
|
|
180
|
-
The resulting `langsmith_runs.json` has clean, readable entries:
|
|
181
|
-
```json
|
|
182
|
-
[
|
|
183
|
-
{
|
|
184
|
-
"name": "ChatGoogleGenerativeAI",
|
|
185
|
-
"tokens": 1332,
|
|
186
|
-
"error": null,
|
|
187
|
-
"user_message": "Analise este texto: Bom dia pessoal...",
|
|
188
|
-
"system_prompt_preview": "Você é um moderador de conteúdo...",
|
|
189
|
-
"llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
|
|
190
|
-
}
|
|
191
|
-
]
|
|
78
|
+
BEST=$(python3 -c "import json; print(json.load(open('.evolver.json'))['best_experiment'])")
|
|
79
|
+
python3 $TOOLS/trace_insights.py \
|
|
80
|
+
--from-experiment "$BEST" \
|
|
81
|
+
--output trace_insights.json 2>/dev/null
|
|
192
82
|
```
|
|
193
83
|
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
### 1.6. Generate Trace Insights (systematic analysis)
|
|
197
|
-
|
|
198
|
-
If LangSmith traces were gathered, run systematic analysis to cluster errors, analyze token usage, and cross-reference with scores:
|
|
84
|
+
If a production project is configured, also gather production insights:
|
|
199
85
|
|
|
200
86
|
```bash
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
--
|
|
207
|
-
--
|
|
208
|
-
--scores "$SCORES_PATH" \
|
|
209
|
-
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
210
|
-
--output .harness-evolver/trace_insights.json 2>/dev/null
|
|
87
|
+
PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
|
|
88
|
+
if [ -n "$PROD" ] && [ ! -f "production_seed.json" ]; then
|
|
89
|
+
python3 $TOOLS/seed_from_traces.py \
|
|
90
|
+
--project "$PROD" --use-sdk \
|
|
91
|
+
--output-md production_seed.md \
|
|
92
|
+
--output-json production_seed.json \
|
|
93
|
+
--limit 100 2>/dev/null
|
|
211
94
|
fi
|
|
212
95
|
```
|
|
213
96
|
|
|
214
|
-
|
|
215
|
-
- `error_clusters`: grouped error patterns with counts
|
|
216
|
-
- `token_analysis`: score distribution by token usage bucket (low/medium/high)
|
|
217
|
-
- `hypotheses`: data-driven theories about failure causes
|
|
218
|
-
- `top_issues`: highest-impact problems sorted by severity
|
|
97
|
+
### 1.8. Analyze Per-Task Failures
|
|
219
98
|
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
|
|
223
|
-
|
|
224
|
-
Before spawning proposers, analyze which tasks are failing and cluster them:
|
|
99
|
+
Read the best experiment results and cluster failures:
|
|
225
100
|
|
|
226
101
|
```bash
|
|
227
|
-
python3
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
summary = json.load(open('.harness-evolver/summary.json'))
|
|
232
|
-
best = summary['best']['version']
|
|
233
|
-
scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
|
|
234
|
-
if not os.path.exists(scores_path):
|
|
235
|
-
scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
|
|
236
|
-
|
|
237
|
-
if not scores_path or not os.path.exists(scores_path):
|
|
238
|
-
print('NO_SCORES')
|
|
239
|
-
sys.exit(0)
|
|
240
|
-
|
|
241
|
-
scores = json.load(open(scores_path))
|
|
242
|
-
tasks_dir = '.harness-evolver/eval/tasks/'
|
|
243
|
-
failures = {}
|
|
244
|
-
|
|
245
|
-
for tid, tdata in scores.get('per_task', {}).items():
|
|
246
|
-
score = tdata.get('score', 0)
|
|
247
|
-
if score < 0.7:
|
|
248
|
-
tfile = os.path.join(tasks_dir, tid + '.json')
|
|
249
|
-
cat = 'unknown'
|
|
250
|
-
if os.path.exists(tfile):
|
|
251
|
-
task = json.load(open(tfile))
|
|
252
|
-
meta = task.get('metadata', {})
|
|
253
|
-
cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
|
|
254
|
-
failures.setdefault(cat, []).append({'id': tid, 'score': score})
|
|
255
|
-
|
|
256
|
-
if not failures:
|
|
257
|
-
print('ALL_PASSING')
|
|
258
|
-
else:
|
|
259
|
-
sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
|
|
260
|
-
for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
|
|
261
|
-
task_ids = [t['id'] for t in tasks]
|
|
262
|
-
avg_score = sum(t['score'] for t in tasks) / len(tasks)
|
|
263
|
-
print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
|
|
264
|
-
" 2>/dev/null
|
|
102
|
+
python3 $TOOLS/read_results.py \
|
|
103
|
+
--experiment "$BEST" \
|
|
104
|
+
--config .evolver.json \
|
|
105
|
+
--output best_results.json 2>/dev/null
|
|
265
106
|
```
|
|
266
107
|
|
|
267
|
-
Parse
|
|
268
|
-
|
|
269
|
-
- If clusters found: D targets cluster 1, E targets cluster 2
|
|
270
|
-
- If only 1 cluster: D targets it, E gets "creative" brief
|
|
271
|
-
|
|
272
|
-
Save clusters for use in step 2.
|
|
273
|
-
|
|
274
|
-
### 2. Propose (3 parallel candidates)
|
|
108
|
+
Parse `best_results.json` to find failing examples (score < 0.7). Group by metadata or error pattern.
|
|
109
|
+
Generate adaptive briefings for Candidates D and E (same logic as v2).
|
|
275
110
|
|
|
276
|
-
|
|
277
|
-
This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
|
|
111
|
+
### 2. Spawn 5 Proposers in Parallel
|
|
278
112
|
|
|
279
|
-
|
|
280
|
-
- **Exploiter parent**: current best version (from summary.json `best.version`)
|
|
281
|
-
- **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
|
|
282
|
-
- **Crossover parents**:
|
|
283
|
-
- Parent A = current best version
|
|
284
|
-
- Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
|
|
285
|
-
If no champion file exists, fall back to a non-best version from the archive.
|
|
113
|
+
Each proposer runs in a **git worktree** via Claude Code's native `isolation: "worktree"` parameter.
|
|
286
114
|
|
|
287
|
-
|
|
115
|
+
**Candidate A (Exploit)** — `run_in_background: true`:
|
|
288
116
|
|
|
289
|
-
**Candidate A (Exploiter)** — `run_in_background: true`:
|
|
290
117
|
```
|
|
291
118
|
Agent(
|
|
292
|
-
subagent_type: "
|
|
293
|
-
description: "Proposer A
|
|
119
|
+
subagent_type: "evolver-proposer",
|
|
120
|
+
description: "Proposer A: exploit best version",
|
|
121
|
+
isolation: "worktree",
|
|
294
122
|
run_in_background: true,
|
|
295
123
|
prompt: |
|
|
296
|
-
<strategy>
|
|
297
|
-
APPROACH: exploitation
|
|
298
|
-
You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
|
|
299
|
-
the highest-impact failing tasks. Base your work on the current best version.
|
|
300
|
-
Do NOT restructure the code. Do NOT change the architecture.
|
|
301
|
-
Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
|
|
302
|
-
</strategy>
|
|
303
|
-
|
|
304
124
|
<objective>
|
|
305
|
-
|
|
125
|
+
Improve the agent code to score higher on the evaluation dataset.
|
|
126
|
+
You are working in an isolated git worktree — modify any file freely.
|
|
306
127
|
</objective>
|
|
307
128
|
|
|
308
|
-
<files_to_read>
|
|
309
|
-
- .harness-evolver/summary.json
|
|
310
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
311
|
-
- .harness-evolver/config.json
|
|
312
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
313
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
314
|
-
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
315
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
316
|
-
- .harness-evolver/langsmith_stats.json (if exists)
|
|
317
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
318
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
319
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
320
|
-
- .harness-evolver/architecture.json (if exists)
|
|
321
|
-
</files_to_read>
|
|
322
|
-
|
|
323
|
-
<output>
|
|
324
|
-
Create directory .harness-evolver/harnesses/{version}a/ containing:
|
|
325
|
-
- harness.py, config.json, proposal.md
|
|
326
|
-
</output>
|
|
327
|
-
)
|
|
328
|
-
```
|
|
329
|
-
|
|
330
|
-
**Candidate B (Explorer)** — `run_in_background: true`:
|
|
331
|
-
```
|
|
332
|
-
Agent(
|
|
333
|
-
subagent_type: "harness-evolver-proposer",
|
|
334
|
-
description: "Proposer B (explore): bold change from {explorer_parent}",
|
|
335
|
-
run_in_background: true,
|
|
336
|
-
prompt: |
|
|
337
|
-
<strategy>
|
|
338
|
-
APPROACH: exploration
|
|
339
|
-
You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
|
|
340
|
-
Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
|
|
341
|
-
Consider: different retrieval strategy, different prompt structure,
|
|
342
|
-
different output parsing, different error handling philosophy.
|
|
343
|
-
Be bold. A creative failure teaches more than a timid success.
|
|
344
|
-
</strategy>
|
|
345
|
-
|
|
346
|
-
<objective>
|
|
347
|
-
Propose harness version {version}b that takes a different approach.
|
|
348
|
-
</objective>
|
|
349
|
-
|
|
350
|
-
<files_to_read>
|
|
351
|
-
- .harness-evolver/summary.json
|
|
352
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
353
|
-
- .harness-evolver/config.json
|
|
354
|
-
- .harness-evolver/baseline/harness.py
|
|
355
|
-
- .harness-evolver/harnesses/{explorer_parent}/harness.py
|
|
356
|
-
- .harness-evolver/harnesses/{explorer_parent}/scores.json
|
|
357
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
358
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
359
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
360
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
361
|
-
- .harness-evolver/architecture.json (if exists)
|
|
362
|
-
</files_to_read>
|
|
363
|
-
|
|
364
|
-
<output>
|
|
365
|
-
Create directory .harness-evolver/harnesses/{version}b/ containing:
|
|
366
|
-
- harness.py, config.json, proposal.md
|
|
367
|
-
</output>
|
|
368
|
-
)
|
|
369
|
-
```
|
|
370
|
-
|
|
371
|
-
**Candidate C (Crossover)** — blocks (last one):
|
|
372
|
-
```
|
|
373
|
-
Agent(
|
|
374
|
-
subagent_type: "harness-evolver-proposer",
|
|
375
|
-
description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
|
|
376
|
-
prompt: |
|
|
377
129
|
<strategy>
|
|
378
|
-
APPROACH:
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
- {parent_b} (score: {score_b}): {summary of what it does well}
|
|
382
|
-
Take the best elements from each and merge them into a single harness.
|
|
130
|
+
APPROACH: exploitation
|
|
131
|
+
Make targeted improvements to the current best version.
|
|
132
|
+
Focus on the specific failures identified in the results.
|
|
383
133
|
</strategy>
|
|
384
134
|
|
|
385
|
-
<objective>
|
|
386
|
-
Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
|
|
387
|
-
</objective>
|
|
388
|
-
|
|
389
135
|
<files_to_read>
|
|
390
|
-
- .
|
|
391
|
-
- .
|
|
392
|
-
- .
|
|
393
|
-
- .
|
|
394
|
-
- .
|
|
395
|
-
- .harness-evolver/harnesses/{parent_b}/harness.py
|
|
396
|
-
- .harness-evolver/harnesses/{parent_b}/scores.json
|
|
397
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
398
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
399
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
400
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
401
|
-
- .harness-evolver/architecture.json (if exists)
|
|
136
|
+
- .evolver.json
|
|
137
|
+
- trace_insights.json (if exists)
|
|
138
|
+
- production_seed.json (if exists)
|
|
139
|
+
- best_results.json (if exists)
|
|
140
|
+
- {entry point file from .evolver.json}
|
|
402
141
|
</files_to_read>
|
|
403
142
|
|
|
404
|
-
<
|
|
405
|
-
|
|
406
|
-
|
|
407
|
-
|
|
408
|
-
|
|
409
|
-
|
|
410
|
-
|
|
411
|
-
**Also spawn these additional candidates:**
|
|
412
|
-
|
|
413
|
-
**Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
|
|
414
|
-
|
|
415
|
-
If failure clusters were found in step 1.8:
|
|
416
|
-
```
|
|
417
|
-
Agent(
|
|
418
|
-
subagent_type: "harness-evolver-proposer",
|
|
419
|
-
description: "Proposer D: fix {cluster_1_category} failures",
|
|
420
|
-
run_in_background: true,
|
|
421
|
-
prompt: |
|
|
422
|
-
<strategy>
|
|
423
|
-
APPROACH: failure-targeted
|
|
424
|
-
Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
|
|
425
|
-
They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
|
|
426
|
-
Read the traces of these specific tasks to understand WHY they fail.
|
|
427
|
-
Your changes should improve these tasks WITHOUT regressing others.
|
|
428
|
-
You are free to change anything — prompts, code, retrieval, architecture —
|
|
429
|
-
whatever is needed to fix THIS specific failure mode.
|
|
430
|
-
</strategy>
|
|
431
|
-
|
|
432
|
-
<objective>
|
|
433
|
-
Propose harness version {version}d targeting {cluster_1_category} failures.
|
|
434
|
-
</objective>
|
|
435
|
-
|
|
436
|
-
<files_to_read>
|
|
437
|
-
- .harness-evolver/summary.json
|
|
438
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
439
|
-
- .harness-evolver/config.json
|
|
440
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
441
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
442
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
443
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
444
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
445
|
-
- .harness-evolver/architecture.json (if exists)
|
|
446
|
-
</files_to_read>
|
|
143
|
+
<context>
|
|
144
|
+
Best experiment: {best_experiment} (score: {best_score})
|
|
145
|
+
Framework: {framework}
|
|
146
|
+
Entry point: {entry_point}
|
|
147
|
+
Evaluators: {evaluators}
|
|
148
|
+
Failing examples: {failing_example_summary}
|
|
149
|
+
</context>
|
|
447
150
|
|
|
448
151
|
<output>
|
|
449
|
-
|
|
450
|
-
|
|
152
|
+
1. Modify the code to improve performance
|
|
153
|
+
2. Commit your changes with a descriptive message
|
|
154
|
+
3. Write proposal.md explaining what you changed and why
|
|
451
155
|
</output>
|
|
452
156
|
)
|
|
453
157
|
```
|
|
454
158
|
|
|
455
|
-
|
|
456
|
-
|
|
457
|
-
Agent(
|
|
458
|
-
subagent_type: "harness-evolver-proposer",
|
|
459
|
-
description: "Proposer D: creative approach",
|
|
460
|
-
run_in_background: true,
|
|
461
|
-
prompt: |
|
|
462
|
-
<strategy>
|
|
463
|
-
APPROACH: creative
|
|
464
|
-
All tasks are scoring well. Try something UNEXPECTED:
|
|
465
|
-
- Different algorithm or library
|
|
466
|
-
- Completely different prompt architecture
|
|
467
|
-
- Novel error handling or output validation
|
|
468
|
-
- Something no one would think of
|
|
469
|
-
The goal is to discover improvements that incremental fixes would miss.
|
|
470
|
-
</strategy>
|
|
471
|
-
...same files_to_read and output as above...
|
|
472
|
-
)
|
|
473
|
-
```
|
|
474
|
-
|
|
475
|
-
**Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
|
|
476
|
-
|
|
477
|
-
If a second failure cluster exists:
|
|
478
|
-
```
|
|
479
|
-
Agent(
|
|
480
|
-
subagent_type: "harness-evolver-proposer",
|
|
481
|
-
description: "Proposer E: fix {cluster_2_category} failures",
|
|
482
|
-
run_in_background: true,
|
|
483
|
-
prompt: |
|
|
484
|
-
<strategy>
|
|
485
|
-
APPROACH: failure-targeted
|
|
486
|
-
Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
|
|
487
|
-
They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
|
|
488
|
-
Read the traces of these specific tasks to understand WHY they fail.
|
|
489
|
-
Your changes should improve these tasks WITHOUT regressing others.
|
|
490
|
-
You are free to change anything — prompts, code, retrieval, architecture —
|
|
491
|
-
whatever is needed to fix THIS specific failure mode.
|
|
492
|
-
</strategy>
|
|
493
|
-
|
|
494
|
-
<objective>
|
|
495
|
-
Propose harness version {version}e targeting {cluster_2_category} failures.
|
|
496
|
-
</objective>
|
|
497
|
-
|
|
498
|
-
<files_to_read>
|
|
499
|
-
- .harness-evolver/summary.json
|
|
500
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
501
|
-
- .harness-evolver/config.json
|
|
502
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
503
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
504
|
-
- .harness-evolver/langsmith_runs.json (if exists)
|
|
505
|
-
- .harness-evolver/trace_insights.json (if exists)
|
|
506
|
-
- .harness-evolver/production_seed.json (if exists)
|
|
507
|
-
- .harness-evolver/architecture.json (if exists)
|
|
508
|
-
</files_to_read>
|
|
509
|
-
|
|
510
|
-
<output>
|
|
511
|
-
Create directory .harness-evolver/harnesses/{version}e/ containing:
|
|
512
|
-
- harness.py, config.json, proposal.md
|
|
513
|
-
</output>
|
|
514
|
-
)
|
|
515
|
-
```
|
|
159
|
+
**Candidate B (Explorer)** — `run_in_background: true`:
|
|
160
|
+
Same structure but `APPROACH: exploration` — bold, fundamentally different approach.
|
|
516
161
|
|
|
517
|
-
|
|
518
|
-
|
|
519
|
-
|
|
520
|
-
subagent_type: "harness-evolver-proposer",
|
|
521
|
-
description: "Proposer E: efficiency optimization",
|
|
522
|
-
run_in_background: true,
|
|
523
|
-
prompt: |
|
|
524
|
-
<strategy>
|
|
525
|
-
APPROACH: efficiency
|
|
526
|
-
Maintain the current quality but optimize for:
|
|
527
|
-
- Fewer LLM tokens (shorter prompts, less context)
|
|
528
|
-
- Faster execution (reduce unnecessary steps)
|
|
529
|
-
- Simpler code (remove redundant logic)
|
|
530
|
-
- Better error handling (graceful degradation)
|
|
531
|
-
Do NOT sacrifice accuracy for speed — same quality, less cost.
|
|
532
|
-
</strategy>
|
|
533
|
-
...same files_to_read and output as above...
|
|
534
|
-
)
|
|
535
|
-
```
|
|
162
|
+
**Candidate C (Crossover)** — `run_in_background: true`:
|
|
163
|
+
Same structure but `APPROACH: crossover` — combine strengths from previous iterations.
|
|
164
|
+
Include git log of recent changes so it can see what was tried.
|
|
536
165
|
|
|
537
|
-
|
|
166
|
+
**Candidates D and E (Failure-Targeted)** — `run_in_background: true`:
|
|
167
|
+
Same structure but `APPROACH: failure-targeted` with specific failing example clusters.
|
|
168
|
+
If ALL_PASSING: D gets `creative`, E gets `efficiency`.
|
|
538
169
|
|
|
539
|
-
|
|
170
|
+
Wait for all 5 to complete.
|
|
540
171
|
|
|
541
|
-
|
|
172
|
+
### 3. Evaluate Each Candidate
|
|
542
173
|
|
|
543
|
-
|
|
174
|
+
For each worktree that has changes (proposer committed something):
|
|
544
175
|
|
|
545
|
-
For each candidate (a, b, c, d, e):
|
|
546
176
|
```bash
|
|
547
|
-
python3 $TOOLS/
|
|
177
|
+
python3 $TOOLS/run_eval.py \
|
|
178
|
+
--config .evolver.json \
|
|
179
|
+
--worktree-path {worktree_path} \
|
|
180
|
+
--experiment-prefix v{NNN}{suffix} \
|
|
181
|
+
--timeout 120
|
|
548
182
|
```
|
|
549
183
|
|
|
550
|
-
|
|
184
|
+
Each candidate becomes a separate LangSmith experiment.
|
|
551
185
|
|
|
552
|
-
### 4.
|
|
186
|
+
### 4. Compare All Candidates
|
|
553
187
|
|
|
554
|
-
For each valid candidate:
|
|
555
188
|
```bash
|
|
556
|
-
python3 $TOOLS/
|
|
557
|
-
--
|
|
558
|
-
--config .
|
|
559
|
-
--
|
|
560
|
-
--eval .harness-evolver/eval/eval.py \
|
|
561
|
-
--traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
|
|
562
|
-
--scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
|
|
563
|
-
--timeout 60
|
|
189
|
+
python3 $TOOLS/read_results.py \
|
|
190
|
+
--experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
|
|
191
|
+
--config .evolver.json \
|
|
192
|
+
--output comparison.json
|
|
564
193
|
```
|
|
565
194
|
|
|
566
|
-
|
|
567
|
-
|
|
568
|
-
|
|
195
|
+
Parse `comparison.json`:
|
|
196
|
+
- `comparison.winner` — highest combined score
|
|
197
|
+
- `comparison.champion` — per-task champion (for next crossover)
|
|
198
|
+
- `comparison.all_candidates` — all scores for reporting
|
|
569
199
|
|
|
570
|
-
|
|
571
|
-
|
|
572
|
-
```
|
|
573
|
-
Agent(
|
|
574
|
-
subagent_type: "harness-evolver-judge",
|
|
575
|
-
description: "Judge: score {version}{suffix} outputs",
|
|
576
|
-
prompt: |
|
|
577
|
-
<objective>
|
|
578
|
-
Score the outputs of harness version {version}{suffix} across all {N} tasks.
|
|
579
|
-
</objective>
|
|
200
|
+
### 5. Merge Winner
|
|
580
201
|
|
|
581
|
-
|
|
582
|
-
- .harness-evolver/harnesses/{version}{suffix}/scores.json
|
|
583
|
-
- .harness-evolver/eval/tasks/ (read all task files)
|
|
584
|
-
</files_to_read>
|
|
585
|
-
|
|
586
|
-
<output>
|
|
587
|
-
Overwrite .harness-evolver/harnesses/{version}{suffix}/scores.json with real scores.
|
|
588
|
-
</output>
|
|
589
|
-
)
|
|
590
|
-
```
|
|
591
|
-
|
|
592
|
-
Wait for `## JUDGE COMPLETE`.
|
|
593
|
-
|
|
594
|
-
If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
|
|
595
|
-
|
|
596
|
-
### 5. Select Winner + Track Per-Task Champions
|
|
597
|
-
|
|
598
|
-
**5a. Find overall winner (highest combined_score):**
|
|
599
|
-
|
|
600
|
-
Compare all evaluated candidates. The winner is the one with highest combined_score.
|
|
601
|
-
|
|
602
|
-
**5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
|
|
202
|
+
If the winner scored higher than the current best:
|
|
603
203
|
|
|
604
204
|
```bash
|
|
605
|
-
|
|
606
|
-
|
|
607
|
-
|
|
608
|
-
version = '{version}'
|
|
609
|
-
candidates = {}
|
|
610
|
-
for suffix in ['a', 'b', 'c', 'd', 'e']:
|
|
611
|
-
path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
|
|
612
|
-
if os.path.exists(path):
|
|
613
|
-
candidates[suffix] = json.load(open(path))
|
|
614
|
-
|
|
615
|
-
if not candidates:
|
|
616
|
-
print('NO_CANDIDATES')
|
|
617
|
-
exit()
|
|
618
|
-
|
|
619
|
-
# Overall winner
|
|
620
|
-
winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
|
|
621
|
-
winner_score = candidates[winner_suffix]['combined_score']
|
|
622
|
-
print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
|
|
623
|
-
|
|
624
|
-
# Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
|
|
625
|
-
task_wins = {}
|
|
626
|
-
winner_tasks = candidates[winner_suffix].get('per_task', {})
|
|
627
|
-
for suffix, data in candidates.items():
|
|
628
|
-
if suffix == winner_suffix:
|
|
629
|
-
continue
|
|
630
|
-
wins = 0
|
|
631
|
-
for tid, tdata in data.get('per_task', {}).items():
|
|
632
|
-
winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
|
|
633
|
-
if tdata.get('score', 0) > winner_task_score:
|
|
634
|
-
wins += 1
|
|
635
|
-
if wins > 0:
|
|
636
|
-
task_wins[suffix] = wins
|
|
637
|
-
|
|
638
|
-
if task_wins:
|
|
639
|
-
champion_suffix = max(task_wins, key=task_wins.get)
|
|
640
|
-
print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
|
|
641
|
-
# Save champion info for next iteration's crossover parent
|
|
642
|
-
with open('.harness-evolver/per_task_champion.json', 'w') as f:
|
|
643
|
-
json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
|
|
644
|
-
else:
|
|
645
|
-
print('NO_CHAMPION: winner dominates all tasks')
|
|
646
|
-
" 2>/dev/null
|
|
647
|
-
```
|
|
648
|
-
|
|
649
|
-
**5c. Promote winner and report ALL candidates:**
|
|
205
|
+
# Get the winning worktree's branch
|
|
206
|
+
WINNER_BRANCH={winning_worktree_branch}
|
|
650
207
|
|
|
651
|
-
|
|
652
|
-
|
|
653
|
-
mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
|
|
208
|
+
# Merge into main
|
|
209
|
+
git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
|
|
654
210
|
```
|
|
655
211
|
|
|
656
|
-
Update
|
|
657
|
-
```
|
|
658
|
-
|
|
659
|
-
|
|
660
|
-
|
|
661
|
-
|
|
662
|
-
|
|
212
|
+
Update `.evolver.json`:
|
|
213
|
+
```python
|
|
214
|
+
import json
|
|
215
|
+
c = json.load(open('.evolver.json'))
|
|
216
|
+
c['best_experiment'] = '{winner_experiment}'
|
|
217
|
+
c['best_score'] = {winner_score}
|
|
218
|
+
c['iterations'] = c['iterations'] + 1
|
|
219
|
+
c['history'].append({
|
|
220
|
+
'version': 'v{NNN}',
|
|
221
|
+
'experiment': '{winner_experiment}',
|
|
222
|
+
'score': {winner_score}
|
|
223
|
+
})
|
|
224
|
+
json.dump(c, open('.evolver.json', 'w'), indent=2)
|
|
663
225
|
```
|
|
664
226
|
|
|
665
|
-
Report ALL candidates
|
|
666
|
-
```
|
|
667
|
-
Iteration {i}/{N} — {num_candidates} candidates evaluated:
|
|
668
|
-
{version}a (exploit): {score_a} — {summary}
|
|
669
|
-
{version}b (explore): {score_b} — {summary}
|
|
670
|
-
{version}c (crossover): {score_c} — {summary}
|
|
671
|
-
{version}d ({strategy_d}): {score_d} — {summary}
|
|
672
|
-
{version}e ({strategy_e}): {score_e} — {summary}
|
|
673
|
-
|
|
674
|
-
Winner: {version}{suffix} ({score})
|
|
675
|
-
Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
|
|
227
|
+
Report ALL candidates:
|
|
676
228
|
```
|
|
229
|
+
Iteration {i}/{N} — 5 candidates evaluated:
|
|
230
|
+
v{NNN}a (exploit): {score_a} — {summary}
|
|
231
|
+
v{NNN}b (explore): {score_b} — {summary}
|
|
232
|
+
v{NNN}c (crossover): {score_c} — {summary}
|
|
233
|
+
v{NNN}d ({strategy}): {score_d} — {summary}
|
|
234
|
+
v{NNN}e ({strategy}): {score_e} — {summary}
|
|
677
235
|
|
|
678
|
-
|
|
236
|
+
Winner: v{NNN}{suffix} ({score}) — merged into main
|
|
237
|
+
Per-task champion: {champion} (beats winner on {N} tasks)
|
|
238
|
+
```
|
|
679
239
|
|
|
680
|
-
### 5.5. Test Suite Growth
|
|
240
|
+
### 5.5. Test Suite Growth
|
|
681
241
|
|
|
682
|
-
|
|
683
|
-
Generate regression tasks to lock in improvements and prevent future regressions:
|
|
242
|
+
If previously-failing examples now pass, add regression examples to the dataset:
|
|
684
243
|
|
|
685
244
|
```bash
|
|
686
|
-
|
|
245
|
+
python3 -c "
|
|
246
|
+
from langsmith import Client
|
|
687
247
|
import json
|
|
688
|
-
s = json.load(open('.harness-evolver/summary.json'))
|
|
689
|
-
versions = s.get('versions', [])
|
|
690
|
-
print(versions[-2]['version'] if len(versions) >= 2 else '')
|
|
691
|
-
" 2>/dev/null)
|
|
692
|
-
if [ -n "$PREV_BEST" ] && [ -f ".harness-evolver/harnesses/$PREV_BEST/scores.json" ]; then
|
|
693
|
-
python3 $TOOLS/test_growth.py \
|
|
694
|
-
--current-scores .harness-evolver/harnesses/{version}/scores.json \
|
|
695
|
-
--previous-scores ".harness-evolver/harnesses/$PREV_BEST/scores.json" \
|
|
696
|
-
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
697
|
-
--output-dir .harness-evolver/eval/tasks/ \
|
|
698
|
-
--max-total-tasks 60 2>/dev/null
|
|
699
|
-
fi
|
|
700
|
-
```
|
|
701
248
|
|
|
702
|
-
|
|
249
|
+
client = Client()
|
|
250
|
+
config = json.load(open('.evolver.json'))
|
|
703
251
|
|
|
704
|
-
|
|
705
|
-
|
|
706
|
-
|
|
252
|
+
# Find examples that improved significantly
|
|
253
|
+
# (score went from <0.5 to >0.8 between iterations)
|
|
254
|
+
# Generate variations and add to dataset
|
|
255
|
+
# client.create_examples(dataset_id=config['dataset_id'], examples=[...])
|
|
256
|
+
print('Test suite growth: added N regression examples')
|
|
257
|
+
" 2>/dev/null
|
|
258
|
+
```
|
|
707
259
|
|
|
708
260
|
### 6. Report
|
|
709
261
|
|
|
710
|
-
|
|
262
|
+
Print: `Iteration {i}/{N}: v{NNN} scored {score} (best: {best} at {best_score})`
|
|
711
263
|
|
|
712
|
-
### 6.5. Auto-trigger Critic
|
|
264
|
+
### 6.5. Auto-trigger Critic
|
|
713
265
|
|
|
714
|
-
|
|
715
|
-
- Did the score jump >0.3 from parent version?
|
|
716
|
-
- Did we reach 1.0 in fewer than 3 total iterations?
|
|
266
|
+
If score jumped >0.3 from previous iteration OR reached target in <3 iterations:
|
|
717
267
|
|
|
718
|
-
|
|
719
|
-
|
|
720
|
-
```bash
|
|
721
|
-
python3 $TOOLS/evaluate.py run \
|
|
722
|
-
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
723
|
-
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
724
|
-
--eval .harness-evolver/eval/eval.py \
|
|
725
|
-
--traces-dir /tmp/critic-check/ \
|
|
726
|
-
--scores /tmp/critic-check-scores.json \
|
|
727
|
-
--timeout 60
|
|
728
|
-
```
|
|
729
|
-
|
|
730
|
-
Dispatch the critic agent:
|
|
268
|
+
Spawn the critic agent to analyze evaluator quality:
|
|
731
269
|
|
|
732
270
|
```
|
|
733
271
|
Agent(
|
|
734
|
-
subagent_type: "
|
|
735
|
-
description: "Critic:
|
|
272
|
+
subagent_type: "evolver-critic",
|
|
273
|
+
description: "Critic: check evaluator gaming",
|
|
736
274
|
prompt: |
|
|
737
275
|
<objective>
|
|
738
|
-
EVAL GAMING DETECTED: Score jumped from {
|
|
739
|
-
|
|
276
|
+
EVAL GAMING DETECTED: Score jumped from {prev_score} to {score}.
|
|
277
|
+
Check if the LangSmith evaluators are being gamed.
|
|
740
278
|
</objective>
|
|
741
279
|
|
|
742
280
|
<files_to_read>
|
|
743
|
-
- .
|
|
744
|
-
- .
|
|
745
|
-
- .
|
|
746
|
-
- .harness-evolver/harnesses/{version}/harness.py
|
|
747
|
-
- .harness-evolver/harnesses/{version}/proposal.md
|
|
748
|
-
- .harness-evolver/config.json
|
|
749
|
-
- .harness-evolver/langsmith_stats.json (if exists)
|
|
281
|
+
- .evolver.json
|
|
282
|
+
- comparison.json
|
|
283
|
+
- trace_insights.json
|
|
750
284
|
</files_to_read>
|
|
751
|
-
|
|
752
|
-
<output>
|
|
753
|
-
Write:
|
|
754
|
-
- .harness-evolver/critic_report.md
|
|
755
|
-
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
756
|
-
</output>
|
|
757
|
-
|
|
758
|
-
<success_criteria>
|
|
759
|
-
- Identifies specific weaknesses in eval.py with task/output examples
|
|
760
|
-
- If gaming detected, shows exact tasks that expose the weakness
|
|
761
|
-
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
762
|
-
- Re-scores the best version with improved eval to show the difference
|
|
763
|
-
</success_criteria>
|
|
764
285
|
)
|
|
765
286
|
```
|
|
766
287
|
|
|
767
|
-
|
|
768
|
-
|
|
769
|
-
If critic wrote `eval_improved.py`:
|
|
770
|
-
- Re-score the best harness with the improved eval
|
|
771
|
-
- Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
|
|
772
|
-
- **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
|
|
773
|
-
- Re-run baseline with new eval and update `summary.json`
|
|
774
|
-
- Print: "Eval upgraded. Resuming evolution with stricter eval."
|
|
775
|
-
- **Continue the loop** with the new eval
|
|
776
|
-
|
|
777
|
-
If critic did NOT write `eval_improved.py` (eval is fine):
|
|
778
|
-
- Print the critic's assessment
|
|
779
|
-
- Continue the loop normally
|
|
780
|
-
|
|
781
|
-
### 7. Auto-trigger Architect (on stagnation or regression)
|
|
782
|
-
|
|
783
|
-
Check if the architect should be auto-spawned:
|
|
784
|
-
- **Stagnation**: 3 consecutive iterations within 1% of each other
|
|
785
|
-
- **Regression**: score dropped below parent score (even once)
|
|
288
|
+
### 7. Auto-trigger Architect
|
|
786
289
|
|
|
787
|
-
|
|
788
|
-
|
|
789
|
-
If triggered:
|
|
790
|
-
|
|
791
|
-
```bash
|
|
792
|
-
python3 $TOOLS/analyze_architecture.py \
|
|
793
|
-
--harness .harness-evolver/harnesses/{best_version}/harness.py \
|
|
794
|
-
--traces-dir .harness-evolver/harnesses/{best_version}/traces \
|
|
795
|
-
--summary .harness-evolver/summary.json \
|
|
796
|
-
-o .harness-evolver/architecture_signals.json
|
|
797
|
-
```
|
|
798
|
-
|
|
799
|
-
Dispatch the architect agent:
|
|
290
|
+
If 3 consecutive iterations within 1% OR score dropped:
|
|
800
291
|
|
|
801
292
|
```
|
|
802
293
|
Agent(
|
|
803
|
-
subagent_type: "
|
|
804
|
-
description: "Architect:
|
|
294
|
+
subagent_type: "evolver-architect",
|
|
295
|
+
description: "Architect: recommend topology change",
|
|
805
296
|
prompt: |
|
|
806
297
|
<objective>
|
|
807
|
-
The evolution loop has
|
|
808
|
-
Analyze the
|
|
298
|
+
The evolution loop has stagnated after {iterations} iterations.
|
|
299
|
+
Analyze the architecture and recommend changes.
|
|
809
300
|
</objective>
|
|
810
301
|
|
|
811
302
|
<files_to_read>
|
|
812
|
-
- .
|
|
813
|
-
- .
|
|
814
|
-
-
|
|
815
|
-
- .harness-evolver/config.json
|
|
816
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
817
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
818
|
-
- .harness-evolver/context7_docs.md (if exists)
|
|
303
|
+
- .evolver.json
|
|
304
|
+
- trace_insights.json
|
|
305
|
+
- {entry point and related source files}
|
|
819
306
|
</files_to_read>
|
|
820
|
-
|
|
821
|
-
<output>
|
|
822
|
-
Write:
|
|
823
|
-
- .harness-evolver/architecture.json (structured recommendation)
|
|
824
|
-
- .harness-evolver/architecture.md (human-readable analysis)
|
|
825
|
-
</output>
|
|
826
|
-
|
|
827
|
-
<success_criteria>
|
|
828
|
-
- Recommendation includes concrete migration steps
|
|
829
|
-
- Each step is implementable in one proposer iteration
|
|
830
|
-
- Considers detected stack and available API keys
|
|
831
|
-
</success_criteria>
|
|
832
307
|
)
|
|
833
308
|
```
|
|
834
309
|
|
|
835
|
-
Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
836
|
-
|
|
837
|
-
Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
|
|
838
|
-
|
|
839
|
-
Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
|
|
840
|
-
|
|
841
310
|
### 8. Check Stop Conditions
|
|
842
311
|
|
|
843
|
-
- **Target**: `
|
|
312
|
+
- **Target**: `score >= target_score` → stop
|
|
844
313
|
- **N reached**: done
|
|
845
|
-
- **Stagnation post-architect**: 3 more iterations without improvement
|
|
314
|
+
- **Stagnation post-architect**: 3 more iterations without improvement → stop
|
|
846
315
|
|
|
847
316
|
## When Loop Ends — Final Report
|
|
848
317
|
|
|
849
318
|
- Best version and score
|
|
850
319
|
- Improvement over baseline (absolute and %)
|
|
851
320
|
- Total iterations run
|
|
852
|
-
-
|
|
853
|
-
-
|
|
854
|
-
- Suggest:
|
|
321
|
+
- Key changes made (git log from baseline to current)
|
|
322
|
+
- LangSmith experiment URLs for comparison
|
|
323
|
+
- Suggest: `/evolver:deploy` to finalize
|