harness-evolver 2.9.1 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/README.md +62 -117
  2. package/agents/evolver-architect.md +53 -0
  3. package/agents/evolver-critic.md +44 -0
  4. package/agents/evolver-proposer.md +128 -0
  5. package/agents/evolver-testgen.md +67 -0
  6. package/bin/install.js +181 -171
  7. package/package.json +7 -7
  8. package/skills/deploy/SKILL.md +49 -56
  9. package/skills/evolve/SKILL.md +156 -687
  10. package/skills/setup/SKILL.md +182 -0
  11. package/skills/status/SKILL.md +23 -21
  12. package/tools/read_results.py +240 -0
  13. package/tools/run_eval.py +202 -0
  14. package/tools/seed_from_traces.py +36 -8
  15. package/tools/setup.py +393 -0
  16. package/tools/trace_insights.py +86 -14
  17. package/agents/harness-evolver-architect.md +0 -173
  18. package/agents/harness-evolver-critic.md +0 -132
  19. package/agents/harness-evolver-judge.md +0 -110
  20. package/agents/harness-evolver-proposer.md +0 -317
  21. package/agents/harness-evolver-testgen.md +0 -112
  22. package/examples/classifier/README.md +0 -25
  23. package/examples/classifier/config.json +0 -3
  24. package/examples/classifier/eval.py +0 -58
  25. package/examples/classifier/harness.py +0 -111
  26. package/examples/classifier/tasks/task_001.json +0 -1
  27. package/examples/classifier/tasks/task_002.json +0 -1
  28. package/examples/classifier/tasks/task_003.json +0 -1
  29. package/examples/classifier/tasks/task_004.json +0 -1
  30. package/examples/classifier/tasks/task_005.json +0 -1
  31. package/examples/classifier/tasks/task_006.json +0 -1
  32. package/examples/classifier/tasks/task_007.json +0 -1
  33. package/examples/classifier/tasks/task_008.json +0 -1
  34. package/examples/classifier/tasks/task_009.json +0 -1
  35. package/examples/classifier/tasks/task_010.json +0 -1
  36. package/skills/architect/SKILL.md +0 -93
  37. package/skills/compare/SKILL.md +0 -73
  38. package/skills/critic/SKILL.md +0 -67
  39. package/skills/diagnose/SKILL.md +0 -96
  40. package/skills/import-traces/SKILL.md +0 -102
  41. package/skills/init/SKILL.md +0 -293
  42. package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
  43. package/tools/__pycache__/init.cpython-313.pyc +0 -0
  44. package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
  45. package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
  46. package/tools/eval_llm_judge.py +0 -233
  47. package/tools/eval_passthrough.py +0 -55
  48. package/tools/evaluate.py +0 -255
  49. package/tools/import_traces.py +0 -229
  50. package/tools/init.py +0 -531
  51. package/tools/llm_api.py +0 -125
  52. package/tools/state.py +0 -219
  53. package/tools/test_growth.py +0 -230
  54. package/tools/trace_logger.py +0 -42
@@ -1,34 +1,31 @@
1
1
  ---
2
- name: harness-evolver:evolve
3
- description: "Use when the user wants to run the optimization loop, improve harness performance, evolve the harness, or iterate on harness quality. Requires .harness-evolver/ to exist (run harness-evolver:init first)."
2
+ name: evolver:evolve
3
+ description: "Use when the user wants to run the optimization loop, improve agent performance, evolve the agent, or iterate on quality. Requires .evolver.json to exist (run evolver:setup first)."
4
4
  argument-hint: "[--iterations N]"
5
5
  allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
6
6
  ---
7
7
 
8
- # /harness-evolver:evolve
8
+ # /evolver:evolve
9
9
 
10
- Run the autonomous propose-evaluate-iterate loop.
10
+ Run the autonomous propose-evaluate-iterate loop using LangSmith as the evaluation backend and git worktrees for isolation.
11
11
 
12
12
  ## Prerequisites
13
13
 
14
- `.harness-evolver/summary.json` must exist. If not, tell user to run `harness-evolver:init`.
14
+ `.evolver.json` must exist. If not, tell user to run `evolver:setup`.
15
15
 
16
16
  ## Resolve Tool Path
17
17
 
18
18
  ```bash
19
- TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
19
+ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")
20
20
  ```
21
21
 
22
22
  ## Parse Arguments
23
23
 
24
- - `--iterations N` (default: 10)
25
- - Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
24
+ - `--iterations N` (default: from interactive question or 5)
26
25
 
27
26
  ## Pre-Loop: Interactive Configuration
28
27
 
29
- If no `--iterations` argument was provided, ask the user interactively:
30
-
31
- Use AskUserQuestion with TWO questions in a single call (simple single-select, no preview needed):
28
+ If no `--iterations` argument was provided, ask the user:
32
29
 
33
30
  ```json
34
31
  {
@@ -38,7 +35,7 @@ Use AskUserQuestion with TWO questions in a single call (simple single-select, n
38
35
  "header": "Iterations",
39
36
  "multiSelect": false,
40
37
  "options": [
41
- {"label": "3 (quick)", "description": "Fast exploration, good for testing setup. ~15 min."},
38
+ {"label": "3 (quick)", "description": "Fast exploration, good for testing. ~15 min."},
42
39
  {"label": "5 (balanced)", "description": "Good trade-off between speed and quality. ~30 min."},
43
40
  {"label": "10 (thorough)", "description": "Deep optimization with adaptive strategies. ~1 hour."}
44
41
  ]
@@ -48,7 +45,7 @@ Use AskUserQuestion with TWO questions in a single call (simple single-select, n
48
45
  "header": "Target",
49
46
  "multiSelect": false,
50
47
  "options": [
51
- {"label": "0.8 (good enough)", "description": "Stop when the harness is reasonably good"},
48
+ {"label": "0.8 (good enough)", "description": "Stop when the agent is reasonably good"},
52
49
  {"label": "0.9 (high quality)", "description": "Stop when quality is high"},
53
50
  {"label": "0.95 (near perfect)", "description": "Push for near-perfect scores"},
54
51
  {"label": "No limit", "description": "Run all iterations regardless of score"}
@@ -58,797 +55,269 @@ Use AskUserQuestion with TWO questions in a single call (simple single-select, n
58
55
  }
59
56
  ```
60
57
 
61
- Apply the answers:
62
- - Set iterations from question 1 (3, 5, or 10)
63
- - Set target_score from question 2 (0.8, 0.9, 0.95, or None)
64
-
65
- If `--iterations` WAS provided as argument, skip these questions and use the argument value.
66
-
67
58
  ## The Loop
68
59
 
69
- For each iteration:
70
-
71
- ### 1. Get Next Version
72
-
60
+ Read config:
73
61
  ```bash
74
- python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
62
+ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Best: {c[\"best_experiment\"]} ({c[\"best_score\"]:.3f}), Iterations: {c[\"iterations\"]}')"
75
63
  ```
76
64
 
77
- ### 1.4. Gather Production Insights (first iteration only)
78
-
79
- On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
80
-
81
- ```bash
82
- PROD_PROJECT=$(python3 -c "
83
- import json, os
84
- c = json.load(open('.harness-evolver/config.json'))
85
- print(c.get('eval', {}).get('production_project', ''))
86
- " 2>/dev/null)
87
- if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
88
- python3 $TOOLS/seed_from_traces.py \
89
- --project "$PROD_PROJECT" \
90
- --output-md .harness-evolver/production_seed.md \
91
- --output-json .harness-evolver/production_seed.json \
92
- --limit 100 2>/dev/null
93
- fi
94
- ```
95
-
96
- The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
97
-
98
- ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
99
-
100
- **Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
101
-
102
- **Step 1: Find the actual LangSmith project name**
103
-
104
- ```bash
105
- langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
106
- ```
107
-
108
- This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
109
-
110
- ```bash
111
- LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
112
- ```
113
-
114
- If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
65
+ For each iteration:
115
66
 
116
- **Step 2: Gather raw traces from the discovered project**
67
+ ### 1. Get Next Version
117
68
 
118
69
  ```bash
119
- if [ -n "$LS_PROJECT" ]; then
120
- langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
121
- langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
122
- echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
123
- else
124
- echo "[]" > /tmp/langsmith_raw.json
125
- echo "{}" > .harness-evolver/langsmith_stats.json
126
- fi
70
+ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"iterations\"]+1:03d}')"
127
71
  ```
128
72
 
129
- **Step 3: Process raw LangSmith data into a readable format for proposers**
73
+ ### 1.5. Gather Trace Insights
130
74
 
131
- The raw langsmith data has LangChain-serialized messages that are hard to read. Process it into a clean summary:
75
+ Run trace insights from the best experiment:
132
76
 
133
77
  ```bash
134
- python3 -c "
135
- import json, sys
136
-
137
- raw = json.load(open('/tmp/langsmith_raw.json'))
138
- if not raw:
139
- json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
140
- sys.exit(0)
141
-
142
- clean = []
143
- for r in raw:
144
- entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
145
-
146
- # Extract readable prompt from LangChain serialized inputs
147
- inputs = r.get('inputs', {})
148
- if isinstance(inputs, dict) and 'messages' in inputs:
149
- msgs = inputs['messages']
150
- for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
151
- for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
152
- if isinstance(msg, dict):
153
- kwargs = msg.get('kwargs', msg)
154
- content = kwargs.get('content', '')
155
- msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
156
- if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
157
- entry['user_message'] = str(content)[:300]
158
- elif 'System' in str(msg_type):
159
- entry['system_prompt_preview'] = str(content)[:200]
160
-
161
- # Extract readable output
162
- outputs = r.get('outputs', {})
163
- if isinstance(outputs, dict) and 'generations' in outputs:
164
- gens = outputs['generations']
165
- if gens and isinstance(gens, list) and gens[0]:
166
- gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
167
- if isinstance(gen, dict):
168
- msg = gen.get('message', gen)
169
- if isinstance(msg, dict):
170
- kwargs = msg.get('kwargs', msg)
171
- entry['llm_response'] = str(kwargs.get('content', ''))[:300]
172
-
173
- clean.append(entry)
174
-
175
- json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
176
- print(f'Processed {len(clean)} LangSmith runs into readable format')
177
- " 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
178
- ```
179
-
180
- The resulting `langsmith_runs.json` has clean, readable entries:
181
- ```json
182
- [
183
- {
184
- "name": "ChatGoogleGenerativeAI",
185
- "tokens": 1332,
186
- "error": null,
187
- "user_message": "Analise este texto: Bom dia pessoal...",
188
- "system_prompt_preview": "Você é um moderador de conteúdo...",
189
- "llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
190
- }
191
- ]
78
+ BEST=$(python3 -c "import json; print(json.load(open('.evolver.json'))['best_experiment'])")
79
+ python3 $TOOLS/trace_insights.py \
80
+ --from-experiment "$BEST" \
81
+ --output trace_insights.json 2>/dev/null
192
82
  ```
193
83
 
194
- These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
195
-
196
- ### 1.6. Generate Trace Insights (systematic analysis)
197
-
198
- If LangSmith traces were gathered, run systematic analysis to cluster errors, analyze token usage, and cross-reference with scores:
84
+ If a production project is configured, also gather production insights:
199
85
 
200
86
  ```bash
201
- if [ -f ".harness-evolver/langsmith_runs.json" ]; then
202
- BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s['best']['version'])")
203
- SCORES_PATH=".harness-evolver/harnesses/$BEST/scores.json"
204
- [ ! -f "$SCORES_PATH" ] && SCORES_PATH=".harness-evolver/baseline/scores.json"
205
- python3 $TOOLS/trace_insights.py \
206
- --langsmith-runs .harness-evolver/langsmith_runs.json \
207
- --langsmith-stats .harness-evolver/langsmith_stats.json \
208
- --scores "$SCORES_PATH" \
209
- --tasks-dir .harness-evolver/eval/tasks/ \
210
- --output .harness-evolver/trace_insights.json 2>/dev/null
87
+ PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
88
+ if [ -n "$PROD" ] && [ ! -f "production_seed.json" ]; then
89
+ python3 $TOOLS/seed_from_traces.py \
90
+ --project "$PROD" --use-sdk \
91
+ --output-md production_seed.md \
92
+ --output-json production_seed.json \
93
+ --limit 100 2>/dev/null
211
94
  fi
212
95
  ```
213
96
 
214
- The resulting `trace_insights.json` contains:
215
- - `error_clusters`: grouped error patterns with counts
216
- - `token_analysis`: score distribution by token usage bucket (low/medium/high)
217
- - `hypotheses`: data-driven theories about failure causes
218
- - `top_issues`: highest-impact problems sorted by severity
97
+ ### 1.8. Analyze Per-Task Failures
219
98
 
220
- This file is included in all proposers' `<files_to_read>` so they have structured diagnostic data.
221
-
222
- ### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
223
-
224
- Before spawning proposers, analyze which tasks are failing and cluster them:
99
+ Read the best experiment results and cluster failures:
225
100
 
226
101
  ```bash
227
- python3 -c "
228
- import json, os, sys
229
-
230
- # Find best version scores
231
- summary = json.load(open('.harness-evolver/summary.json'))
232
- best = summary['best']['version']
233
- scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
234
- if not os.path.exists(scores_path):
235
- scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
236
-
237
- if not scores_path or not os.path.exists(scores_path):
238
- print('NO_SCORES')
239
- sys.exit(0)
240
-
241
- scores = json.load(open(scores_path))
242
- tasks_dir = '.harness-evolver/eval/tasks/'
243
- failures = {}
244
-
245
- for tid, tdata in scores.get('per_task', {}).items():
246
- score = tdata.get('score', 0)
247
- if score < 0.7:
248
- tfile = os.path.join(tasks_dir, tid + '.json')
249
- cat = 'unknown'
250
- if os.path.exists(tfile):
251
- task = json.load(open(tfile))
252
- meta = task.get('metadata', {})
253
- cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
254
- failures.setdefault(cat, []).append({'id': tid, 'score': score})
255
-
256
- if not failures:
257
- print('ALL_PASSING')
258
- else:
259
- sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
260
- for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
261
- task_ids = [t['id'] for t in tasks]
262
- avg_score = sum(t['score'] for t in tasks) / len(tasks)
263
- print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
264
- " 2>/dev/null
102
+ python3 $TOOLS/read_results.py \
103
+ --experiment "$BEST" \
104
+ --config .evolver.json \
105
+ --output best_results.json 2>/dev/null
265
106
  ```
266
107
 
267
- Parse the output:
268
- - If `NO_SCORES` or `ALL_PASSING`: D gets "creative" brief, E gets "efficiency" brief
269
- - If clusters found: D targets cluster 1, E targets cluster 2
270
- - If only 1 cluster: D targets it, E gets "creative" brief
271
-
272
- Save clusters for use in step 2.
273
-
274
- ### 2. Propose (3 parallel candidates)
108
+ Parse `best_results.json` to find failing examples (score < 0.7). Group by metadata or error pattern.
109
+ Generate adaptive briefings for Candidates D and E (same logic as v2).
275
110
 
276
- Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
277
- This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
111
+ ### 2. Spawn 5 Proposers in Parallel
278
112
 
279
- Determine parents for each strategy:
280
- - **Exploiter parent**: current best version (from summary.json `best.version`)
281
- - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
282
- - **Crossover parents**:
283
- - Parent A = current best version
284
- - Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
285
- If no champion file exists, fall back to a non-best version from the archive.
113
+ Each proposer runs in a **git worktree** via Claude Code's native `isolation: "worktree"` parameter.
286
114
 
287
- Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
115
+ **Candidate A (Exploit)** `run_in_background: true`:
288
116
 
289
- **Candidate A (Exploiter)** — `run_in_background: true`:
290
117
  ```
291
118
  Agent(
292
- subagent_type: "harness-evolver-proposer",
293
- description: "Proposer A (exploit): targeted fix for {version}",
119
+ subagent_type: "evolver-proposer",
120
+ description: "Proposer A: exploit best version",
121
+ isolation: "worktree",
294
122
  run_in_background: true,
295
123
  prompt: |
296
- <strategy>
297
- APPROACH: exploitation
298
- You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
299
- the highest-impact failing tasks. Base your work on the current best version.
300
- Do NOT restructure the code. Do NOT change the architecture.
301
- Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
302
- </strategy>
303
-
304
124
  <objective>
305
- Propose harness version {version}a that improves on {best_score}.
125
+ Improve the agent code to score higher on the evaluation dataset.
126
+ You are working in an isolated git worktree — modify any file freely.
306
127
  </objective>
307
128
 
308
- <files_to_read>
309
- - .harness-evolver/summary.json
310
- - .harness-evolver/PROPOSER_HISTORY.md
311
- - .harness-evolver/config.json
312
- - .harness-evolver/harnesses/{best_version}/harness.py
313
- - .harness-evolver/harnesses/{best_version}/scores.json
314
- - .harness-evolver/harnesses/{best_version}/proposal.md
315
- - .harness-evolver/langsmith_diagnosis.json (if exists)
316
- - .harness-evolver/langsmith_stats.json (if exists)
317
- - .harness-evolver/langsmith_runs.json (if exists)
318
- - .harness-evolver/trace_insights.json (if exists)
319
- - .harness-evolver/production_seed.json (if exists)
320
- - .harness-evolver/architecture.json (if exists)
321
- </files_to_read>
322
-
323
- <output>
324
- Create directory .harness-evolver/harnesses/{version}a/ containing:
325
- - harness.py, config.json, proposal.md
326
- </output>
327
- )
328
- ```
329
-
330
- **Candidate B (Explorer)** — `run_in_background: true`:
331
- ```
332
- Agent(
333
- subagent_type: "harness-evolver-proposer",
334
- description: "Proposer B (explore): bold change from {explorer_parent}",
335
- run_in_background: true,
336
- prompt: |
337
- <strategy>
338
- APPROACH: exploration
339
- You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
340
- Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
341
- Consider: different retrieval strategy, different prompt structure,
342
- different output parsing, different error handling philosophy.
343
- Be bold. A creative failure teaches more than a timid success.
344
- </strategy>
345
-
346
- <objective>
347
- Propose harness version {version}b that takes a different approach.
348
- </objective>
349
-
350
- <files_to_read>
351
- - .harness-evolver/summary.json
352
- - .harness-evolver/PROPOSER_HISTORY.md
353
- - .harness-evolver/config.json
354
- - .harness-evolver/baseline/harness.py
355
- - .harness-evolver/harnesses/{explorer_parent}/harness.py
356
- - .harness-evolver/harnesses/{explorer_parent}/scores.json
357
- - .harness-evolver/langsmith_diagnosis.json (if exists)
358
- - .harness-evolver/langsmith_runs.json (if exists)
359
- - .harness-evolver/trace_insights.json (if exists)
360
- - .harness-evolver/production_seed.json (if exists)
361
- - .harness-evolver/architecture.json (if exists)
362
- </files_to_read>
363
-
364
- <output>
365
- Create directory .harness-evolver/harnesses/{version}b/ containing:
366
- - harness.py, config.json, proposal.md
367
- </output>
368
- )
369
- ```
370
-
371
- **Candidate C (Crossover)** — blocks (last one):
372
- ```
373
- Agent(
374
- subagent_type: "harness-evolver-proposer",
375
- description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
376
- prompt: |
377
129
  <strategy>
378
- APPROACH: crossover
379
- You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
380
- - {parent_a} (score: {score_a}): {summary of what it does well}
381
- - {parent_b} (score: {score_b}): {summary of what it does well}
382
- Take the best elements from each and merge them into a single harness.
130
+ APPROACH: exploitation
131
+ Make targeted improvements to the current best version.
132
+ Focus on the specific failures identified in the results.
383
133
  </strategy>
384
134
 
385
- <objective>
386
- Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
387
- </objective>
388
-
389
135
  <files_to_read>
390
- - .harness-evolver/summary.json
391
- - .harness-evolver/PROPOSER_HISTORY.md
392
- - .harness-evolver/config.json
393
- - .harness-evolver/harnesses/{parent_a}/harness.py
394
- - .harness-evolver/harnesses/{parent_a}/scores.json
395
- - .harness-evolver/harnesses/{parent_b}/harness.py
396
- - .harness-evolver/harnesses/{parent_b}/scores.json
397
- - .harness-evolver/langsmith_diagnosis.json (if exists)
398
- - .harness-evolver/langsmith_runs.json (if exists)
399
- - .harness-evolver/trace_insights.json (if exists)
400
- - .harness-evolver/production_seed.json (if exists)
401
- - .harness-evolver/architecture.json (if exists)
136
+ - .evolver.json
137
+ - trace_insights.json (if exists)
138
+ - production_seed.json (if exists)
139
+ - best_results.json (if exists)
140
+ - {entry point file from .evolver.json}
402
141
  </files_to_read>
403
142
 
404
- <output>
405
- Create directory .harness-evolver/harnesses/{version}c/ containing:
406
- - harness.py, config.json, proposal.md
407
- </output>
408
- )
409
- ```
410
-
411
- **Also spawn these additional candidates:**
412
-
413
- **Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
414
-
415
- If failure clusters were found in step 1.8:
416
- ```
417
- Agent(
418
- subagent_type: "harness-evolver-proposer",
419
- description: "Proposer D: fix {cluster_1_category} failures",
420
- run_in_background: true,
421
- prompt: |
422
- <strategy>
423
- APPROACH: failure-targeted
424
- Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
425
- They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
426
- Read the traces of these specific tasks to understand WHY they fail.
427
- Your changes should improve these tasks WITHOUT regressing others.
428
- You are free to change anything — prompts, code, retrieval, architecture —
429
- whatever is needed to fix THIS specific failure mode.
430
- </strategy>
431
-
432
- <objective>
433
- Propose harness version {version}d targeting {cluster_1_category} failures.
434
- </objective>
435
-
436
- <files_to_read>
437
- - .harness-evolver/summary.json
438
- - .harness-evolver/PROPOSER_HISTORY.md
439
- - .harness-evolver/config.json
440
- - .harness-evolver/harnesses/{best_version}/harness.py
441
- - .harness-evolver/harnesses/{best_version}/scores.json
442
- - .harness-evolver/langsmith_runs.json (if exists)
443
- - .harness-evolver/trace_insights.json (if exists)
444
- - .harness-evolver/production_seed.json (if exists)
445
- - .harness-evolver/architecture.json (if exists)
446
- </files_to_read>
143
+ <context>
144
+ Best experiment: {best_experiment} (score: {best_score})
145
+ Framework: {framework}
146
+ Entry point: {entry_point}
147
+ Evaluators: {evaluators}
148
+ Failing examples: {failing_example_summary}
149
+ </context>
447
150
 
448
151
  <output>
449
- Create directory .harness-evolver/harnesses/{version}d/ containing:
450
- - harness.py, config.json, proposal.md
152
+ 1. Modify the code to improve performance
153
+ 2. Commit your changes with a descriptive message
154
+ 3. Write proposal.md explaining what you changed and why
451
155
  </output>
452
156
  )
453
157
  ```
454
158
 
455
- If ALL_PASSING (no failures):
456
- ```
457
- Agent(
458
- subagent_type: "harness-evolver-proposer",
459
- description: "Proposer D: creative approach",
460
- run_in_background: true,
461
- prompt: |
462
- <strategy>
463
- APPROACH: creative
464
- All tasks are scoring well. Try something UNEXPECTED:
465
- - Different algorithm or library
466
- - Completely different prompt architecture
467
- - Novel error handling or output validation
468
- - Something no one would think of
469
- The goal is to discover improvements that incremental fixes would miss.
470
- </strategy>
471
- ...same files_to_read and output as above...
472
- )
473
- ```
474
-
475
- **Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
476
-
477
- If a second failure cluster exists:
478
- ```
479
- Agent(
480
- subagent_type: "harness-evolver-proposer",
481
- description: "Proposer E: fix {cluster_2_category} failures",
482
- run_in_background: true,
483
- prompt: |
484
- <strategy>
485
- APPROACH: failure-targeted
486
- Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
487
- They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
488
- Read the traces of these specific tasks to understand WHY they fail.
489
- Your changes should improve these tasks WITHOUT regressing others.
490
- You are free to change anything — prompts, code, retrieval, architecture —
491
- whatever is needed to fix THIS specific failure mode.
492
- </strategy>
493
-
494
- <objective>
495
- Propose harness version {version}e targeting {cluster_2_category} failures.
496
- </objective>
497
-
498
- <files_to_read>
499
- - .harness-evolver/summary.json
500
- - .harness-evolver/PROPOSER_HISTORY.md
501
- - .harness-evolver/config.json
502
- - .harness-evolver/harnesses/{best_version}/harness.py
503
- - .harness-evolver/harnesses/{best_version}/scores.json
504
- - .harness-evolver/langsmith_runs.json (if exists)
505
- - .harness-evolver/trace_insights.json (if exists)
506
- - .harness-evolver/production_seed.json (if exists)
507
- - .harness-evolver/architecture.json (if exists)
508
- </files_to_read>
509
-
510
- <output>
511
- Create directory .harness-evolver/harnesses/{version}e/ containing:
512
- - harness.py, config.json, proposal.md
513
- </output>
514
- )
515
- ```
159
+ **Candidate B (Explorer)** — `run_in_background: true`:
160
+ Same structure but `APPROACH: exploration` — bold, fundamentally different approach.
516
161
 
517
- If no second cluster (or ALL_PASSING):
518
- ```
519
- Agent(
520
- subagent_type: "harness-evolver-proposer",
521
- description: "Proposer E: efficiency optimization",
522
- run_in_background: true,
523
- prompt: |
524
- <strategy>
525
- APPROACH: efficiency
526
- Maintain the current quality but optimize for:
527
- - Fewer LLM tokens (shorter prompts, less context)
528
- - Faster execution (reduce unnecessary steps)
529
- - Simpler code (remove redundant logic)
530
- - Better error handling (graceful degradation)
531
- Do NOT sacrifice accuracy for speed — same quality, less cost.
532
- </strategy>
533
- ...same files_to_read and output as above...
534
- )
535
- ```
162
+ **Candidate C (Crossover)** `run_in_background: true`:
163
+ Same structure but `APPROACH: crossover` — combine strengths from previous iterations.
164
+ Include git log of recent changes so it can see what was tried.
536
165
 
537
- Wait for all 5 to complete. The background agents will notify when done.
166
+ **Candidates D and E (Failure-Targeted)** `run_in_background: true`:
167
+ Same structure but `APPROACH: failure-targeted` with specific failing example clusters.
168
+ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
538
169
 
539
- **Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
170
+ Wait for all 5 to complete.
540
171
 
541
- **On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, step 1.8 will naturally shift D and E toward failure-targeted or creative strategies based on actual task performance.
172
+ ### 3. Evaluate Each Candidate
542
173
 
543
- ### 3. Validate All Candidates
174
+ For each worktree that has changes (proposer committed something):
544
175
 
545
- For each candidate (a, b, c, d, e):
546
176
  ```bash
547
- python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
177
+ python3 $TOOLS/run_eval.py \
178
+ --config .evolver.json \
179
+ --worktree-path {worktree_path} \
180
+ --experiment-prefix v{NNN}{suffix} \
181
+ --timeout 120
548
182
  ```
549
183
 
550
- Remove any that fail validation.
184
+ Each candidate becomes a separate LangSmith experiment.
551
185
 
552
- ### 4. Evaluate All Candidates
186
+ ### 4. Compare All Candidates
553
187
 
554
- For each valid candidate:
555
188
  ```bash
556
- python3 $TOOLS/evaluate.py run \
557
- --harness .harness-evolver/harnesses/{version}{suffix}/harness.py \
558
- --config .harness-evolver/harnesses/{version}{suffix}/config.json \
559
- --tasks-dir .harness-evolver/eval/tasks/ \
560
- --eval .harness-evolver/eval/eval.py \
561
- --traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
562
- --scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
563
- --timeout 60
189
+ python3 $TOOLS/read_results.py \
190
+ --experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
191
+ --config .evolver.json \
192
+ --output comparison.json
564
193
  ```
565
194
 
566
- ### 4.5. Judge (if eval returned pending scores)
567
-
568
- For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
195
+ Parse `comparison.json`:
196
+ - `comparison.winner` — highest combined score
197
+ - `comparison.champion` per-task champion (for next crossover)
198
+ - `comparison.all_candidates` — all scores for reporting
569
199
 
570
- Spawn judge subagent with `subagent_type: "harness-evolver-judge"` for EACH candidate that needs judging:
571
-
572
- ```
573
- Agent(
574
- subagent_type: "harness-evolver-judge",
575
- description: "Judge: score {version}{suffix} outputs",
576
- prompt: |
577
- <objective>
578
- Score the outputs of harness version {version}{suffix} across all {N} tasks.
579
- </objective>
200
+ ### 5. Merge Winner
580
201
 
581
- <files_to_read>
582
- - .harness-evolver/harnesses/{version}{suffix}/scores.json
583
- - .harness-evolver/eval/tasks/ (read all task files)
584
- </files_to_read>
585
-
586
- <output>
587
- Overwrite .harness-evolver/harnesses/{version}{suffix}/scores.json with real scores.
588
- </output>
589
- )
590
- ```
591
-
592
- Wait for `## JUDGE COMPLETE`.
593
-
594
- If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
595
-
596
- ### 5. Select Winner + Track Per-Task Champions
597
-
598
- **5a. Find overall winner (highest combined_score):**
599
-
600
- Compare all evaluated candidates. The winner is the one with highest combined_score.
601
-
602
- **5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
202
+ If the winner scored higher than the current best:
603
203
 
604
204
  ```bash
605
- python3 -c "
606
- import json, os
607
-
608
- version = '{version}'
609
- candidates = {}
610
- for suffix in ['a', 'b', 'c', 'd', 'e']:
611
- path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
612
- if os.path.exists(path):
613
- candidates[suffix] = json.load(open(path))
614
-
615
- if not candidates:
616
- print('NO_CANDIDATES')
617
- exit()
618
-
619
- # Overall winner
620
- winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
621
- winner_score = candidates[winner_suffix]['combined_score']
622
- print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
623
-
624
- # Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
625
- task_wins = {}
626
- winner_tasks = candidates[winner_suffix].get('per_task', {})
627
- for suffix, data in candidates.items():
628
- if suffix == winner_suffix:
629
- continue
630
- wins = 0
631
- for tid, tdata in data.get('per_task', {}).items():
632
- winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
633
- if tdata.get('score', 0) > winner_task_score:
634
- wins += 1
635
- if wins > 0:
636
- task_wins[suffix] = wins
637
-
638
- if task_wins:
639
- champion_suffix = max(task_wins, key=task_wins.get)
640
- print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
641
- # Save champion info for next iteration's crossover parent
642
- with open('.harness-evolver/per_task_champion.json', 'w') as f:
643
- json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
644
- else:
645
- print('NO_CHAMPION: winner dominates all tasks')
646
- " 2>/dev/null
647
- ```
648
-
649
- **5c. Promote winner and report ALL candidates:**
205
+ # Get the winning worktree's branch
206
+ WINNER_BRANCH={winning_worktree_branch}
650
207
 
651
- Rename winner directory to official version:
652
- ```bash
653
- mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
208
+ # Merge into main
209
+ git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
654
210
  ```
655
211
 
656
- Update state:
657
- ```bash
658
- python3 $TOOLS/state.py update \
659
- --base-dir .harness-evolver \
660
- --version {version} \
661
- --scores .harness-evolver/harnesses/{version}/scores.json \
662
- --proposal .harness-evolver/harnesses/{version}/proposal.md
212
+ Update `.evolver.json`:
213
+ ```python
214
+ import json
215
+ c = json.load(open('.evolver.json'))
216
+ c['best_experiment'] = '{winner_experiment}'
217
+ c['best_score'] = {winner_score}
218
+ c['iterations'] = c['iterations'] + 1
219
+ c['history'].append({
220
+ 'version': 'v{NNN}',
221
+ 'experiment': '{winner_experiment}',
222
+ 'score': {winner_score}
223
+ })
224
+ json.dump(c, open('.evolver.json', 'w'), indent=2)
663
225
  ```
664
226
 
665
- Report ALL candidates with their scores and strategies:
666
- ```
667
- Iteration {i}/{N} — {num_candidates} candidates evaluated:
668
- {version}a (exploit): {score_a} — {summary}
669
- {version}b (explore): {score_b} — {summary}
670
- {version}c (crossover): {score_c} — {summary}
671
- {version}d ({strategy_d}): {score_d} — {summary}
672
- {version}e ({strategy_e}): {score_e} — {summary}
673
-
674
- Winner: {version}{suffix} ({score})
675
- Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
227
+ Report ALL candidates:
676
228
  ```
229
+ Iteration {i}/{N} — 5 candidates evaluated:
230
+ v{NNN}a (exploit): {score_a} — {summary}
231
+ v{NNN}b (explore): {score_b} — {summary}
232
+ v{NNN}c (crossover): {score_c} — {summary}
233
+ v{NNN}d ({strategy}): {score_d} — {summary}
234
+ v{NNN}e ({strategy}): {score_e} — {summary}
677
235
 
678
- Keep losing candidates in their directories (they're part of the archive never discard, per DGM).
236
+ Winner: v{NNN}{suffix} ({score})merged into main
237
+ Per-task champion: {champion} (beats winner on {N} tasks)
238
+ ```
679
239
 
680
- ### 5.5. Test Suite Growth (Durable Regression Gates)
240
+ ### 5.5. Test Suite Growth
681
241
 
682
- After the winner is promoted, check if any previously-failing tasks are now passing.
683
- Generate regression tasks to lock in improvements and prevent future regressions:
242
+ If previously-failing examples now pass, add regression examples to the dataset:
684
243
 
685
244
  ```bash
686
- PREV_BEST=$(python3 -c "
245
+ python3 -c "
246
+ from langsmith import Client
687
247
  import json
688
- s = json.load(open('.harness-evolver/summary.json'))
689
- versions = s.get('versions', [])
690
- print(versions[-2]['version'] if len(versions) >= 2 else '')
691
- " 2>/dev/null)
692
- if [ -n "$PREV_BEST" ] && [ -f ".harness-evolver/harnesses/$PREV_BEST/scores.json" ]; then
693
- python3 $TOOLS/test_growth.py \
694
- --current-scores .harness-evolver/harnesses/{version}/scores.json \
695
- --previous-scores ".harness-evolver/harnesses/$PREV_BEST/scores.json" \
696
- --tasks-dir .harness-evolver/eval/tasks/ \
697
- --output-dir .harness-evolver/eval/tasks/ \
698
- --max-total-tasks 60 2>/dev/null
699
- fi
700
- ```
701
248
 
702
- If new tasks were added, print: "Added {N} regression tasks to lock in improvements on: {task_ids}"
249
+ client = Client()
250
+ config = json.load(open('.evolver.json'))
703
251
 
704
- This is the "durable test gates" pattern: every fixed failure becomes a permanent regression test.
705
- New tasks are tagged with `metadata.type: "regression"` and `metadata.source: "regression"` so they
706
- can be distinguished from original tasks. The test suite only grows — regression tasks are never removed.
252
+ # Find examples that improved significantly
253
+ # (score went from <0.5 to >0.8 between iterations)
254
+ # Generate variations and add to dataset
255
+ # client.create_examples(dataset_id=config['dataset_id'], examples=[...])
256
+ print('Test suite growth: added N regression examples')
257
+ " 2>/dev/null
258
+ ```
707
259
 
708
260
  ### 6. Report
709
261
 
710
- Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
262
+ Print: `Iteration {i}/{N}: v{NNN} scored {score} (best: {best} at {best_score})`
711
263
 
712
- ### 6.5. Auto-trigger Critic (on eval gaming)
264
+ ### 6.5. Auto-trigger Critic
713
265
 
714
- Read `summary.json` and check:
715
- - Did the score jump >0.3 from parent version?
716
- - Did we reach 1.0 in fewer than 3 total iterations?
266
+ If score jumped >0.3 from previous iteration OR reached target in <3 iterations:
717
267
 
718
- If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
719
-
720
- ```bash
721
- python3 $TOOLS/evaluate.py run \
722
- --harness .harness-evolver/harnesses/{version}/harness.py \
723
- --tasks-dir .harness-evolver/eval/tasks/ \
724
- --eval .harness-evolver/eval/eval.py \
725
- --traces-dir /tmp/critic-check/ \
726
- --scores /tmp/critic-check-scores.json \
727
- --timeout 60
728
- ```
729
-
730
- Dispatch the critic agent:
268
+ Spawn the critic agent to analyze evaluator quality:
731
269
 
732
270
  ```
733
271
  Agent(
734
- subagent_type: "harness-evolver-critic",
735
- description: "Critic: analyze eval quality",
272
+ subagent_type: "evolver-critic",
273
+ description: "Critic: check evaluator gaming",
736
274
  prompt: |
737
275
  <objective>
738
- EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
739
- Analyze the eval quality and propose a stricter eval.
276
+ EVAL GAMING DETECTED: Score jumped from {prev_score} to {score}.
277
+ Check if the LangSmith evaluators are being gamed.
740
278
  </objective>
741
279
 
742
280
  <files_to_read>
743
- - .harness-evolver/eval/eval.py
744
- - .harness-evolver/summary.json
745
- - .harness-evolver/harnesses/{version}/scores.json
746
- - .harness-evolver/harnesses/{version}/harness.py
747
- - .harness-evolver/harnesses/{version}/proposal.md
748
- - .harness-evolver/config.json
749
- - .harness-evolver/langsmith_stats.json (if exists)
281
+ - .evolver.json
282
+ - comparison.json
283
+ - trace_insights.json
750
284
  </files_to_read>
751
-
752
- <output>
753
- Write:
754
- - .harness-evolver/critic_report.md
755
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
756
- </output>
757
-
758
- <success_criteria>
759
- - Identifies specific weaknesses in eval.py with task/output examples
760
- - If gaming detected, shows exact tasks that expose the weakness
761
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
762
- - Re-scores the best version with improved eval to show the difference
763
- </success_criteria>
764
285
  )
765
286
  ```
766
287
 
767
- Wait for `## CRITIC REPORT COMPLETE`.
768
-
769
- If critic wrote `eval_improved.py`:
770
- - Re-score the best harness with the improved eval
771
- - Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
772
- - **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
773
- - Re-run baseline with new eval and update `summary.json`
774
- - Print: "Eval upgraded. Resuming evolution with stricter eval."
775
- - **Continue the loop** with the new eval
776
-
777
- If critic did NOT write `eval_improved.py` (eval is fine):
778
- - Print the critic's assessment
779
- - Continue the loop normally
780
-
781
- ### 7. Auto-trigger Architect (on stagnation or regression)
782
-
783
- Check if the architect should be auto-spawned:
784
- - **Stagnation**: 3 consecutive iterations within 1% of each other
785
- - **Regression**: score dropped below parent score (even once)
288
+ ### 7. Auto-trigger Architect
786
289
 
787
- AND `.harness-evolver/architecture.json` does NOT already exist.
788
-
789
- If triggered:
790
-
791
- ```bash
792
- python3 $TOOLS/analyze_architecture.py \
793
- --harness .harness-evolver/harnesses/{best_version}/harness.py \
794
- --traces-dir .harness-evolver/harnesses/{best_version}/traces \
795
- --summary .harness-evolver/summary.json \
796
- -o .harness-evolver/architecture_signals.json
797
- ```
798
-
799
- Dispatch the architect agent:
290
+ If 3 consecutive iterations within 1% OR score dropped:
800
291
 
801
292
  ```
802
293
  Agent(
803
- subagent_type: "harness-evolver-architect",
804
- description: "Architect: analyze topology after {stagnation/regression}",
294
+ subagent_type: "evolver-architect",
295
+ description: "Architect: recommend topology change",
805
296
  prompt: |
806
297
  <objective>
807
- The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
808
- Analyze the harness architecture and recommend a topology change.
298
+ The evolution loop has stagnated after {iterations} iterations.
299
+ Analyze the architecture and recommend changes.
809
300
  </objective>
810
301
 
811
302
  <files_to_read>
812
- - .harness-evolver/architecture_signals.json
813
- - .harness-evolver/summary.json
814
- - .harness-evolver/PROPOSER_HISTORY.md
815
- - .harness-evolver/config.json
816
- - .harness-evolver/harnesses/{best_version}/harness.py
817
- - .harness-evolver/harnesses/{best_version}/scores.json
818
- - .harness-evolver/context7_docs.md (if exists)
303
+ - .evolver.json
304
+ - trace_insights.json
305
+ - {entry point and related source files}
819
306
  </files_to_read>
820
-
821
- <output>
822
- Write:
823
- - .harness-evolver/architecture.json (structured recommendation)
824
- - .harness-evolver/architecture.md (human-readable analysis)
825
- </output>
826
-
827
- <success_criteria>
828
- - Recommendation includes concrete migration steps
829
- - Each step is implementable in one proposer iteration
830
- - Considers detected stack and available API keys
831
- </success_criteria>
832
307
  )
833
308
  ```
834
309
 
835
- Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
836
-
837
- Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
838
-
839
- Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
840
-
841
310
  ### 8. Check Stop Conditions
842
311
 
843
- - **Target**: `combined_score >= target_score` → stop
312
+ - **Target**: `score >= target_score` → stop
844
313
  - **N reached**: done
845
- - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
314
+ - **Stagnation post-architect**: 3 more iterations without improvement → stop
846
315
 
847
316
  ## When Loop Ends — Final Report
848
317
 
849
318
  - Best version and score
850
319
  - Improvement over baseline (absolute and %)
851
320
  - Total iterations run
852
- - Whether critic was triggered and eval was upgraded
853
- - Whether architect was triggered and what it recommended
854
- - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
321
+ - Key changes made (git log from baseline to current)
322
+ - LangSmith experiment URLs for comparison
323
+ - Suggest: `/evolver:deploy` to finalize