harness-evolver 2.9.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/README.md +62 -117
  2. package/agents/evolver-architect.md +53 -0
  3. package/agents/evolver-critic.md +44 -0
  4. package/agents/evolver-proposer.md +128 -0
  5. package/agents/evolver-testgen.md +67 -0
  6. package/bin/install.js +181 -171
  7. package/package.json +7 -7
  8. package/skills/deploy/SKILL.md +49 -56
  9. package/skills/evolve/SKILL.md +180 -700
  10. package/skills/setup/SKILL.md +182 -0
  11. package/skills/status/SKILL.md +23 -21
  12. package/tools/read_results.py +240 -0
  13. package/tools/run_eval.py +202 -0
  14. package/tools/seed_from_traces.py +36 -8
  15. package/tools/setup.py +393 -0
  16. package/tools/trace_insights.py +86 -14
  17. package/agents/harness-evolver-architect.md +0 -173
  18. package/agents/harness-evolver-critic.md +0 -132
  19. package/agents/harness-evolver-judge.md +0 -110
  20. package/agents/harness-evolver-proposer.md +0 -317
  21. package/agents/harness-evolver-testgen.md +0 -112
  22. package/examples/classifier/README.md +0 -25
  23. package/examples/classifier/config.json +0 -3
  24. package/examples/classifier/eval.py +0 -58
  25. package/examples/classifier/harness.py +0 -111
  26. package/examples/classifier/tasks/task_001.json +0 -1
  27. package/examples/classifier/tasks/task_002.json +0 -1
  28. package/examples/classifier/tasks/task_003.json +0 -1
  29. package/examples/classifier/tasks/task_004.json +0 -1
  30. package/examples/classifier/tasks/task_005.json +0 -1
  31. package/examples/classifier/tasks/task_006.json +0 -1
  32. package/examples/classifier/tasks/task_007.json +0 -1
  33. package/examples/classifier/tasks/task_008.json +0 -1
  34. package/examples/classifier/tasks/task_009.json +0 -1
  35. package/examples/classifier/tasks/task_010.json +0 -1
  36. package/skills/architect/SKILL.md +0 -93
  37. package/skills/compare/SKILL.md +0 -73
  38. package/skills/critic/SKILL.md +0 -67
  39. package/skills/diagnose/SKILL.md +0 -96
  40. package/skills/import-traces/SKILL.md +0 -102
  41. package/skills/init/SKILL.md +0 -253
  42. package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
  43. package/tools/__pycache__/init.cpython-313.pyc +0 -0
  44. package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
  45. package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
  46. package/tools/eval_llm_judge.py +0 -233
  47. package/tools/eval_passthrough.py +0 -55
  48. package/tools/evaluate.py +0 -255
  49. package/tools/import_traces.py +0 -229
  50. package/tools/init.py +0 -531
  51. package/tools/llm_api.py +0 -125
  52. package/tools/state.py +0 -219
  53. package/tools/test_growth.py +0 -230
  54. package/tools/trace_logger.py +0 -42
@@ -1,843 +1,323 @@
1
1
  ---
2
- name: harness-evolver:evolve
3
- description: "Use when the user wants to run the optimization loop, improve harness performance, evolve the harness, or iterate on harness quality. Requires .harness-evolver/ to exist (run harness-evolver:init first)."
2
+ name: evolver:evolve
3
+ description: "Use when the user wants to run the optimization loop, improve agent performance, evolve the agent, or iterate on quality. Requires .evolver.json to exist (run evolver:setup first)."
4
4
  argument-hint: "[--iterations N]"
5
5
  allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
6
6
  ---
7
7
 
8
- # /harness-evolver:evolve
8
+ # /evolver:evolve
9
9
 
10
- Run the autonomous propose-evaluate-iterate loop.
10
+ Run the autonomous propose-evaluate-iterate loop using LangSmith as the evaluation backend and git worktrees for isolation.
11
11
 
12
12
  ## Prerequisites
13
13
 
14
- `.harness-evolver/summary.json` must exist. If not, tell user to run `harness-evolver:init`.
14
+ `.evolver.json` must exist. If not, tell user to run `evolver:setup`.
15
15
 
16
16
  ## Resolve Tool Path
17
17
 
18
18
  ```bash
19
- TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
19
+ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")
20
20
  ```
21
21
 
22
22
  ## Parse Arguments
23
23
 
24
- - `--iterations N` (default: 10)
25
- - Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
24
+ - `--iterations N` (default: from interactive question or 5)
26
25
 
27
26
  ## Pre-Loop: Interactive Configuration
28
27
 
29
- If no `--iterations` argument was provided, ask the user interactively:
28
+ If no `--iterations` argument was provided, ask the user:
30
29
 
31
- Use AskUserQuestion with TWO questions:
32
-
33
- ```
34
- Question 1: "How many evolution iterations?"
35
- Header: "Iterations"
36
- Options:
37
- - "3 (quick)" — Fast exploration, good for testing setup
38
- - "5 (balanced)" — Good trade-off between speed and quality
39
- - "10 (thorough)" Deep optimization, takes longer
40
-
41
- Question 2: "Stop early if score reaches?"
42
- Header: "Target"
43
- Options:
44
- - "0.8 (good enough)" — Stop when the harness is reasonably good
45
- - "0.9 (high quality)"Stop when quality is high
46
- - "0.95 (near perfect)" — Push for near-perfect scores
47
- - "No limit" — Run all iterations regardless of score
30
+ ```json
31
+ {
32
+ "questions": [
33
+ {
34
+ "question": "How many evolution iterations?",
35
+ "header": "Iterations",
36
+ "multiSelect": false,
37
+ "options": [
38
+ {"label": "3 (quick)", "description": "Fast exploration, good for testing. ~15 min."},
39
+ {"label": "5 (balanced)", "description": "Good trade-off between speed and quality. ~30 min."},
40
+ {"label": "10 (thorough)", "description": "Deep optimization with adaptive strategies. ~1 hour."}
41
+ ]
42
+ },
43
+ {
44
+ "question": "Stop early if score reaches?",
45
+ "header": "Target",
46
+ "multiSelect": false,
47
+ "options": [
48
+ {"label": "0.8 (good enough)", "description": "Stop when the agent is reasonably good"},
49
+ {"label": "0.9 (high quality)", "description": "Stop when quality is high"},
50
+ {"label": "0.95 (near perfect)", "description": "Push for near-perfect scores"},
51
+ {"label": "No limit", "description": "Run all iterations regardless of score"}
52
+ ]
53
+ }
54
+ ]
55
+ }
48
56
  ```
49
57
 
50
- Apply the answers:
51
- - Set iterations from question 1 (3, 5, or 10)
52
- - Set target_score from question 2 (0.8, 0.9, 0.95, or None)
53
-
54
- If `--iterations` WAS provided as argument, skip these questions and use the argument value.
55
-
56
58
  ## The Loop
57
59
 
58
- For each iteration:
59
-
60
- ### 1. Get Next Version
61
-
60
+ Read config:
62
61
  ```bash
63
- python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
62
+ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Best: {c[\"best_experiment\"]} ({c[\"best_score\"]:.3f}), Iterations: {c[\"iterations\"]}')"
64
63
  ```
65
64
 
66
- ### 1.4. Gather Production Insights (first iteration only)
67
-
68
- On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
69
-
70
- ```bash
71
- PROD_PROJECT=$(python3 -c "
72
- import json, os
73
- c = json.load(open('.harness-evolver/config.json'))
74
- print(c.get('eval', {}).get('production_project', ''))
75
- " 2>/dev/null)
76
- if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
77
- python3 $TOOLS/seed_from_traces.py \
78
- --project "$PROD_PROJECT" \
79
- --output-md .harness-evolver/production_seed.md \
80
- --output-json .harness-evolver/production_seed.json \
81
- --limit 100 2>/dev/null
82
- fi
83
- ```
84
-
85
- The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
86
-
87
- ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
88
-
89
- **Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
90
-
91
- **Step 1: Find the actual LangSmith project name**
92
-
93
- ```bash
94
- langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
95
- ```
96
-
97
- This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
98
-
99
- ```bash
100
- LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
101
- ```
102
-
103
- If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
65
+ For each iteration:
104
66
 
105
- **Step 2: Gather raw traces from the discovered project**
67
+ ### 1. Get Next Version
106
68
 
107
69
  ```bash
108
- if [ -n "$LS_PROJECT" ]; then
109
- langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
110
- langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
111
- echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
112
- else
113
- echo "[]" > /tmp/langsmith_raw.json
114
- echo "{}" > .harness-evolver/langsmith_stats.json
115
- fi
70
+ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"iterations\"]+1:03d}')"
116
71
  ```
117
72
 
118
- **Step 3: Process raw LangSmith data into a readable format for proposers**
73
+ ### 1.5. Gather Trace Insights
119
74
 
120
- The raw langsmith data has LangChain-serialized messages that are hard to read. Process it into a clean summary:
75
+ Run trace insights from the best experiment:
121
76
 
122
77
  ```bash
123
- python3 -c "
124
- import json, sys
125
-
126
- raw = json.load(open('/tmp/langsmith_raw.json'))
127
- if not raw:
128
- json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
129
- sys.exit(0)
130
-
131
- clean = []
132
- for r in raw:
133
- entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
134
-
135
- # Extract readable prompt from LangChain serialized inputs
136
- inputs = r.get('inputs', {})
137
- if isinstance(inputs, dict) and 'messages' in inputs:
138
- msgs = inputs['messages']
139
- for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
140
- for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
141
- if isinstance(msg, dict):
142
- kwargs = msg.get('kwargs', msg)
143
- content = kwargs.get('content', '')
144
- msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
145
- if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
146
- entry['user_message'] = str(content)[:300]
147
- elif 'System' in str(msg_type):
148
- entry['system_prompt_preview'] = str(content)[:200]
149
-
150
- # Extract readable output
151
- outputs = r.get('outputs', {})
152
- if isinstance(outputs, dict) and 'generations' in outputs:
153
- gens = outputs['generations']
154
- if gens and isinstance(gens, list) and gens[0]:
155
- gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
156
- if isinstance(gen, dict):
157
- msg = gen.get('message', gen)
158
- if isinstance(msg, dict):
159
- kwargs = msg.get('kwargs', msg)
160
- entry['llm_response'] = str(kwargs.get('content', ''))[:300]
161
-
162
- clean.append(entry)
163
-
164
- json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
165
- print(f'Processed {len(clean)} LangSmith runs into readable format')
166
- " 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
167
- ```
168
-
169
- The resulting `langsmith_runs.json` has clean, readable entries:
170
- ```json
171
- [
172
- {
173
- "name": "ChatGoogleGenerativeAI",
174
- "tokens": 1332,
175
- "error": null,
176
- "user_message": "Analise este texto: Bom dia pessoal...",
177
- "system_prompt_preview": "Você é um moderador de conteúdo...",
178
- "llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
179
- }
180
- ]
78
+ BEST=$(python3 -c "import json; print(json.load(open('.evolver.json'))['best_experiment'])")
79
+ python3 $TOOLS/trace_insights.py \
80
+ --from-experiment "$BEST" \
81
+ --output trace_insights.json 2>/dev/null
181
82
  ```
182
83
 
183
- These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
184
-
185
- ### 1.6. Generate Trace Insights (systematic analysis)
186
-
187
- If LangSmith traces were gathered, run systematic analysis to cluster errors, analyze token usage, and cross-reference with scores:
84
+ If a production project is configured, also gather production insights:
188
85
 
189
86
  ```bash
190
- if [ -f ".harness-evolver/langsmith_runs.json" ]; then
191
- BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s['best']['version'])")
192
- SCORES_PATH=".harness-evolver/harnesses/$BEST/scores.json"
193
- [ ! -f "$SCORES_PATH" ] && SCORES_PATH=".harness-evolver/baseline/scores.json"
194
- python3 $TOOLS/trace_insights.py \
195
- --langsmith-runs .harness-evolver/langsmith_runs.json \
196
- --langsmith-stats .harness-evolver/langsmith_stats.json \
197
- --scores "$SCORES_PATH" \
198
- --tasks-dir .harness-evolver/eval/tasks/ \
199
- --output .harness-evolver/trace_insights.json 2>/dev/null
87
+ PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
88
+ if [ -n "$PROD" ] && [ ! -f "production_seed.json" ]; then
89
+ python3 $TOOLS/seed_from_traces.py \
90
+ --project "$PROD" --use-sdk \
91
+ --output-md production_seed.md \
92
+ --output-json production_seed.json \
93
+ --limit 100 2>/dev/null
200
94
  fi
201
95
  ```
202
96
 
203
- The resulting `trace_insights.json` contains:
204
- - `error_clusters`: grouped error patterns with counts
205
- - `token_analysis`: score distribution by token usage bucket (low/medium/high)
206
- - `hypotheses`: data-driven theories about failure causes
207
- - `top_issues`: highest-impact problems sorted by severity
208
-
209
- This file is included in all proposers' `<files_to_read>` so they have structured diagnostic data.
97
+ ### 1.8. Analyze Per-Task Failures
210
98
 
211
- ### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
212
-
213
- Before spawning proposers, analyze which tasks are failing and cluster them:
99
+ Read the best experiment results and cluster failures:
214
100
 
215
101
  ```bash
216
- python3 -c "
217
- import json, os, sys
218
-
219
- # Find best version scores
220
- summary = json.load(open('.harness-evolver/summary.json'))
221
- best = summary['best']['version']
222
- scores_path = f'.harness-evolver/harnesses/{best}/scores.json'
223
- if not os.path.exists(scores_path):
224
- scores_path = '.harness-evolver/baseline/scores.json' if os.path.exists('.harness-evolver/baseline/scores.json') else None
225
-
226
- if not scores_path or not os.path.exists(scores_path):
227
- print('NO_SCORES')
228
- sys.exit(0)
229
-
230
- scores = json.load(open(scores_path))
231
- tasks_dir = '.harness-evolver/eval/tasks/'
232
- failures = {}
233
-
234
- for tid, tdata in scores.get('per_task', {}).items():
235
- score = tdata.get('score', 0)
236
- if score < 0.7:
237
- tfile = os.path.join(tasks_dir, tid + '.json')
238
- cat = 'unknown'
239
- if os.path.exists(tfile):
240
- task = json.load(open(tfile))
241
- meta = task.get('metadata', {})
242
- cat = meta.get('category', meta.get('type', meta.get('difficulty', 'unknown')))
243
- failures.setdefault(cat, []).append({'id': tid, 'score': score})
244
-
245
- if not failures:
246
- print('ALL_PASSING')
247
- else:
248
- sorted_clusters = sorted(failures.items(), key=lambda x: -len(x[1]))
249
- for i, (cat, tasks) in enumerate(sorted_clusters[:2]):
250
- task_ids = [t['id'] for t in tasks]
251
- avg_score = sum(t['score'] for t in tasks) / len(tasks)
252
- print(f'CLUSTER_{i+1}|{cat}|{json.dumps(task_ids)}|{avg_score:.2f}')
253
- " 2>/dev/null
102
+ python3 $TOOLS/read_results.py \
103
+ --experiment "$BEST" \
104
+ --config .evolver.json \
105
+ --output best_results.json 2>/dev/null
254
106
  ```
255
107
 
256
- Parse the output:
257
- - If `NO_SCORES` or `ALL_PASSING`: D gets "creative" brief, E gets "efficiency" brief
258
- - If clusters found: D targets cluster 1, E targets cluster 2
259
- - If only 1 cluster: D targets it, E gets "creative" brief
260
-
261
- Save clusters for use in step 2.
262
-
263
- ### 2. Propose (3 parallel candidates)
264
-
265
- Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
266
- This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
267
-
268
- Determine parents for each strategy:
269
- - **Exploiter parent**: current best version (from summary.json `best.version`)
270
- - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
271
- - **Crossover parents**:
272
- - Parent A = current best version
273
- - Parent B = per-task champion from previous iteration (read `.harness-evolver/per_task_champion.json`).
274
- If no champion file exists, fall back to a non-best version from the archive.
275
-
276
- Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
277
-
278
- **Candidate A (Exploiter)** — `run_in_background: true`:
279
- ```
280
- Agent(
281
- subagent_type: "harness-evolver-proposer",
282
- description: "Proposer A (exploit): targeted fix for {version}",
283
- run_in_background: true,
284
- prompt: |
285
- <strategy>
286
- APPROACH: exploitation
287
- You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
288
- the highest-impact failing tasks. Base your work on the current best version.
289
- Do NOT restructure the code. Do NOT change the architecture.
290
- Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
291
- </strategy>
108
+ Parse `best_results.json` to find failing examples (score < 0.7). Group by metadata or error pattern.
109
+ Generate adaptive briefings for Candidates D and E (same logic as v2).
292
110
 
293
- <objective>
294
- Propose harness version {version}a that improves on {best_score}.
295
- </objective>
111
+ ### 2. Spawn 5 Proposers in Parallel
296
112
 
297
- <files_to_read>
298
- - .harness-evolver/summary.json
299
- - .harness-evolver/PROPOSER_HISTORY.md
300
- - .harness-evolver/config.json
301
- - .harness-evolver/harnesses/{best_version}/harness.py
302
- - .harness-evolver/harnesses/{best_version}/scores.json
303
- - .harness-evolver/harnesses/{best_version}/proposal.md
304
- - .harness-evolver/langsmith_diagnosis.json (if exists)
305
- - .harness-evolver/langsmith_stats.json (if exists)
306
- - .harness-evolver/langsmith_runs.json (if exists)
307
- - .harness-evolver/trace_insights.json (if exists)
308
- - .harness-evolver/production_seed.json (if exists)
309
- - .harness-evolver/architecture.json (if exists)
310
- </files_to_read>
113
+ Each proposer runs in a **git worktree** via Claude Code's native `isolation: "worktree"` parameter.
311
114
 
312
- <output>
313
- Create directory .harness-evolver/harnesses/{version}a/ containing:
314
- - harness.py, config.json, proposal.md
315
- </output>
316
- )
317
- ```
115
+ **Candidate A (Exploit)** — `run_in_background: true`:
318
116
 
319
- **Candidate B (Explorer)** — `run_in_background: true`:
320
117
  ```
321
118
  Agent(
322
- subagent_type: "harness-evolver-proposer",
323
- description: "Proposer B (explore): bold change from {explorer_parent}",
119
+ subagent_type: "evolver-proposer",
120
+ description: "Proposer A: exploit best version",
121
+ isolation: "worktree",
324
122
  run_in_background: true,
325
123
  prompt: |
326
- <strategy>
327
- APPROACH: exploration
328
- You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
329
- Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
330
- Consider: different retrieval strategy, different prompt structure,
331
- different output parsing, different error handling philosophy.
332
- Be bold. A creative failure teaches more than a timid success.
333
- </strategy>
334
-
335
124
  <objective>
336
- Propose harness version {version}b that takes a different approach.
125
+ Improve the agent code to score higher on the evaluation dataset.
126
+ You are working in an isolated git worktree — modify any file freely.
337
127
  </objective>
338
128
 
339
- <files_to_read>
340
- - .harness-evolver/summary.json
341
- - .harness-evolver/PROPOSER_HISTORY.md
342
- - .harness-evolver/config.json
343
- - .harness-evolver/baseline/harness.py
344
- - .harness-evolver/harnesses/{explorer_parent}/harness.py
345
- - .harness-evolver/harnesses/{explorer_parent}/scores.json
346
- - .harness-evolver/langsmith_diagnosis.json (if exists)
347
- - .harness-evolver/langsmith_runs.json (if exists)
348
- - .harness-evolver/trace_insights.json (if exists)
349
- - .harness-evolver/production_seed.json (if exists)
350
- - .harness-evolver/architecture.json (if exists)
351
- </files_to_read>
352
-
353
- <output>
354
- Create directory .harness-evolver/harnesses/{version}b/ containing:
355
- - harness.py, config.json, proposal.md
356
- </output>
357
- )
358
- ```
359
-
360
- **Candidate C (Crossover)** — blocks (last one):
361
- ```
362
- Agent(
363
- subagent_type: "harness-evolver-proposer",
364
- description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
365
- prompt: |
366
129
  <strategy>
367
- APPROACH: crossover
368
- You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
369
- - {parent_a} (score: {score_a}): {summary of what it does well}
370
- - {parent_b} (score: {score_b}): {summary of what it does well}
371
- Take the best elements from each and merge them into a single harness.
130
+ APPROACH: exploitation
131
+ Make targeted improvements to the current best version.
132
+ Focus on the specific failures identified in the results.
372
133
  </strategy>
373
134
 
374
- <objective>
375
- Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
376
- </objective>
377
-
378
135
  <files_to_read>
379
- - .harness-evolver/summary.json
380
- - .harness-evolver/PROPOSER_HISTORY.md
381
- - .harness-evolver/config.json
382
- - .harness-evolver/harnesses/{parent_a}/harness.py
383
- - .harness-evolver/harnesses/{parent_a}/scores.json
384
- - .harness-evolver/harnesses/{parent_b}/harness.py
385
- - .harness-evolver/harnesses/{parent_b}/scores.json
386
- - .harness-evolver/langsmith_diagnosis.json (if exists)
387
- - .harness-evolver/langsmith_runs.json (if exists)
388
- - .harness-evolver/trace_insights.json (if exists)
389
- - .harness-evolver/production_seed.json (if exists)
390
- - .harness-evolver/architecture.json (if exists)
136
+ - .evolver.json
137
+ - trace_insights.json (if exists)
138
+ - production_seed.json (if exists)
139
+ - best_results.json (if exists)
140
+ - {entry point file from .evolver.json}
391
141
  </files_to_read>
392
142
 
393
- <output>
394
- Create directory .harness-evolver/harnesses/{version}c/ containing:
395
- - harness.py, config.json, proposal.md
396
- </output>
397
- )
398
- ```
399
-
400
- **Also spawn these additional candidates:**
401
-
402
- **Candidate D (Failure-Targeted or Creative)** — `run_in_background: true`:
403
-
404
- If failure clusters were found in step 1.8:
405
- ```
406
- Agent(
407
- subagent_type: "harness-evolver-proposer",
408
- description: "Proposer D: fix {cluster_1_category} failures",
409
- run_in_background: true,
410
- prompt: |
411
- <strategy>
412
- APPROACH: failure-targeted
413
- Focus on fixing these SPECIFIC failing tasks: {cluster_1_task_ids}
414
- They share the pattern: {cluster_1_category} (avg score: {cluster_1_avg})
415
- Read the traces of these specific tasks to understand WHY they fail.
416
- Your changes should improve these tasks WITHOUT regressing others.
417
- You are free to change anything — prompts, code, retrieval, architecture —
418
- whatever is needed to fix THIS specific failure mode.
419
- </strategy>
420
-
421
- <objective>
422
- Propose harness version {version}d targeting {cluster_1_category} failures.
423
- </objective>
424
-
425
- <files_to_read>
426
- - .harness-evolver/summary.json
427
- - .harness-evolver/PROPOSER_HISTORY.md
428
- - .harness-evolver/config.json
429
- - .harness-evolver/harnesses/{best_version}/harness.py
430
- - .harness-evolver/harnesses/{best_version}/scores.json
431
- - .harness-evolver/langsmith_runs.json (if exists)
432
- - .harness-evolver/trace_insights.json (if exists)
433
- - .harness-evolver/production_seed.json (if exists)
434
- - .harness-evolver/architecture.json (if exists)
435
- </files_to_read>
143
+ <context>
144
+ Best experiment: {best_experiment} (score: {best_score})
145
+ Framework: {framework}
146
+ Entry point: {entry_point}
147
+ Evaluators: {evaluators}
148
+ Failing examples: {failing_example_summary}
149
+ </context>
436
150
 
437
151
  <output>
438
- Create directory .harness-evolver/harnesses/{version}d/ containing:
439
- - harness.py, config.json, proposal.md
152
+ 1. Modify the code to improve performance
153
+ 2. Commit your changes with a descriptive message
154
+ 3. Write proposal.md explaining what you changed and why
440
155
  </output>
441
156
  )
442
157
  ```
443
158
 
444
- If ALL_PASSING (no failures):
445
- ```
446
- Agent(
447
- subagent_type: "harness-evolver-proposer",
448
- description: "Proposer D: creative approach",
449
- run_in_background: true,
450
- prompt: |
451
- <strategy>
452
- APPROACH: creative
453
- All tasks are scoring well. Try something UNEXPECTED:
454
- - Different algorithm or library
455
- - Completely different prompt architecture
456
- - Novel error handling or output validation
457
- - Something no one would think of
458
- The goal is to discover improvements that incremental fixes would miss.
459
- </strategy>
460
- ...same files_to_read and output as above...
461
- )
462
- ```
463
-
464
- **Candidate E (Failure-Targeted or Efficiency)** — `run_in_background: true`:
465
-
466
- If a second failure cluster exists:
467
- ```
468
- Agent(
469
- subagent_type: "harness-evolver-proposer",
470
- description: "Proposer E: fix {cluster_2_category} failures",
471
- run_in_background: true,
472
- prompt: |
473
- <strategy>
474
- APPROACH: failure-targeted
475
- Focus on fixing these SPECIFIC failing tasks: {cluster_2_task_ids}
476
- They share the pattern: {cluster_2_category} (avg score: {cluster_2_avg})
477
- Read the traces of these specific tasks to understand WHY they fail.
478
- Your changes should improve these tasks WITHOUT regressing others.
479
- You are free to change anything — prompts, code, retrieval, architecture —
480
- whatever is needed to fix THIS specific failure mode.
481
- </strategy>
482
-
483
- <objective>
484
- Propose harness version {version}e targeting {cluster_2_category} failures.
485
- </objective>
486
-
487
- <files_to_read>
488
- - .harness-evolver/summary.json
489
- - .harness-evolver/PROPOSER_HISTORY.md
490
- - .harness-evolver/config.json
491
- - .harness-evolver/harnesses/{best_version}/harness.py
492
- - .harness-evolver/harnesses/{best_version}/scores.json
493
- - .harness-evolver/langsmith_runs.json (if exists)
494
- - .harness-evolver/trace_insights.json (if exists)
495
- - .harness-evolver/production_seed.json (if exists)
496
- - .harness-evolver/architecture.json (if exists)
497
- </files_to_read>
498
-
499
- <output>
500
- Create directory .harness-evolver/harnesses/{version}e/ containing:
501
- - harness.py, config.json, proposal.md
502
- </output>
503
- )
504
- ```
159
+ **Candidate B (Explorer)** — `run_in_background: true`:
160
+ Same structure but `APPROACH: exploration` — bold, fundamentally different approach.
505
161
 
506
- If no second cluster (or ALL_PASSING):
507
- ```
508
- Agent(
509
- subagent_type: "harness-evolver-proposer",
510
- description: "Proposer E: efficiency optimization",
511
- run_in_background: true,
512
- prompt: |
513
- <strategy>
514
- APPROACH: efficiency
515
- Maintain the current quality but optimize for:
516
- - Fewer LLM tokens (shorter prompts, less context)
517
- - Faster execution (reduce unnecessary steps)
518
- - Simpler code (remove redundant logic)
519
- - Better error handling (graceful degradation)
520
- Do NOT sacrifice accuracy for speed — same quality, less cost.
521
- </strategy>
522
- ...same files_to_read and output as above...
523
- )
524
- ```
162
+ **Candidate C (Crossover)** `run_in_background: true`:
163
+ Same structure but `APPROACH: crossover` — combine strengths from previous iterations.
164
+ Include git log of recent changes so it can see what was tried.
525
165
 
526
- Wait for all 5 to complete. The background agents will notify when done.
166
+ **Candidates D and E (Failure-Targeted)** `run_in_background: true`:
167
+ Same structure but `APPROACH: failure-targeted` with specific failing example clusters.
168
+ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
527
169
 
528
- **Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
170
+ Wait for all 5 to complete.
529
171
 
530
- **On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, step 1.8 will naturally shift D and E toward failure-targeted or creative strategies based on actual task performance.
172
+ ### 3. Evaluate Each Candidate
531
173
 
532
- ### 3. Validate All Candidates
174
+ For each worktree that has changes (proposer committed something):
533
175
 
534
- For each candidate (a, b, c, d, e):
535
176
  ```bash
536
- python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
177
+ python3 $TOOLS/run_eval.py \
178
+ --config .evolver.json \
179
+ --worktree-path {worktree_path} \
180
+ --experiment-prefix v{NNN}{suffix} \
181
+ --timeout 120
537
182
  ```
538
183
 
539
- Remove any that fail validation.
184
+ Each candidate becomes a separate LangSmith experiment.
540
185
 
541
- ### 4. Evaluate All Candidates
186
+ ### 4. Compare All Candidates
542
187
 
543
- For each valid candidate:
544
188
  ```bash
545
- python3 $TOOLS/evaluate.py run \
546
- --harness .harness-evolver/harnesses/{version}{suffix}/harness.py \
547
- --config .harness-evolver/harnesses/{version}{suffix}/config.json \
548
- --tasks-dir .harness-evolver/eval/tasks/ \
549
- --eval .harness-evolver/eval/eval.py \
550
- --traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
551
- --scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
552
- --timeout 60
553
- ```
554
-
555
- ### 4.5. Judge (if eval returned pending scores)
556
-
557
- For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
558
-
559
- Spawn judge subagent with `subagent_type: "harness-evolver-judge"` for EACH candidate that needs judging:
560
-
561
- ```
562
- Agent(
563
- subagent_type: "harness-evolver-judge",
564
- description: "Judge: score {version}{suffix} outputs",
565
- prompt: |
566
- <objective>
567
- Score the outputs of harness version {version}{suffix} across all {N} tasks.
568
- </objective>
569
-
570
- <files_to_read>
571
- - .harness-evolver/harnesses/{version}{suffix}/scores.json
572
- - .harness-evolver/eval/tasks/ (read all task files)
573
- </files_to_read>
574
-
575
- <output>
576
- Overwrite .harness-evolver/harnesses/{version}{suffix}/scores.json with real scores.
577
- </output>
578
- )
189
+ python3 $TOOLS/read_results.py \
190
+ --experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
191
+ --config .evolver.json \
192
+ --output comparison.json
579
193
  ```
580
194
 
581
- Wait for `## JUDGE COMPLETE`.
582
-
583
- If eval_type is NOT "pending-judge", the eval.py already produced real scores skip this step.
584
-
585
- ### 5. Select Winner + Track Per-Task Champions
195
+ Parse `comparison.json`:
196
+ - `comparison.winner` — highest combined score
197
+ - `comparison.champion` per-task champion (for next crossover)
198
+ - `comparison.all_candidates` — all scores for reporting
586
199
 
587
- **5a. Find overall winner (highest combined_score):**
200
+ ### 5. Merge Winner
588
201
 
589
- Compare all evaluated candidates. The winner is the one with highest combined_score.
590
-
591
- **5b. Find per-task champion (candidate that beats the winner on most individual tasks):**
202
+ If the winner scored higher than the current best:
592
203
 
593
204
  ```bash
594
- python3 -c "
595
- import json, os
596
-
597
- version = '{version}'
598
- candidates = {}
599
- for suffix in ['a', 'b', 'c', 'd', 'e']:
600
- path = f'.harness-evolver/harnesses/{version}{suffix}/scores.json'
601
- if os.path.exists(path):
602
- candidates[suffix] = json.load(open(path))
603
-
604
- if not candidates:
605
- print('NO_CANDIDATES')
606
- exit()
607
-
608
- # Overall winner
609
- winner_suffix = max(candidates, key=lambda s: candidates[s].get('combined_score', 0))
610
- winner_score = candidates[winner_suffix]['combined_score']
611
- print(f'WINNER: {winner_suffix} (score: {winner_score:.3f})')
612
-
613
- # Per-task champion: which NON-WINNER candidate beats the winner on the most tasks?
614
- task_wins = {}
615
- winner_tasks = candidates[winner_suffix].get('per_task', {})
616
- for suffix, data in candidates.items():
617
- if suffix == winner_suffix:
618
- continue
619
- wins = 0
620
- for tid, tdata in data.get('per_task', {}).items():
621
- winner_task_score = winner_tasks.get(tid, {}).get('score', 0)
622
- if tdata.get('score', 0) > winner_task_score:
623
- wins += 1
624
- if wins > 0:
625
- task_wins[suffix] = wins
626
-
627
- if task_wins:
628
- champion_suffix = max(task_wins, key=task_wins.get)
629
- print(f'PER_TASK_CHAMPION: {champion_suffix} (beats winner on {task_wins[champion_suffix]} tasks)')
630
- # Save champion info for next iteration's crossover parent
631
- with open('.harness-evolver/per_task_champion.json', 'w') as f:
632
- json.dump({'suffix': champion_suffix, 'version': f'{version}{champion_suffix}', 'task_wins': task_wins[champion_suffix]}, f)
633
- else:
634
- print('NO_CHAMPION: winner dominates all tasks')
635
- " 2>/dev/null
636
- ```
637
-
638
- **5c. Promote winner and report ALL candidates:**
205
+ # Get the winning worktree's branch
206
+ WINNER_BRANCH={winning_worktree_branch}
639
207
 
640
- Rename winner directory to official version:
641
- ```bash
642
- mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
208
+ # Merge into main
209
+ git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
643
210
  ```
644
211
 
645
- Update state:
646
- ```bash
647
- python3 $TOOLS/state.py update \
648
- --base-dir .harness-evolver \
649
- --version {version} \
650
- --scores .harness-evolver/harnesses/{version}/scores.json \
651
- --proposal .harness-evolver/harnesses/{version}/proposal.md
212
+ Update `.evolver.json`:
213
+ ```python
214
+ import json
215
+ c = json.load(open('.evolver.json'))
216
+ c['best_experiment'] = '{winner_experiment}'
217
+ c['best_score'] = {winner_score}
218
+ c['iterations'] = c['iterations'] + 1
219
+ c['history'].append({
220
+ 'version': 'v{NNN}',
221
+ 'experiment': '{winner_experiment}',
222
+ 'score': {winner_score}
223
+ })
224
+ json.dump(c, open('.evolver.json', 'w'), indent=2)
652
225
  ```
653
226
 
654
- Report ALL candidates with their scores and strategies:
655
- ```
656
- Iteration {i}/{N} — {num_candidates} candidates evaluated:
657
- {version}a (exploit): {score_a} — {summary}
658
- {version}b (explore): {score_b} — {summary}
659
- {version}c (crossover): {score_c} — {summary}
660
- {version}d ({strategy_d}): {score_d} — {summary}
661
- {version}e ({strategy_e}): {score_e} — {summary}
662
-
663
- Winner: {version}{suffix} ({score})
664
- Per-task champion: {champion_suffix} (beats winner on {N} tasks) — saved for next crossover
227
+ Report ALL candidates:
665
228
  ```
229
+ Iteration {i}/{N} — 5 candidates evaluated:
230
+ v{NNN}a (exploit): {score_a} — {summary}
231
+ v{NNN}b (explore): {score_b} — {summary}
232
+ v{NNN}c (crossover): {score_c} — {summary}
233
+ v{NNN}d ({strategy}): {score_d} — {summary}
234
+ v{NNN}e ({strategy}): {score_e} — {summary}
666
235
 
667
- Keep losing candidates in their directories (they're part of the archive never discard, per DGM).
236
+ Winner: v{NNN}{suffix} ({score})merged into main
237
+ Per-task champion: {champion} (beats winner on {N} tasks)
238
+ ```
668
239
 
669
- ### 5.5. Test Suite Growth (Durable Regression Gates)
240
+ ### 5.5. Test Suite Growth
670
241
 
671
- After the winner is promoted, check if any previously-failing tasks are now passing.
672
- Generate regression tasks to lock in improvements and prevent future regressions:
242
+ If previously-failing examples now pass, add regression examples to the dataset:
673
243
 
674
244
  ```bash
675
- PREV_BEST=$(python3 -c "
245
+ python3 -c "
246
+ from langsmith import Client
676
247
  import json
677
- s = json.load(open('.harness-evolver/summary.json'))
678
- versions = s.get('versions', [])
679
- print(versions[-2]['version'] if len(versions) >= 2 else '')
680
- " 2>/dev/null)
681
- if [ -n "$PREV_BEST" ] && [ -f ".harness-evolver/harnesses/$PREV_BEST/scores.json" ]; then
682
- python3 $TOOLS/test_growth.py \
683
- --current-scores .harness-evolver/harnesses/{version}/scores.json \
684
- --previous-scores ".harness-evolver/harnesses/$PREV_BEST/scores.json" \
685
- --tasks-dir .harness-evolver/eval/tasks/ \
686
- --output-dir .harness-evolver/eval/tasks/ \
687
- --max-total-tasks 60 2>/dev/null
688
- fi
689
- ```
690
248
 
691
- If new tasks were added, print: "Added {N} regression tasks to lock in improvements on: {task_ids}"
249
+ client = Client()
250
+ config = json.load(open('.evolver.json'))
692
251
 
693
- This is the "durable test gates" pattern: every fixed failure becomes a permanent regression test.
694
- New tasks are tagged with `metadata.type: "regression"` and `metadata.source: "regression"` so they
695
- can be distinguished from original tasks. The test suite only grows — regression tasks are never removed.
252
+ # Find examples that improved significantly
253
+ # (score went from <0.5 to >0.8 between iterations)
254
+ # Generate variations and add to dataset
255
+ # client.create_examples(dataset_id=config['dataset_id'], examples=[...])
256
+ print('Test suite growth: added N regression examples')
257
+ " 2>/dev/null
258
+ ```
696
259
 
697
260
  ### 6. Report
698
261
 
699
- Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
262
+ Print: `Iteration {i}/{N}: v{NNN} scored {score} (best: {best} at {best_score})`
700
263
 
701
- ### 6.5. Auto-trigger Critic (on eval gaming)
264
+ ### 6.5. Auto-trigger Critic
702
265
 
703
- Read `summary.json` and check:
704
- - Did the score jump >0.3 from parent version?
705
- - Did we reach 1.0 in fewer than 3 total iterations?
266
+ If score jumped >0.3 from previous iteration OR reached target in <3 iterations:
706
267
 
707
- If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
708
-
709
- ```bash
710
- python3 $TOOLS/evaluate.py run \
711
- --harness .harness-evolver/harnesses/{version}/harness.py \
712
- --tasks-dir .harness-evolver/eval/tasks/ \
713
- --eval .harness-evolver/eval/eval.py \
714
- --traces-dir /tmp/critic-check/ \
715
- --scores /tmp/critic-check-scores.json \
716
- --timeout 60
717
- ```
718
-
719
- Dispatch the critic agent:
268
+ Spawn the critic agent to analyze evaluator quality:
720
269
 
721
270
  ```
722
271
  Agent(
723
- subagent_type: "harness-evolver-critic",
724
- description: "Critic: analyze eval quality",
272
+ subagent_type: "evolver-critic",
273
+ description: "Critic: check evaluator gaming",
725
274
  prompt: |
726
275
  <objective>
727
- EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
728
- Analyze the eval quality and propose a stricter eval.
276
+ EVAL GAMING DETECTED: Score jumped from {prev_score} to {score}.
277
+ Check if the LangSmith evaluators are being gamed.
729
278
  </objective>
730
279
 
731
280
  <files_to_read>
732
- - .harness-evolver/eval/eval.py
733
- - .harness-evolver/summary.json
734
- - .harness-evolver/harnesses/{version}/scores.json
735
- - .harness-evolver/harnesses/{version}/harness.py
736
- - .harness-evolver/harnesses/{version}/proposal.md
737
- - .harness-evolver/config.json
738
- - .harness-evolver/langsmith_stats.json (if exists)
281
+ - .evolver.json
282
+ - comparison.json
283
+ - trace_insights.json
739
284
  </files_to_read>
740
-
741
- <output>
742
- Write:
743
- - .harness-evolver/critic_report.md
744
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
745
- </output>
746
-
747
- <success_criteria>
748
- - Identifies specific weaknesses in eval.py with task/output examples
749
- - If gaming detected, shows exact tasks that expose the weakness
750
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
751
- - Re-scores the best version with improved eval to show the difference
752
- </success_criteria>
753
285
  )
754
286
  ```
755
287
 
756
- Wait for `## CRITIC REPORT COMPLETE`.
757
-
758
- If critic wrote `eval_improved.py`:
759
- - Re-score the best harness with the improved eval
760
- - Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
761
- - **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
762
- - Re-run baseline with new eval and update `summary.json`
763
- - Print: "Eval upgraded. Resuming evolution with stricter eval."
764
- - **Continue the loop** with the new eval
765
-
766
- If critic did NOT write `eval_improved.py` (eval is fine):
767
- - Print the critic's assessment
768
- - Continue the loop normally
769
-
770
- ### 7. Auto-trigger Architect (on stagnation or regression)
771
-
772
- Check if the architect should be auto-spawned:
773
- - **Stagnation**: 3 consecutive iterations within 1% of each other
774
- - **Regression**: score dropped below parent score (even once)
288
+ ### 7. Auto-trigger Architect
775
289
 
776
- AND `.harness-evolver/architecture.json` does NOT already exist.
777
-
778
- If triggered:
779
-
780
- ```bash
781
- python3 $TOOLS/analyze_architecture.py \
782
- --harness .harness-evolver/harnesses/{best_version}/harness.py \
783
- --traces-dir .harness-evolver/harnesses/{best_version}/traces \
784
- --summary .harness-evolver/summary.json \
785
- -o .harness-evolver/architecture_signals.json
786
- ```
787
-
788
- Dispatch the architect agent:
290
+ If 3 consecutive iterations within 1% OR score dropped:
789
291
 
790
292
  ```
791
293
  Agent(
792
- subagent_type: "harness-evolver-architect",
793
- description: "Architect: analyze topology after {stagnation/regression}",
294
+ subagent_type: "evolver-architect",
295
+ description: "Architect: recommend topology change",
794
296
  prompt: |
795
297
  <objective>
796
- The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
797
- Analyze the harness architecture and recommend a topology change.
298
+ The evolution loop has stagnated after {iterations} iterations.
299
+ Analyze the architecture and recommend changes.
798
300
  </objective>
799
301
 
800
302
  <files_to_read>
801
- - .harness-evolver/architecture_signals.json
802
- - .harness-evolver/summary.json
803
- - .harness-evolver/PROPOSER_HISTORY.md
804
- - .harness-evolver/config.json
805
- - .harness-evolver/harnesses/{best_version}/harness.py
806
- - .harness-evolver/harnesses/{best_version}/scores.json
807
- - .harness-evolver/context7_docs.md (if exists)
303
+ - .evolver.json
304
+ - trace_insights.json
305
+ - {entry point and related source files}
808
306
  </files_to_read>
809
-
810
- <output>
811
- Write:
812
- - .harness-evolver/architecture.json (structured recommendation)
813
- - .harness-evolver/architecture.md (human-readable analysis)
814
- </output>
815
-
816
- <success_criteria>
817
- - Recommendation includes concrete migration steps
818
- - Each step is implementable in one proposer iteration
819
- - Considers detected stack and available API keys
820
- </success_criteria>
821
307
  )
822
308
  ```
823
309
 
824
- Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
825
-
826
- Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
827
-
828
- Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
829
-
830
310
  ### 8. Check Stop Conditions
831
311
 
832
- - **Target**: `combined_score >= target_score` → stop
312
+ - **Target**: `score >= target_score` → stop
833
313
  - **N reached**: done
834
- - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
314
+ - **Stagnation post-architect**: 3 more iterations without improvement → stop
835
315
 
836
316
  ## When Loop Ends — Final Report
837
317
 
838
318
  - Best version and score
839
319
  - Improvement over baseline (absolute and %)
840
320
  - Total iterations run
841
- - Whether critic was triggered and eval was upgraded
842
- - Whether architect was triggered and what it recommended
843
- - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
321
+ - Key changes made (git log from baseline to current)
322
+ - LangSmith experiment URLs for comparison
323
+ - Suggest: `/evolver:deploy` to finalize