harness-evolver 4.0.3 → 4.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
3
  "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
4
- "version": "4.0.3",
4
+ "version": "4.2.0",
5
5
  "author": {
6
6
  "name": "Raphael Valdetaro"
7
7
  },
package/README.md CHANGED
@@ -67,8 +67,8 @@ claude
67
67
  <td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
68
68
  </tr>
69
69
  <tr>
70
- <td><b>5 Adaptive Proposers</b></td>
71
- <td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
70
+ <td><b>Self-Organizing Proposers</b></td>
71
+ <td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
72
72
  </tr>
73
73
  <tr>
74
74
  <td><b>Agent-Based Evaluation</b></td>
@@ -88,7 +88,7 @@ claude
88
88
  </tr>
89
89
  <tr>
90
90
  <td><b>Evolution Memory</b></td>
91
- <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which strategies win, which failures recur, and promotes insights after 2+ occurrences.</td>
91
+ <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
92
92
  </tr>
93
93
  <tr>
94
94
  <td><b>Smart Gating</b></td>
@@ -107,7 +107,7 @@ claude
107
107
  | Command | What it does |
108
108
  |---|---|
109
109
  | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
110
- | `/evolver:evolve` | Run the optimization loop (5 parallel proposers in worktrees) |
110
+ | `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
111
111
  | `/evolver:status` | Show progress, scores, history |
112
112
  | `/evolver:deploy` | Tag, push, clean up temporary files |
113
113
 
@@ -117,7 +117,7 @@ claude
117
117
 
118
118
  | Agent | Role | Color |
119
119
  |---|---|---|
120
- | **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
120
+ | **Proposer** | Self-organizing investigates a data-driven lens, decides own approach, may abstain | Green |
121
121
  | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
122
122
  | **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
123
123
  | **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
@@ -134,10 +134,10 @@ claude
134
134
  +- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
135
135
  +- 1. Read state (.evolver.json + LangSmith experiments)
136
136
  +- 1.5 Gather trace insights (cluster errors, tokens, latency)
137
- +- 1.8 Analyze per-task failures (adaptive briefings)
138
- +- 1.8a Synthesize strategy document (coordinator synthesis)
137
+ +- 1.8 Analyze per-task failures
138
+ +- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
139
139
  +- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
140
- +- 2. Spawn 5 proposers in parallel (each in a git worktree)
140
+ +- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
141
141
  +- 3. Run target for each candidate (code-based evaluators)
142
142
  +- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
143
143
  +- 4. Compare experiments -> select winner + per-task champion
@@ -165,7 +165,7 @@ Skills (markdown)
165
165
  └── /evolver:deploy → tags and pushes
166
166
 
167
167
  Agents (markdown)
168
- ├── Proposer (x5) → modifies code in isolated git worktrees
168
+ ├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
169
169
  ├── Evaluator → LLM-as-judge via langsmith-cli
170
170
  ├── Critic → detects gaming + implements stricter evaluators
171
171
  ├── Architect → ULTRAPLAN deep analysis (opus model)
@@ -182,7 +182,7 @@ Tools (Python + langsmith SDK)
182
182
  ├── iteration_gate.py → three-gate iteration triggers
183
183
  ├── regression_tracker.py → tracks regressions, adds guard examples
184
184
  ├── consolidate.py → cross-iteration memory consolidation
185
- ├── synthesize_strategy.py→ generates strategy document for proposers
185
+ ├── synthesize_strategy.py→ generates strategy document + investigation lenses
186
186
  ├── add_evaluator.py → programmatically adds evaluators
187
187
  └── adversarial_inject.py → detects memorization, injects adversarial tests
188
188
  ```
@@ -217,6 +217,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
217
217
  ## References
218
218
 
219
219
  - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
220
+ - [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
220
221
  - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
221
222
  - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
222
223
  - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
@@ -1,75 +1,56 @@
1
1
  ---
2
2
  name: evolver-proposer
3
3
  description: |
4
- Use this agent to propose improvements to an LLM agent's code.
5
- Works in an isolated git worktree — modifies real code, not a harness wrapper.
6
- Spawned by the evolve skill with a strategy (exploit/explore/crossover/failure-targeted).
4
+ Self-organizing agent optimizer. Investigates a data-driven lens (question),
5
+ decides its own approach, and modifies real code in an isolated git worktree.
6
+ May self-abstain if it cannot add meaningful value.
7
7
  tools: Read, Write, Edit, Bash, Glob, Grep
8
8
  color: green
9
9
  permissionMode: acceptEdits
10
10
  ---
11
11
 
12
- # Evolver — Proposer Agent (v3)
12
+ # Evolver — Self-Organizing Proposer (v4)
13
13
 
14
- You are an LLM agent optimizer. Your job is to modify the user's actual agent code to improve its performance on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
14
+ You are an LLM agent optimizer. Your job is to improve the user's agent code to score higher on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
15
15
 
16
16
  ## Bootstrap
17
17
 
18
- Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
18
+ Your prompt contains `<files_to_read>`, `<context>`, and `<lens>` blocks. You MUST:
19
19
  1. Read every file listed in `<files_to_read>` using the Read tool
20
20
  2. Parse the `<context>` block for current scores, failing examples, and framework info
21
- 3. Read the `<strategy>` block for your assigned approach
21
+ 3. Read the `<lens>` block this is your investigation starting point
22
22
 
23
23
  ## Turn Budget
24
24
 
25
- You have a maximum of **16 turns** to complete your proposal. Budget them:
26
- - Turns 1-3: Orient (read files, understand codebase)
27
- - Turns 4-6: Diagnose (read insights, identify targets)
28
- - Turns 7-12: Implement (make changes, consult docs)
29
- - Turns 13-14: Test (verify changes don't break the entry point)
30
- - Turns 15-16: Commit and document
25
+ You have a maximum of **16 turns**. You decide how to allocate them. General guidance:
26
+ - Spend early turns reading context and investigating your lens question
27
+ - Spend middle turns implementing changes and consulting documentation
28
+ - Reserve final turns for committing and writing proposal.md
31
29
 
32
30
  **If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
33
31
 
34
32
  **Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
35
33
 
36
- ## Strategy Injection
34
+ ## Lens Protocol
37
35
 
38
- Your prompt contains a `<strategy>` block. Follow it:
39
- - **exploitation**: Conservative fix on current best. Focus on specific failing examples.
40
- - **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
41
- - **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
42
- - **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
43
- - **creative**: Try something unexpected — different libraries, architecture, algorithms.
44
- - **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
36
+ Your prompt contains a `<lens>` block with an **investigation question**. This is your starting point, not your mandate.
45
37
 
46
- If no strategy block is present, default to exploitation.
38
+ 1. **Investigate** dig into the data relevant to the lens question (trace insights, failing examples, code)
39
+ 2. **Hypothesize** — form your own theory about what to change
40
+ 3. **Decide** — choose your approach freely. You may end up solving something completely different from what the lens asks. That's fine.
41
+ 4. **Implement or Abstain** — if you can add meaningful value, implement and commit. If not, abstain.
47
42
 
48
- ## Your Workflow
49
-
50
- ### Phase 1: Orient
51
-
52
- Read .evolver.json to understand:
53
- - What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
54
- - What's the entry point?
55
- - What evaluators are active? (correctness, conciseness, latency, etc.)
56
- - What's the current best score?
43
+ You are NOT constrained to the lens topic. The lens gives you a starting perspective. Your actual approach is yours to decide.
57
44
 
58
- ### Phase 2: Diagnose
45
+ ## Your Workflow
59
46
 
60
- Read trace_insights.json and best_results.json to understand:
61
- - Which examples are failing and why?
62
- - What error patterns exist?
63
- - Are there token/latency issues?
47
+ There are no fixed phases. Use your judgment to allocate turns. A typical flow:
64
48
 
65
- If production_seed.json exists, read it to understand real-world usage:
66
- - What do real user inputs look like?
67
- - What are the common error patterns in production?
68
- - Which query types get the most traffic?
49
+ **Orient** — Read .evolver.json, strategy.md, evolution_memory.md. Understand the framework, entry point, evaluators, current score, and what has been tried before.
69
50
 
70
- ### Phase 3: Propose Changes
51
+ **Investigate** Read trace_insights.json and best_results.json. Understand which examples fail and why. If production_seed.json exists, understand real-world usage patterns. Focus on data relevant to your lens question.
71
52
 
72
- Based on your strategy and diagnosis, modify the code:
53
+ **Decide** — Based on investigation, decide what to change. Consider:
73
54
  - **Prompts**: system prompts, few-shot examples, output format instructions
74
55
  - **Routing**: how queries are dispatched to different handlers
75
56
  - **Tools**: tool definitions, tool selection logic
@@ -77,7 +58,23 @@ Based on your strategy and diagnosis, modify the code:
77
58
  - **Error handling**: retry logic, fallback strategies, timeout handling
78
59
  - **Model selection**: which model for which task
79
60
 
80
- ### Phase 3.5: Consult Documentation (MANDATORY)
61
+ ## Self-Abstention
62
+
63
+ If after investigating your lens you conclude you cannot add meaningful value, you may **abstain**. This is a valued contribution — it saves evaluation tokens and signals confidence that the current code handles the lens topic adequately.
64
+
65
+ To abstain, skip implementation and write only a `proposal.md`:
66
+
67
+ ```
68
+ ## ABSTAIN
69
+ - **Lens**: {the question you investigated}
70
+ - **Finding**: {what you discovered during investigation}
71
+ - **Reason**: {why you're abstaining}
72
+ - **Suggested focus**: {optional — what future iterations should look at}
73
+ ```
74
+
75
+ Then end with the return protocol using `ABSTAIN` as your approach.
76
+
77
+ ### Consult Documentation (MANDATORY)
81
78
 
82
79
  **Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
83
80
 
@@ -106,7 +103,7 @@ Ask about the SPECIFIC API you're going to use or change.
106
103
 
107
104
  **If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
108
105
 
109
- ### Phase 4: Commit and Document
106
+ ### Commit and Document
110
107
 
111
108
  1. **Commit all changes** with a descriptive message:
112
109
  ```bash
@@ -143,7 +140,7 @@ Prioritize changes that fix real production failures over synthetic test failure
143
140
  ## Rules
144
141
 
145
142
  1. **Read before writing** — understand the code before changing it
146
- 2. **Minimal changes** — change only what's needed for your strategy
143
+ 2. **Focused changes** — change what's needed based on your investigation. Don't scatter changes across unrelated files.
147
144
  3. **Don't break the interface** — the agent must still be runnable with the same command
148
145
  4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
149
146
  5. **Write proposal.md** — the evolve skill reads this to understand what you did
@@ -153,8 +150,9 @@ Prioritize changes that fix real production failures over synthetic test failure
153
150
  When done, end your response with:
154
151
 
155
152
  ## PROPOSAL COMPLETE
156
- - **Version**: v{NNN}{suffix}
157
- - **Strategy**: {strategy}
153
+ - **Version**: v{NNN}-{id}
154
+ - **Lens**: {the investigation question}
155
+ - **Approach**: {what you chose to do and why — free text, your own words}
158
156
  - **Changes**: {brief list of files changed}
159
157
  - **Expected impact**: {which evaluators/examples should improve}
160
158
  - **Files modified**: {count}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "4.0.3",
3
+ "version": "4.2.0",
4
4
  "description": "LangSmith-native autonomous agent optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -127,6 +127,122 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
127
127
  - "Fix and retry" — attempt auto-fix with `--fix` flag
128
128
  - "Abort" — stop the evolution loop
129
129
 
130
+ ### 0.6. Dataset Health Check
131
+
132
+ Run the dataset health diagnostic:
133
+
134
+ ```bash
135
+ $EVOLVER_PY $TOOLS/dataset_health.py \
136
+ --config .evolver.json \
137
+ --production-seed production_seed.json \
138
+ --output health_report.json 2>/dev/null
139
+ ```
140
+
141
+ Read `health_report.json`. Print summary:
142
+ ```bash
143
+ python3 -c "
144
+ import json, os
145
+ if os.path.exists('health_report.json'):
146
+ r = json.load(open('health_report.json'))
147
+ print(f'Dataset Health: {r[\"health_score\"]}/10 ({r[\"example_count\"]} examples)')
148
+ for issue in r.get('issues', []):
149
+ print(f' [{issue[\"severity\"]}] {issue[\"message\"]}')
150
+ "
151
+ ```
152
+
153
+ ### 0.7. Auto-Correct Dataset Issues
154
+
155
+ If `health_report.json` has corrections, apply them automatically:
156
+
157
+ ```bash
158
+ CORRECTIONS=$(python3 -c "
159
+ import json, os
160
+ if os.path.exists('health_report.json'):
161
+ r = json.load(open('health_report.json'))
162
+ for c in r.get('corrections', []):
163
+ print(c['action'])
164
+ " 2>/dev/null)
165
+ ```
166
+
167
+ For each correction:
168
+
169
+ **If `create_splits`**: Run inline Python to assign 70/30 splits:
170
+ ```bash
171
+ $EVOLVER_PY -c "
172
+ from langsmith import Client
173
+ import json, random
174
+ client = Client()
175
+ config = json.load(open('.evolver.json'))
176
+ examples = list(client.list_examples(dataset_name=config['dataset']))
177
+ random.shuffle(examples)
178
+ sp = int(len(examples) * 0.7)
179
+ for ex in examples[:sp]:
180
+ client.update_example(ex.id, split='train')
181
+ for ex in examples[sp:]:
182
+ client.update_example(ex.id, split='held_out')
183
+ print(f'Assigned splits: {sp} train, {len(examples)-sp} held_out')
184
+ "
185
+ ```
186
+
187
+ **If `generate_hard`**: Spawn testgen agent with hard-mode instruction:
188
+ ```
189
+ Agent(
190
+ subagent_type: "evolver-testgen",
191
+ description: "Generate hard examples to rebalance dataset",
192
+ prompt: |
193
+ <objective>
194
+ The dataset is skewed toward easy examples. Generate {count} HARD examples
195
+ that the current agent is likely to fail on.
196
+ Focus on: edge cases, adversarial inputs, complex multi-step queries,
197
+ ambiguous questions, and inputs that require deep reasoning.
198
+ </objective>
199
+ <files_to_read>
200
+ - .evolver.json
201
+ - strategy.md (if exists)
202
+ - production_seed.json (if exists)
203
+ </files_to_read>
204
+ )
205
+ ```
206
+
207
+ **If `fill_coverage`**: Spawn testgen agent with coverage-fill instruction:
208
+ ```
209
+ Agent(
210
+ subagent_type: "evolver-testgen",
211
+ description: "Generate examples for missing categories",
212
+ prompt: |
213
+ <objective>
214
+ The dataset is missing these production categories: {categories}.
215
+ Generate 5 examples per missing category.
216
+ Use production_seed.json for real-world patterns in these categories.
217
+ </objective>
218
+ <files_to_read>
219
+ - .evolver.json
220
+ - production_seed.json (if exists)
221
+ </files_to_read>
222
+ )
223
+ ```
224
+
225
+ **If `retire_dead`**: Move dead examples to retired split:
226
+ ```bash
227
+ $EVOLVER_PY -c "
228
+ from langsmith import Client
229
+ import json
230
+ client = Client()
231
+ report = json.load(open('health_report.json'))
232
+ dead_ids = report.get('dead_examples', {}).get('ids', [])
233
+ config = json.load(open('.evolver.json'))
234
+ examples = {str(e.id): e for e in client.list_examples(dataset_name=config['dataset'])}
235
+ retired = 0
236
+ for eid in dead_ids:
237
+ if eid in examples:
238
+ client.update_example(examples[eid].id, split='retired')
239
+ retired += 1
240
+ print(f'Retired {retired} dead examples')
241
+ "
242
+ ```
243
+
244
+ After corrections, log what was done. Do NOT re-run health check (corrections may need an experiment cycle to show effect).
245
+
130
246
  For each iteration:
131
247
 
132
248
  ### 1. Get Next Version
@@ -170,12 +286,13 @@ if [ -n "$BEST" ]; then
170
286
  $EVOLVER_PY $TOOLS/read_results.py \
171
287
  --experiment "$BEST" \
172
288
  --config .evolver.json \
289
+ --split train \
173
290
  --output best_results.json 2>/dev/null
174
291
  fi
175
292
  ```
176
293
 
177
294
  If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
178
- Generate adaptive briefings for Candidates D and E (same logic as v2).
295
+ This failure data feeds into `synthesize_strategy.py` which generates targeted lenses for proposers.
179
296
  If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
180
297
 
181
298
  ### 1.8a. Synthesize Strategy
@@ -189,17 +306,18 @@ $EVOLVER_PY $TOOLS/synthesize_strategy.py \
189
306
  --best-results best_results.json \
190
307
  --evolution-memory evolution_memory.json \
191
308
  --production-seed production_seed.json \
192
- --output strategy.md 2>/dev/null
309
+ --output strategy.md \
310
+ --lenses lenses.json 2>/dev/null
193
311
  ```
194
312
 
195
- The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). It synthesizes trace analysis, evolution memory, and production data into an actionable document. Proposers also receive `production_seed.json` directly for access to raw production traces.
313
+ The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). The `lenses.json` file contains dynamically generated investigation questions one per proposer. Each lens directs a proposer's attention to a different aspect of the problem (failure cluster, architecture, production data, evolution memory, or open investigation).
196
314
 
197
315
  ### 1.9. Prepare Shared Proposer Context
198
316
 
199
- Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning 5 proposers costs barely more than 1.
317
+ Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning N proposers costs barely more than 1.
200
318
 
201
319
  ```bash
202
- # Build shared context block (identical for all 5 proposers)
320
+ # Build shared context block (identical for all proposers)
203
321
  SHARED_FILES_BLOCK="<files_to_read>
204
322
  - .evolver.json
205
323
  - strategy.md (if exists)
@@ -223,13 +341,19 @@ You are working in an isolated git worktree — modify any file freely.
223
341
  </objective>"
224
342
  ```
225
343
 
226
- **CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all 5 proposer prompts. Only the `<strategy>` block differs. Place the strategy block LAST in the prompt so the shared prefix is maximized.
344
+ **CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all proposer prompts. Only the `<lens>` block differs. Place the lens block LAST in the prompt so the shared prefix is maximized.
345
+
346
+ ### 2. Spawn Proposers in Parallel (Dynamic Lenses)
347
+
348
+ Read `lenses.json` to get the list of investigation lenses:
227
349
 
228
- ### 2. Spawn 5 Proposers in Parallel
350
+ ```bash
351
+ LENS_COUNT=$(python3 -c "import json; print(json.load(open('lenses.json'))['lens_count'])")
352
+ ```
229
353
 
230
- Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique strategy suffix.
354
+ Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique lens.
231
355
 
232
- **All 5 candidates** — `run_in_background: true, isolation: "worktree"`:
356
+ **For each lens** — `run_in_background: true, isolation: "worktree"`:
233
357
 
234
358
  The prompt for EACH proposer follows this structure:
235
359
  ```
@@ -239,66 +363,54 @@ The prompt for EACH proposer follows this structure:
239
363
 
240
364
  {SHARED_CONTEXT_BLOCK}
241
365
 
242
- <strategy>
243
- {UNIQUE PER CANDIDATE — see below}
244
- </strategy>
366
+ <lens>
367
+ Investigation question: {lens.question}
245
368
 
246
- <output>
247
- 1. Modify the code to improve performance
248
- 2. Commit your changes with a descriptive message
249
- 3. Write proposal.md explaining what you changed and why
250
- </output>
251
- ```
369
+ This is your STARTING POINT, not your mandate. Investigate, form your
370
+ own hypothesis, and implement whatever you conclude will help most.
371
+ You may solve something entirely different that's fine.
372
+ If you cannot add meaningful value, ABSTAIN.
252
373
 
253
- **Candidate A strategy block:**
254
- ```
255
- APPROACH: exploitation
256
- Make targeted improvements to the current best version.
257
- Focus on the specific failures identified in the results.
258
- ```
259
-
260
- **Candidate B strategy block:**
261
- ```
262
- APPROACH: exploration
263
- Try a fundamentally different approach. Change algorithms, prompts, routing, architecture.
264
- Don't be afraid to make big changes — this worktree is disposable.
265
- ```
374
+ Source: {lens.source}
375
+ </lens>
266
376
 
267
- **Candidate C strategy block:**
268
- ```
269
- APPROACH: crossover
270
- Combine strengths from previous iterations. Check git log for what was tried.
271
- Recent changes: {git_log_last_5}
377
+ <output>
378
+ 1. Investigate the lens question
379
+ 2. Decide your approach (or abstain)
380
+ 3. If proceeding: modify code, commit, write proposal.md
381
+ 4. proposal.md must include: what you chose to do, why, how it relates to the lens
382
+ </output>
272
383
  ```
273
384
 
274
- **Candidate D strategy block:**
275
- ```
276
- APPROACH: {failure_targeted_or_creative}
277
- {adaptive_briefing_d}
278
- ```
385
+ For each lens in `lenses.json`, spawn one proposer agent:
279
386
 
280
- **Candidate E strategy block:**
281
387
  ```
282
- APPROACH: {failure_targeted_or_efficiency}
283
- {adaptive_briefing_e}
388
+ Agent(
389
+ subagent_type: "evolver-proposer",
390
+ description: "Proposer {lens.id}: {lens.source} lens",
391
+ isolation: "worktree",
392
+ run_in_background: true,
393
+ prompt: {SHARED_PREFIX + LENS_BLOCK above, with lens fields filled in}
394
+ )
284
395
  ```
285
396
 
286
- Wait for all 5 to complete.
397
+ Wait for all proposers to complete.
287
398
 
288
399
  **Stuck proposer detection**: If any proposer hasn't completed after 10 minutes, it may be stuck in a loop. The Claude Code runtime handles this via the agent's turn limit. If a proposer returns without committing changes, skip it — don't retry.
289
400
 
290
- After all proposers complete, check which ones actually committed:
401
+ After all proposers complete, check which ones committed and which abstained:
291
402
 
292
403
  ```bash
293
404
  for WORKTREE in {worktree_paths}; do
294
- CHANGES=$(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l)
295
- if [ "$CHANGES" -eq 0 ]; then
405
+ if [ -f "$WORKTREE/proposal.md" ] && grep -q "## ABSTAIN" "$WORKTREE/proposal.md" 2>/dev/null; then
406
+ echo "Proposer in $WORKTREE abstained skipping evaluation"
407
+ elif [ $(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l) -eq 0 ]; then
296
408
  echo "Proposer in $WORKTREE made no commits — skipping"
297
409
  fi
298
410
  done
299
411
  ```
300
412
 
301
- Only run evaluation (Step 3) for proposers that committed changes.
413
+ Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
302
414
 
303
415
  ### 3. Run Target for Each Candidate
304
416
 
@@ -308,7 +420,7 @@ For each worktree that has changes (proposer committed something):
308
420
  $EVOLVER_PY $TOOLS/run_eval.py \
309
421
  --config .evolver.json \
310
422
  --worktree-path {worktree_path} \
311
- --experiment-prefix v{NNN}{suffix} \
423
+ --experiment-prefix v{NNN}-{lens_id} \
312
424
  --timeout 120
313
425
  ```
314
426
 
@@ -339,11 +451,7 @@ Agent(
339
451
  prompt: |
340
452
  <experiment>
341
453
  Evaluate the following experiments (one per candidate):
342
- - {experiment_name_a}
343
- - {experiment_name_b}
344
- - {experiment_name_c}
345
- - {experiment_name_d}
346
- - {experiment_name_e}
454
+ {list all experiment names from proposers that committed changes — skip abstained}
347
455
  </experiment>
348
456
 
349
457
  <evaluators>
@@ -370,14 +478,14 @@ Wait for the evaluator agent to complete before proceeding.
370
478
 
371
479
  ```bash
372
480
  $EVOLVER_PY $TOOLS/read_results.py \
373
- --experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
481
+ --experiments "{comma-separated list of experiment names from non-abstained proposers}" \
374
482
  --config .evolver.json \
375
483
  --output comparison.json
376
484
  ```
377
485
 
378
486
  Parse `comparison.json`:
379
487
  - `comparison.winner` — highest combined score
380
- - `comparison.champion` — per-task champion (for next crossover)
488
+ - `comparison.champion` — per-task champion (for next iteration's context)
381
489
  - `comparison.all_candidates` — all scores for reporting
382
490
 
383
491
  ### 5. Merge Winner
@@ -389,7 +497,7 @@ If the winner scored higher than the current best:
389
497
  WINNER_BRANCH={winning_worktree_branch}
390
498
 
391
499
  # Merge into main
392
- git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
500
+ git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}-{lens_id} (score: {score})"
393
501
  ```
394
502
 
395
503
  Update `.evolver.json`:
@@ -409,14 +517,14 @@ json.dump(c, open('.evolver.json', 'w'), indent=2)
409
517
 
410
518
  Report ALL candidates:
411
519
  ```
412
- Iteration {i}/{N} — 5 candidates evaluated:
413
- v{NNN}a (exploit): {score_a} {summary}
414
- v{NNN}b (explore): {score_b} — {summary}
415
- v{NNN}c (crossover): {score_c} — {summary}
416
- v{NNN}d ({strategy}): {score_d} {summary}
417
- v{NNN}e ({strategy}): {score_e} — {summary}
520
+ Iteration {i}/{N} — {lens_count} lenses, {evaluated_count} candidates evaluated ({abstained_count} abstained):
521
+ {For each proposer, read proposal.md and extract the Approach field}
522
+ v{NNN}-1 ({approach from proposal.md}): {score} — {summary}
523
+ v{NNN}-2 ({approach from proposal.md}): {score} — {summary}
524
+ v{NNN}-3 (ABSTAINED): -- — {reason from proposal.md}
525
+ ...
418
526
 
419
- Winner: v{NNN}{suffix} ({score}) — merged into main
527
+ Winner: v{NNN}-{id} ({score}) — merged into main
420
528
  Per-task champion: {champion} (beats winner on {N} tasks)
421
529
  ```
422
530
 
@@ -145,15 +145,20 @@ def generate_adversarial_inputs(client, dataset_name, num_inputs=5):
145
145
  return adversarial
146
146
 
147
147
 
148
- def inject_adversarial(client, dataset_id, adversarial_inputs):
148
+ def inject_adversarial(client, dataset_id, adversarial_inputs, config=None):
149
149
  """Add adversarial examples to dataset."""
150
+ config = config or {}
150
151
  added = 0
151
152
  for adv in adversarial_inputs:
152
153
  try:
154
+ split = "train" if random.random() < 0.7 else "held_out"
155
+ metadata = dict(adv["metadata"])
156
+ metadata["added_at_iteration"] = config.get("iterations", 0)
153
157
  client.create_example(
154
158
  inputs=adv["inputs"],
155
159
  dataset_id=dataset_id,
156
- metadata=adv["metadata"],
160
+ metadata=metadata,
161
+ split=split,
157
162
  )
158
163
  added += 1
159
164
  except Exception as e:
@@ -182,7 +187,7 @@ def main():
182
187
 
183
188
  injected = 0
184
189
  if args.inject and adversarial:
185
- injected = inject_adversarial(client, config["dataset_id"], adversarial)
190
+ injected = inject_adversarial(client, config["dataset_id"], adversarial, config=config)
186
191
 
187
192
  result = {
188
193
  "memorization_suspects": len(suspicious),