harness-evolver 4.0.3 → 4.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
3
  "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
4
- "version": "4.0.3",
4
+ "version": "4.1.0",
5
5
  "author": {
6
6
  "name": "Raphael Valdetaro"
7
7
  },
package/README.md CHANGED
@@ -67,8 +67,8 @@ claude
67
67
  <td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
68
68
  </tr>
69
69
  <tr>
70
- <td><b>5 Adaptive Proposers</b></td>
71
- <td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
70
+ <td><b>Self-Organizing Proposers</b></td>
71
+ <td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
72
72
  </tr>
73
73
  <tr>
74
74
  <td><b>Agent-Based Evaluation</b></td>
@@ -88,7 +88,7 @@ claude
88
88
  </tr>
89
89
  <tr>
90
90
  <td><b>Evolution Memory</b></td>
91
- <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which strategies win, which failures recur, and promotes insights after 2+ occurrences.</td>
91
+ <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
92
92
  </tr>
93
93
  <tr>
94
94
  <td><b>Smart Gating</b></td>
@@ -107,7 +107,7 @@ claude
107
107
  | Command | What it does |
108
108
  |---|---|
109
109
  | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
110
- | `/evolver:evolve` | Run the optimization loop (5 parallel proposers in worktrees) |
110
+ | `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
111
111
  | `/evolver:status` | Show progress, scores, history |
112
112
  | `/evolver:deploy` | Tag, push, clean up temporary files |
113
113
 
@@ -117,7 +117,7 @@ claude
117
117
 
118
118
  | Agent | Role | Color |
119
119
  |---|---|---|
120
- | **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
120
+ | **Proposer** | Self-organizing investigates a data-driven lens, decides own approach, may abstain | Green |
121
121
  | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
122
122
  | **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
123
123
  | **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
@@ -134,10 +134,10 @@ claude
134
134
  +- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
135
135
  +- 1. Read state (.evolver.json + LangSmith experiments)
136
136
  +- 1.5 Gather trace insights (cluster errors, tokens, latency)
137
- +- 1.8 Analyze per-task failures (adaptive briefings)
138
- +- 1.8a Synthesize strategy document (coordinator synthesis)
137
+ +- 1.8 Analyze per-task failures
138
+ +- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
139
139
  +- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
140
- +- 2. Spawn 5 proposers in parallel (each in a git worktree)
140
+ +- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
141
141
  +- 3. Run target for each candidate (code-based evaluators)
142
142
  +- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
143
143
  +- 4. Compare experiments -> select winner + per-task champion
@@ -165,7 +165,7 @@ Skills (markdown)
165
165
  └── /evolver:deploy → tags and pushes
166
166
 
167
167
  Agents (markdown)
168
- ├── Proposer (x5) → modifies code in isolated git worktrees
168
+ ├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
169
169
  ├── Evaluator → LLM-as-judge via langsmith-cli
170
170
  ├── Critic → detects gaming + implements stricter evaluators
171
171
  ├── Architect → ULTRAPLAN deep analysis (opus model)
@@ -182,7 +182,7 @@ Tools (Python + langsmith SDK)
182
182
  ├── iteration_gate.py → three-gate iteration triggers
183
183
  ├── regression_tracker.py → tracks regressions, adds guard examples
184
184
  ├── consolidate.py → cross-iteration memory consolidation
185
- ├── synthesize_strategy.py→ generates strategy document for proposers
185
+ ├── synthesize_strategy.py→ generates strategy document + investigation lenses
186
186
  ├── add_evaluator.py → programmatically adds evaluators
187
187
  └── adversarial_inject.py → detects memorization, injects adversarial tests
188
188
  ```
@@ -217,6 +217,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
217
217
  ## References
218
218
 
219
219
  - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
220
+ - [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
220
221
  - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
221
222
  - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
222
223
  - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
@@ -1,75 +1,56 @@
1
1
  ---
2
2
  name: evolver-proposer
3
3
  description: |
4
- Use this agent to propose improvements to an LLM agent's code.
5
- Works in an isolated git worktree — modifies real code, not a harness wrapper.
6
- Spawned by the evolve skill with a strategy (exploit/explore/crossover/failure-targeted).
4
+ Self-organizing agent optimizer. Investigates a data-driven lens (question),
5
+ decides its own approach, and modifies real code in an isolated git worktree.
6
+ May self-abstain if it cannot add meaningful value.
7
7
  tools: Read, Write, Edit, Bash, Glob, Grep
8
8
  color: green
9
9
  permissionMode: acceptEdits
10
10
  ---
11
11
 
12
- # Evolver — Proposer Agent (v3)
12
+ # Evolver — Self-Organizing Proposer (v4)
13
13
 
14
- You are an LLM agent optimizer. Your job is to modify the user's actual agent code to improve its performance on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
14
+ You are an LLM agent optimizer. Your job is to improve the user's agent code to score higher on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
15
15
 
16
16
  ## Bootstrap
17
17
 
18
- Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
18
+ Your prompt contains `<files_to_read>`, `<context>`, and `<lens>` blocks. You MUST:
19
19
  1. Read every file listed in `<files_to_read>` using the Read tool
20
20
  2. Parse the `<context>` block for current scores, failing examples, and framework info
21
- 3. Read the `<strategy>` block for your assigned approach
21
+ 3. Read the `<lens>` block this is your investigation starting point
22
22
 
23
23
  ## Turn Budget
24
24
 
25
- You have a maximum of **16 turns** to complete your proposal. Budget them:
26
- - Turns 1-3: Orient (read files, understand codebase)
27
- - Turns 4-6: Diagnose (read insights, identify targets)
28
- - Turns 7-12: Implement (make changes, consult docs)
29
- - Turns 13-14: Test (verify changes don't break the entry point)
30
- - Turns 15-16: Commit and document
25
+ You have a maximum of **16 turns**. You decide how to allocate them. General guidance:
26
+ - Spend early turns reading context and investigating your lens question
27
+ - Spend middle turns implementing changes and consulting documentation
28
+ - Reserve final turns for committing and writing proposal.md
31
29
 
32
30
  **If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
33
31
 
34
32
  **Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
35
33
 
36
- ## Strategy Injection
34
+ ## Lens Protocol
37
35
 
38
- Your prompt contains a `<strategy>` block. Follow it:
39
- - **exploitation**: Conservative fix on current best. Focus on specific failing examples.
40
- - **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
41
- - **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
42
- - **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
43
- - **creative**: Try something unexpected — different libraries, architecture, algorithms.
44
- - **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
36
+ Your prompt contains a `<lens>` block with an **investigation question**. This is your starting point, not your mandate.
45
37
 
46
- If no strategy block is present, default to exploitation.
38
+ 1. **Investigate** dig into the data relevant to the lens question (trace insights, failing examples, code)
39
+ 2. **Hypothesize** — form your own theory about what to change
40
+ 3. **Decide** — choose your approach freely. You may end up solving something completely different from what the lens asks. That's fine.
41
+ 4. **Implement or Abstain** — if you can add meaningful value, implement and commit. If not, abstain.
47
42
 
48
- ## Your Workflow
49
-
50
- ### Phase 1: Orient
51
-
52
- Read .evolver.json to understand:
53
- - What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
54
- - What's the entry point?
55
- - What evaluators are active? (correctness, conciseness, latency, etc.)
56
- - What's the current best score?
43
+ You are NOT constrained to the lens topic. The lens gives you a starting perspective. Your actual approach is yours to decide.
57
44
 
58
- ### Phase 2: Diagnose
45
+ ## Your Workflow
59
46
 
60
- Read trace_insights.json and best_results.json to understand:
61
- - Which examples are failing and why?
62
- - What error patterns exist?
63
- - Are there token/latency issues?
47
+ There are no fixed phases. Use your judgment to allocate turns. A typical flow:
64
48
 
65
- If production_seed.json exists, read it to understand real-world usage:
66
- - What do real user inputs look like?
67
- - What are the common error patterns in production?
68
- - Which query types get the most traffic?
49
+ **Orient** — Read .evolver.json, strategy.md, evolution_memory.md. Understand the framework, entry point, evaluators, current score, and what has been tried before.
69
50
 
70
- ### Phase 3: Propose Changes
51
+ **Investigate** Read trace_insights.json and best_results.json. Understand which examples fail and why. If production_seed.json exists, understand real-world usage patterns. Focus on data relevant to your lens question.
71
52
 
72
- Based on your strategy and diagnosis, modify the code:
53
+ **Decide** — Based on investigation, decide what to change. Consider:
73
54
  - **Prompts**: system prompts, few-shot examples, output format instructions
74
55
  - **Routing**: how queries are dispatched to different handlers
75
56
  - **Tools**: tool definitions, tool selection logic
@@ -77,7 +58,23 @@ Based on your strategy and diagnosis, modify the code:
77
58
  - **Error handling**: retry logic, fallback strategies, timeout handling
78
59
  - **Model selection**: which model for which task
79
60
 
80
- ### Phase 3.5: Consult Documentation (MANDATORY)
61
+ ## Self-Abstention
62
+
63
+ If after investigating your lens you conclude you cannot add meaningful value, you may **abstain**. This is a valued contribution — it saves evaluation tokens and signals confidence that the current code handles the lens topic adequately.
64
+
65
+ To abstain, skip implementation and write only a `proposal.md`:
66
+
67
+ ```
68
+ ## ABSTAIN
69
+ - **Lens**: {the question you investigated}
70
+ - **Finding**: {what you discovered during investigation}
71
+ - **Reason**: {why you're abstaining}
72
+ - **Suggested focus**: {optional — what future iterations should look at}
73
+ ```
74
+
75
+ Then end with the return protocol using `ABSTAIN` as your approach.
76
+
77
+ ### Consult Documentation (MANDATORY)
81
78
 
82
79
  **Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
83
80
 
@@ -106,7 +103,7 @@ Ask about the SPECIFIC API you're going to use or change.
106
103
 
107
104
  **If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
108
105
 
109
- ### Phase 4: Commit and Document
106
+ ### Commit and Document
110
107
 
111
108
  1. **Commit all changes** with a descriptive message:
112
109
  ```bash
@@ -143,7 +140,7 @@ Prioritize changes that fix real production failures over synthetic test failure
143
140
  ## Rules
144
141
 
145
142
  1. **Read before writing** — understand the code before changing it
146
- 2. **Minimal changes** — change only what's needed for your strategy
143
+ 2. **Focused changes** — change what's needed based on your investigation. Don't scatter changes across unrelated files.
147
144
  3. **Don't break the interface** — the agent must still be runnable with the same command
148
145
  4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
149
146
  5. **Write proposal.md** — the evolve skill reads this to understand what you did
@@ -153,8 +150,9 @@ Prioritize changes that fix real production failures over synthetic test failure
153
150
  When done, end your response with:
154
151
 
155
152
  ## PROPOSAL COMPLETE
156
- - **Version**: v{NNN}{suffix}
157
- - **Strategy**: {strategy}
153
+ - **Version**: v{NNN}-{id}
154
+ - **Lens**: {the investigation question}
155
+ - **Approach**: {what you chose to do and why — free text, your own words}
158
156
  - **Changes**: {brief list of files changed}
159
157
  - **Expected impact**: {which evaluators/examples should improve}
160
158
  - **Files modified**: {count}
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "4.0.3",
3
+ "version": "4.1.0",
4
4
  "description": "LangSmith-native autonomous agent optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -175,7 +175,7 @@ fi
175
175
  ```
176
176
 
177
177
  If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
178
- Generate adaptive briefings for Candidates D and E (same logic as v2).
178
+ This failure data feeds into `synthesize_strategy.py` which generates targeted lenses for proposers.
179
179
  If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
180
180
 
181
181
  ### 1.8a. Synthesize Strategy
@@ -189,17 +189,18 @@ $EVOLVER_PY $TOOLS/synthesize_strategy.py \
189
189
  --best-results best_results.json \
190
190
  --evolution-memory evolution_memory.json \
191
191
  --production-seed production_seed.json \
192
- --output strategy.md 2>/dev/null
192
+ --output strategy.md \
193
+ --lenses lenses.json 2>/dev/null
193
194
  ```
194
195
 
195
- The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). It synthesizes trace analysis, evolution memory, and production data into an actionable document. Proposers also receive `production_seed.json` directly for access to raw production traces.
196
+ The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). The `lenses.json` file contains dynamically generated investigation questions one per proposer. Each lens directs a proposer's attention to a different aspect of the problem (failure cluster, architecture, production data, evolution memory, or open investigation).
196
197
 
197
198
  ### 1.9. Prepare Shared Proposer Context
198
199
 
199
- Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning 5 proposers costs barely more than 1.
200
+ Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning N proposers costs barely more than 1.
200
201
 
201
202
  ```bash
202
- # Build shared context block (identical for all 5 proposers)
203
+ # Build shared context block (identical for all proposers)
203
204
  SHARED_FILES_BLOCK="<files_to_read>
204
205
  - .evolver.json
205
206
  - strategy.md (if exists)
@@ -223,13 +224,19 @@ You are working in an isolated git worktree — modify any file freely.
223
224
  </objective>"
224
225
  ```
225
226
 
226
- **CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all 5 proposer prompts. Only the `<strategy>` block differs. Place the strategy block LAST in the prompt so the shared prefix is maximized.
227
+ **CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all proposer prompts. Only the `<lens>` block differs. Place the lens block LAST in the prompt so the shared prefix is maximized.
227
228
 
228
- ### 2. Spawn 5 Proposers in Parallel
229
+ ### 2. Spawn Proposers in Parallel (Dynamic Lenses)
229
230
 
230
- Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique strategy suffix.
231
+ Read `lenses.json` to get the list of investigation lenses:
231
232
 
232
- **All 5 candidates** — `run_in_background: true, isolation: "worktree"`:
233
+ ```bash
234
+ LENS_COUNT=$(python3 -c "import json; print(json.load(open('lenses.json'))['lens_count'])")
235
+ ```
236
+
237
+ Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique lens.
238
+
239
+ **For each lens** — `run_in_background: true, isolation: "worktree"`:
233
240
 
234
241
  The prompt for EACH proposer follows this structure:
235
242
  ```
@@ -239,66 +246,54 @@ The prompt for EACH proposer follows this structure:
239
246
 
240
247
  {SHARED_CONTEXT_BLOCK}
241
248
 
242
- <strategy>
243
- {UNIQUE PER CANDIDATE — see below}
244
- </strategy>
245
-
246
- <output>
247
- 1. Modify the code to improve performance
248
- 2. Commit your changes with a descriptive message
249
- 3. Write proposal.md explaining what you changed and why
250
- </output>
251
- ```
249
+ <lens>
250
+ Investigation question: {lens.question}
252
251
 
253
- **Candidate A strategy block:**
254
- ```
255
- APPROACH: exploitation
256
- Make targeted improvements to the current best version.
257
- Focus on the specific failures identified in the results.
258
- ```
252
+ This is your STARTING POINT, not your mandate. Investigate, form your
253
+ own hypothesis, and implement whatever you conclude will help most.
254
+ You may solve something entirely different — that's fine.
255
+ If you cannot add meaningful value, ABSTAIN.
259
256
 
260
- **Candidate B strategy block:**
261
- ```
262
- APPROACH: exploration
263
- Try a fundamentally different approach. Change algorithms, prompts, routing, architecture.
264
- Don't be afraid to make big changes — this worktree is disposable.
265
- ```
257
+ Source: {lens.source}
258
+ </lens>
266
259
 
267
- **Candidate C strategy block:**
268
- ```
269
- APPROACH: crossover
270
- Combine strengths from previous iterations. Check git log for what was tried.
271
- Recent changes: {git_log_last_5}
260
+ <output>
261
+ 1. Investigate the lens question
262
+ 2. Decide your approach (or abstain)
263
+ 3. If proceeding: modify code, commit, write proposal.md
264
+ 4. proposal.md must include: what you chose to do, why, how it relates to the lens
265
+ </output>
272
266
  ```
273
267
 
274
- **Candidate D strategy block:**
275
- ```
276
- APPROACH: {failure_targeted_or_creative}
277
- {adaptive_briefing_d}
278
- ```
268
+ For each lens in `lenses.json`, spawn one proposer agent:
279
269
 
280
- **Candidate E strategy block:**
281
270
  ```
282
- APPROACH: {failure_targeted_or_efficiency}
283
- {adaptive_briefing_e}
271
+ Agent(
272
+ subagent_type: "evolver-proposer",
273
+ description: "Proposer {lens.id}: {lens.source} lens",
274
+ isolation: "worktree",
275
+ run_in_background: true,
276
+ prompt: {SHARED_PREFIX + LENS_BLOCK above, with lens fields filled in}
277
+ )
284
278
  ```
285
279
 
286
- Wait for all 5 to complete.
280
+ Wait for all proposers to complete.
287
281
 
288
282
  **Stuck proposer detection**: If any proposer hasn't completed after 10 minutes, it may be stuck in a loop. The Claude Code runtime handles this via the agent's turn limit. If a proposer returns without committing changes, skip it — don't retry.
289
283
 
290
- After all proposers complete, check which ones actually committed:
284
+ After all proposers complete, check which ones committed and which abstained:
291
285
 
292
286
  ```bash
293
287
  for WORKTREE in {worktree_paths}; do
294
- CHANGES=$(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l)
295
- if [ "$CHANGES" -eq 0 ]; then
288
+ if [ -f "$WORKTREE/proposal.md" ] && grep -q "## ABSTAIN" "$WORKTREE/proposal.md" 2>/dev/null; then
289
+ echo "Proposer in $WORKTREE abstained skipping evaluation"
290
+ elif [ $(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l) -eq 0 ]; then
296
291
  echo "Proposer in $WORKTREE made no commits — skipping"
297
292
  fi
298
293
  done
299
294
  ```
300
295
 
301
- Only run evaluation (Step 3) for proposers that committed changes.
296
+ Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
302
297
 
303
298
  ### 3. Run Target for Each Candidate
304
299
 
@@ -308,7 +303,7 @@ For each worktree that has changes (proposer committed something):
308
303
  $EVOLVER_PY $TOOLS/run_eval.py \
309
304
  --config .evolver.json \
310
305
  --worktree-path {worktree_path} \
311
- --experiment-prefix v{NNN}{suffix} \
306
+ --experiment-prefix v{NNN}-{lens_id} \
312
307
  --timeout 120
313
308
  ```
314
309
 
@@ -339,11 +334,7 @@ Agent(
339
334
  prompt: |
340
335
  <experiment>
341
336
  Evaluate the following experiments (one per candidate):
342
- - {experiment_name_a}
343
- - {experiment_name_b}
344
- - {experiment_name_c}
345
- - {experiment_name_d}
346
- - {experiment_name_e}
337
+ {list all experiment names from proposers that committed changes — skip abstained}
347
338
  </experiment>
348
339
 
349
340
  <evaluators>
@@ -370,14 +361,14 @@ Wait for the evaluator agent to complete before proceeding.
370
361
 
371
362
  ```bash
372
363
  $EVOLVER_PY $TOOLS/read_results.py \
373
- --experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
364
+ --experiments "{comma-separated list of experiment names from non-abstained proposers}" \
374
365
  --config .evolver.json \
375
366
  --output comparison.json
376
367
  ```
377
368
 
378
369
  Parse `comparison.json`:
379
370
  - `comparison.winner` — highest combined score
380
- - `comparison.champion` — per-task champion (for next crossover)
371
+ - `comparison.champion` — per-task champion (for next iteration's context)
381
372
  - `comparison.all_candidates` — all scores for reporting
382
373
 
383
374
  ### 5. Merge Winner
@@ -389,7 +380,7 @@ If the winner scored higher than the current best:
389
380
  WINNER_BRANCH={winning_worktree_branch}
390
381
 
391
382
  # Merge into main
392
- git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
383
+ git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}-{lens_id} (score: {score})"
393
384
  ```
394
385
 
395
386
  Update `.evolver.json`:
@@ -409,14 +400,14 @@ json.dump(c, open('.evolver.json', 'w'), indent=2)
409
400
 
410
401
  Report ALL candidates:
411
402
  ```
412
- Iteration {i}/{N} — 5 candidates evaluated:
413
- v{NNN}a (exploit): {score_a} {summary}
414
- v{NNN}b (explore): {score_b} — {summary}
415
- v{NNN}c (crossover): {score_c} — {summary}
416
- v{NNN}d ({strategy}): {score_d} {summary}
417
- v{NNN}e ({strategy}): {score_e} — {summary}
403
+ Iteration {i}/{N} — {lens_count} lenses, {evaluated_count} candidates evaluated ({abstained_count} abstained):
404
+ {For each proposer, read proposal.md and extract the Approach field}
405
+ v{NNN}-1 ({approach from proposal.md}): {score} — {summary}
406
+ v{NNN}-2 ({approach from proposal.md}): {score} — {summary}
407
+ v{NNN}-3 (ABSTAINED): -- — {reason from proposal.md}
408
+ ...
418
409
 
419
- Winner: v{NNN}{suffix} ({score}) — merged into main
410
+ Winner: v{NNN}-{id} ({score}) — merged into main
420
411
  Per-task champion: {champion} (beats winner on {N} tasks)
421
412
  ```
422
413
 
@@ -94,24 +94,16 @@ def consolidate(orientation, signals, existing_memory=None):
94
94
  """Phase 3: Merge signals into consolidated memory."""
95
95
  insights = []
96
96
 
97
- # Strategy effectiveness
97
+ # Winning approach tracking (from comparison data)
98
98
  winning = signals.get("winning_strategies", [])
99
- strategy_map = {"a": "exploit", "b": "explore", "c": "crossover", "d": "failure-targeted-1", "e": "failure-targeted-2"}
100
- win_counts = {}
101
- for w in winning:
102
- exp = w.get("experiment", "")
103
- if exp:
104
- suffix = exp[-1]
105
- name = strategy_map.get(suffix, suffix)
106
- win_counts[name] = win_counts.get(name, 0) + 1
107
-
108
- if win_counts:
109
- best_strategy = max(win_counts, key=win_counts.get)
99
+ if winning:
100
+ win_count = len(winning)
101
+ best_score = max(w.get("score", 0) for w in winning)
110
102
  insights.append({
111
103
  "type": "strategy_effectiveness",
112
- "insight": f"Most winning strategy: {best_strategy} ({win_counts[best_strategy]} wins)",
113
- "recurrence": win_counts[best_strategy],
114
- "data": win_counts,
104
+ "insight": f"Best candidate score: {best_score:.3f} across {win_count} iterations",
105
+ "recurrence": win_count,
106
+ "data": {"win_count": win_count, "best_score": best_score},
115
107
  })
116
108
 
117
109
  # Recurring failures (only promote if seen 2+ times)
@@ -1,10 +1,15 @@
1
1
  #!/usr/bin/env python3
2
- """Synthesize evolution strategy document from trace analysis.
2
+ """Synthesize evolution strategy document and investigation lenses.
3
3
 
4
4
  Reads trace_insights.json, best_results.json, evolution_memory.json,
5
5
  and production_seed.json to produce a targeted strategy document with
6
6
  specific file paths and concrete change recommendations for proposers.
7
7
 
8
+ When --lenses is specified, also generates a lenses.json file containing
9
+ investigation questions derived from failure clusters, architecture issues,
10
+ production data, and evolution memory. Each lens becomes a focused brief
11
+ for one proposer agent.
12
+
8
13
  Usage:
9
14
  python3 synthesize_strategy.py \
10
15
  --config .evolver.json \
@@ -12,7 +17,8 @@ Usage:
12
17
  --best-results best_results.json \
13
18
  --evolution-memory evolution_memory.json \
14
19
  --production-seed production_seed.json \
15
- --output strategy.md
20
+ --output strategy.md \
21
+ --lenses lenses.json
16
22
  """
17
23
 
18
24
  import argparse
@@ -120,6 +126,118 @@ def synthesize(config, insights, results, memory, production=None):
120
126
  return strategy
121
127
 
122
128
 
129
+ def generate_lenses(strategy, config, insights, results, memory, production, max_lenses=5):
130
+ """Generate investigation lenses from available data sources."""
131
+ lenses = []
132
+ lens_id = 0
133
+
134
+ # Failure cluster lenses (one per distinct cluster, max 3)
135
+ for cluster in strategy.get("failure_clusters", [])[:3]:
136
+ lens_id += 1
137
+ desc = cluster["description"]
138
+ severity = cluster["severity"]
139
+ examples = []
140
+ for ex in strategy.get("failing_examples", []):
141
+ if ex.get("error") and cluster.get("type", "") in str(ex.get("error", "")):
142
+ examples.append(ex["example_id"])
143
+ if not examples:
144
+ examples = [ex["example_id"] for ex in strategy.get("failing_examples", [])[:3]]
145
+ lenses.append({
146
+ "id": lens_id,
147
+ "question": f"{desc} — what code change would fix this?",
148
+ "source": "failure_cluster",
149
+ "severity": severity,
150
+ "context": {"examples": examples[:5]},
151
+ })
152
+
153
+ # Architecture lens from trace insights
154
+ if insights:
155
+ for issue in insights.get("top_issues", []):
156
+ if issue.get("severity") == "high" and issue.get("type") in (
157
+ "architecture", "routing", "topology", "structure",
158
+ ):
159
+ lens_id += 1
160
+ lenses.append({
161
+ "id": lens_id,
162
+ "question": f"Architectural issue: {issue['description']} — what structural change would help?",
163
+ "source": "architecture",
164
+ "severity": "high",
165
+ "context": {"issue_type": issue["type"]},
166
+ })
167
+ break # at most 1 architecture lens
168
+
169
+ # Production lens
170
+ if production:
171
+ prod_issues = []
172
+ neg = production.get("negative_feedback_inputs", [])
173
+ if neg:
174
+ prod_issues.append(f"Users gave negative feedback on {len(neg)} queries")
175
+ errors = production.get("error_patterns", production.get("errors", []))
176
+ if errors and isinstance(errors, list) and len(errors) > 0:
177
+ prod_issues.append(f"Production errors: {str(errors[0])[:100]}")
178
+ slow = production.get("slow_queries", [])
179
+ if slow:
180
+ prod_issues.append(f"{len(slow)} slow queries detected")
181
+ if prod_issues:
182
+ lens_id += 1
183
+ lenses.append({
184
+ "id": lens_id,
185
+ "question": f"Production data shows: {'; '.join(prod_issues)}. How should the agent handle these real-world patterns?",
186
+ "source": "production",
187
+ "severity": "high",
188
+ "context": {},
189
+ })
190
+
191
+ # Evolution memory lens — winning patterns
192
+ if memory:
193
+ for insight in memory.get("insights", []):
194
+ if insight.get("type") == "strategy_effectiveness" and insight.get("recurrence", 0) >= 2:
195
+ lens_id += 1
196
+ lenses.append({
197
+ "id": lens_id,
198
+ "question": f"{insight['insight']} — what further improvements in this direction are possible?",
199
+ "source": "evolution_memory",
200
+ "severity": "medium",
201
+ "context": {"recurrence": insight["recurrence"]},
202
+ })
203
+ break # at most 1 memory lens
204
+
205
+ # Evolution memory lens — persistent failures
206
+ if memory:
207
+ for insight in memory.get("insights", []):
208
+ if insight.get("type") == "recurring_failure" and insight.get("recurrence", 0) >= 3:
209
+ lens_id += 1
210
+ lenses.append({
211
+ "id": lens_id,
212
+ "question": f"{insight['insight']} — this has persisted {insight['recurrence']} iterations. Why?",
213
+ "source": "persistent_failure",
214
+ "severity": "critical",
215
+ "context": {"recurrence": insight["recurrence"]},
216
+ })
217
+ break # at most 1 persistent failure lens
218
+
219
+ # Open lens (always included)
220
+ lens_id += 1
221
+ lenses.append({
222
+ "id": lens_id,
223
+ "question": "Open investigation — read all context and investigate what stands out most to you.",
224
+ "source": "open",
225
+ "severity": "medium",
226
+ "context": {},
227
+ })
228
+
229
+ # Sort by severity, take top max_lenses
230
+ severity_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
231
+ lenses.sort(key=lambda l: severity_order.get(l["severity"], 2))
232
+ lenses = lenses[:max_lenses]
233
+
234
+ # Reassign sequential IDs after sorting/truncating
235
+ for i, lens in enumerate(lenses):
236
+ lens["id"] = i + 1
237
+
238
+ return lenses
239
+
240
+
123
241
  def format_strategy_md(strategy, config):
124
242
  """Format strategy as markdown document."""
125
243
  lines = [
@@ -197,6 +315,7 @@ def main():
197
315
  parser.add_argument("--evolution-memory", default="evolution_memory.json")
198
316
  parser.add_argument("--production-seed", default="production_seed.json")
199
317
  parser.add_argument("--output", default="strategy.md")
318
+ parser.add_argument("--lenses", default=None, help="Output path for lenses JSON")
200
319
  args = parser.parse_args()
201
320
 
202
321
  with open(args.config) as f:
@@ -217,6 +336,23 @@ def main():
217
336
  with open(json_path, "w") as f:
218
337
  json.dump(strategy, f, indent=2)
219
338
 
339
+ # Generate lenses if requested
340
+ if args.lenses:
341
+ max_proposers = config.get("max_proposers", 5)
342
+ lens_list = generate_lenses(
343
+ strategy, config, insights, results, memory, production,
344
+ max_lenses=max_proposers,
345
+ )
346
+ from datetime import datetime, timezone
347
+ lenses_output = {
348
+ "generated_at": datetime.now(timezone.utc).isoformat(),
349
+ "lens_count": len(lens_list),
350
+ "lenses": lens_list,
351
+ }
352
+ with open(args.lenses, "w") as f:
353
+ json.dump(lenses_output, f, indent=2)
354
+ print(f"Generated {len(lens_list)} lenses → {args.lenses}", file=sys.stderr)
355
+
220
356
  print(md)
221
357
 
222
358