harness-evolver 4.0.2 → 4.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +11 -10
- package/agents/evolver-proposer.md +45 -47
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +56 -65
- package/tools/consolidate.py +7 -15
- package/tools/synthesize_strategy.py +138 -2
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "harness-evolver",
|
|
3
3
|
"description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
|
|
4
|
-
"version": "4.0
|
|
4
|
+
"version": "4.1.0",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "Raphael Valdetaro"
|
|
7
7
|
},
|
package/README.md
CHANGED
|
@@ -67,8 +67,8 @@ claude
|
|
|
67
67
|
<td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
|
|
68
68
|
</tr>
|
|
69
69
|
<tr>
|
|
70
|
-
<td><b>
|
|
71
|
-
<td>Each iteration
|
|
70
|
+
<td><b>Self-Organizing Proposers</b></td>
|
|
71
|
+
<td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
|
|
72
72
|
</tr>
|
|
73
73
|
<tr>
|
|
74
74
|
<td><b>Agent-Based Evaluation</b></td>
|
|
@@ -88,7 +88,7 @@ claude
|
|
|
88
88
|
</tr>
|
|
89
89
|
<tr>
|
|
90
90
|
<td><b>Evolution Memory</b></td>
|
|
91
|
-
<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which
|
|
91
|
+
<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
|
|
92
92
|
</tr>
|
|
93
93
|
<tr>
|
|
94
94
|
<td><b>Smart Gating</b></td>
|
|
@@ -107,7 +107,7 @@ claude
|
|
|
107
107
|
| Command | What it does |
|
|
108
108
|
|---|---|
|
|
109
109
|
| `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
|
|
110
|
-
| `/evolver:evolve` | Run the optimization loop (
|
|
110
|
+
| `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
|
|
111
111
|
| `/evolver:status` | Show progress, scores, history |
|
|
112
112
|
| `/evolver:deploy` | Tag, push, clean up temporary files |
|
|
113
113
|
|
|
@@ -117,7 +117,7 @@ claude
|
|
|
117
117
|
|
|
118
118
|
| Agent | Role | Color |
|
|
119
119
|
|---|---|---|
|
|
120
|
-
| **Proposer** |
|
|
120
|
+
| **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
|
|
121
121
|
| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
|
|
122
122
|
| **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
|
|
123
123
|
| **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
|
|
@@ -134,10 +134,10 @@ claude
|
|
|
134
134
|
+- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
|
|
135
135
|
+- 1. Read state (.evolver.json + LangSmith experiments)
|
|
136
136
|
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
|
|
137
|
-
+- 1.8 Analyze per-task failures
|
|
138
|
-
+- 1.8a Synthesize strategy document (
|
|
137
|
+
+- 1.8 Analyze per-task failures
|
|
138
|
+
+- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
|
|
139
139
|
+- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
|
|
140
|
-
+- 2. Spawn
|
|
140
|
+
+- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
|
|
141
141
|
+- 3. Run target for each candidate (code-based evaluators)
|
|
142
142
|
+- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
|
|
143
143
|
+- 4. Compare experiments -> select winner + per-task champion
|
|
@@ -165,7 +165,7 @@ Skills (markdown)
|
|
|
165
165
|
└── /evolver:deploy → tags and pushes
|
|
166
166
|
|
|
167
167
|
Agents (markdown)
|
|
168
|
-
├── Proposer (
|
|
168
|
+
├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
|
|
169
169
|
├── Evaluator → LLM-as-judge via langsmith-cli
|
|
170
170
|
├── Critic → detects gaming + implements stricter evaluators
|
|
171
171
|
├── Architect → ULTRAPLAN deep analysis (opus model)
|
|
@@ -182,7 +182,7 @@ Tools (Python + langsmith SDK)
|
|
|
182
182
|
├── iteration_gate.py → three-gate iteration triggers
|
|
183
183
|
├── regression_tracker.py → tracks regressions, adds guard examples
|
|
184
184
|
├── consolidate.py → cross-iteration memory consolidation
|
|
185
|
-
├── synthesize_strategy.py→ generates strategy document
|
|
185
|
+
├── synthesize_strategy.py→ generates strategy document + investigation lenses
|
|
186
186
|
├── add_evaluator.py → programmatically adds evaluators
|
|
187
187
|
└── adversarial_inject.py → detects memorization, injects adversarial tests
|
|
188
188
|
```
|
|
@@ -217,6 +217,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
|
|
|
217
217
|
## References
|
|
218
218
|
|
|
219
219
|
- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
|
|
220
|
+
- [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
|
|
220
221
|
- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
|
|
221
222
|
- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
|
|
222
223
|
- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
|
|
@@ -1,75 +1,56 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: evolver-proposer
|
|
3
3
|
description: |
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
4
|
+
Self-organizing agent optimizer. Investigates a data-driven lens (question),
|
|
5
|
+
decides its own approach, and modifies real code in an isolated git worktree.
|
|
6
|
+
May self-abstain if it cannot add meaningful value.
|
|
7
7
|
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
8
8
|
color: green
|
|
9
9
|
permissionMode: acceptEdits
|
|
10
10
|
---
|
|
11
11
|
|
|
12
|
-
# Evolver — Proposer
|
|
12
|
+
# Evolver — Self-Organizing Proposer (v4)
|
|
13
13
|
|
|
14
|
-
You are an LLM agent optimizer. Your job is to
|
|
14
|
+
You are an LLM agent optimizer. Your job is to improve the user's agent code to score higher on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
|
|
15
15
|
|
|
16
16
|
## Bootstrap
|
|
17
17
|
|
|
18
|
-
Your prompt contains `<files_to_read
|
|
18
|
+
Your prompt contains `<files_to_read>`, `<context>`, and `<lens>` blocks. You MUST:
|
|
19
19
|
1. Read every file listed in `<files_to_read>` using the Read tool
|
|
20
20
|
2. Parse the `<context>` block for current scores, failing examples, and framework info
|
|
21
|
-
3. Read the `<
|
|
21
|
+
3. Read the `<lens>` block — this is your investigation starting point
|
|
22
22
|
|
|
23
23
|
## Turn Budget
|
|
24
24
|
|
|
25
|
-
You have a maximum of **16 turns
|
|
26
|
-
-
|
|
27
|
-
-
|
|
28
|
-
-
|
|
29
|
-
- Turns 13-14: Test (verify changes don't break the entry point)
|
|
30
|
-
- Turns 15-16: Commit and document
|
|
25
|
+
You have a maximum of **16 turns**. You decide how to allocate them. General guidance:
|
|
26
|
+
- Spend early turns reading context and investigating your lens question
|
|
27
|
+
- Spend middle turns implementing changes and consulting documentation
|
|
28
|
+
- Reserve final turns for committing and writing proposal.md
|
|
31
29
|
|
|
32
30
|
**If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
|
|
33
31
|
|
|
34
32
|
**Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
|
|
35
33
|
|
|
36
|
-
##
|
|
34
|
+
## Lens Protocol
|
|
37
35
|
|
|
38
|
-
Your prompt contains a `<
|
|
39
|
-
- **exploitation**: Conservative fix on current best. Focus on specific failing examples.
|
|
40
|
-
- **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
|
|
41
|
-
- **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
|
|
42
|
-
- **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
|
|
43
|
-
- **creative**: Try something unexpected — different libraries, architecture, algorithms.
|
|
44
|
-
- **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
|
|
36
|
+
Your prompt contains a `<lens>` block with an **investigation question**. This is your starting point, not your mandate.
|
|
45
37
|
|
|
46
|
-
|
|
38
|
+
1. **Investigate** — dig into the data relevant to the lens question (trace insights, failing examples, code)
|
|
39
|
+
2. **Hypothesize** — form your own theory about what to change
|
|
40
|
+
3. **Decide** — choose your approach freely. You may end up solving something completely different from what the lens asks. That's fine.
|
|
41
|
+
4. **Implement or Abstain** — if you can add meaningful value, implement and commit. If not, abstain.
|
|
47
42
|
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
### Phase 1: Orient
|
|
51
|
-
|
|
52
|
-
Read .evolver.json to understand:
|
|
53
|
-
- What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
|
|
54
|
-
- What's the entry point?
|
|
55
|
-
- What evaluators are active? (correctness, conciseness, latency, etc.)
|
|
56
|
-
- What's the current best score?
|
|
43
|
+
You are NOT constrained to the lens topic. The lens gives you a starting perspective. Your actual approach is yours to decide.
|
|
57
44
|
|
|
58
|
-
|
|
45
|
+
## Your Workflow
|
|
59
46
|
|
|
60
|
-
|
|
61
|
-
- Which examples are failing and why?
|
|
62
|
-
- What error patterns exist?
|
|
63
|
-
- Are there token/latency issues?
|
|
47
|
+
There are no fixed phases. Use your judgment to allocate turns. A typical flow:
|
|
64
48
|
|
|
65
|
-
|
|
66
|
-
- What do real user inputs look like?
|
|
67
|
-
- What are the common error patterns in production?
|
|
68
|
-
- Which query types get the most traffic?
|
|
49
|
+
**Orient** — Read .evolver.json, strategy.md, evolution_memory.md. Understand the framework, entry point, evaluators, current score, and what has been tried before.
|
|
69
50
|
|
|
70
|
-
|
|
51
|
+
**Investigate** — Read trace_insights.json and best_results.json. Understand which examples fail and why. If production_seed.json exists, understand real-world usage patterns. Focus on data relevant to your lens question.
|
|
71
52
|
|
|
72
|
-
Based on
|
|
53
|
+
**Decide** — Based on investigation, decide what to change. Consider:
|
|
73
54
|
- **Prompts**: system prompts, few-shot examples, output format instructions
|
|
74
55
|
- **Routing**: how queries are dispatched to different handlers
|
|
75
56
|
- **Tools**: tool definitions, tool selection logic
|
|
@@ -77,7 +58,23 @@ Based on your strategy and diagnosis, modify the code:
|
|
|
77
58
|
- **Error handling**: retry logic, fallback strategies, timeout handling
|
|
78
59
|
- **Model selection**: which model for which task
|
|
79
60
|
|
|
80
|
-
|
|
61
|
+
## Self-Abstention
|
|
62
|
+
|
|
63
|
+
If after investigating your lens you conclude you cannot add meaningful value, you may **abstain**. This is a valued contribution — it saves evaluation tokens and signals confidence that the current code handles the lens topic adequately.
|
|
64
|
+
|
|
65
|
+
To abstain, skip implementation and write only a `proposal.md`:
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
## ABSTAIN
|
|
69
|
+
- **Lens**: {the question you investigated}
|
|
70
|
+
- **Finding**: {what you discovered during investigation}
|
|
71
|
+
- **Reason**: {why you're abstaining}
|
|
72
|
+
- **Suggested focus**: {optional — what future iterations should look at}
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Then end with the return protocol using `ABSTAIN` as your approach.
|
|
76
|
+
|
|
77
|
+
### Consult Documentation (MANDATORY)
|
|
81
78
|
|
|
82
79
|
**Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
|
|
83
80
|
|
|
@@ -106,7 +103,7 @@ Ask about the SPECIFIC API you're going to use or change.
|
|
|
106
103
|
|
|
107
104
|
**If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
|
|
108
105
|
|
|
109
|
-
###
|
|
106
|
+
### Commit and Document
|
|
110
107
|
|
|
111
108
|
1. **Commit all changes** with a descriptive message:
|
|
112
109
|
```bash
|
|
@@ -143,7 +140,7 @@ Prioritize changes that fix real production failures over synthetic test failure
|
|
|
143
140
|
## Rules
|
|
144
141
|
|
|
145
142
|
1. **Read before writing** — understand the code before changing it
|
|
146
|
-
2. **
|
|
143
|
+
2. **Focused changes** — change what's needed based on your investigation. Don't scatter changes across unrelated files.
|
|
147
144
|
3. **Don't break the interface** — the agent must still be runnable with the same command
|
|
148
145
|
4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
|
|
149
146
|
5. **Write proposal.md** — the evolve skill reads this to understand what you did
|
|
@@ -153,8 +150,9 @@ Prioritize changes that fix real production failures over synthetic test failure
|
|
|
153
150
|
When done, end your response with:
|
|
154
151
|
|
|
155
152
|
## PROPOSAL COMPLETE
|
|
156
|
-
- **Version**: v{NNN}{
|
|
157
|
-
- **
|
|
153
|
+
- **Version**: v{NNN}-{id}
|
|
154
|
+
- **Lens**: {the investigation question}
|
|
155
|
+
- **Approach**: {what you chose to do and why — free text, your own words}
|
|
158
156
|
- **Changes**: {brief list of files changed}
|
|
159
157
|
- **Expected impact**: {which evaluators/examples should improve}
|
|
160
158
|
- **Files modified**: {count}
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -175,7 +175,7 @@ fi
|
|
|
175
175
|
```
|
|
176
176
|
|
|
177
177
|
If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
|
|
178
|
-
|
|
178
|
+
This failure data feeds into `synthesize_strategy.py` which generates targeted lenses for proposers.
|
|
179
179
|
If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
|
|
180
180
|
|
|
181
181
|
### 1.8a. Synthesize Strategy
|
|
@@ -189,17 +189,18 @@ $EVOLVER_PY $TOOLS/synthesize_strategy.py \
|
|
|
189
189
|
--best-results best_results.json \
|
|
190
190
|
--evolution-memory evolution_memory.json \
|
|
191
191
|
--production-seed production_seed.json \
|
|
192
|
-
--output strategy.md
|
|
192
|
+
--output strategy.md \
|
|
193
|
+
--lenses lenses.json 2>/dev/null
|
|
193
194
|
```
|
|
194
195
|
|
|
195
|
-
The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9).
|
|
196
|
+
The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). The `lenses.json` file contains dynamically generated investigation questions — one per proposer. Each lens directs a proposer's attention to a different aspect of the problem (failure cluster, architecture, production data, evolution memory, or open investigation).
|
|
196
197
|
|
|
197
198
|
### 1.9. Prepare Shared Proposer Context
|
|
198
199
|
|
|
199
|
-
Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning
|
|
200
|
+
Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning N proposers costs barely more than 1.
|
|
200
201
|
|
|
201
202
|
```bash
|
|
202
|
-
# Build shared context block (identical for all
|
|
203
|
+
# Build shared context block (identical for all proposers)
|
|
203
204
|
SHARED_FILES_BLOCK="<files_to_read>
|
|
204
205
|
- .evolver.json
|
|
205
206
|
- strategy.md (if exists)
|
|
@@ -223,13 +224,19 @@ You are working in an isolated git worktree — modify any file freely.
|
|
|
223
224
|
</objective>"
|
|
224
225
|
```
|
|
225
226
|
|
|
226
|
-
**CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all
|
|
227
|
+
**CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all proposer prompts. Only the `<lens>` block differs. Place the lens block LAST in the prompt so the shared prefix is maximized.
|
|
227
228
|
|
|
228
|
-
### 2. Spawn
|
|
229
|
+
### 2. Spawn Proposers in Parallel (Dynamic Lenses)
|
|
229
230
|
|
|
230
|
-
|
|
231
|
+
Read `lenses.json` to get the list of investigation lenses:
|
|
231
232
|
|
|
232
|
-
|
|
233
|
+
```bash
|
|
234
|
+
LENS_COUNT=$(python3 -c "import json; print(json.load(open('lenses.json'))['lens_count'])")
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique lens.
|
|
238
|
+
|
|
239
|
+
**For each lens** — `run_in_background: true, isolation: "worktree"`:
|
|
233
240
|
|
|
234
241
|
The prompt for EACH proposer follows this structure:
|
|
235
242
|
```
|
|
@@ -239,66 +246,54 @@ The prompt for EACH proposer follows this structure:
|
|
|
239
246
|
|
|
240
247
|
{SHARED_CONTEXT_BLOCK}
|
|
241
248
|
|
|
242
|
-
<
|
|
243
|
-
|
|
244
|
-
</strategy>
|
|
245
|
-
|
|
246
|
-
<output>
|
|
247
|
-
1. Modify the code to improve performance
|
|
248
|
-
2. Commit your changes with a descriptive message
|
|
249
|
-
3. Write proposal.md explaining what you changed and why
|
|
250
|
-
</output>
|
|
251
|
-
```
|
|
249
|
+
<lens>
|
|
250
|
+
Investigation question: {lens.question}
|
|
252
251
|
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
Focus on the specific failures identified in the results.
|
|
258
|
-
```
|
|
252
|
+
This is your STARTING POINT, not your mandate. Investigate, form your
|
|
253
|
+
own hypothesis, and implement whatever you conclude will help most.
|
|
254
|
+
You may solve something entirely different — that's fine.
|
|
255
|
+
If you cannot add meaningful value, ABSTAIN.
|
|
259
256
|
|
|
260
|
-
|
|
261
|
-
|
|
262
|
-
APPROACH: exploration
|
|
263
|
-
Try a fundamentally different approach. Change algorithms, prompts, routing, architecture.
|
|
264
|
-
Don't be afraid to make big changes — this worktree is disposable.
|
|
265
|
-
```
|
|
257
|
+
Source: {lens.source}
|
|
258
|
+
</lens>
|
|
266
259
|
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
260
|
+
<output>
|
|
261
|
+
1. Investigate the lens question
|
|
262
|
+
2. Decide your approach (or abstain)
|
|
263
|
+
3. If proceeding: modify code, commit, write proposal.md
|
|
264
|
+
4. proposal.md must include: what you chose to do, why, how it relates to the lens
|
|
265
|
+
</output>
|
|
272
266
|
```
|
|
273
267
|
|
|
274
|
-
|
|
275
|
-
```
|
|
276
|
-
APPROACH: {failure_targeted_or_creative}
|
|
277
|
-
{adaptive_briefing_d}
|
|
278
|
-
```
|
|
268
|
+
For each lens in `lenses.json`, spawn one proposer agent:
|
|
279
269
|
|
|
280
|
-
**Candidate E strategy block:**
|
|
281
270
|
```
|
|
282
|
-
|
|
283
|
-
|
|
271
|
+
Agent(
|
|
272
|
+
subagent_type: "evolver-proposer",
|
|
273
|
+
description: "Proposer {lens.id}: {lens.source} lens",
|
|
274
|
+
isolation: "worktree",
|
|
275
|
+
run_in_background: true,
|
|
276
|
+
prompt: {SHARED_PREFIX + LENS_BLOCK above, with lens fields filled in}
|
|
277
|
+
)
|
|
284
278
|
```
|
|
285
279
|
|
|
286
|
-
Wait for all
|
|
280
|
+
Wait for all proposers to complete.
|
|
287
281
|
|
|
288
282
|
**Stuck proposer detection**: If any proposer hasn't completed after 10 minutes, it may be stuck in a loop. The Claude Code runtime handles this via the agent's turn limit. If a proposer returns without committing changes, skip it — don't retry.
|
|
289
283
|
|
|
290
|
-
After all proposers complete, check which ones
|
|
284
|
+
After all proposers complete, check which ones committed and which abstained:
|
|
291
285
|
|
|
292
286
|
```bash
|
|
293
287
|
for WORKTREE in {worktree_paths}; do
|
|
294
|
-
|
|
295
|
-
|
|
288
|
+
if [ -f "$WORKTREE/proposal.md" ] && grep -q "## ABSTAIN" "$WORKTREE/proposal.md" 2>/dev/null; then
|
|
289
|
+
echo "Proposer in $WORKTREE abstained — skipping evaluation"
|
|
290
|
+
elif [ $(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l) -eq 0 ]; then
|
|
296
291
|
echo "Proposer in $WORKTREE made no commits — skipping"
|
|
297
292
|
fi
|
|
298
293
|
done
|
|
299
294
|
```
|
|
300
295
|
|
|
301
|
-
Only run evaluation (Step 3) for proposers that committed changes.
|
|
296
|
+
Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
|
|
302
297
|
|
|
303
298
|
### 3. Run Target for Each Candidate
|
|
304
299
|
|
|
@@ -308,7 +303,7 @@ For each worktree that has changes (proposer committed something):
|
|
|
308
303
|
$EVOLVER_PY $TOOLS/run_eval.py \
|
|
309
304
|
--config .evolver.json \
|
|
310
305
|
--worktree-path {worktree_path} \
|
|
311
|
-
--experiment-prefix v{NNN}{
|
|
306
|
+
--experiment-prefix v{NNN}-{lens_id} \
|
|
312
307
|
--timeout 120
|
|
313
308
|
```
|
|
314
309
|
|
|
@@ -339,11 +334,7 @@ Agent(
|
|
|
339
334
|
prompt: |
|
|
340
335
|
<experiment>
|
|
341
336
|
Evaluate the following experiments (one per candidate):
|
|
342
|
-
|
|
343
|
-
- {experiment_name_b}
|
|
344
|
-
- {experiment_name_c}
|
|
345
|
-
- {experiment_name_d}
|
|
346
|
-
- {experiment_name_e}
|
|
337
|
+
{list all experiment names from proposers that committed changes — skip abstained}
|
|
347
338
|
</experiment>
|
|
348
339
|
|
|
349
340
|
<evaluators>
|
|
@@ -370,14 +361,14 @@ Wait for the evaluator agent to complete before proceeding.
|
|
|
370
361
|
|
|
371
362
|
```bash
|
|
372
363
|
$EVOLVER_PY $TOOLS/read_results.py \
|
|
373
|
-
--experiments "
|
|
364
|
+
--experiments "{comma-separated list of experiment names from non-abstained proposers}" \
|
|
374
365
|
--config .evolver.json \
|
|
375
366
|
--output comparison.json
|
|
376
367
|
```
|
|
377
368
|
|
|
378
369
|
Parse `comparison.json`:
|
|
379
370
|
- `comparison.winner` — highest combined score
|
|
380
|
-
- `comparison.champion` — per-task champion (for next
|
|
371
|
+
- `comparison.champion` — per-task champion (for next iteration's context)
|
|
381
372
|
- `comparison.all_candidates` — all scores for reporting
|
|
382
373
|
|
|
383
374
|
### 5. Merge Winner
|
|
@@ -389,7 +380,7 @@ If the winner scored higher than the current best:
|
|
|
389
380
|
WINNER_BRANCH={winning_worktree_branch}
|
|
390
381
|
|
|
391
382
|
# Merge into main
|
|
392
|
-
git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{
|
|
383
|
+
git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}-{lens_id} (score: {score})"
|
|
393
384
|
```
|
|
394
385
|
|
|
395
386
|
Update `.evolver.json`:
|
|
@@ -409,14 +400,14 @@ json.dump(c, open('.evolver.json', 'w'), indent=2)
|
|
|
409
400
|
|
|
410
401
|
Report ALL candidates:
|
|
411
402
|
```
|
|
412
|
-
Iteration {i}/{N} —
|
|
413
|
-
|
|
414
|
-
v{NNN}
|
|
415
|
-
v{NNN}
|
|
416
|
-
v{NNN}
|
|
417
|
-
|
|
403
|
+
Iteration {i}/{N} — {lens_count} lenses, {evaluated_count} candidates evaluated ({abstained_count} abstained):
|
|
404
|
+
{For each proposer, read proposal.md and extract the Approach field}
|
|
405
|
+
v{NNN}-1 ({approach from proposal.md}): {score} — {summary}
|
|
406
|
+
v{NNN}-2 ({approach from proposal.md}): {score} — {summary}
|
|
407
|
+
v{NNN}-3 (ABSTAINED): -- — {reason from proposal.md}
|
|
408
|
+
...
|
|
418
409
|
|
|
419
|
-
Winner: v{NNN}{
|
|
410
|
+
Winner: v{NNN}-{id} ({score}) — merged into main
|
|
420
411
|
Per-task champion: {champion} (beats winner on {N} tasks)
|
|
421
412
|
```
|
|
422
413
|
|
package/tools/consolidate.py
CHANGED
|
@@ -94,24 +94,16 @@ def consolidate(orientation, signals, existing_memory=None):
|
|
|
94
94
|
"""Phase 3: Merge signals into consolidated memory."""
|
|
95
95
|
insights = []
|
|
96
96
|
|
|
97
|
-
#
|
|
97
|
+
# Winning approach tracking (from comparison data)
|
|
98
98
|
winning = signals.get("winning_strategies", [])
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
exp = w.get("experiment", "")
|
|
103
|
-
if exp:
|
|
104
|
-
suffix = exp[-1]
|
|
105
|
-
name = strategy_map.get(suffix, suffix)
|
|
106
|
-
win_counts[name] = win_counts.get(name, 0) + 1
|
|
107
|
-
|
|
108
|
-
if win_counts:
|
|
109
|
-
best_strategy = max(win_counts, key=win_counts.get)
|
|
99
|
+
if winning:
|
|
100
|
+
win_count = len(winning)
|
|
101
|
+
best_score = max(w.get("score", 0) for w in winning)
|
|
110
102
|
insights.append({
|
|
111
103
|
"type": "strategy_effectiveness",
|
|
112
|
-
"insight": f"
|
|
113
|
-
"recurrence":
|
|
114
|
-
"data":
|
|
104
|
+
"insight": f"Best candidate score: {best_score:.3f} across {win_count} iterations",
|
|
105
|
+
"recurrence": win_count,
|
|
106
|
+
"data": {"win_count": win_count, "best_score": best_score},
|
|
115
107
|
})
|
|
116
108
|
|
|
117
109
|
# Recurring failures (only promote if seen 2+ times)
|
|
@@ -1,10 +1,15 @@
|
|
|
1
1
|
#!/usr/bin/env python3
|
|
2
|
-
"""Synthesize evolution strategy document
|
|
2
|
+
"""Synthesize evolution strategy document and investigation lenses.
|
|
3
3
|
|
|
4
4
|
Reads trace_insights.json, best_results.json, evolution_memory.json,
|
|
5
5
|
and production_seed.json to produce a targeted strategy document with
|
|
6
6
|
specific file paths and concrete change recommendations for proposers.
|
|
7
7
|
|
|
8
|
+
When --lenses is specified, also generates a lenses.json file containing
|
|
9
|
+
investigation questions derived from failure clusters, architecture issues,
|
|
10
|
+
production data, and evolution memory. Each lens becomes a focused brief
|
|
11
|
+
for one proposer agent.
|
|
12
|
+
|
|
8
13
|
Usage:
|
|
9
14
|
python3 synthesize_strategy.py \
|
|
10
15
|
--config .evolver.json \
|
|
@@ -12,7 +17,8 @@ Usage:
|
|
|
12
17
|
--best-results best_results.json \
|
|
13
18
|
--evolution-memory evolution_memory.json \
|
|
14
19
|
--production-seed production_seed.json \
|
|
15
|
-
--output strategy.md
|
|
20
|
+
--output strategy.md \
|
|
21
|
+
--lenses lenses.json
|
|
16
22
|
"""
|
|
17
23
|
|
|
18
24
|
import argparse
|
|
@@ -120,6 +126,118 @@ def synthesize(config, insights, results, memory, production=None):
|
|
|
120
126
|
return strategy
|
|
121
127
|
|
|
122
128
|
|
|
129
|
+
def generate_lenses(strategy, config, insights, results, memory, production, max_lenses=5):
|
|
130
|
+
"""Generate investigation lenses from available data sources."""
|
|
131
|
+
lenses = []
|
|
132
|
+
lens_id = 0
|
|
133
|
+
|
|
134
|
+
# Failure cluster lenses (one per distinct cluster, max 3)
|
|
135
|
+
for cluster in strategy.get("failure_clusters", [])[:3]:
|
|
136
|
+
lens_id += 1
|
|
137
|
+
desc = cluster["description"]
|
|
138
|
+
severity = cluster["severity"]
|
|
139
|
+
examples = []
|
|
140
|
+
for ex in strategy.get("failing_examples", []):
|
|
141
|
+
if ex.get("error") and cluster.get("type", "") in str(ex.get("error", "")):
|
|
142
|
+
examples.append(ex["example_id"])
|
|
143
|
+
if not examples:
|
|
144
|
+
examples = [ex["example_id"] for ex in strategy.get("failing_examples", [])[:3]]
|
|
145
|
+
lenses.append({
|
|
146
|
+
"id": lens_id,
|
|
147
|
+
"question": f"{desc} — what code change would fix this?",
|
|
148
|
+
"source": "failure_cluster",
|
|
149
|
+
"severity": severity,
|
|
150
|
+
"context": {"examples": examples[:5]},
|
|
151
|
+
})
|
|
152
|
+
|
|
153
|
+
# Architecture lens from trace insights
|
|
154
|
+
if insights:
|
|
155
|
+
for issue in insights.get("top_issues", []):
|
|
156
|
+
if issue.get("severity") == "high" and issue.get("type") in (
|
|
157
|
+
"architecture", "routing", "topology", "structure",
|
|
158
|
+
):
|
|
159
|
+
lens_id += 1
|
|
160
|
+
lenses.append({
|
|
161
|
+
"id": lens_id,
|
|
162
|
+
"question": f"Architectural issue: {issue['description']} — what structural change would help?",
|
|
163
|
+
"source": "architecture",
|
|
164
|
+
"severity": "high",
|
|
165
|
+
"context": {"issue_type": issue["type"]},
|
|
166
|
+
})
|
|
167
|
+
break # at most 1 architecture lens
|
|
168
|
+
|
|
169
|
+
# Production lens
|
|
170
|
+
if production:
|
|
171
|
+
prod_issues = []
|
|
172
|
+
neg = production.get("negative_feedback_inputs", [])
|
|
173
|
+
if neg:
|
|
174
|
+
prod_issues.append(f"Users gave negative feedback on {len(neg)} queries")
|
|
175
|
+
errors = production.get("error_patterns", production.get("errors", []))
|
|
176
|
+
if errors and isinstance(errors, list) and len(errors) > 0:
|
|
177
|
+
prod_issues.append(f"Production errors: {str(errors[0])[:100]}")
|
|
178
|
+
slow = production.get("slow_queries", [])
|
|
179
|
+
if slow:
|
|
180
|
+
prod_issues.append(f"{len(slow)} slow queries detected")
|
|
181
|
+
if prod_issues:
|
|
182
|
+
lens_id += 1
|
|
183
|
+
lenses.append({
|
|
184
|
+
"id": lens_id,
|
|
185
|
+
"question": f"Production data shows: {'; '.join(prod_issues)}. How should the agent handle these real-world patterns?",
|
|
186
|
+
"source": "production",
|
|
187
|
+
"severity": "high",
|
|
188
|
+
"context": {},
|
|
189
|
+
})
|
|
190
|
+
|
|
191
|
+
# Evolution memory lens — winning patterns
|
|
192
|
+
if memory:
|
|
193
|
+
for insight in memory.get("insights", []):
|
|
194
|
+
if insight.get("type") == "strategy_effectiveness" and insight.get("recurrence", 0) >= 2:
|
|
195
|
+
lens_id += 1
|
|
196
|
+
lenses.append({
|
|
197
|
+
"id": lens_id,
|
|
198
|
+
"question": f"{insight['insight']} — what further improvements in this direction are possible?",
|
|
199
|
+
"source": "evolution_memory",
|
|
200
|
+
"severity": "medium",
|
|
201
|
+
"context": {"recurrence": insight["recurrence"]},
|
|
202
|
+
})
|
|
203
|
+
break # at most 1 memory lens
|
|
204
|
+
|
|
205
|
+
# Evolution memory lens — persistent failures
|
|
206
|
+
if memory:
|
|
207
|
+
for insight in memory.get("insights", []):
|
|
208
|
+
if insight.get("type") == "recurring_failure" and insight.get("recurrence", 0) >= 3:
|
|
209
|
+
lens_id += 1
|
|
210
|
+
lenses.append({
|
|
211
|
+
"id": lens_id,
|
|
212
|
+
"question": f"{insight['insight']} — this has persisted {insight['recurrence']} iterations. Why?",
|
|
213
|
+
"source": "persistent_failure",
|
|
214
|
+
"severity": "critical",
|
|
215
|
+
"context": {"recurrence": insight["recurrence"]},
|
|
216
|
+
})
|
|
217
|
+
break # at most 1 persistent failure lens
|
|
218
|
+
|
|
219
|
+
# Open lens (always included)
|
|
220
|
+
lens_id += 1
|
|
221
|
+
lenses.append({
|
|
222
|
+
"id": lens_id,
|
|
223
|
+
"question": "Open investigation — read all context and investigate what stands out most to you.",
|
|
224
|
+
"source": "open",
|
|
225
|
+
"severity": "medium",
|
|
226
|
+
"context": {},
|
|
227
|
+
})
|
|
228
|
+
|
|
229
|
+
# Sort by severity, take top max_lenses
|
|
230
|
+
severity_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
|
|
231
|
+
lenses.sort(key=lambda l: severity_order.get(l["severity"], 2))
|
|
232
|
+
lenses = lenses[:max_lenses]
|
|
233
|
+
|
|
234
|
+
# Reassign sequential IDs after sorting/truncating
|
|
235
|
+
for i, lens in enumerate(lenses):
|
|
236
|
+
lens["id"] = i + 1
|
|
237
|
+
|
|
238
|
+
return lenses
|
|
239
|
+
|
|
240
|
+
|
|
123
241
|
def format_strategy_md(strategy, config):
|
|
124
242
|
"""Format strategy as markdown document."""
|
|
125
243
|
lines = [
|
|
@@ -197,6 +315,7 @@ def main():
|
|
|
197
315
|
parser.add_argument("--evolution-memory", default="evolution_memory.json")
|
|
198
316
|
parser.add_argument("--production-seed", default="production_seed.json")
|
|
199
317
|
parser.add_argument("--output", default="strategy.md")
|
|
318
|
+
parser.add_argument("--lenses", default=None, help="Output path for lenses JSON")
|
|
200
319
|
args = parser.parse_args()
|
|
201
320
|
|
|
202
321
|
with open(args.config) as f:
|
|
@@ -217,6 +336,23 @@ def main():
|
|
|
217
336
|
with open(json_path, "w") as f:
|
|
218
337
|
json.dump(strategy, f, indent=2)
|
|
219
338
|
|
|
339
|
+
# Generate lenses if requested
|
|
340
|
+
if args.lenses:
|
|
341
|
+
max_proposers = config.get("max_proposers", 5)
|
|
342
|
+
lens_list = generate_lenses(
|
|
343
|
+
strategy, config, insights, results, memory, production,
|
|
344
|
+
max_lenses=max_proposers,
|
|
345
|
+
)
|
|
346
|
+
from datetime import datetime, timezone
|
|
347
|
+
lenses_output = {
|
|
348
|
+
"generated_at": datetime.now(timezone.utc).isoformat(),
|
|
349
|
+
"lens_count": len(lens_list),
|
|
350
|
+
"lenses": lens_list,
|
|
351
|
+
}
|
|
352
|
+
with open(args.lenses, "w") as f:
|
|
353
|
+
json.dump(lenses_output, f, indent=2)
|
|
354
|
+
print(f"Generated {len(lens_list)} lenses → {args.lenses}", file=sys.stderr)
|
|
355
|
+
|
|
220
356
|
print(md)
|
|
221
357
|
|
|
222
358
|
|