harness-evolver 4.0.3 → 4.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +11 -10
- package/agents/evolver-proposer.md +45 -47
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +173 -65
- package/tools/__pycache__/adversarial_inject.cpython-313.pyc +0 -0
- package/tools/__pycache__/regression_tracker.cpython-313.pyc +0 -0
- package/tools/__pycache__/setup.cpython-313.pyc +0 -0
- package/tools/adversarial_inject.py +8 -3
- package/tools/consolidate.py +7 -15
- package/tools/dataset_health.py +385 -0
- package/tools/read_results.py +21 -2
- package/tools/regression_tracker.py +17 -4
- package/tools/setup.py +23 -0
- package/tools/synthesize_strategy.py +138 -2
- package/tools/trace_insights.py +7 -1
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "harness-evolver",
|
|
3
3
|
"description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
|
|
4
|
-
"version": "4.0
|
|
4
|
+
"version": "4.2.0",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "Raphael Valdetaro"
|
|
7
7
|
},
|
package/README.md
CHANGED
|
@@ -67,8 +67,8 @@ claude
|
|
|
67
67
|
<td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
|
|
68
68
|
</tr>
|
|
69
69
|
<tr>
|
|
70
|
-
<td><b>
|
|
71
|
-
<td>Each iteration
|
|
70
|
+
<td><b>Self-Organizing Proposers</b></td>
|
|
71
|
+
<td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
|
|
72
72
|
</tr>
|
|
73
73
|
<tr>
|
|
74
74
|
<td><b>Agent-Based Evaluation</b></td>
|
|
@@ -88,7 +88,7 @@ claude
|
|
|
88
88
|
</tr>
|
|
89
89
|
<tr>
|
|
90
90
|
<td><b>Evolution Memory</b></td>
|
|
91
|
-
<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which
|
|
91
|
+
<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
|
|
92
92
|
</tr>
|
|
93
93
|
<tr>
|
|
94
94
|
<td><b>Smart Gating</b></td>
|
|
@@ -107,7 +107,7 @@ claude
|
|
|
107
107
|
| Command | What it does |
|
|
108
108
|
|---|---|
|
|
109
109
|
| `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
|
|
110
|
-
| `/evolver:evolve` | Run the optimization loop (
|
|
110
|
+
| `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
|
|
111
111
|
| `/evolver:status` | Show progress, scores, history |
|
|
112
112
|
| `/evolver:deploy` | Tag, push, clean up temporary files |
|
|
113
113
|
|
|
@@ -117,7 +117,7 @@ claude
|
|
|
117
117
|
|
|
118
118
|
| Agent | Role | Color |
|
|
119
119
|
|---|---|---|
|
|
120
|
-
| **Proposer** |
|
|
120
|
+
| **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
|
|
121
121
|
| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
|
|
122
122
|
| **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
|
|
123
123
|
| **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
|
|
@@ -134,10 +134,10 @@ claude
|
|
|
134
134
|
+- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
|
|
135
135
|
+- 1. Read state (.evolver.json + LangSmith experiments)
|
|
136
136
|
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
|
|
137
|
-
+- 1.8 Analyze per-task failures
|
|
138
|
-
+- 1.8a Synthesize strategy document (
|
|
137
|
+
+- 1.8 Analyze per-task failures
|
|
138
|
+
+- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
|
|
139
139
|
+- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
|
|
140
|
-
+- 2. Spawn
|
|
140
|
+
+- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
|
|
141
141
|
+- 3. Run target for each candidate (code-based evaluators)
|
|
142
142
|
+- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
|
|
143
143
|
+- 4. Compare experiments -> select winner + per-task champion
|
|
@@ -165,7 +165,7 @@ Skills (markdown)
|
|
|
165
165
|
└── /evolver:deploy → tags and pushes
|
|
166
166
|
|
|
167
167
|
Agents (markdown)
|
|
168
|
-
├── Proposer (
|
|
168
|
+
├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
|
|
169
169
|
├── Evaluator → LLM-as-judge via langsmith-cli
|
|
170
170
|
├── Critic → detects gaming + implements stricter evaluators
|
|
171
171
|
├── Architect → ULTRAPLAN deep analysis (opus model)
|
|
@@ -182,7 +182,7 @@ Tools (Python + langsmith SDK)
|
|
|
182
182
|
├── iteration_gate.py → three-gate iteration triggers
|
|
183
183
|
├── regression_tracker.py → tracks regressions, adds guard examples
|
|
184
184
|
├── consolidate.py → cross-iteration memory consolidation
|
|
185
|
-
├── synthesize_strategy.py→ generates strategy document
|
|
185
|
+
├── synthesize_strategy.py→ generates strategy document + investigation lenses
|
|
186
186
|
├── add_evaluator.py → programmatically adds evaluators
|
|
187
187
|
└── adversarial_inject.py → detects memorization, injects adversarial tests
|
|
188
188
|
```
|
|
@@ -217,6 +217,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
|
|
|
217
217
|
## References
|
|
218
218
|
|
|
219
219
|
- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
|
|
220
|
+
- [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
|
|
220
221
|
- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
|
|
221
222
|
- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
|
|
222
223
|
- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
|
|
@@ -1,75 +1,56 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: evolver-proposer
|
|
3
3
|
description: |
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
4
|
+
Self-organizing agent optimizer. Investigates a data-driven lens (question),
|
|
5
|
+
decides its own approach, and modifies real code in an isolated git worktree.
|
|
6
|
+
May self-abstain if it cannot add meaningful value.
|
|
7
7
|
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
8
8
|
color: green
|
|
9
9
|
permissionMode: acceptEdits
|
|
10
10
|
---
|
|
11
11
|
|
|
12
|
-
# Evolver — Proposer
|
|
12
|
+
# Evolver — Self-Organizing Proposer (v4)
|
|
13
13
|
|
|
14
|
-
You are an LLM agent optimizer. Your job is to
|
|
14
|
+
You are an LLM agent optimizer. Your job is to improve the user's agent code to score higher on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
|
|
15
15
|
|
|
16
16
|
## Bootstrap
|
|
17
17
|
|
|
18
|
-
Your prompt contains `<files_to_read
|
|
18
|
+
Your prompt contains `<files_to_read>`, `<context>`, and `<lens>` blocks. You MUST:
|
|
19
19
|
1. Read every file listed in `<files_to_read>` using the Read tool
|
|
20
20
|
2. Parse the `<context>` block for current scores, failing examples, and framework info
|
|
21
|
-
3. Read the `<
|
|
21
|
+
3. Read the `<lens>` block — this is your investigation starting point
|
|
22
22
|
|
|
23
23
|
## Turn Budget
|
|
24
24
|
|
|
25
|
-
You have a maximum of **16 turns
|
|
26
|
-
-
|
|
27
|
-
-
|
|
28
|
-
-
|
|
29
|
-
- Turns 13-14: Test (verify changes don't break the entry point)
|
|
30
|
-
- Turns 15-16: Commit and document
|
|
25
|
+
You have a maximum of **16 turns**. You decide how to allocate them. General guidance:
|
|
26
|
+
- Spend early turns reading context and investigating your lens question
|
|
27
|
+
- Spend middle turns implementing changes and consulting documentation
|
|
28
|
+
- Reserve final turns for committing and writing proposal.md
|
|
31
29
|
|
|
32
30
|
**If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
|
|
33
31
|
|
|
34
32
|
**Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
|
|
35
33
|
|
|
36
|
-
##
|
|
34
|
+
## Lens Protocol
|
|
37
35
|
|
|
38
|
-
Your prompt contains a `<
|
|
39
|
-
- **exploitation**: Conservative fix on current best. Focus on specific failing examples.
|
|
40
|
-
- **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
|
|
41
|
-
- **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
|
|
42
|
-
- **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
|
|
43
|
-
- **creative**: Try something unexpected — different libraries, architecture, algorithms.
|
|
44
|
-
- **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
|
|
36
|
+
Your prompt contains a `<lens>` block with an **investigation question**. This is your starting point, not your mandate.
|
|
45
37
|
|
|
46
|
-
|
|
38
|
+
1. **Investigate** — dig into the data relevant to the lens question (trace insights, failing examples, code)
|
|
39
|
+
2. **Hypothesize** — form your own theory about what to change
|
|
40
|
+
3. **Decide** — choose your approach freely. You may end up solving something completely different from what the lens asks. That's fine.
|
|
41
|
+
4. **Implement or Abstain** — if you can add meaningful value, implement and commit. If not, abstain.
|
|
47
42
|
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
### Phase 1: Orient
|
|
51
|
-
|
|
52
|
-
Read .evolver.json to understand:
|
|
53
|
-
- What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
|
|
54
|
-
- What's the entry point?
|
|
55
|
-
- What evaluators are active? (correctness, conciseness, latency, etc.)
|
|
56
|
-
- What's the current best score?
|
|
43
|
+
You are NOT constrained to the lens topic. The lens gives you a starting perspective. Your actual approach is yours to decide.
|
|
57
44
|
|
|
58
|
-
|
|
45
|
+
## Your Workflow
|
|
59
46
|
|
|
60
|
-
|
|
61
|
-
- Which examples are failing and why?
|
|
62
|
-
- What error patterns exist?
|
|
63
|
-
- Are there token/latency issues?
|
|
47
|
+
There are no fixed phases. Use your judgment to allocate turns. A typical flow:
|
|
64
48
|
|
|
65
|
-
|
|
66
|
-
- What do real user inputs look like?
|
|
67
|
-
- What are the common error patterns in production?
|
|
68
|
-
- Which query types get the most traffic?
|
|
49
|
+
**Orient** — Read .evolver.json, strategy.md, evolution_memory.md. Understand the framework, entry point, evaluators, current score, and what has been tried before.
|
|
69
50
|
|
|
70
|
-
|
|
51
|
+
**Investigate** — Read trace_insights.json and best_results.json. Understand which examples fail and why. If production_seed.json exists, understand real-world usage patterns. Focus on data relevant to your lens question.
|
|
71
52
|
|
|
72
|
-
Based on
|
|
53
|
+
**Decide** — Based on investigation, decide what to change. Consider:
|
|
73
54
|
- **Prompts**: system prompts, few-shot examples, output format instructions
|
|
74
55
|
- **Routing**: how queries are dispatched to different handlers
|
|
75
56
|
- **Tools**: tool definitions, tool selection logic
|
|
@@ -77,7 +58,23 @@ Based on your strategy and diagnosis, modify the code:
|
|
|
77
58
|
- **Error handling**: retry logic, fallback strategies, timeout handling
|
|
78
59
|
- **Model selection**: which model for which task
|
|
79
60
|
|
|
80
|
-
|
|
61
|
+
## Self-Abstention
|
|
62
|
+
|
|
63
|
+
If after investigating your lens you conclude you cannot add meaningful value, you may **abstain**. This is a valued contribution — it saves evaluation tokens and signals confidence that the current code handles the lens topic adequately.
|
|
64
|
+
|
|
65
|
+
To abstain, skip implementation and write only a `proposal.md`:
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
## ABSTAIN
|
|
69
|
+
- **Lens**: {the question you investigated}
|
|
70
|
+
- **Finding**: {what you discovered during investigation}
|
|
71
|
+
- **Reason**: {why you're abstaining}
|
|
72
|
+
- **Suggested focus**: {optional — what future iterations should look at}
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Then end with the return protocol using `ABSTAIN` as your approach.
|
|
76
|
+
|
|
77
|
+
### Consult Documentation (MANDATORY)
|
|
81
78
|
|
|
82
79
|
**Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
|
|
83
80
|
|
|
@@ -106,7 +103,7 @@ Ask about the SPECIFIC API you're going to use or change.
|
|
|
106
103
|
|
|
107
104
|
**If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
|
|
108
105
|
|
|
109
|
-
###
|
|
106
|
+
### Commit and Document
|
|
110
107
|
|
|
111
108
|
1. **Commit all changes** with a descriptive message:
|
|
112
109
|
```bash
|
|
@@ -143,7 +140,7 @@ Prioritize changes that fix real production failures over synthetic test failure
|
|
|
143
140
|
## Rules
|
|
144
141
|
|
|
145
142
|
1. **Read before writing** — understand the code before changing it
|
|
146
|
-
2. **
|
|
143
|
+
2. **Focused changes** — change what's needed based on your investigation. Don't scatter changes across unrelated files.
|
|
147
144
|
3. **Don't break the interface** — the agent must still be runnable with the same command
|
|
148
145
|
4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
|
|
149
146
|
5. **Write proposal.md** — the evolve skill reads this to understand what you did
|
|
@@ -153,8 +150,9 @@ Prioritize changes that fix real production failures over synthetic test failure
|
|
|
153
150
|
When done, end your response with:
|
|
154
151
|
|
|
155
152
|
## PROPOSAL COMPLETE
|
|
156
|
-
- **Version**: v{NNN}{
|
|
157
|
-
- **
|
|
153
|
+
- **Version**: v{NNN}-{id}
|
|
154
|
+
- **Lens**: {the investigation question}
|
|
155
|
+
- **Approach**: {what you chose to do and why — free text, your own words}
|
|
158
156
|
- **Changes**: {brief list of files changed}
|
|
159
157
|
- **Expected impact**: {which evaluators/examples should improve}
|
|
160
158
|
- **Files modified**: {count}
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -127,6 +127,122 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
|
|
|
127
127
|
- "Fix and retry" — attempt auto-fix with `--fix` flag
|
|
128
128
|
- "Abort" — stop the evolution loop
|
|
129
129
|
|
|
130
|
+
### 0.6. Dataset Health Check
|
|
131
|
+
|
|
132
|
+
Run the dataset health diagnostic:
|
|
133
|
+
|
|
134
|
+
```bash
|
|
135
|
+
$EVOLVER_PY $TOOLS/dataset_health.py \
|
|
136
|
+
--config .evolver.json \
|
|
137
|
+
--production-seed production_seed.json \
|
|
138
|
+
--output health_report.json 2>/dev/null
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Read `health_report.json`. Print summary:
|
|
142
|
+
```bash
|
|
143
|
+
python3 -c "
|
|
144
|
+
import json, os
|
|
145
|
+
if os.path.exists('health_report.json'):
|
|
146
|
+
r = json.load(open('health_report.json'))
|
|
147
|
+
print(f'Dataset Health: {r[\"health_score\"]}/10 ({r[\"example_count\"]} examples)')
|
|
148
|
+
for issue in r.get('issues', []):
|
|
149
|
+
print(f' [{issue[\"severity\"]}] {issue[\"message\"]}')
|
|
150
|
+
"
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
### 0.7. Auto-Correct Dataset Issues
|
|
154
|
+
|
|
155
|
+
If `health_report.json` has corrections, apply them automatically:
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
CORRECTIONS=$(python3 -c "
|
|
159
|
+
import json, os
|
|
160
|
+
if os.path.exists('health_report.json'):
|
|
161
|
+
r = json.load(open('health_report.json'))
|
|
162
|
+
for c in r.get('corrections', []):
|
|
163
|
+
print(c['action'])
|
|
164
|
+
" 2>/dev/null)
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
For each correction:
|
|
168
|
+
|
|
169
|
+
**If `create_splits`**: Run inline Python to assign 70/30 splits:
|
|
170
|
+
```bash
|
|
171
|
+
$EVOLVER_PY -c "
|
|
172
|
+
from langsmith import Client
|
|
173
|
+
import json, random
|
|
174
|
+
client = Client()
|
|
175
|
+
config = json.load(open('.evolver.json'))
|
|
176
|
+
examples = list(client.list_examples(dataset_name=config['dataset']))
|
|
177
|
+
random.shuffle(examples)
|
|
178
|
+
sp = int(len(examples) * 0.7)
|
|
179
|
+
for ex in examples[:sp]:
|
|
180
|
+
client.update_example(ex.id, split='train')
|
|
181
|
+
for ex in examples[sp:]:
|
|
182
|
+
client.update_example(ex.id, split='held_out')
|
|
183
|
+
print(f'Assigned splits: {sp} train, {len(examples)-sp} held_out')
|
|
184
|
+
"
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
**If `generate_hard`**: Spawn testgen agent with hard-mode instruction:
|
|
188
|
+
```
|
|
189
|
+
Agent(
|
|
190
|
+
subagent_type: "evolver-testgen",
|
|
191
|
+
description: "Generate hard examples to rebalance dataset",
|
|
192
|
+
prompt: |
|
|
193
|
+
<objective>
|
|
194
|
+
The dataset is skewed toward easy examples. Generate {count} HARD examples
|
|
195
|
+
that the current agent is likely to fail on.
|
|
196
|
+
Focus on: edge cases, adversarial inputs, complex multi-step queries,
|
|
197
|
+
ambiguous questions, and inputs that require deep reasoning.
|
|
198
|
+
</objective>
|
|
199
|
+
<files_to_read>
|
|
200
|
+
- .evolver.json
|
|
201
|
+
- strategy.md (if exists)
|
|
202
|
+
- production_seed.json (if exists)
|
|
203
|
+
</files_to_read>
|
|
204
|
+
)
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
**If `fill_coverage`**: Spawn testgen agent with coverage-fill instruction:
|
|
208
|
+
```
|
|
209
|
+
Agent(
|
|
210
|
+
subagent_type: "evolver-testgen",
|
|
211
|
+
description: "Generate examples for missing categories",
|
|
212
|
+
prompt: |
|
|
213
|
+
<objective>
|
|
214
|
+
The dataset is missing these production categories: {categories}.
|
|
215
|
+
Generate 5 examples per missing category.
|
|
216
|
+
Use production_seed.json for real-world patterns in these categories.
|
|
217
|
+
</objective>
|
|
218
|
+
<files_to_read>
|
|
219
|
+
- .evolver.json
|
|
220
|
+
- production_seed.json (if exists)
|
|
221
|
+
</files_to_read>
|
|
222
|
+
)
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
**If `retire_dead`**: Move dead examples to retired split:
|
|
226
|
+
```bash
|
|
227
|
+
$EVOLVER_PY -c "
|
|
228
|
+
from langsmith import Client
|
|
229
|
+
import json
|
|
230
|
+
client = Client()
|
|
231
|
+
report = json.load(open('health_report.json'))
|
|
232
|
+
dead_ids = report.get('dead_examples', {}).get('ids', [])
|
|
233
|
+
config = json.load(open('.evolver.json'))
|
|
234
|
+
examples = {str(e.id): e for e in client.list_examples(dataset_name=config['dataset'])}
|
|
235
|
+
retired = 0
|
|
236
|
+
for eid in dead_ids:
|
|
237
|
+
if eid in examples:
|
|
238
|
+
client.update_example(examples[eid].id, split='retired')
|
|
239
|
+
retired += 1
|
|
240
|
+
print(f'Retired {retired} dead examples')
|
|
241
|
+
"
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
After corrections, log what was done. Do NOT re-run health check (corrections may need an experiment cycle to show effect).
|
|
245
|
+
|
|
130
246
|
For each iteration:
|
|
131
247
|
|
|
132
248
|
### 1. Get Next Version
|
|
@@ -170,12 +286,13 @@ if [ -n "$BEST" ]; then
|
|
|
170
286
|
$EVOLVER_PY $TOOLS/read_results.py \
|
|
171
287
|
--experiment "$BEST" \
|
|
172
288
|
--config .evolver.json \
|
|
289
|
+
--split train \
|
|
173
290
|
--output best_results.json 2>/dev/null
|
|
174
291
|
fi
|
|
175
292
|
```
|
|
176
293
|
|
|
177
294
|
If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
|
|
178
|
-
|
|
295
|
+
This failure data feeds into `synthesize_strategy.py` which generates targeted lenses for proposers.
|
|
179
296
|
If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
|
|
180
297
|
|
|
181
298
|
### 1.8a. Synthesize Strategy
|
|
@@ -189,17 +306,18 @@ $EVOLVER_PY $TOOLS/synthesize_strategy.py \
|
|
|
189
306
|
--best-results best_results.json \
|
|
190
307
|
--evolution-memory evolution_memory.json \
|
|
191
308
|
--production-seed production_seed.json \
|
|
192
|
-
--output strategy.md
|
|
309
|
+
--output strategy.md \
|
|
310
|
+
--lenses lenses.json 2>/dev/null
|
|
193
311
|
```
|
|
194
312
|
|
|
195
|
-
The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9).
|
|
313
|
+
The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). The `lenses.json` file contains dynamically generated investigation questions — one per proposer. Each lens directs a proposer's attention to a different aspect of the problem (failure cluster, architecture, production data, evolution memory, or open investigation).
|
|
196
314
|
|
|
197
315
|
### 1.9. Prepare Shared Proposer Context
|
|
198
316
|
|
|
199
|
-
Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning
|
|
317
|
+
Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning N proposers costs barely more than 1.
|
|
200
318
|
|
|
201
319
|
```bash
|
|
202
|
-
# Build shared context block (identical for all
|
|
320
|
+
# Build shared context block (identical for all proposers)
|
|
203
321
|
SHARED_FILES_BLOCK="<files_to_read>
|
|
204
322
|
- .evolver.json
|
|
205
323
|
- strategy.md (if exists)
|
|
@@ -223,13 +341,19 @@ You are working in an isolated git worktree — modify any file freely.
|
|
|
223
341
|
</objective>"
|
|
224
342
|
```
|
|
225
343
|
|
|
226
|
-
**CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all
|
|
344
|
+
**CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all proposer prompts. Only the `<lens>` block differs. Place the lens block LAST in the prompt so the shared prefix is maximized.
|
|
345
|
+
|
|
346
|
+
### 2. Spawn Proposers in Parallel (Dynamic Lenses)
|
|
347
|
+
|
|
348
|
+
Read `lenses.json` to get the list of investigation lenses:
|
|
227
349
|
|
|
228
|
-
|
|
350
|
+
```bash
|
|
351
|
+
LENS_COUNT=$(python3 -c "import json; print(json.load(open('lenses.json'))['lens_count'])")
|
|
352
|
+
```
|
|
229
353
|
|
|
230
|
-
Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique
|
|
354
|
+
Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique lens.
|
|
231
355
|
|
|
232
|
-
**
|
|
356
|
+
**For each lens** — `run_in_background: true, isolation: "worktree"`:
|
|
233
357
|
|
|
234
358
|
The prompt for EACH proposer follows this structure:
|
|
235
359
|
```
|
|
@@ -239,66 +363,54 @@ The prompt for EACH proposer follows this structure:
|
|
|
239
363
|
|
|
240
364
|
{SHARED_CONTEXT_BLOCK}
|
|
241
365
|
|
|
242
|
-
<
|
|
243
|
-
|
|
244
|
-
</strategy>
|
|
366
|
+
<lens>
|
|
367
|
+
Investigation question: {lens.question}
|
|
245
368
|
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
</output>
|
|
251
|
-
```
|
|
369
|
+
This is your STARTING POINT, not your mandate. Investigate, form your
|
|
370
|
+
own hypothesis, and implement whatever you conclude will help most.
|
|
371
|
+
You may solve something entirely different — that's fine.
|
|
372
|
+
If you cannot add meaningful value, ABSTAIN.
|
|
252
373
|
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
APPROACH: exploitation
|
|
256
|
-
Make targeted improvements to the current best version.
|
|
257
|
-
Focus on the specific failures identified in the results.
|
|
258
|
-
```
|
|
259
|
-
|
|
260
|
-
**Candidate B strategy block:**
|
|
261
|
-
```
|
|
262
|
-
APPROACH: exploration
|
|
263
|
-
Try a fundamentally different approach. Change algorithms, prompts, routing, architecture.
|
|
264
|
-
Don't be afraid to make big changes — this worktree is disposable.
|
|
265
|
-
```
|
|
374
|
+
Source: {lens.source}
|
|
375
|
+
</lens>
|
|
266
376
|
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
377
|
+
<output>
|
|
378
|
+
1. Investigate the lens question
|
|
379
|
+
2. Decide your approach (or abstain)
|
|
380
|
+
3. If proceeding: modify code, commit, write proposal.md
|
|
381
|
+
4. proposal.md must include: what you chose to do, why, how it relates to the lens
|
|
382
|
+
</output>
|
|
272
383
|
```
|
|
273
384
|
|
|
274
|
-
|
|
275
|
-
```
|
|
276
|
-
APPROACH: {failure_targeted_or_creative}
|
|
277
|
-
{adaptive_briefing_d}
|
|
278
|
-
```
|
|
385
|
+
For each lens in `lenses.json`, spawn one proposer agent:
|
|
279
386
|
|
|
280
|
-
**Candidate E strategy block:**
|
|
281
387
|
```
|
|
282
|
-
|
|
283
|
-
|
|
388
|
+
Agent(
|
|
389
|
+
subagent_type: "evolver-proposer",
|
|
390
|
+
description: "Proposer {lens.id}: {lens.source} lens",
|
|
391
|
+
isolation: "worktree",
|
|
392
|
+
run_in_background: true,
|
|
393
|
+
prompt: {SHARED_PREFIX + LENS_BLOCK above, with lens fields filled in}
|
|
394
|
+
)
|
|
284
395
|
```
|
|
285
396
|
|
|
286
|
-
Wait for all
|
|
397
|
+
Wait for all proposers to complete.
|
|
287
398
|
|
|
288
399
|
**Stuck proposer detection**: If any proposer hasn't completed after 10 minutes, it may be stuck in a loop. The Claude Code runtime handles this via the agent's turn limit. If a proposer returns without committing changes, skip it — don't retry.
|
|
289
400
|
|
|
290
|
-
After all proposers complete, check which ones
|
|
401
|
+
After all proposers complete, check which ones committed and which abstained:
|
|
291
402
|
|
|
292
403
|
```bash
|
|
293
404
|
for WORKTREE in {worktree_paths}; do
|
|
294
|
-
|
|
295
|
-
|
|
405
|
+
if [ -f "$WORKTREE/proposal.md" ] && grep -q "## ABSTAIN" "$WORKTREE/proposal.md" 2>/dev/null; then
|
|
406
|
+
echo "Proposer in $WORKTREE abstained — skipping evaluation"
|
|
407
|
+
elif [ $(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l) -eq 0 ]; then
|
|
296
408
|
echo "Proposer in $WORKTREE made no commits — skipping"
|
|
297
409
|
fi
|
|
298
410
|
done
|
|
299
411
|
```
|
|
300
412
|
|
|
301
|
-
Only run evaluation (Step 3) for proposers that committed changes.
|
|
413
|
+
Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
|
|
302
414
|
|
|
303
415
|
### 3. Run Target for Each Candidate
|
|
304
416
|
|
|
@@ -308,7 +420,7 @@ For each worktree that has changes (proposer committed something):
|
|
|
308
420
|
$EVOLVER_PY $TOOLS/run_eval.py \
|
|
309
421
|
--config .evolver.json \
|
|
310
422
|
--worktree-path {worktree_path} \
|
|
311
|
-
--experiment-prefix v{NNN}{
|
|
423
|
+
--experiment-prefix v{NNN}-{lens_id} \
|
|
312
424
|
--timeout 120
|
|
313
425
|
```
|
|
314
426
|
|
|
@@ -339,11 +451,7 @@ Agent(
|
|
|
339
451
|
prompt: |
|
|
340
452
|
<experiment>
|
|
341
453
|
Evaluate the following experiments (one per candidate):
|
|
342
|
-
|
|
343
|
-
- {experiment_name_b}
|
|
344
|
-
- {experiment_name_c}
|
|
345
|
-
- {experiment_name_d}
|
|
346
|
-
- {experiment_name_e}
|
|
454
|
+
{list all experiment names from proposers that committed changes — skip abstained}
|
|
347
455
|
</experiment>
|
|
348
456
|
|
|
349
457
|
<evaluators>
|
|
@@ -370,14 +478,14 @@ Wait for the evaluator agent to complete before proceeding.
|
|
|
370
478
|
|
|
371
479
|
```bash
|
|
372
480
|
$EVOLVER_PY $TOOLS/read_results.py \
|
|
373
|
-
--experiments "
|
|
481
|
+
--experiments "{comma-separated list of experiment names from non-abstained proposers}" \
|
|
374
482
|
--config .evolver.json \
|
|
375
483
|
--output comparison.json
|
|
376
484
|
```
|
|
377
485
|
|
|
378
486
|
Parse `comparison.json`:
|
|
379
487
|
- `comparison.winner` — highest combined score
|
|
380
|
-
- `comparison.champion` — per-task champion (for next
|
|
488
|
+
- `comparison.champion` — per-task champion (for next iteration's context)
|
|
381
489
|
- `comparison.all_candidates` — all scores for reporting
|
|
382
490
|
|
|
383
491
|
### 5. Merge Winner
|
|
@@ -389,7 +497,7 @@ If the winner scored higher than the current best:
|
|
|
389
497
|
WINNER_BRANCH={winning_worktree_branch}
|
|
390
498
|
|
|
391
499
|
# Merge into main
|
|
392
|
-
git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{
|
|
500
|
+
git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}-{lens_id} (score: {score})"
|
|
393
501
|
```
|
|
394
502
|
|
|
395
503
|
Update `.evolver.json`:
|
|
@@ -409,14 +517,14 @@ json.dump(c, open('.evolver.json', 'w'), indent=2)
|
|
|
409
517
|
|
|
410
518
|
Report ALL candidates:
|
|
411
519
|
```
|
|
412
|
-
Iteration {i}/{N} —
|
|
413
|
-
|
|
414
|
-
v{NNN}
|
|
415
|
-
v{NNN}
|
|
416
|
-
v{NNN}
|
|
417
|
-
|
|
520
|
+
Iteration {i}/{N} — {lens_count} lenses, {evaluated_count} candidates evaluated ({abstained_count} abstained):
|
|
521
|
+
{For each proposer, read proposal.md and extract the Approach field}
|
|
522
|
+
v{NNN}-1 ({approach from proposal.md}): {score} — {summary}
|
|
523
|
+
v{NNN}-2 ({approach from proposal.md}): {score} — {summary}
|
|
524
|
+
v{NNN}-3 (ABSTAINED): -- — {reason from proposal.md}
|
|
525
|
+
...
|
|
418
526
|
|
|
419
|
-
Winner: v{NNN}{
|
|
527
|
+
Winner: v{NNN}-{id} ({score}) — merged into main
|
|
420
528
|
Per-task champion: {champion} (beats winner on {N} tasks)
|
|
421
529
|
```
|
|
422
530
|
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
@@ -145,15 +145,20 @@ def generate_adversarial_inputs(client, dataset_name, num_inputs=5):
|
|
|
145
145
|
return adversarial
|
|
146
146
|
|
|
147
147
|
|
|
148
|
-
def inject_adversarial(client, dataset_id, adversarial_inputs):
|
|
148
|
+
def inject_adversarial(client, dataset_id, adversarial_inputs, config=None):
|
|
149
149
|
"""Add adversarial examples to dataset."""
|
|
150
|
+
config = config or {}
|
|
150
151
|
added = 0
|
|
151
152
|
for adv in adversarial_inputs:
|
|
152
153
|
try:
|
|
154
|
+
split = "train" if random.random() < 0.7 else "held_out"
|
|
155
|
+
metadata = dict(adv["metadata"])
|
|
156
|
+
metadata["added_at_iteration"] = config.get("iterations", 0)
|
|
153
157
|
client.create_example(
|
|
154
158
|
inputs=adv["inputs"],
|
|
155
159
|
dataset_id=dataset_id,
|
|
156
|
-
metadata=
|
|
160
|
+
metadata=metadata,
|
|
161
|
+
split=split,
|
|
157
162
|
)
|
|
158
163
|
added += 1
|
|
159
164
|
except Exception as e:
|
|
@@ -182,7 +187,7 @@ def main():
|
|
|
182
187
|
|
|
183
188
|
injected = 0
|
|
184
189
|
if args.inject and adversarial:
|
|
185
|
-
injected = inject_adversarial(client, config["dataset_id"], adversarial)
|
|
190
|
+
injected = inject_adversarial(client, config["dataset_id"], adversarial, config=config)
|
|
186
191
|
|
|
187
192
|
result = {
|
|
188
193
|
"memorization_suspects": len(suspicious),
|