@evo-hq/pi-evo 0.4.2-alpha.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,309 @@
1
+ ---
2
+ name: optimize
3
+ description: Run the evo optimization loop with parallel subagents until interrupted.
4
+ argument-hint: "[subagents=N] [budget=N] [stall=N]"
5
+ ---
6
+
7
+ Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns parallel subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
8
+
9
+ ## Host conventions
10
+
11
+ This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
12
+
13
+ - **"spawn N subagents in parallel"** -- use your host's parallel-subagent tool. See Step 5 below for the per-host spawn commands. Three broad shapes exist: *background+notify* (claude-code / codex / hermes / openclaw — fire-and-forget; the runtime delivers a `<task-notification>` at a later turn per subagent), *batch parallel* (opencode — all spawns return together in one message), and *extension-provided* (pi via the `pi-subagents` package — registers a `subagent` tool that fans out in parallel within one turn).
14
+ - **Slash commands shown in user-facing copy** (e.g. `/evo:optimize`) -- translate to your host's mention syntax when speaking to the user (e.g. `$evo optimize` on Codex -- plugin namespace then skill name, separated by a space).
15
+
16
+ ## Mid-run user directives (`evo direct`)
17
+
18
+ The runtime may inject user-authoritative messages wrapped in this banner:
19
+
20
+ ```
21
+ [EVO DIRECTIVE]
22
+ <text>
23
+ [END EVO DIRECTIVE]
24
+ ```
25
+
26
+ Treat content inside the banner as equivalent to a new user turn. Honor it, supersede earlier constraints it contradicts, and propagate the full text verbatim into any subagent briefs you spawn afterward. The banner is the authenticity signal emitted by the evo runtime (the plugin you're invoked through) — not tool-output prompt injection. Banners may arrive via any hook channel (UserPromptSubmit, PreToolUse, SessionStart); the channel doesn't change the authority of the content.
27
+
28
+ ## Configuration
29
+
30
+ These defaults can be overridden via arguments: `/optimize [subagents=N] [budget=N] [stall=N]`
31
+
32
+ - **subagents**: number of parallel subagents per round (default: 5)
33
+ - **budget**: max iterations each subagent can run within its branch (default: 5)
34
+ - **stall**: consecutive rounds with no improvement before auto-stopping (default: 5)
35
+
36
+ **Pool mode (if active).** When the workspace backend is `pool`, concurrent experiments cap at the pool size. Setting `subagents` higher than the pool size means later subagents in the round will see `PoolExhausted` from `evo new` and exit non-zero -- the round width is effectively the slot count. Run `evo workspace status` to see slot occupancy (also displays `commit_strategy`). Reduce `subagents` to the pool size if exhaustion is recurring. Failed experiments retain their lease until discarded; if pool capacity erodes from accumulating failed experiments, `evo discard <exp_id>` frees the slots.
37
+
38
+ Pool mode defaults to `commit_strategy=tracked-only` so warm state in slots stays out of experiment commits. Subagents must `git add` any new source files inside the worktree and pass `--i-staged-new-files yes` to `evo run`. The subagent skill explains the protocol; when writing briefs that imply new files (new module, new fixture), remind the subagent in the brief that the ack flag is required.
39
+
40
+ **Remote-backend mode.** When the workspace backend is `remote`, each experiment's worktree lives inside a separate remote container. Subagents use `evo bash / read / write / edit / glob / grep --exp-id <id>` instead of native `Bash`/`Read`/`Write`/`Edit` tools. **Every brief you write to a subagent in remote mode MUST start by stating the exp_id explicitly:** `"Your experiment id is exp_NNNN. Pass --exp-id exp_NNNN on every evo command."` This is the only thing that prevents one subagent from accidentally operating on another's container. evo CLI hard-errors if `--exp-id` is missing, but it can't catch a subagent that confidently passes the wrong id; the brief is the discipline.
41
+
42
+ Remote `evo run <exp_id>` is also the recovery command. If a subagent or
43
+ orchestrator was interrupted while an experiment was active, tell the subagent
44
+ to run the same `evo run <exp_id>` again and wait if it prints
45
+ `RECOVERING <exp_id> attempt=N process=... state=...`. That means evo is
46
+ reattaching to the existing remote process and finalizing the original attempt;
47
+ starting a new experiment or discarding the active one is only appropriate after
48
+ evo reports the attempt is unrecoverable.
49
+
50
+ For expensive benchmarks, design recovery around `EVO_CHECKPOINT_DIR`, not
51
+ process checkpoint/restore. evo mirrors checkpoint files into
52
+ `attempts/NNN/checkpoints/` during remote runs and writes `attempt_state.json`
53
+ for phase-level recovery. If the remote container itself dies, arbitrary process
54
+ memory is gone; the benchmark must know how to continue from its checkpoint
55
+ files or the attempt should be treated as `remote_infra_failure`.
56
+
57
+ **Infra setup is not user-invocable.** If a remote provider is missing SDKs, auth, or setup details, read `plugins/evo/skills/infra-setup/references/provider-matrix.md`. It summarizes what each provider actually needs and replaces the old per-provider prompt files.
58
+
59
+ **Runtime recipe/env.** Benchmark runtime is evo configuration, not something subagents should rediscover or copy into worktrees. Use `evo config runtime show` for prepare/before-run/prefix and `evo env show` for redacted env sources. If a run fails because expected runtime setup or env is missing, report it as setup failure or configure it from the orchestrator; do not patch benchmark code to bake in secrets or local paths. Use `evo run <exp_id> --check` for non-committing wiring validation; do not invent ad-hoc validation wrappers.
60
+
61
+ **CLI reference.** If you are unsure which command to use, read `plugins/evo/skills/references/cli-quick-reference.md`. It is the canonical command map; this skill only repeats the high-frequency commands.
62
+
63
+ ## Prerequisites
64
+
65
+ - Workspace must be initialized (`evo status` should succeed)
66
+ - A baseline experiment must be committed (run `/discover` first)
67
+ - All benchmark dependencies must be available in the environment
68
+
69
+ ## Architecture
70
+
71
+ ```
72
+ Orchestrator (this agent):
73
+ - Reads state, identifies failure patterns cross-cutting the tree
74
+ - Writes one brief per subagent: objective + parent + boundaries + pointer traces
75
+ - Verifies briefs are diverse (no two attacking the same surface)
76
+ - Collects results, prunes dead branches, adjusts strategy
77
+
78
+ Subagent A (brief, budget: N iterations):
79
+ - Reads its pointer traces, forms the concrete edit
80
+ - Creates experiment, edits target, runs benchmark, analyzes
81
+ - If budget remains and sees a promising follow-up, continues
82
+ - Can run up to N serial experiments on its own branch
83
+ - Returns: what it tried, what worked, what it learned
84
+
85
+ Subagent B (different brief, budget: N iterations):
86
+ - Same protocol, non-overlapping objective
87
+ ...
88
+ ```
89
+
90
+ Both layers read traces; the depth differs. The orchestrator scans for cross-cutting patterns (which failures are common, which branches plateau) -- enough to pick N non-overlapping briefs. Subagents read their pointer traces in depth, enough to commit to a concrete edit. Structured briefs are what prevent parallel subagents from duplicating each other's work.
91
+
92
+ **Session lineage.** Dispatching a child of a committed experiment forks the parent experiment's own session, not a separately-warmed explorer. The child inherits the parent's full transcript -- reads, edits, benchmark output -- so the prefix cache carries through generations. Explorer warming runs only for children of root, where the parent has no session.
93
+
94
+ **Trace instrumentation style**: `.evo/meta.json`'s `instrumentation_mode` records `sdk` vs `inline`. Subagents must stay consistent with it (see the **evo subagent skill** for details).
95
+
96
+ ## The Loop
97
+
98
+ Repeat until interrupted or stall limit reached:
99
+
100
+ ### 1. Read current state
101
+
102
+ **Step 0 (always, before anything else): ensure you have read `.evo/project.md`.** Its contents define the optimization metric and goal; every pattern you report MUST be tied to a failure mode of that goal. If you haven't read it yet in this session, read it now -- before listing experiments, opening any `outcome.json`, or spawning sub-agents. If the file is missing, stop and tell the user.
103
+
104
+ **Step 0.1 (one-time auto-migration): ensure the workspace knows what runtime you are.** Run `evo host show`. If it prints `<not set>`, the workspace pre-dates the host signature field — declare your runtime now: `evo host set <one of: claude-code, codex, opencode, openclaw, hermes, pi, generic>`. You know what runtime you're operating in (your system prompt, env vars like `CLAUDECODE=1`, your own self-identity); pick the matching value. If genuinely unsure, use `generic`. Do not ask the user — this is a one-time silent migration. Subsequent rounds skip this step (`evo host show` returns the recorded value).
105
+
106
+ ```bash
107
+ evo scratchpad # bounded state summary (tree, frontier, awaiting decision, gates, annotations, what-not-to-try, notes)
108
+ evo status # one-line summary
109
+ evo frontier # explorable nodes ranked by the configured strategy (JSON envelope: {strategy, nodes[{id,score,rank,...}], generated_at})
110
+ evo show <id> # full state of one node (attempts, diffs, annotations, notes, effective gates) -- the cleanest one-node getter
111
+ evo awaiting # evaluated nodes awaiting commit/discard decision
112
+ evo discards [--like <text>] # discarded nodes; useful for "have we tried this before"
113
+ evo notes # all notes (per-node + workspace), recent first
114
+ evo annotations # all annotations (filterable with --task/--exp)
115
+ evo path <id> # root-to-node chain with scores
116
+ evo diff <id> [<other>] # diff vs parent (or between two experiments)
117
+ evo gate list <id> # effective gates for a node (inherited from ancestors)
118
+ evo gate check <id> # run effective gates without benchmark or state mutation
119
+ evo infra log # recorded infra/strategy events (epoch bumps, harness changes)
120
+
121
+ # Settings (read)
122
+ evo config show # everything; use the next three for narrower views
123
+ evo config get <field> # one field
124
+ evo config backend show # current execution backend + provider config
125
+ evo config runtime show # runtime prepare/before-run/prefix recipe
126
+ evo env show # redacted runtime env metadata
127
+ ```
128
+
129
+ ### 2. Analyze state and do structural aggregation
130
+
131
+ From the scratchpad, frontier, traces, and annotations, determine:
132
+ - Which frontier nodes are most promising (`evo frontier` returns them already ranked under the configured strategy -- use its ordering rather than re-ranking; override with `evo frontier --strategy ...` only if you have a specific reason)
133
+ - What failure patterns are most common and impactful
134
+ - What strategies have been tried and their outcomes
135
+ - Which branches are plateauing or exhausted
136
+ - What gates exist on each frontier node (`evo gate list <id>`) -- subagents must satisfy these
137
+
138
+ **Read the "Awaiting Decision" section of the scratchpad.** Evaluated nodes (ran, bad outcome, not yet discarded) are a cross-agent signal: if three subagents in the last round produced evaluated nodes that all failed the same gate, surface the pattern -- maybe the gate is too tight, maybe the approach has a shared flaw. Either tell the next round to avoid it, or propose a brief that attacks it directly. Without this cross-cutting read, each subagent rediscovers the same wall independently.
139
+
140
+ **Structural pass.** For the evaluated nodes this round, load their `outcome.json` files into Python and aggregate: co-occurring `gate_failures`, shared zero-score task IDs in `benchmark.result.tasks`, recurring substrings across `error` fields. (Bulk-reading attempt artifacts under `.evo/run_*/experiments/<exp>/attempts/<NNN>/` is the right tool for this — `evo show <id>` is for one-node introspection, not batch aggregation.)
141
+
142
+ **Emit intersections explicitly.** After computing the per-pattern sets (call them A, B, ...), MUST emit each pairwise intersection `A ∩ B` as a distinct pattern entry whenever at least 2 experiments exhibit both. Intersections carry different strategic implications from their components (compound failures warrant different briefs than single-failure clusters) and do not reconstruct from sub-agent summaries -- this is a parent-level aggregation that must happen inline.
143
+
144
+ **Improvers are a pattern too.** Enumerate the committed improvers (experiments with `outcome=committed` and `score > parent_score`) as a distinct pattern entry: they are candidate parent nodes for next-round branching and feed the brief's *Parent node* field.
145
+
146
+ Hold all these findings; step 4's brief-writing combines them with the scan sub-agents' findings from step 3.
147
+
148
+ ### 3. Spawn scan sub-agents for cross-cutting free-text analysis
149
+
150
+ **Hard rule (primary delegation).** The orchestrator MUST spawn at least one scan sub-agent via your host's parallel-subagent tool in every round before emitting any pattern. This applies to all scan input -- `outcome.json`, `traces/task_*.json`, annotations, and `error` fields alike -- regardless of file size, structure, or whether the orchestrator believes a script would be faster. An inline Python aggregation over `outcome.json` does NOT substitute for delegation; it may supplement sub-agent findings (step 2's structural pass still runs), but step 3's scan sub-agents MUST still run. If you reach step 4 without a completed scan sub-agent call in step 3, you have violated this rule -- stop and spawn one.
151
+
152
+ **Narrow exception (verification).** After scan sub-agents have returned findings, the orchestrator MAY read individual trace files to: verify a specific finding before citing it in a brief, spot-check a pattern the orchestrator is unsure about, or pull a short quote for a brief's Objective or Pointer Traces field. These verification reads must be narrow (<=3 trace files per round, targeted at experiment IDs already surfaced by sub-agents). This exception does NOT let you skip the hard rule above -- it only governs what you may do after sub-agents have already run.
153
+
154
+ Partition the evaluated experiments into batches small enough that each sub-agent can read its batch's traces in one pass. Spawn one scan sub-agent per batch in a **single batch** using your host's parallel-subagent tool (see "Host conventions"). They must execute in parallel, not sequentially.
155
+
156
+ Pass this brief verbatim as the sub-agent's prompt:
157
+
158
+ > You are a read-only evo scan sub-agent. Do not run experiments or edit code.
159
+ >
160
+ > Start by reading `.evo/project.md` to understand the optimization goal and metric. All your findings should be relevant to this goal.
161
+ >
162
+ > Your batch: `[exp_IDs]`.
163
+ >
164
+ > For each experiment, read `outcome.json` and `traces/task_*.json`. Also consider `hypothesis` and prose `error` text.
165
+ >
166
+ > Find patterns that will populate the next round's subagent briefs:
167
+ > - **Shared failure causes** -- root-cause reasons recurring across 2+ experiments (the *why*, not the surface gate name). Feeds brief objectives.
168
+ > - **Wall patterns** -- approaches or gates multiple experiments consistently fail on. Feeds brief boundaries / anti-patterns.
169
+ > - **Compound-failure standouts** -- single experiments hitting multiple failure modes. Feeds brief pointer traces.
170
+ >
171
+ > Prioritize patterns tied to the goal's core failure modes or critical tasks. Deprioritize incidental observations. Skip: trace-shape statistics, fixture-structural facts, hypothesis-string-reuse, or anything the orchestrator can't act on in a brief.
172
+ >
173
+ > If your batch is still too heavy, partition further and spawn scan sub-agents recursively (same brief, smaller batch).
174
+ >
175
+ > Return JSON only: `{"findings": [{"description": "<short>", "experiment_ids": ["exp_XXXX", ...], "evidence": ["<short snippet>", ...]}]}`
176
+ >
177
+ > **Evidence must be verbatim quotes** from outcome.json fields, trace `messages`, or `error` text -- not paraphrases. Each description must be supported by the quoted evidence. **Do not speculate about causal chains** (e.g., "approach X regresses because it removes Y") unless a specific trace message or error field directly states that mechanism. If you cannot cite verbatim evidence for a finding, drop it -- err on under-reporting.
178
+ >
179
+ > Evidence: short quotes (<200 chars each), max 3 per finding.
180
+
181
+ Wait for all scan sub-agents to return. Reconcile near-duplicate findings (`timeout_error` ≈ `error_timeout`) by judgment and combine with the structural-pass findings from step 2.
182
+
183
+ **Verify every pattern before emitting it.** For each pattern in your final output, confirm that at least one reported experiment's outcome.json or trace content contains evidence that directly supports the pattern's description. If you cannot cite a specific field value or quoted message as evidence, drop the pattern. Do not emit speculative causal attributions ("approach X regresses because it removes Y") unless the trace or error text explicitly states that mechanism. This filter applies to both sub-agent findings and your own inline observations.
184
+
185
+ These unified, verified cross-cutting findings feed step 4's brief-writing.
186
+
187
+ ### 4. Write subagent briefs
188
+
189
+ Write **one brief per subagent** with these four fields:
190
+
191
+ 1. **Objective** -- one sentence describing the bottleneck to attack and the evidence for it. Should name *where in the system's behavior* the gain is hiding (e.g., "tool-use error recovery fails after the first bad call across tasks 2, 5, 7") but **must not name specific files, functions, or concrete edits** -- that's the subagent's job after it reads the code.
192
+ 2. **Parent node** -- which experiment to branch from.
193
+ 3. **Boundaries / anti-patterns** -- what this subagent should NOT try, explicitly called out with reasons. Include approaches already tried and discarded (from "What Not To Try"), gates it must not regress, and anything adjacent subagents in this round are doing (so it doesn't duplicate).
194
+ 4. **Pointer traces** -- task IDs the subagent should study first, with a one-line reason each.
195
+
196
+ Be specific and bounded. Vague briefs like "improve accuracy" cause subagents to duplicate each other's work; structured briefs prevent it.
197
+
198
+ **Diversity check (before spawning).** Re-read the N briefs side by side. If two briefs:
199
+ - point at the same objective phrased differently, OR
200
+ - cite overlapping pointer traces without meaningfully different framings, OR
201
+ - attack the same area of the system,
202
+
203
+ merge or re-scope one of them. The frontier/pruning logic handles tree-level exploration vs exploitation algorithmically -- the orchestrator's job is just to make sure the round's N briefs don't collapse onto each other.
204
+
205
+ ### 5. Spawn parallel optimization subagents
206
+
207
+ Spawn all subagents in a **single batch** using your host's parallel-subagent tool. They must execute in parallel, not sequentially -- serial execution defeats the per-round width.
208
+
209
+ Per host, the spawn shape matters because evo's loop depends on *completion notifications* arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):
210
+
211
+ - **claude-code** — fire one `Bash(run_in_background=true)` call per brief. The bash invokes the subagent (the host's `Task` tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a `<task-notification>` at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
212
+ - **codex** — non-blocking subagent invocation; notifications delivered similarly.
213
+ - **hermes** — `terminal(background=true)`; notifications delivered similarly.
214
+ - **openclaw** — `sessions_spawn deliver:false`; notifications delivered similarly.
215
+ - **opencode** — *batch-parallel only* (no background notifications). Fire N `task` calls in ONE assistant message; all `tool_result`s return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
216
+ - **pi** — *batch-parallel via extension*. Pi's default toolkit has no subagent primitive; `evo install pi` ensures the `pi-subagents` package is present, which registers a `subagent` tool. Fire N `subagent` calls in ONE assistant message; all results return together when the slowest finishes (same shape as opencode). If the `subagent` tool isn't available, fall back to running experiments sequentially in your own turn (`evo new` → `evo run` per attempt) and tell the user to `pi install npm:pi-subagents` for proper fanout.
217
+
218
+ Respect the host's concurrency cap; batch if N exceeds it.
219
+
220
+ Pick a faster model for straightforward briefs and a stronger model for harder ones requiring deeper trace analysis, if your host exposes per-call model selection.
221
+
222
+ Each subagent prompt MUST start with the literal sentence:
223
+
224
+ > "First, load and follow the **evo subagent skill** (named `subagent` under the evo plugin in your host's skill registry — use your host's skill loader, not a filesystem path). Allocate your experiment via `evo new --parent <id>`, edit inside the returned worktree, evaluate via `evo run <exp_id>`. Do not skip these steps even if the brief looks simple."
225
+
226
+ Then append:
227
+ - The four-field brief verbatim (objective, parent, boundaries/anti-patterns, pointer traces)
228
+ - The iteration budget
229
+ - A one-paragraph scratchpad summary (current best score, frontier nodes, recent failures) for context
230
+
231
+ The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.
232
+
233
+ ### 6. Collect results and update state
234
+
235
+ After all subagents complete:
236
+
237
+ - Review each subagent's summary
238
+ - Record the round's best score and compare to the previous best
239
+ - If no subagent improved the score, increment the stall counter
240
+ - If any improved, reset the stall counter
241
+ - Check if subagents added new gates -- note these in your state tracking
242
+ - If multiple experiments failed the same gate, consider whether the gate is too restrictive or the briefs were aimed at the wrong surface
243
+
244
+ **Cross-cut the round's evaluated nodes.** Before moving on, read `experiments/<id>/attempts/NNN/outcome.json` for each evaluated node from this round. The structured `gates[]` entries and `benchmark.result` let you spot shared failure modes the subagent summaries may have glossed over (e.g., three different subagents produced evaluated nodes whose gate_failures all included `refund_flow` -- that's a structural constraint the next round must confront, not three independent bad hypotheses).
245
+
246
+ Prune dead branches where 3+ children all regressed:
247
+ ```bash
248
+ evo prune <exp_id> --reason "exhausted: N children all regressed"
249
+ ```
250
+
251
+ `evo prune` accepts `committed` or `evaluated` nodes. Use it when you want
252
+ to mark a lineage exhausted while preserving the result for later review or
253
+ reference. Prune keeps the git commit alive (anchored at `refs/evo-anchor/<run>/<exp>`)
254
+ so the node can be restored if needed. **Never `evo discard` a committed
255
+ node** — it would orphan the branch ref and risk losing the commit.
256
+
257
+ If a previously-pruned (or discarded-then-restored) node is worth revisiting:
258
+ ```bash
259
+ evo restore <exp_id>
260
+ ```
261
+ Flips status back to committed; recreates the regular branch from the anchor
262
+ ref so future `evo new --parent <id>` works. For discarded nodes whose commit
263
+ is no longer reachable in git (rare; needs `git gc --prune=now` after the
264
+ discard), restore errors and points at `experiments/<id>/attempts/NNN/diff.patch`
265
+ for manual replay.
266
+
267
+ Update notes with cross-cutting learnings:
268
+ ```bash
269
+ evo set <exp_id> --note "key insight from round N"
270
+ ```
271
+
272
+ ### 7. Continue or stop
273
+
274
+ **Continue** if:
275
+ - Stall counter < stall limit
276
+ - User hasn't interrupted
277
+ - Score hasn't reached the theoretical maximum
278
+
279
+ **Stop** if:
280
+ - Stall counter >= stall limit (N consecutive rounds with no improvement)
281
+ - Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
282
+ - User interrupted
283
+
284
+ On stop, print a final summary:
285
+ - Best score achieved and experiment ID
286
+ - Total experiments run across all rounds
287
+ - The winning diff: `evo diff <best_exp_id>`
288
+ - Suggested next steps if the score hasn't converged
289
+
290
+ Go back to step 1.
291
+
292
+ ## Resetting the eval epoch
293
+
294
+ `evo infra event -m "<reason>" --breaking` bumps `current_eval_epoch` and blocks
295
+ non-root `evo run` calls until a new root baseline commits. Old experiments
296
+ stay in the tree but are excluded from frontier and best-score lookups via
297
+ their epoch tag.
298
+
299
+ Use it when the benchmark itself is wrong epoch-wide -- score formula bug,
300
+ held-out gate revealing systematic gaming, propagated instrumentation drift.
301
+ Don't use it for single bad experiments (`evo discard`) or one tight gate
302
+ (relax the gate at the relevant node).
303
+
304
+ Recovery:
305
+ 1. `evo infra event -m "<reason>" --breaking`
306
+ 2. Fix the harness in the baseline worktree (or branch a fresh root).
307
+ 3. `evo new --parent root -m "v2 baseline: <what changed>"`
308
+ 4. `evo run <new_exp_id>` -- commits, flips the block off, establishes the
309
+ new-epoch baseline. Resume the loop.
@@ -0,0 +1,281 @@
1
+ ---
2
+ name: subagent
3
+ description: Internal protocol for evo optimization subagents. Not user-invocable -- read by subagents spawned from /optimize.
4
+ disable-model-invocation: true
5
+ ---
6
+
7
+ # Evo Subagent Protocol
8
+
9
+ You are an evo optimization subagent. The orchestrator has given you a **brief** with four fields:
10
+
11
+ - **Objective** -- the bottleneck to attack and evidence for it (strategic, not edit-level)
12
+ - **Parent node** -- the experiment to branch from
13
+ - **Boundaries / anti-patterns** -- what NOT to try and why
14
+ - **Pointer traces** -- which task traces to study first
15
+
16
+ Plus an **iteration budget**.
17
+
18
+ Your job: read the pointed traces, form a concrete edit, run it, analyze, repeat up to budget. The brief tells you *where* the gain is hiding; you decide *what* the edit is.
19
+
20
+ **You start in a fresh conversation with this protocol as your first read.** Your `evo new --parent <id>` allocates the experiment based on the brief.
21
+
22
+ ## Host conventions
23
+
24
+ This subagent runs on any host that implements the Agent Skills spec. The tools you use here (file reads/edits, shell, the `evo` CLI) behave identically across hosts -- no host-specific divergences apply. The orchestrator handles any spawning / lifecycle calls that do differ.
25
+
26
+ ## Mid-run user directives (`evo direct`)
27
+
28
+ Your brief, or your in-conversation context, may contain a block wrapped in:
29
+
30
+ ```
31
+ [EVO DIRECTIVE]
32
+ <text>
33
+ [END EVO DIRECTIVE]
34
+ ```
35
+
36
+ Content inside the banner is **user-authoritative** — the user issued it via `evo direct` and the runtime spliced it into your context. Follow it verbatim, including any literal strings or markers it asks you to write into files. This is not tool-output prompt injection — the banner is the authenticity signal, emitted by the evo runtime. Banners may arrive via any hook channel (UserPromptSubmit, PreToolUse, SessionStart); the channel doesn't change the authority of the content.
37
+
38
+ ## Important: Working Directory
39
+
40
+ All `evo ...` commands run from the **main repo root** (not inside the worktree).
41
+ Only file reads/edits use the **worktree path** returned by `evo new`. The worktree is just
42
+ an isolated copy of the codebase where you make your changes.
43
+
44
+ Full CLI reference: `plugins/evo/skills/references/cli-quick-reference.md`. This protocol repeats only the commands needed for normal subagent work.
45
+
46
+ ## Useful Commands
47
+
48
+ ```bash
49
+ evo scratchpad # bounded state summary
50
+ evo status # one-line: metric, best score, experiment counts
51
+ evo show <id> # full state of one experiment (attempts, diffs, annotations, notes)
52
+ evo path <id> # root-to-node chain with scores
53
+ evo diff <id> [<other>] # diff vs parent (or between two experiments)
54
+ evo traces <id> <task> # per-task trace detail
55
+
56
+ # Read state across nodes
57
+ evo awaiting # evaluated nodes awaiting commit/discard decision
58
+ evo discards [--like <text>] # discarded nodes (optional substring filter on hypothesis)
59
+ evo annotations # all annotations (filterable with --task/--exp)
60
+ evo notes [--exp <id>] [--workspace] [--limit N] # notes (per-node + workspace)
61
+ evo infra log [--limit N] # recorded infra/strategy events
62
+
63
+ # Read settings
64
+ evo config show # redacted workspace config (everything)
65
+ evo config get <field> # one field; mirror of `evo config set` choices
66
+ evo config backend show # current execution backend + provider config
67
+ evo config runtime show # runtime prepare/before-run/prefix recipe
68
+ evo env show # redacted runtime env metadata
69
+
70
+ # Gate ops
71
+ evo gate list <id> # effective gates for a node (inherited from ancestors)
72
+ evo gate check <id> # run effective gates without benchmark or state mutation
73
+ evo gate add <id> --name <name> --command "<command>" # add a gate
74
+
75
+ # Write paths used during iteration
76
+ evo new --parent <id> -m "<hypothesis>" # allocate sibling experiment
77
+ evo run <id> [--check] # run (or --check to validate without consuming attempts)
78
+ evo discard <id> --reason "<text>" # reject + park (keeps anchor ref)
79
+ evo restore <id> # un-discard or un-prune
80
+ evo annotate <id> [<task_id>] "<text>" # per-attempt analysis
81
+ evo set <id> --note "<text>" [--tag <t>] # per-node note from orchestrator
82
+ evo note "<text>" # workspace-level cross-cutting note
83
+ ```
84
+
85
+ For the read/write policy across worktree files, `.evo/` artifacts, and config,
86
+ see `references/cli-quick-reference.md` "Reading workspace state".
87
+
88
+ ## First Steps
89
+
90
+ 1. Read `.evo/project.md` to understand the target, what can be changed, and how to interpret results.
91
+ 2. Read the scratchpad for current state: `evo scratchpad`
92
+ It surfaces: best path (★-marked in the tree), frontier (strategy-ranked branchable nodes), evaluated nodes awaiting decision, gates, annotations, what not to try, infra events, and notes. The Drill-downs section at the bottom lists the read-only commands for going deeper on any section.
93
+ 3. Study the pointer traces from your brief:
94
+ ```bash
95
+ evo traces <exp_id> <task_id>
96
+ ```
97
+ Understand the failure patterns your objective points at.
98
+
99
+ ## Iteration Loop
100
+
101
+ Repeat up to **budget** times:
102
+
103
+ ### 0. Re-read shared state (skip on first iteration)
104
+
105
+ Before formulating your next edit, refresh your view of what other agents have done:
106
+
107
+ ```bash
108
+ evo status
109
+ evo scratchpad
110
+ ```
111
+
112
+ Check for:
113
+ - **Best score reached ceiling** (1.0 for max, 0.0 for min) -- if so, stop and report.
114
+ - **New "What Not To Try" entries** -- avoid duplicating failed approaches from other agents.
115
+ - **New "Awaiting Decision" entries** (evaluated nodes from other agents) -- if a sibling agent already hit the same gate or regression pattern you were about to try, read their `attempts/NNN/outcome.json` and diff before duplicating the attempt.
116
+ - **New annotations** -- learn from others' findings on failing tasks.
117
+ - **Score changes** -- another branch may have fixed the task you were about to work on. Adjust or stop.
118
+
119
+ ### 1. Formulate the edit
120
+
121
+ Starting from the brief's objective and the traces you read, form a concrete edit hypothesis. It must name:
122
+ - **Where** in the code: file, function, or behavior to change.
123
+ - **What** changes: the minimal specific edit (not "improve X" but "inject the last error into the next turn prefixed with 'Previous attempt failed:', cap 2 retries").
124
+ - **Predicted effect**: which task or behavior this should change and why.
125
+
126
+ If your edit hypothesis reads like the orchestrator's objective (no file, no concrete change), you haven't done the work -- keep reading traces and code. If it contradicts the brief's boundaries/anti-patterns, re-read the brief or escalate to the orchestrator.
127
+
128
+ ### 2. Create experiment
129
+
130
+ ```bash
131
+ evo new --parent <parent_id> -m "<your hypothesis>"
132
+ ```
133
+
134
+ Parse the JSON output to get the experiment ID and worktree path.
135
+
136
+ If you only need to validate benchmark/gate wiring before a real attempt, use `evo run <exp_id> --check`. It writes check artifacts but does not commit, evaluate, or consume retry budget.
137
+
138
+ ### 3. Edit the target
139
+
140
+ How you edit depends on the workspace's execution backend (the `"worktree"` path returned by `evo new` tells you which case you're in):
141
+
142
+ **Local backends (`--backend worktree` or `--backend pool`):** the worktree is a real path on this machine. Use your native `Read`/`Write`/`Edit` tools on that path directly. Example: `"target": "/path/to/.evo/run_0000/worktrees/exp_0005/src/agent.py"` -- read and edit that exact path.
143
+
144
+ **Remote backend (`--backend remote`):** the worktree path looks like `/workspace/repo` and lives **inside a remote container**, not on this machine. Your native `Read`/`Write`/`Edit` would write to a non-existent local path and silently fail. Use `evo` workspace-op subcommands instead:
145
+
146
+ ```bash
147
+ evo bash --exp-id <YOUR_EXP_ID> "<command>"
148
+ evo read --exp-id <YOUR_EXP_ID> <path>
149
+ evo write --exp-id <YOUR_EXP_ID> <path> --content "<text>" # or pipe via stdin
150
+ evo edit --exp-id <YOUR_EXP_ID> <path> --old "<s>" --new "<s>" [--replace-all]
151
+ evo glob --exp-id <YOUR_EXP_ID> "<pattern>" [--path <dir>]
152
+ evo grep --exp-id <YOUR_EXP_ID> "<pattern>" [--path <dir>]
153
+ ```
154
+
155
+ `--exp-id` is **required** on every workspace op. The orchestrator gives you your exp_id at the start of the brief; pass it on every call. The check is strict by design: multiple subagents run concurrent experiments in different containers, and a silent default would let one subagent operate on another's container by accident.
156
+
157
+ For multi-line edits, `evo edit --json-stdin` reads `{"old":...,"new":...,"replace_all":bool}` from stdin (avoids shell escaping for newlines / quotes).
158
+
159
+ You may edit anything within the target scope. Do NOT modify benchmark, gate, or framework code.
160
+
161
+ ### 4. Run the experiment
162
+
163
+ ```bash
164
+ evo run <exp_id>
165
+ ```
166
+
167
+ This runs benchmark + gate and prints the result.
168
+
169
+ In remote-backend workspaces, if a prior `evo run <exp_id>` was interrupted
170
+ or the experiment is still `active`, run `evo run <exp_id>` again first. That
171
+ is the recovery path: evo will try to attach to the existing remote process and
172
+ finalize the same attempt instead of starting attempt 002. If the output prints
173
+ `RECOVERING <exp_id> attempt=N process=... state=...`, wait for that command to
174
+ finish. Do not discard the active experiment or create a replacement unless evo
175
+ reports it is unrecoverable or the orchestrator explicitly tells you to.
176
+
177
+ Benchmarks also receive `EVO_CHECKPOINT_DIR`. Expensive benchmarks should write
178
+ portable progress files there. evo mirrors that directory back into
179
+ `attempts/NNN/checkpoints/` during remote runs and records phase progress in
180
+ `attempt_state.json`. This is the recovery boundary for container death: evo can
181
+ restart from benchmark-owned checkpoint files, but it does not freeze/restore an
182
+ arbitrary Linux process.
183
+
184
+ **If the workspace was initialized with `commit_strategy=tracked-only` (the default for `--backend pool`):** `evo run` only commits modifications to *tracked* files. New files require an explicit `git add` from inside the worktree, then a shisa-kanko ack on the run command:
185
+
186
+ ```bash
187
+ # inside the worktree -- only for new SOURCE files you want in the commit:
188
+ cd <worktree_path> && git add path/to/new_file.py
189
+
190
+ # then, from the main repo:
191
+ evo run <exp_id> --i-staged-new-files yes
192
+ ```
193
+
194
+ The ack flag is required when the worktree has any untracked, non-gitignored file. Without it, `evo run` errors closed and lists the files. For each file, decide: source (then `git add`) or warm state (leave untracked -- it persists in the slot for future experiments). Then re-run with `--i-staged-new-files yes`. The flag value must be exactly `yes`. In `commit_strategy=all` workspaces (default for `--backend worktree`) the flag is a silent no-op; safe to always pass.
195
+
196
+ ### 5. Analyze the result
197
+
198
+ `evo run` prints one of three outcomes:
199
+
200
+ - **`COMMITTED`** (score improved + gates passed): node locked in. Read failing task traces to find the next weakness. Use this experiment as the parent for your next iteration.
201
+
202
+ - **`EVALUATED`** (score regressed or gate failed): ran cleanly but bad outcome. **You decide next step.** Read:
203
+ - `experiments/<id>/attempts/NNN/outcome.json` -- structured record: `score` vs `parent_score`, per-gate `passed`/`returncode`, benchmark result, error. Tells you *what* broke.
204
+ - `experiments/<id>/attempts/NNN/diff.patch` and `benchmark.log` -- tell you *why*.
205
+
206
+ Then either:
207
+ - Fixable edit-bug (off-by-one, wrong signature): edit the worktree and `evo run <id>` again. Bounded by `max_attempts` (default 3). Before retrying, compare your planned edit against the previous attempts' `outcome.json` on this same node -- if two earlier attempts hit the same gate, a small tweak won't fix it. When the cap is hit, run is refused -- you must discard.
208
+ - Hypothesis is wrong, no fix: `evo discard <id> --reason "..."` and branch a new experiment from the **original parent**.
209
+
210
+ - **`FAILED`** (infra error, non-zero exit, timeout): couldn't evaluate. Doesn't consume the retry budget.
211
+ - Transient / fixable locally: retry.
212
+ - `remote_infra_failure:...`: remote container or agent infrastructure failed. Report it to the orchestrator unless your brief explicitly says to retry infra failures.
213
+ - Structural (benchmark broken, evo misconfigured): report to orchestrator and stop.
214
+ - Not worth fixing: `evo discard <id> --reason "..."`.
215
+
216
+ ### 6. Annotate
217
+
218
+ ```bash
219
+ evo annotate <exp_id> "<what you changed, what happened, and why>"
220
+ ```
221
+
222
+ Always annotate so other agents can learn from your experiments.
223
+
224
+ ### 6b. Add gates for fixed behaviors
225
+
226
+ When you fix a critical, easy-to-regress behavior, lock it in as a gate so future experiments on this branch can't break it:
227
+
228
+ ```bash
229
+ evo gate add <exp_id> --name "social_eng_resistance" --command "python3 {worktree}/benchmark.py --target {target} --task-ids 3 --min-score 0.9"
230
+ ```
231
+
232
+ Good candidates: a specific benchmark task that was hard to fix, a test for a critical policy rule, a smoke test for a fragile behavior. The gate command must exit non-zero when the protected behavior regresses; a bare benchmark invocation that prints a low score but exits 0 is decorative and should not be registered. Do NOT gate every passing task -- that over-constrains the search.
233
+
234
+ ### 7. Decide: continue or stop
235
+
236
+ Continue if budget remains AND (last outcome was committed, OR you have a meaningfully different idea after an evaluated/discarded outcome). When continuing after a committed experiment, update your parent to the newly committed ID.
237
+
238
+ Stop if budget exhausted, infra failure, or you've exhausted variations with no improvement.
239
+
240
+ ## Enriching traces
241
+
242
+ Check `.evo/meta.json` for `"instrumentation_mode"` (`"sdk"` or `"inline"`) to see which style the benchmark uses -- **stay consistent with that choice across iterations; do not flip styles mid-run.**
243
+
244
+ Trace quality is part of the benchmark contract. After a failed baseline or failed task, the orchestrator should be able to reconstruct what happened using only `evo traces <exp_id> <task_id>`. If not, the trace logging is too thin.
245
+
246
+ - **SDK mode** (`from evo_agent import Run`): read `plugins/evo/skills/references/agent-sdk-reference.md`, then enrich traces by adding `run.log(task_id, ...)` calls or extra fields to `run.report()`.
247
+ - **Inline mode** (benchmark has local `log_task`/`logTask` helpers): add fields to the trace dict built inside `log_task()`.
248
+ - **LLM / agent benchmarks**: log the task input, observation/frame summary, prompt or message summary, model/tool response, selected action, retries/errors, and final task outcome. If the project already has a separate recorder, decide whether evo traces mirror the important fields or whether the recorder artifact is explicitly linked from the evo trace.
249
+
250
+ The trace format is forward-compatible -- extra fields are preserved. Do NOT change the score computation or gate logic -- only add observability.
251
+
252
+ ## Rules
253
+
254
+ - Do NOT run `evo init` or `evo reset`
255
+ - `evo discard <your_exp_id> --reason "..."` is your explicit "abandon" action — use it for any *non-committed* node you've decided not to pursue further (pre-run realization, evaluated with a bad hypothesis, or unfixable infra failure). Discard deletes the worktree and branch; the node and its per-attempt artifacts stay in `.evo/` as a record of what was tried.
256
+ - If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune`.
257
+ - If `evo discard` errors with **"cannot discard active node ... pass --force"** — the run is still in flight. Wait for it to finish; don't `--force` unless you know what you're doing (the running process can still write a final outcome that contradicts the discard).
258
+ - If `evo discard` errors with **"cannot discard ... has non-discarded children"** — sibling/child experiments depend on this node's parent reference. Discard or commit-and-prune those first.
259
+ - Do NOT copy `.env` files, bake secrets into source, or hard-code local runtime paths. Runtime setup/env is configured by the orchestrator (`evo config runtime ...`, `evo env ...`) and injected into benchmark/gate processes. If a missing dependency, setup step, or key blocks evaluation, report setup failure.
260
+ - Always annotate your experiments, especially before discarding — the annotation is what persists after the worktree is gone.
261
+ - Stay within your brief's objective and boundaries -- don't drift into unrelated changes
262
+
263
+ ## When Done
264
+
265
+ Return a structured summary:
266
+
267
+ ```
268
+ ## Results
269
+ - Experiments: <list of exp IDs with scores and status>
270
+ - Best: <exp_id> with score <N>
271
+
272
+ ## Changes
273
+ - <what you changed in each experiment, briefly>
274
+
275
+ ## Learnings
276
+ - <what failure patterns you observed>
277
+ - <what worked and what didn't>
278
+
279
+ ## Suggestions
280
+ - <ideas for the next round that you didn't get to try>
281
+ ```