@evo-hq/pi-evo 0.5.0-alpha.11 → 0.5.0-alpha.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@evo-hq/pi-evo",
3
- "version": "0.5.0-alpha.11",
3
+ "version": "0.5.0-alpha.13",
4
4
  "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
5
5
  "publishConfig": {
6
6
  "access": "public"
@@ -2,7 +2,7 @@
2
2
  name: discover
3
3
  description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
4
4
  argument-hint: <optional context about what to optimize>
5
- evo_version: 0.5.0-alpha.11
5
+ evo_version: 0.5.0-alpha.13
6
6
  ---
7
7
 
8
8
  # Discover
@@ -116,20 +116,20 @@ evo --version
116
116
  The output must be exactly:
117
117
 
118
118
  ```
119
- evo-hq-cli 0.5.0-alpha.11
119
+ evo-hq-cli 0.5.0-alpha.13
120
120
  ```
121
121
 
122
122
  Three outcomes:
123
123
 
124
124
  1. **Matches exactly** — continue to step 1.
125
125
  2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
126
- > Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.11`). Run:
126
+ > Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.13`). Run:
127
127
  > ```
128
- > uv tool install --force evo-hq-cli==0.5.0-alpha.11
128
+ > uv tool install --force evo-hq-cli==0.5.0-alpha.13
129
129
  > ```
130
130
  > Then re-invoke this skill.
131
131
  3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
132
- > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.11` (or `pipx install evo-hq-cli==0.5.0-alpha.11`). Then re-invoke this skill.
132
+ > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.13` (or `pipx install evo-hq-cli==0.5.0-alpha.13`). Then re-invoke this skill.
133
133
 
134
134
  Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
135
135
 
@@ -308,10 +308,10 @@ Dashboard live: http://127.0.0.1:8080 (pid 12345)
308
308
  **Benchmark commands must be eval-only.** Do NOT wrap training and evaluation into a single benchmark command. If your benchmark command runs training before scoring, every gate revalidation and every `evo run --check` retrains from scratch, and the experiment budget burns on duplicated training instead of new experiments. Training is a separate step the agent invokes BEFORE `evo run`:
309
309
 
310
310
  1. The agent makes changes (data curation, hyperparameter selection, technique choice, training code edits).
311
- 2. The agent runs training to produce a checkpoint at the experiment worktree's `final_model/` (or wherever the technique's recipe in `evo:finetuning/references/glue.md` specifies).
312
- 3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced checkpoint and emits a score.
311
+ 2. The agent runs the build/training step to produce its artifact — a checkpoint, adapter, index, whatever the recipe makes. Write it to `EVO_CHECKPOINT_DIR` (durable: survives between-attempt cleanup and discard) and declare it in the benchmark result's `artifacts` field so it's preserved + reusable; the per-technique I/O contract is in `evo:finetuning/references/glue.md`. Never hardcode a fixed name like `final_model/`.
312
+ 3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced artifact and emits a score.
313
313
 
314
- The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs train+eval together with no eval-only mode, wrap it: add a `--skip-train` flag, or have the wrapper detect an existing checkpoint at `final_model/` and short-circuit the train step. Without this, evo's gate-recheck and re-score mechanics retrain repeatedly and the budget evaporates.
314
+ The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs build+eval together with no eval-only mode, wrap it: add a `--skip-build` flag, or have the wrapper detect an existing artifact (under `EVO_CHECKPOINT_DIR`) and short-circuit the build step. Without this, evo's gate-recheck and re-score mechanics rebuild repeatedly and the budget evaporates.
315
315
 
316
316
  **Runtime environment.** If the benchmark needs keys or other runtime variables, configure them through evo rather than copying `.env` into worktrees or hand-editing `config.json`:
317
317
 
@@ -323,6 +323,16 @@ evo env show
323
323
 
324
324
  Values are resolved fresh by the orchestrator on each `evo run`. Config stores dotenv source metadata and key names, not secret values. The benchmark and gates receive the resolved env; gates do not receive `EVO_*` artifact variables.
325
325
 
326
+ ### 7a. Record the task category (`task-skills`)
327
+
328
+ You already decided what's being optimized (steps 2–4). Record the evo category skill(s) a builder should load for it, so every executing agent — prose subagent or workflow lane — loads the right method knowledge instead of rediscovering it each round:
329
+
330
+ ```bash
331
+ evo config set task-skills finetuning # any weight-update / training task
332
+ ```
333
+
334
+ Rule of thumb: if the optimization updates model weights (SFT / LoRA / DPO / RL / continued-pretraining), set `task-skills finetuning`. **Leave it unset** for prompt / code / config / harness optimization — the subagent protocol already covers those; only set it when a dedicated category skill applies. Use a comma-separated list if more than one applies. Mirror the choice in `.evo/project.md` (step 12) so it's human-readable and survives as the fallback if config is ever cleared.
335
+
326
336
  ## 8. Set up gates
327
337
 
328
338
  Gates inherit down the experiment tree -- children automatically get all ancestor gates.
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: infra-setup
3
3
  description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
4
- evo_version: 0.5.0-alpha.11
4
+ evo_version: 0.5.0-alpha.13
5
5
  ---
6
6
 
7
7
  # Infra Setup
@@ -2,7 +2,7 @@
2
2
  name: optimize
3
3
  description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
4
4
  argument-hint: "[subagents=N] [budget=N] [stall=N]"
5
- evo_version: 0.5.0-alpha.11
5
+ evo_version: 0.5.0-alpha.13
6
6
  ---
7
7
 
8
8
  Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
@@ -102,17 +102,19 @@ evo defaults get subagents-only --json
102
102
 
103
103
  As your **very first actions, before the loop**, resolve each and arm it: run `evo autonomous on` / `evo subagents-only on` when it resolves on, or `evo autonomous off` / `evo subagents-only off` when an explicit instruction or stored default turned it off. If a behavior resolves off — whether from the user's instruction this run or a stored default — say so in your opening message (e.g. "autonomous off — running one round at a time, as you asked") so it's never invisible.
104
104
 
105
- **Orchestrator driver (Claude Code only).** evo can drive the loop two ways: the prose loop below (default, every host), or a deterministic **dynamic workflow** (Claude Code only, opt-in). Resolve which as part of your very first actions:
105
+ **Orchestrator driver.** evo drives the loop two ways: a deterministic **dynamic workflow** (Claude Code only) or the **prose loop** below (every host). **On Claude Code the workflow is the DEFAULT use it whenever it's available.** Resolve which as part of your very first actions:
106
106
 
107
- 1. `evo host show` — must be `claude-code` for the workflow driver. If it prints `<not set>` (a pre-host workspace), determine your actual runtime from your own context (system prompt, env such as `CLAUDECODE=1`, self-identity): **only if you are genuinely Claude Code**, do the one-time host migration now (`evo host set claude-code`) and continue; if you are any other runtime, do NOT stamp the host here — leave it for Step 0.1 to record and use the prose loop. Any non-`claude-code` host uses the prose loop.
108
- 2. `evo config get default-orchestrator` — `workflow` selects the workflow driver; anything else (including unset) resolves to `prose`. An explicit user instruction this run still wins.
107
+ 1. `evo host show` — the workflow driver requires `claude-code`. If it prints `<not set>` (a pre-host workspace), determine your actual runtime from your own context (system prompt, env such as `CLAUDECODE=1`, self-identity): **only if you are genuinely Claude Code**, do the one-time host migration now (`evo host set claude-code`) and continue; if you are any other runtime, do NOT stamp the host here — leave it for Step 0.1 and use the prose loop.
108
+ 2. `evo config get default-orchestrator` — `prose` is an explicit **opt-out** (honor it: use the prose loop). `workflow` **or unset** resolves to the workflow driver on Claude Code. An explicit user instruction this run still wins.
109
109
 
110
- If host is `claude-code` **and** the value is `workflow` **and** the Workflow tool is available in this session, do NOT drive the loop turn-by-turn. Launch the bundled workflow once instead:
110
+ **Use the workflow** when host is `claude-code`, the value is not explicitly `prose`, AND the **Workflow tool is actually present in your available tools this session** — this is the default path, not opt-in. The availability check is load-bearing: **older Claude Code builds do not ship the Workflow tool**, so verify it's really in your toolset; do not assume it exists from the host alone. When (and only when) you will actually launch it, FIRST persist the choice so the rest of evo agrees (`evo config get` reflects it, and the autonomous stop-nudge auto-suppresses under the workflow): run `evo config set default-orchestrator workflow`. Then launch it once — do NOT drive the loop turn-by-turn:
111
111
 
112
112
  - Call the **Workflow** tool with `scriptPath: ${CLAUDE_PLUGIN_ROOT}/skills/optimize/workflows/evo-optimize.js` and `args: {pluginRoot: "${CLAUDE_PLUGIN_ROOT}", subagents: <N>, budget: <N>, stall: <N>}`, using the round sizing you resolved above. **Pass all four keys explicitly — never omit one.** For `stall`, use the user's `/optimize stall=N` override if given, else the default 5. (The workflow's stop condition is the stall limit, so a dropped `stall` silently reverts it to 5.)
113
- - Report the returned `runId` and tell the user to watch progress with `/workflows`. The workflow runs the round loop itself (orient → mandatory scan + cross-history axis check → ideators on stall/periodic → briefs → fan-out + verify → collect → frontier-select → stall); you do **not** execute "The Loop" section below.
113
+ - Report the returned `runId` and tell the user to watch progress with `/workflows`. The workflow runs the round loop itself (orient → mandatory scan + cross-history axis check → ideators on stall/periodic → briefs → fan-out + verify → collect → frontier-select → stall) plus the concurrent meta controller; you do **not** execute "The Loop" section below, and you do **not** need autonomous mode (the workflow self-drives; its stall limit is the stop).
114
114
 
115
- Otherwise any non-`claude-code` host, `default-orchestrator` unset/`prose`, or the Workflow tool unavailable ignore this and follow **The Loop** below as the canonical driver. The workflow is only an execution strategy over the same `evo` CLI; gates, frontier, dashboard, and recovery are identical either way.
115
+ Use **The Loop** below only when the workflow can't drive: host is not `claude-code`, `default-orchestrator` is explicitly `prose`, or the Workflow tool is unavailable (e.g. an older Claude Code build). The workflow is only an execution strategy over the same `evo` CLI; gates, frontier, dashboard, and recovery are identical either way.
116
+
117
+ **Reconcile config when you fall back to prose.** The stop-nudge that drives the prose loop is auto-suppressed whenever `default-orchestrator` is `workflow`. So if you fall back to the prose loop on Claude Code because the Workflow tool isn't available (older build) while `default-orchestrator` is still `workflow` from a prior run, you MUST set it back — `evo config set default-orchestrator prose` — and arm autonomous as usual. Otherwise the prose loop's stop-nudge stays suppressed and the run stalls after one round. Invariant to preserve: `default-orchestrator=workflow` in config iff the workflow is actually the driver this run.
116
118
 
117
119
  **Autonomous mode.** Off lets you stop naturally at a turn boundary — finish a round, report, and stop. On arms the stop-nudge: at every turn boundary you are re-prompted to keep driving the loop until the **stall** limit is hit or the user interrupts. Without it, the loop does NOT force-continue across turn boundaries. To stop an autonomous run, the user runs `evo autonomous off` or `evo exit-optimize-mode`.
118
120
 
@@ -5,7 +5,7 @@
5
5
  * opt-in, Claude-Code-only driver; the prose skill remains the canonical, host-agnostic
6
6
  * default. The workflow encodes the loop CONTROL: while/stall, mandatory scan + cross-history
7
7
  * axis check, research escalation (ideators on stall / every ~5 commits), brief + diversity,
8
- * fan-out + verify, collect + frontier-select. A concurrent ANALYST thread (Opus, self-paced,
8
+ * fan-out + verify, collect + frontier-select. A concurrent META thread (Opus, self-paced,
9
9
  * read-only) runs alongside the round loop via Promise.all — host + cross-history checks during
10
10
  * rounds, feeding hints into the next brief. All domain work goes through the `evo` CLI inside
11
11
  * agents — the script itself never touches the filesystem/shell.
@@ -27,7 +27,7 @@
27
27
 
28
28
  export const meta = {
29
29
  name: 'evo-optimize',
30
- description: 'Deterministic evo tree-search loop over the evo CLI (orient, scan, ideate-on-stall, brief, fan-out, verify, collect).',
30
+ description: 'evo tree-search loop over the evo CLI (orient, scan, ideate-on-stall, brief, fan-out, verify, collect) with a concurrent meta controller that stops doomed experiments and restructures the workflow (logic flow + prompts) live.',
31
31
  phases: [
32
32
  { title: 'Orient', detail: 'read state + select frontier parents to extend' },
33
33
  { title: 'Scan', detail: 'mandatory parallel cross-cutting scan + structural aggregation (incl. cross-history axis check)' },
@@ -36,7 +36,8 @@ export const meta = {
36
36
  { title: 'Optimize', detail: 'parallel optimization subagents (evo new/run)' },
37
37
  { title: 'Verify', detail: 'validity audit + benchmark-noise confirm' },
38
38
  { title: 'Collect', detail: 'prune dead lineages, record cross-cutting notes' },
39
- { title: 'Analyst', detail: 'concurrent independent observer (Opus) — host + cross-history checks during rounds' },
39
+ { title: 'Meta', detail: 'concurrent controller (Opus) — host/cross-history checks, STOP doomed runs, suggest directions, AND restructure the workflow live (logic flow + prompts)' },
40
+ { title: 'Meta-step', detail: 'extra agent step the meta injected into the round via an inject-step harness edit' },
40
41
  ],
41
42
  }
42
43
 
@@ -65,6 +66,8 @@ const STATE = {
65
66
  committedCount: { type: 'integer' }, // total committed experiments (drives the periodic ideator trigger)
66
67
  verifyRepeats: { type: 'integer' }, // benchmark noise profile (1 = deterministic, no confirm-loop)
67
68
  summary: { type: 'string' }, // short scratchpad summary for subagent context
69
+ taskSkills: { type: 'array', items: { type: 'string' } }, // category skills a builder should load (e.g. ["finetuning"]) — resolved from config/project, never hardcoded in the workflow
70
+ knownLearnings: { type: 'array', items: { type: 'string' } }, // durable lessons to apply up front (drained from annotations) so a fresh stateless lane doesn't rediscover them
68
71
  },
69
72
  }
70
73
 
@@ -178,13 +181,59 @@ const PREVERDICT = {
178
181
  },
179
182
  }
180
183
 
181
- // Analyst tick output: work-quality hints (fed into the next brief) + runtime/host alerts (surfaced).
182
- const ANALYST_FINDINGS = {
184
+ // Meta tick output: work-quality hints (fed into the next brief) + runtime/host alerts (surfaced).
185
+ const META_FINDINGS = {
183
186
  type: 'object',
184
- required: ['briefHints', 'alerts'],
187
+ required: ['briefHints', 'alerts', 'stops'],
185
188
  properties: {
186
189
  briefHints: { type: 'array', items: { type: 'string' } },
187
190
  alerts: { type: 'array', items: { type: 'string' } },
191
+ // STOP recommendations for in-flight experiments that are clearly doomed.
192
+ // A stop is NOT a crash: each carries the diagnosis + a fix so the gated
193
+ // enforcer can abort, annotate (lesson outlives the worktree), and classify+
194
+ // preserve — and the fix feeds the next round (and, if general, a skill prior).
195
+ stops: {
196
+ type: 'array',
197
+ items: {
198
+ type: 'object',
199
+ required: ['expId', 'failureClass', 'reason', 'fixHint'],
200
+ properties: {
201
+ expId: { type: 'string' },
202
+ failureClass: { enum: ['build', 'eval', 'hypothesis'] },
203
+ reason: { type: 'string' }, // what's wrong + the concrete evidence
204
+ fixHint: { type: 'string' }, // what the NEXT experiment must change
205
+ },
206
+ },
207
+ },
208
+ // FREE-WILL harness edits: the meta agent restructures the WORKFLOW itself, live
209
+ // (its logic flow + prompts). Applied directly each tick — no allow-list, no caps —
210
+ // and audited (editLog + run log + returned state). It edits only the search harness;
211
+ // it never touches the benchmark / grader / verifier (those are not part of the
212
+ // workflow it controls, so they stay fixed and the score stays comparable).
213
+ harnessEdits: {
214
+ type: 'array',
215
+ items: {
216
+ type: 'object',
217
+ required: ['op', 'rationale'],
218
+ properties: {
219
+ op: { enum: ['set-knob', 'toggle-phase', 'set-prompt', 'inject-step'] },
220
+ rationale: { type: 'string' }, // why this edit, with evidence
221
+ // set-knob — retune the loop's control flow
222
+ knob: { enum: ['width', 'budget', 'stall', 'ideateEvery', 'ideateStall'] },
223
+ value: { type: 'number' },
224
+ // toggle-phase — turn a discretionary phase on/off
225
+ phaseName: { enum: ['scan', 'ideate'] },
226
+ enabled: { type: 'boolean' },
227
+ // set-prompt — edit a prompt the workflow uses (append a directive or replace it wholesale)
228
+ target: { enum: ['state', 'scan', 'aggregate', 'brief', 'implement', 'run', 'ideator', 'collect'] },
229
+ mode: { enum: ['append', 'replace'] },
230
+ text: { type: 'string' },
231
+ // inject-step — add an extra agent step at a fixed seam each round
232
+ at: { enum: ['before-scan', 'after-scan', 'before-brief', 'after-collect'] },
233
+ label: { type: 'string' },
234
+ },
235
+ },
236
+ },
188
237
  },
189
238
  }
190
239
 
@@ -206,17 +255,17 @@ const LIMIT = Number(A.stall) || 5
206
255
  const IDEATE_STALL = Math.max(1, Math.min(3, LIMIT - 1))
207
256
  const IDEATE_EVERY_COMMITS = 5 // periodic research cadence (matches prose step 6b)
208
257
  const PREVERIFY_MAX = 3 // pre-run verify <-> revise attempts before discarding a rigged edit
209
- // Concurrent analyst thread (runs alongside the round loop, NOT per-round).
210
- const ANALYST_ENABLED = true
211
- const ANALYST_MODEL = 'opus' // the analyst always reasons with Opus (judgment-heavy)
212
- const ANALYST_INTERVAL_S = 300 // self-pace: observe ~every 5 min, during rounds
213
- const ANALYST_HOP_S = 15 // the wait is INTERRUPTIBLE in hops of this size: when the optimize loop
214
- // ends mid-wait it drops a sentinel the analyst polls, so the in-flight
215
- // tick exits within ~ANALYST_HOP_S instead of stalling the run for the
258
+ // Concurrent meta thread (runs alongside the round loop, NOT per-round).
259
+ const META_ENABLED = true
260
+ const META_MODEL = 'opus' // the meta always reasons with Opus (judgment-heavy)
261
+ const META_INTERVAL_S = 300 // self-pace: observe ~every 5 min, during rounds
262
+ const META_HOP_S = 15 // the wait is INTERRUPTIBLE in hops of this size: when the optimize loop
263
+ // ends mid-wait it drops a sentinel the meta polls, so the in-flight
264
+ // tick exits within ~META_HOP_S instead of stalling the run for the
216
265
  // full interval (the script can't interrupt an agent's `sleep` directly).
217
- const DONE_SENTINEL = '.evo/.wf_optimize_done' // optimize -> analyst "loop is over" signal (a file,
266
+ const DONE_SENTINEL = '.evo/.wf_optimize_done' // optimize -> meta "loop is over" signal (a file,
218
267
  // since the in-memory `done` flag isn't visible to the agent's process)
219
- const ANALYST_MAX_FAILS = 3 // consecutive failed ticks before the advisory analyst self-disables
268
+ const META_MAX_FAILS = 3 // consecutive failed ticks before the advisory meta self-disables
220
269
  // (guards against a hot-spin when ticks fail instantly, e.g. a bad schema)
221
270
  // Experiments per scan agent. Heuristic for the prose "small enough to read in one pass" rule —
222
271
  // the workflow can't recursively self-partition like the prose loop, so this is fixed up front.
@@ -281,11 +330,14 @@ function statePrompt() {
281
330
  'Read-only. Do NOT edit files or run experiments. Run these and parse their output:',
282
331
  '`evo scratchpad`, `evo frontier` (already prints a JSON envelope), `evo status`, `evo awaiting`.',
283
332
  'Also read `.evo/project.md` for the metric goal, direction, and the benchmark-determinism line.',
333
+ 'Also run `evo config get task-skills` (if unset/blank, infer the task category from project.md) and read recent `evo annotations` for durable learnings already found this run.',
284
334
  'Return: bestScore + bestExpId; the theoretical ceiling (1.0 for max metric, 0.0 for min)',
285
335
  'and direction; the frontier nodes ALREADY ranked by the configured strategy',
286
336
  '(id, score, rank) — preserve evo\'s ordering, do not re-rank; the list of evaluated-but-',
287
337
  'undecided experiment ids; committedCount (number of committed experiments, from `evo status`);',
288
338
  'verifyRepeats (from project.md: 1 if deterministic, 3 if sampling-based / variance-expected);',
339
+ 'taskSkills (category skills a builder should load, e.g. ["finetuning"] — from `evo config get task-skills`, else inferred from project.md);',
340
+ 'knownLearnings (short durable lessons from annotations to apply up front: trainer-API quirks, device placement, eval-side config, etc.);',
289
341
  'and a 2-3 sentence scratchpad summary for subagent context.',
290
342
  ].join(' ')
291
343
  }
@@ -305,6 +357,7 @@ function scanBrief(batch) {
305
357
  '- Shared failure causes -- root-cause reasons recurring across 2+ experiments (the *why*, not the surface gate name).',
306
358
  '- Wall patterns -- approaches or gates multiple experiments consistently fail on.',
307
359
  '- Compound-failure standouts -- single experiments hitting multiple failure modes.',
360
+ '- Axis exhaustion vs fixable plumbing -- read each node\'s `failure_class` (build|eval|hypothesis) from outcome.json. A cluster of `hypothesis` failures across STRUCTURALLY DISTINCT approaches means the axis itself is unpromising (flag it so the next briefs diverge); `build`/`eval` failures are fixable plumbing (recipe/scoring) and must NOT be read as axis exhaustion.',
308
361
  '',
309
362
  'Evidence must be VERBATIM quotes from outcome.json fields, trace messages, or error text -- not paraphrases. If you cannot cite verbatim evidence for a finding, drop it. Evidence: short quotes (<200 chars), max 3 per finding.',
310
363
  'Return JSON only: {"findings":[{"description","experiment_ids":[],"evidence":[]}]}.',
@@ -326,11 +379,15 @@ function aggregatePrompt(ids) {
326
379
  '~the same score (a plateau), the bottleneck is not where those hypotheses aimed — emit kind:"axis-warning" whose',
327
380
  'label names the saturated axis AND suggests the orthogonal axis (harness, score definition, input data, or a',
328
381
  'different mechanism) the next briefs should pivot to. At most one axis-warning.',
382
+ 'Also read DISCARDED nodes (`evo discards` + their outcome.json `failure_class`): a cluster of 3+ failure_class="hypothesis"',
383
+ 'discards across STRUCTURALLY DISTINCT approaches is itself an axis-warning (the direction keeps not helping). IGNORE',
384
+ 'failure_class="build"/"eval" discards for axis purposes — those are fixable plumbing (retry/resume or eval-retest), not',
385
+ 'evidence the axis is wrong.',
329
386
  'Return JSON only.',
330
387
  ].join(' ')
331
388
  }
332
389
 
333
- function briefPrompt(state, findings, patterns, parents, ideated, analystHints) {
390
+ function briefPrompt(state, findings, patterns, parents, ideated, metaHints) {
334
391
  return [
335
392
  'You are the evo orchestrator\'s brief writer.',
336
393
  'State summary:', state.summary || '',
@@ -341,10 +398,10 @@ function briefPrompt(state, findings, patterns, parents, ideated, analystHints)
341
398
  ? '\nFRESH IDEATOR PROPOSALS may be available — read `.evo/run_*/ideator/proposals.jsonl` and reconcile BEFORE writing: skip any whose technique was already tried (`evo discards --like "<keyword>"`); score the rest by expected_score_uplift x confidence (frontier_extrapolation > failure_analysis > literature, all else equal); let the top 1-2 become brief objectives, citing the proposal\'s hypothesis/technique. Proposals are advisory — if none beat the in-graph scan findings, ignore them.'
342
399
  : '',
343
400
  '\nIf the patterns include an "axis-warning", the current axis is saturated — target the ORTHOGONAL axis it names rather than iterating the plateaued one.',
344
- (analystHints && analystHints.length)
345
- ? '\nLIVE ANALYST SIGNALS (from the concurrent observer — fold relevant ones into objectives/boundaries, e.g. switch off a saturated axis, avoid a flagged dead direction): ' + JSON.stringify(analystHints)
401
+ (metaHints && metaHints.length)
402
+ ? '\nLIVE META SIGNALS (from the concurrent observer — fold relevant ones into objectives/boundaries, e.g. switch off a saturated axis, avoid a flagged dead direction): ' + JSON.stringify(metaHints)
346
403
  : '',
347
- `\nWrite up to ${WIDTH} briefs (use the full round width of ${WIDTH} whenever you can find that many genuinely DISTINCT objectives — multiple briefs MAY branch from the SAME parent when fewer than ${WIDTH} frontier parents exist, as long as each attacks a different surface; do not pad with redundant briefs). One per subagent, each with four fields:`,
404
+ `\nWrite up to ${harness.width} briefs (use the full round width of ${harness.width} whenever you can find that many genuinely DISTINCT objectives — multiple briefs MAY branch from the SAME parent when fewer than ${harness.width} frontier parents exist, as long as each attacks a different surface; do not pad with redundant briefs). One per subagent, each with four fields:`,
348
405
  '1. objective -- one sentence naming WHERE in system behavior the gain hides, with evidence; NO file/function/edit names.',
349
406
  '2. parent -- which experiment id to branch from (choose from the selected parents).',
350
407
  '3. boundaries -- what NOT to try and why (discarded approaches, gates not to regress, what adjacent briefs this round do).',
@@ -356,9 +413,24 @@ function briefPrompt(state, findings, patterns, parents, ideated, analystHints)
356
413
  }
357
414
 
358
415
  // IMPLEMENT — allocate + edit, but do NOT run (a pre-run verifier audits the edit first).
416
+ // Context capsule shared by every builder/runner lane: which category skills to load on demand,
417
+ // and the durable learnings to apply up front — so a fresh stateless agent inherits the priors and
418
+ // hard-won lessons a prose single-subagent would have had, instead of rediscovering them.
419
+ function capsuleLines(state) {
420
+ const skills = (state && state.taskSkills) || []
421
+ const learnings = (state && state.knownLearnings) || []
422
+ const lines = []
423
+ lines.push(skills.length
424
+ ? `Task-category skills — load IN FULL via your host skill loader if the work needs them (they carry this category's priors, recipes, and pre-run checks): ${skills.join(', ')}.`
425
+ : "Identify the task category from `.evo/project.md` and load the matching evo skill (e.g. evo:finetuning for a training move) IN FULL before you build — it carries this category's priors and pre-run checks.")
426
+ if (learnings.length) lines.push(`KNOWN LEARNINGS to apply before acting (already found this run — do not rediscover): ${JSON.stringify(learnings)}.`)
427
+ return lines
428
+ }
429
+
359
430
  function implementPrompt(brief, parent, state) {
360
431
  return [
361
432
  `First, load and follow the evo subagent skill: Read ${pr}/skills/subagent/SKILL.md IN FULL and follow it as your operating protocol — do not skip it even if the brief looks simple.`,
433
+ ...capsuleLines(state),
362
434
  `Allocate your experiment via \`evo new --parent ${parent}\`, then edit inside the returned worktree to implement the brief.`,
363
435
  'IMPORTANT: do NOT run `evo run` yet — a pre-run verifier audits your change first. Stop once the edit is complete.',
364
436
  'Do NOT edit benchmark, gate, or framework code; do NOT weaken/bypass any gate.',
@@ -397,12 +469,14 @@ function revisePrompt(expId, worktree, findings) {
397
469
  }
398
470
 
399
471
  // RUN — evaluate + commit the (pre-verified) experiment.
400
- function runPrompt(expId) {
472
+ function runPrompt(expId, state) {
401
473
  return [
402
474
  `First, load the evo subagent skill: Read ${pr}/skills/subagent/SKILL.md IN FULL and follow its run protocol (it covers \`evo run ${expId} --check\` for non-committing wiring validation that does not consume the attempt budget).`,
403
- `CRITICAL ordering: if this experiment produces its artifact through a build or training step (e.g. a finetune that writes final_model/), run that step to COMPLETION and confirm the artifact exists BEFORE the real run. Never call \`evo run\` while training is still in flight or before final_model/ exists — evaluating a not-yet-produced model is the "final_model not found" failure and it wastes the attempt. If the experiment trains, the parent checkpoint is in EVO_PARENT_POLICY (warm-start from it; do not retrain from base).`,
475
+ ...capsuleLines(state),
476
+ `CRITICAL ordering: if this experiment produces an output artifact through a build or training step (whatever your recipe declares — a checkpoint dir, adapter, merged model, index, etc.), run that step to COMPLETION and confirm the artifact exists BEFORE the real run. Never call \`evo run\` while that step is still in flight or before its output exists — evaluating a not-yet-produced artifact wastes the attempt. If the experiment warm-starts, the parent's reusable artifact is in EVO_PARENT_POLICY (start from it; do not redo from scratch).`,
404
477
  `Then run \`evo run ${expId}\` to evaluate and (if it improves and passes gates) commit it.`,
405
478
  'If it exits GATE_FAILED, do not fight the gate — report status=evaluated.',
479
+ 'If `evo run` is terminated externally mid-flight (the concurrent meta can STOP a doomed experiment — it aborts the run and discards this node with a diagnosis), do NOT retry: report status:none and stop. The diagnosis is already recorded as an annotation and will steer the next brief.',
406
480
  `Return: expIds:["${expId}"]; status (committed|evaluated|failed|none); committedImprover = true ONLY if evo printed COMMITTED;`,
407
481
  'bestExpId + bestScore (required when committedImprover is true); any gates added; learnings.',
408
482
  ].join(' ')
@@ -450,16 +524,16 @@ function collectPrompt(results, round) {
450
524
  ].join(' ')
451
525
  }
452
526
 
453
- // One analyst tick (a FRESH Opus agent each call — no memory across ticks, so `reported` carries
527
+ // One meta tick (a FRESH Opus agent each call — no memory across ticks, so `reported` carries
454
528
  // the dedup state in the loop's closure). Read-only: observes host + cross-history signals DURING
455
529
  // rounds, returns work-quality briefHints (folded into the next brief) + runtime alerts (surfaced).
456
- function analystPrompt(ctx, intervalS, reported) {
530
+ function metaPrompt(ctx, intervalS, reported) {
457
531
  return [
458
- 'You are the evo ANALYST — an independent observer running CONCURRENTLY with the optimize loop.',
459
- 'Read-only: do NOT edit code, run experiments, or mutate evo state.',
532
+ 'You are the evo META agent — an independent controller running CONCURRENTLY with the optimize loop.',
533
+ 'You do NOT edit experiment code, run experiments, or touch the benchmark/grader/verifier. But you DO shape the optimize WORKFLOW: stop doomed experiments, suggest next directions (briefHints), AND restructure the running workflow itself — its logic flow + prompts — via harnessEdits (your distinctive power, detailed below).',
460
534
  `FIRST pace yourself with an INTERRUPTIBLE wait, so you stop promptly when the optimize loop ends. Run this single Bash command with a tool timeout of at least ${(intervalS + 30) * 1000} ms:`,
461
- ` \`if [ -f ${DONE_SENTINEL} ]; then echo OPTIMIZE_DONE; else for i in $(seq 1 ${Math.ceil(intervalS / ANALYST_HOP_S)}); do sleep ${ANALYST_HOP_S}; [ -f ${DONE_SENTINEL} ] && { echo OPTIMIZE_DONE; break; }; done; fi\``,
462
- `If that prints OPTIMIZE_DONE, the optimize loop has finished — return {"briefHints":[],"alerts":[]} immediately WITHOUT gathering any signals. Otherwise the full interval elapsed: now gather signals and report.`,
535
+ ` \`if [ -f ${DONE_SENTINEL} ]; then echo OPTIMIZE_DONE; else for i in $(seq 1 ${Math.ceil(intervalS / META_HOP_S)}); do sleep ${META_HOP_S}; [ -f ${DONE_SENTINEL} ] && { echo OPTIMIZE_DONE; break; }; done; fi\``,
536
+ `If that prints OPTIMIZE_DONE, the optimize loop has finished — return {"briefHints":[],"alerts":[],"stops":[],"harnessEdits":[]} immediately WITHOUT gathering any signals. Otherwise the full interval elapsed: now gather signals and report.`,
463
537
  `Current loop state: round=${ctx.round}, stall=${ctx.stall}/${LIMIT}, best=${ctx.bestScore}.`,
464
538
  `Already reported (do NOT repeat — only emit findings NEW since these): ${JSON.stringify(reported || [])}.`,
465
539
  'Walk these checks (skip any whose inputs are unavailable; cite evidence; nothing speculative):',
@@ -468,7 +542,13 @@ function analystPrompt(ctx, intervalS, reported) {
468
542
  '- Stuck experiment / time-budget overrun: from `evo status`/`evo show`, an experiment active far longer than its peers, or a round overrunning the others. ALERT.',
469
543
  '- Stuck axis: from `evo tree`, 3+ structurally-distinct committed hypotheses plateaued at ~the same score → name the saturated axis + one orthogonal axis. BRIEF HINT.',
470
544
  '- Dead direction / ignored mechanism: annotations repeatedly naming a mechanism the recent work ignores, or a direction that keeps regressing. BRIEF HINT.',
471
- 'Return {briefHints:[...], alerts:[...]}. briefHints feed the NEXT round\'s briefs (work-quality redirections); alerts surface to the user (runtime/host issues). Empty arrays are finemost ticks should be quiet.',
545
+ '- Heading toward failure (STOP): an in-flight experiment that is CLEARLY doomed or wasting the budget a divergent / NaN / flatlined progress metric; projected completion beyond the remaining time budget; or a known-fatal signature (e.g. output the scorer cannot parse; a silent resource mis-placement that tanks throughput with no error; a corrupt input/format that invalidates the result). HIGH PRECISION ONLY: default to NOT stopping recommend a STOP only with concrete evidence that finishing is wasted, and only for an experiment still `active`. Emit a stop with: expId; failureClass (build = the build/produce step is broken; eval = artifact is fine but scoring/serving is wrong; hypothesis = it runs but won\'t help); reason (the diagnosis + the evidence you saw); fixHint (what the NEXT experiment must change).',
546
+ 'For STOPs you stay READ-ONLY: do NOT run `evo abort` / `evo discard` yourself. A gated enforcer acts on each stop — it aborts the run + its subprocess tree, annotates your diagnosis (so it outlives the worktree and feeds the next round), and discards with the failureClass so the partial artifact is preserved. A STOP is a diagnosed, recoverable stop, never a silent kill.',
547
+ '',
548
+ 'HARNESS CONTROL (your distinctive power): you may restructure the optimize workflow itself, live, when you judge it will help — edits apply directly (free will) and take effect next round. Current harness state: ' + JSON.stringify(harnessSummary()) + '.',
549
+ 'harnessEdits ops: (1) set-knob {knob: width|budget|stall|ideateEvery|ideateStall, value} — retune the loop (widen the round, deepen branches, change the stall limit or ideation cadence). (2) toggle-phase {phaseName: scan|ideate, enabled} — turn a phase off/on (e.g. skip scan when traces are uninformative; force ideation early). (3) set-prompt {target: state|scan|aggregate|brief|implement|run|ideator|collect, mode: append|replace, text} — edit the prompt that step uses (append a directive, or replace it wholesale). (4) inject-step {at: before-scan|after-scan|before-brief|after-collect, text, label} — add an extra agent step at that seam each round. Every edit needs a rationale citing the evidence.',
550
+ 'HARD CONSTRAINT: edit ONLY the search harness above. NEVER propose edits to the benchmark, grader, scorer, held-out test, or any gate — those define how results are judged; if you change them the score stops meaning anything. Emit harnessEdits ONLY with concrete evidence the current workflow SHAPE is the bottleneck; most ticks should emit none.',
551
+ 'Return {briefHints:[...], alerts:[...], stops:[...], harnessEdits:[...]}. briefHints feed the NEXT round\'s briefs; alerts surface to the user; each stop triggers the gated enforcer; each harnessEdit is applied directly to the running workflow. All-empty is fine — most ticks should be quiet.',
472
552
  ].join('\n')
473
553
  }
474
554
 
@@ -476,11 +556,21 @@ function analystPrompt(ctx, intervalS, reported) {
476
556
  // iteration budget (deepening the branch each time a committed improver lands). The independent
477
557
  // evo:verifier gates EACH run for design-time cheating BEFORE the experiment is evaluated; its
478
558
  // findings are fed back to a revise agent on the same experiment until it passes or is discarded.
559
+ //
560
+ // Lane decomposition (decompose only at CONTEXT SEAMS): build+eval are a SINGLE agent — `run`
561
+ // produces the artifact and evaluates it end-to-end (one coherent context, no handoff mid-build).
562
+ // The only split is `implement` (write the edit) vs `run`, separated by the read-only evo:verifier
563
+ // seam — a genuinely different concern (adversarial diff audit, different agentType/model) that has
564
+ // to interpose between the edit and the expensive run. The two share state by REFERENCE (the
565
+ // worktree on disk), not by passing a context window, and BOTH receive the capsule (category skills
566
+ // + known learnings via capsuleLines), so neither reverts to base-model defaults. Merging implement
567
+ // into run would erase the verifier gate for no real gain, since the code already lives in the
568
+ // worktree the run agent reads.
479
569
  async function runBrief(brief, state) {
480
570
  let parent = brief.parent
481
571
  let best = null
482
- for (let depth = 0; depth < ITER; depth++) {
483
- const impl = await agent(implementPrompt(brief, parent, state), {
572
+ for (let depth = 0; depth < harness.budget; depth++) {
573
+ const impl = await agent(withHarnessPrompt('implement', implementPrompt(brief, parent, state)), {
484
574
  schema: IMPL_RESULT, model: brief.hard ? 'opus' : 'sonnet', phase: 'Optimize', label: `impl:${parent}#${depth}`,
485
575
  })
486
576
  if (!impl || !impl.expId) break
@@ -504,7 +594,7 @@ async function runBrief(brief, state) {
504
594
  }
505
595
 
506
596
  // run -> evaluate + commit
507
- const r = await agent(runPrompt(impl.expId), { schema: SUBAGENT_RESULT, phase: 'Optimize', label: `run:${impl.expId}` })
597
+ const r = await agent(withHarnessPrompt('run', runPrompt(impl.expId, state)), { schema: SUBAGENT_RESULT, phase: 'Optimize', label: `run:${impl.expId}` })
508
598
  if (!r) break
509
599
 
510
600
  // post-run validity audit (evo:verifier, post-phase) on committed improvers
@@ -539,51 +629,131 @@ let stall = 0
539
629
  let round = 0
540
630
  let lastIdeatedCommit = 0 // committedCount at the last ideator dispatch (periodic cadence)
541
631
  let ideatedThisStall = false // fire ideators once per stall episode, not every stalled round
542
- let lastBestScore = null // latest best score, surfaced to the concurrent analyst thread
543
- let done = false // set when the optimize loop ends -> stops the analyst thread
544
- const analystSignals = [] // briefHints the analyst pushes; drained into the next round's brief
632
+ let lastBestScore = null // latest best score, surfaced to the concurrent meta thread
633
+ let done = false // set when the optimize loop ends -> stops the meta thread
634
+ const metaSignals = [] // briefHints the meta pushes; drained into the next round's brief
545
635
 
546
- log(`evo-optimize start: subagents=${WIDTH} budget=${ITER} stall=${LIMIT} analyst=${ANALYST_ENABLED ? ANALYST_MODEL : 'off'} | argsType=${typeof args} A.subagents=${A.subagents} A.budget=${A.budget} A.stall=${A.stall}`)
636
+ // ---------------------------------------------------------------------------
637
+ // Mutable HARNESS (the round-plan + prompts the meta agent edits live, free-will).
638
+ // Initialized to the resolved defaults, so a run where the meta emits no edits behaves
639
+ // byte-identically to before. The optimize loop READS this object each round; the meta
640
+ // thread WRITES it via applyHarnessEdit. Single-threaded event loop => edits applied in a
641
+ // meta tick land between optimize-loop awaits and take effect at the next read (next round).
642
+ // Every edit is audited: harness.editLog + a run-log line + the workflow's return value.
643
+ // Scope boundary: this controls the SEARCH harness only — never the grader/verifier.
644
+ // ---------------------------------------------------------------------------
645
+ const harness = {
646
+ width: WIDTH,
647
+ budget: ITER,
648
+ stall: LIMIT,
649
+ ideateEvery: IDEATE_EVERY_COMMITS,
650
+ ideateStall: IDEATE_STALL,
651
+ phases: { scan: true, ideate: true },
652
+ prompts: {}, // target -> { mode: 'append'|'replace', text }
653
+ injectedSteps: [], // { at, prompt, label }
654
+ editLog: [], // audit trail: { round, op, ...spec, rationale }
655
+ }
656
+
657
+ // Apply a meta prompt override (append a directive, or replace wholesale) to a base prompt.
658
+ function withHarnessPrompt(target, baseText) {
659
+ const o = harness.prompts[target]
660
+ if (!o || !o.text) return baseText
661
+ return o.mode === 'replace'
662
+ ? o.text
663
+ : baseText + '\n\n[META-ADDED DIRECTIVE — injected live by the meta agent]: ' + o.text
664
+ }
547
665
 
548
- // The optimize round loop (runs concurrently with analystLoop via Promise.all).
666
+ // Run any meta-injected extra steps registered at a given seam (insert-step op).
667
+ async function runInjected(at, ctxLabel) {
668
+ for (const s of harness.injectedSteps.filter((x) => x.at === at)) {
669
+ try {
670
+ await agent(s.prompt, { phase: 'Meta-step', label: s.label || `injected:${at}:${ctxLabel}` })
671
+ } catch (e) {
672
+ log(`META injected step (${at}) errored (ignored): ${(e && e.message) || e}`)
673
+ }
674
+ }
675
+ }
676
+
677
+ // Apply ONE harness edit with free will (no validation gate, no caps) — then audit it.
678
+ function applyHarnessEdit(e, atRound) {
679
+ if (!e || !e.op) return
680
+ const rec = { round: atRound, op: e.op, rationale: e.rationale || '' }
681
+ if (e.op === 'set-knob' && e.knob && typeof e.value === 'number') {
682
+ harness[e.knob] = e.value; rec.knob = e.knob; rec.value = e.value
683
+ } else if (e.op === 'toggle-phase' && e.phaseName) {
684
+ harness.phases[e.phaseName] = e.enabled !== false; rec.phaseName = e.phaseName; rec.enabled = harness.phases[e.phaseName]
685
+ } else if (e.op === 'set-prompt' && e.target && e.text) {
686
+ harness.prompts[e.target] = { mode: e.mode === 'replace' ? 'replace' : 'append', text: e.text }
687
+ rec.target = e.target; rec.mode = harness.prompts[e.target].mode
688
+ } else if (e.op === 'inject-step' && e.at && e.text) {
689
+ harness.injectedSteps.push({ at: e.at, prompt: e.text, label: e.label || `meta:${e.at}` }); rec.at = e.at; rec.label = e.label || `meta:${e.at}`
690
+ } else {
691
+ log(`META harness edit IGNORED (incomplete spec for op=${e.op}): ${JSON.stringify(e)}`); return
692
+ }
693
+ harness.editLog.push(rec)
694
+ log(`META HARNESS EDIT [r${atRound}] ${JSON.stringify(rec)}`)
695
+ }
696
+
697
+ function harnessSummary() {
698
+ return {
699
+ width: harness.width, budget: harness.budget, stall: harness.stall,
700
+ ideateEvery: harness.ideateEvery, ideateStall: harness.ideateStall,
701
+ phases: harness.phases,
702
+ promptsOverridden: Object.entries(harness.prompts).map(([k, v]) => `${k}:${v.mode}`),
703
+ injectedSteps: harness.injectedSteps.map((s) => `${s.at}:${s.label}`),
704
+ edits: harness.editLog.length,
705
+ }
706
+ }
707
+
708
+ log(`evo-optimize start: subagents=${WIDTH} budget=${ITER} stall=${LIMIT} meta=${META_ENABLED ? META_MODEL : 'off'} | argsType=${typeof args} A.subagents=${A.subagents} A.budget=${A.budget} A.stall=${A.stall}`)
709
+
710
+ // The optimize round loop (runs concurrently with metaLoop via Promise.all).
549
711
  async function optimizeLoop() {
550
- while (stall < LIMIT) {
712
+ while (stall < harness.stall) {
551
713
  round += 1
552
714
 
553
715
  phase('Orient')
554
- const state = await agent(statePrompt(), { schema: STATE, agentType: 'Explore', model: 'sonnet', phase: 'Orient', label: `state:r${round}` })
716
+ await runInjected('before-scan', `r${round}`) // meta seam (pre-orient/scan)
717
+ const state = await agent(withHarnessPrompt('state', statePrompt()), { schema: STATE, agentType: 'Explore', model: 'sonnet', phase: 'Orient', label: `state:r${round}` })
555
718
  lastBestScore = state.bestScore
556
719
  if (state.bestScore === state.ceiling) { log(`ceiling reached (best=${state.bestScore}) — stopping`); break }
557
- const parents = (state.frontier || []).slice(0, WIDTH)
720
+ const parents = (state.frontier || []).slice(0, harness.width)
558
721
  if (parents.length === 0) { log('no explorable frontier nodes — stopping'); break }
559
722
 
560
- // N1 + N1.5 — mandatory parallel scan + structural aggregation (barrier). Scan runs EVERY round
561
- // (hard rule); when there are no evaluated-undecided nodes yet (round 1) it falls back to the
562
- // committed frontier so at least one scan agent still runs before briefs.
563
- phase('Scan')
564
- const evaluatedIds = state.evaluatedIds || []
565
- const frontierIds = (state.frontier || []).map((f) => f.id).filter(Boolean)
566
- const scanTargets = evaluatedIds.length ? evaluatedIds : frontierIds
567
- const batches = chunk(scanTargets, SCAN_BATCH)
568
- const scanThunks = batches.map((b) => () => agent(scanBrief(b), { schema: FINDINGS, agentType: 'Explore', phase: 'Scan', label: `scan ${b.length}: ${batchLabel(b)}` }))
569
- const aggregateIds = [...new Set([...evaluatedIds, ...frontierIds])]
570
- const aggThunk = aggregateIds.length
571
- ? [() => agent(aggregatePrompt(aggregateIds), { schema: PATTERNS, agentType: 'Explore', phase: 'Scan', label: 'aggregate' })]
572
- : []
573
- const scanResults = (await parallel([...scanThunks, ...aggThunk])).filter(Boolean)
574
- const findings = scanResults.flatMap((r) => (r && r.findings) ? r.findings : [])
575
- const patterns = scanResults.flatMap((r) => (r && r.patterns) ? r.patterns : [])
576
-
577
- // N1.7 — research escalation (6b): on stall (before the hard limit) or every ~5 commits, fire the
578
- // three ideators in parallel. parallel() blocks until all return (proposals land before briefing).
723
+ // N1 + N1.5 — parallel scan + structural aggregation (barrier). Scan normally runs EVERY round
724
+ // (hard rule), but the meta agent MAY disable it via a toggle-phase edit (free will) when off,
725
+ // the round briefs from prior signals only. Round 1 falls back to the committed frontier.
726
+ let findings = []
727
+ let patterns = []
728
+ if (harness.phases.scan) {
729
+ phase('Scan')
730
+ const evaluatedIds = state.evaluatedIds || []
731
+ const frontierIds = (state.frontier || []).map((f) => f.id).filter(Boolean)
732
+ const scanTargets = evaluatedIds.length ? evaluatedIds : frontierIds
733
+ const batches = chunk(scanTargets, SCAN_BATCH)
734
+ const scanThunks = batches.map((b) => () => agent(withHarnessPrompt('scan', scanBrief(b)), { schema: FINDINGS, agentType: 'Explore', phase: 'Scan', label: `scan ${b.length}: ${batchLabel(b)}` }))
735
+ const aggregateIds = [...new Set([...evaluatedIds, ...frontierIds])]
736
+ const aggThunk = aggregateIds.length
737
+ ? [() => agent(withHarnessPrompt('aggregate', aggregatePrompt(aggregateIds)), { schema: PATTERNS, agentType: 'Explore', phase: 'Scan', label: 'aggregate' })]
738
+ : []
739
+ const scanResults = (await parallel([...scanThunks, ...aggThunk])).filter(Boolean)
740
+ findings = scanResults.flatMap((r) => (r && r.findings) ? r.findings : [])
741
+ patterns = scanResults.flatMap((r) => (r && r.patterns) ? r.patterns : [])
742
+ } else {
743
+ log('scan phase disabled by meta — briefing from prior signals only')
744
+ }
745
+ await runInjected('after-scan', `r${round}`)
746
+
747
+ // N1.7 — research escalation (6b): on stall (before the hard limit) or every ~N commits, fire the
748
+ // three ideators in parallel. Gated by harness.phases.ideate + the harness cadence knobs (meta-tunable).
579
749
  const commits = Number(state.committedCount) || 0
580
- const stalledTrigger = stall >= IDEATE_STALL && !ideatedThisStall
581
- const periodicTrigger = commits - lastIdeatedCommit >= IDEATE_EVERY_COMMITS
750
+ const stalledTrigger = stall >= harness.ideateStall && !ideatedThisStall
751
+ const periodicTrigger = commits - lastIdeatedCommit >= harness.ideateEvery
582
752
  let ideated = false
583
- if (stalledTrigger || periodicTrigger) {
753
+ if (harness.phases.ideate && (stalledTrigger || periodicTrigger)) {
584
754
  phase('Ideate')
585
755
  await parallel(['frontier_extrapolation', 'failure_analysis', 'literature'].map((b) => () =>
586
- agent(ideatorPrompt(b), { agentType: 'evo:ideator', phase: 'Ideate', label: `ideate:${b}` })))
756
+ agent(withHarnessPrompt('ideator', ideatorPrompt(b)), { agentType: 'evo:ideator', phase: 'Ideate', label: `ideate:${b}` })))
587
757
  lastIdeatedCommit = commits
588
758
  if (stalledTrigger) ideatedThisStall = true
589
759
  ideated = true
@@ -591,10 +761,11 @@ async function optimizeLoop() {
591
761
  }
592
762
 
593
763
  // N2 — brief writer: reconciles ideator proposals (6c), acts on axis-warning, and folds in any
594
- // live analyst hints accumulated since the last round; JS diversity dedupe afterwards.
764
+ // live meta hints accumulated since the last round; JS diversity dedupe afterwards.
765
+ await runInjected('before-brief', `r${round}`)
595
766
  phase('Brief')
596
- const analystHints = analystSignals.splice(0)
597
- const briefOut = await agent(briefPrompt(state, findings, patterns, parents, ideated, analystHints), { schema: BRIEFS, phase: 'Brief', label: `briefs:r${round}` })
767
+ const metaHints = metaSignals.splice(0)
768
+ const briefOut = await agent(withHarnessPrompt('brief', briefPrompt(state, findings, patterns, parents, ideated, metaHints)), { schema: BRIEFS, phase: 'Brief', label: `briefs:r${round}` })
598
769
  const briefs = dedupeBriefs((briefOut && briefOut.briefs) || [])
599
770
  if (briefs.length === 0) { log('no briefs produced — stopping'); break }
600
771
 
@@ -603,7 +774,8 @@ async function optimizeLoop() {
603
774
 
604
775
  // N5 — collect: prune dead lineages, record notes.
605
776
  phase('Collect')
606
- await agent(collectPrompt(results, round), { phase: 'Collect', label: `collect:r${round}` })
777
+ await agent(withHarnessPrompt('collect', collectPrompt(results, round)), { phase: 'Collect', label: `collect:r${round}` })
778
+ await runInjected('after-collect', `r${round}`)
607
779
 
608
780
  // Loop control: stall resets only when this round produced a VERIFIED committed score that beats
609
781
  // the PRIOR BEST in the metric direction (a beat-its-own-parent commit is branch progress, not a
@@ -619,56 +791,95 @@ async function optimizeLoop() {
619
791
  log(`round ${round}: improved=${improved} roundBest=${roundBest} prevBest=${state.bestScore} stall=${stall}/${LIMIT} spent=${budget.spent()}`)
620
792
  }
621
793
  done = true
622
- // Wake any in-flight analyst tick now (its `sleep` can't see the in-memory `done`): the sentinel
623
- // makes the tick's interruptible wait exit within ~ANALYST_HOP_S instead of running the full interval.
624
- if (ANALYST_ENABLED) await agent(`mkdir -p .evo && : > ${DONE_SENTINEL} && echo signalled`, { phase: 'Collect', label: 'signal:optimize-done' })
794
+ // Wake any in-flight meta tick now (its `sleep` can't see the in-memory `done`): the sentinel
795
+ // makes the tick's interruptible wait exit within ~META_HOP_S instead of running the full interval.
796
+ if (META_ENABLED) await agent(`mkdir -p .evo && : > ${DONE_SENTINEL} && echo signalled`, { phase: 'Collect', label: 'signal:optimize-done' })
625
797
  log(`optimize loop finished after ${round} round(s), final stall=${stall}/${LIMIT}`)
626
798
  return { rounds: round, finalStall: stall }
627
799
  }
628
800
 
629
- // Concurrent analyst thread (P1-sliver/P2-P5/P7): an independent, self-paced Opus observer that runs
801
+ // Concurrent meta thread (P1-sliver/P2-P5/P7): an independent, self-paced Opus observer that runs
630
802
  // DURING rounds (not per-round). Each tick is a FRESH agent (no cross-tick memory), so `reported`
631
- // holds the dedup state in this closure. Work-quality findings -> analystSignals (next brief);
803
+ // holds the dedup state in this closure. Work-quality findings -> metaSignals (next brief);
632
804
  // runtime/host alerts -> the run log. Stops when optimizeLoop sets `done`.
633
- async function analystLoop() {
634
- if (!ANALYST_ENABLED) return
805
+ // Gated ENFORCER for an meta STOP: detect (meta) and act (this agent) stay separate. Verifies
806
+ // the experiment is still active, then aborts its run (driver + subprocess tree), annotates the
807
+ // diagnosis (survives the worktree + feeds the next round via knownLearnings), and discards with the
808
+ // failure class so the partial artifact is preserved + classified. A STOP is a diagnosed, recoverable
809
+ // stop — never a silent kill.
810
+ function enforceStopPrompt(s) {
811
+ return [
812
+ `A concurrent meta flagged experiment ${s.expId} as heading toward failure and recommends STOPPING it. You are the gated ENFORCER — read-only except for the three evo commands below; do NOT edit code or run training.`,
813
+ `First VERIFY: run \`evo show ${s.expId}\`. Only proceed if its status is still \`active\`. If it is committed / evaluated / discarded / not found, do NOTHING and report skipped (it already resolved).`,
814
+ `If still active, run in order:`,
815
+ ` 1. \`evo abort ${s.expId}\` — stop the evo run driver and its subprocess tree.`,
816
+ ` 2. annotate the diagnosis so it outlives the worktree and feeds the next round: \`evo annotate ${s.expId} "STOPPED (${s.failureClass}): ${s.reason} | FIX: ${s.fixHint}"\` (quote carefully).`,
817
+ ` 3. classify + preserve: \`evo discard ${s.expId} --force --failure-class ${s.failureClass} --reason "meta stop: ${s.reason}"\` (--force because abort already killed the driver; declared artifacts are preserved).`,
818
+ `Report what you did (aborted / annotated / discarded) or that you skipped because it was no longer active. This is a diagnosed, recoverable stop, not a crash.`,
819
+ ].join('\n')
820
+ }
821
+
822
+ async function metaLoop() {
823
+ if (!META_ENABLED) return
635
824
  const reported = [] // closure memory across the stateless ticks (caps re-alerting)
636
825
  let t = 0
637
826
  let fails = 0 // consecutive tick failures; trips the self-disable below
638
827
  while (!done) {
639
828
  t += 1
640
- // The analyst is purely advisory and read-only: a failed tick must NEVER reject this loop and
829
+ // The meta is purely advisory and read-only: a failed tick must NEVER reject this loop and
641
830
  // abort the optimizer. Swallow any tick error, log it, and continue (or exit if `done` flipped).
642
831
  let tick = null
643
832
  try {
644
- tick = await agent(analystPrompt({ round, stall, bestScore: lastBestScore }, ANALYST_INTERVAL_S, reported.slice(-30)), {
645
- agentType: 'Explore', model: ANALYST_MODEL, schema: ANALYST_FINDINGS, phase: 'Analyst', label: `analyst#${t}`,
833
+ tick = await agent(metaPrompt({ round, stall, bestScore: lastBestScore }, META_INTERVAL_S, reported.slice(-30)), {
834
+ agentType: 'Explore', model: META_MODEL, schema: META_FINDINGS, phase: 'Meta', label: `meta#${t}`,
646
835
  })
647
836
  } catch (e) {
648
- log(`ANALYST tick #${t} errored (ignored, optimize unaffected): ${(e && e.message) || e}`)
837
+ log(`META tick #${t} errored (ignored, optimize unaffected): ${(e && e.message) || e}`)
649
838
  }
650
839
  if (tick) {
651
840
  fails = 0 // a real tick resets the failure streak
652
- for (const h of (tick.briefHints || [])) { analystSignals.push(h); reported.push(h) }
653
- for (const a of (tick.alerts || [])) { log(`ANALYST ALERT: ${a}`); reported.push(a) }
654
- } else if (++fails >= ANALYST_MAX_FAILS) {
841
+ for (const h of (tick.briefHints || [])) { metaSignals.push(h); reported.push(h) }
842
+ for (const a of (tick.alerts || [])) { log(`META ALERT: ${a}`); reported.push(a) }
843
+ // HARNESS EDITS (new ability): the meta restructures the workflow itself live — applied
844
+ // directly with free will (no gate, no caps), audited via harness.editLog + the run log.
845
+ // Takes effect at the next round (the optimize loop reads `harness` at each round start).
846
+ for (const e of (tick.harnessEdits || [])) applyHarnessEdit(e, round)
847
+ // STOP recommendations: hand each to a gated enforcer (detect/act separation). The fix also
848
+ // feeds the next round's brief so the loop corrects rather than just abandons.
849
+ for (const s of (tick.stops || [])) {
850
+ if (!s || !s.expId) continue
851
+ const stopKey = `stop:${s.expId}`
852
+ if (reported.includes(stopKey)) continue // never re-enforce the same experiment
853
+ reported.push(stopKey)
854
+ log(`META STOP: ${s.expId} [${s.failureClass}] ${s.reason}`)
855
+ metaSignals.push(`Experiment ${s.expId} was stopped (${s.failureClass}): ${s.reason} — next: ${s.fixHint}`)
856
+ try {
857
+ await agent(enforceStopPrompt(s), { phase: 'Meta', label: `enforce-stop:${s.expId}` })
858
+ } catch (e) {
859
+ log(`META enforce-stop ${s.expId} errored (ignored): ${(e && e.message) || e}`)
860
+ }
861
+ }
862
+ } else if (++fails >= META_MAX_FAILS) {
655
863
  // The pacing wait lives INSIDE the agent, so a tick that fails before sleeping (e.g. a schema
656
864
  // reject) leaves nothing to pace the retry — left unchecked the loop hot-spins agents. The
657
- // analyst is optional, so after a short streak of failures, disable it for the rest of the run.
658
- log(`ANALYST disabled after ${fails} consecutive failed ticks — optimize continues without it.`)
865
+ // meta is optional, so after a short streak of failures, disable it for the rest of the run.
866
+ log(`META disabled after ${fails} consecutive failed ticks — optimize continues without it.`)
659
867
  return
660
868
  }
661
869
  }
662
870
  }
663
871
 
664
- // Clear any stale sentinel from a prior run BEFORE the threads start, else the analyst's first wait
872
+ // Clear any stale sentinel from a prior run BEFORE the threads start, else the meta's first wait
665
873
  // would see it and exit instantly. The script can't touch the filesystem itself, so an agent does it.
666
- if (ANALYST_ENABLED) await agent(`rm -f ${DONE_SENTINEL}; echo cleared`, { phase: 'Orient', label: 'init:clear-sentinel' })
874
+ if (META_ENABLED) await agent(`rm -f ${DONE_SENTINEL}; echo cleared`, { phase: 'Orient', label: 'init:clear-sentinel' })
667
875
 
668
- // optimizeLoop is the run's result; analystLoop is advisory. The `.catch` is the definitive guard that
876
+ // optimizeLoop is the run's result; metaLoop is advisory. The `.catch` is the definitive guard that
669
877
  // the observer thread can NEVER reject the combined promise and fail an otherwise-good optimize run.
670
878
  const [optimizeResult] = await Promise.all([
671
879
  optimizeLoop(),
672
- analystLoop().catch((e) => log(`ANALYST thread exited abnormally (ignored): ${(e && e.message) || e}`)),
880
+ metaLoop().catch((e) => log(`META thread exited abnormally (ignored): ${(e && e.message) || e}`)),
673
881
  ])
674
- return optimizeResult
882
+ // Surface the harness audit alongside the optimize result: final round-plan + every live edit the
883
+ // meta agent applied (knobs, phase toggles, prompt overrides, injected steps), so the run is fully
884
+ // reconstructable from the return value + the run log.
885
+ return { ...optimizeResult, harness: harnessSummary(), harnessEditLog: harness.editLog }
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: report
3
3
  description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
4
- evo_version: 0.5.0-alpha.11
4
+ evo_version: 0.5.0-alpha.13
5
5
  ---
6
6
 
7
7
  # Report
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: subagent
3
3
  description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
4
- evo_version: 0.5.0-alpha.11
4
+ evo_version: 0.5.0-alpha.13
5
5
  ---
6
6
 
7
7
  # Evo Subagent Protocol
@@ -124,8 +124,10 @@ evo gate add <id> --name <name> --command "<command>" # add a gate
124
124
 
125
125
  # Write paths used during iteration
126
126
  evo new --parent <id> -m "<hypothesis>" # allocate sibling experiment
127
+ evo new --parent <id> -m "<h>" --from-artifact <exp[:label]> # seed from a preserved artifact (EVO_SEED_ARTIFACT)
127
128
  evo run <id> [--check] # run (or --check to validate without consuming attempts)
128
- evo discard <id> --reason "<text>" # reject + park (keeps anchor ref)
129
+ evo abort <id> # stop a mid-run experiment (driver + its subprocess tree)
130
+ evo discard <id> --reason "<text>" [--failure-class build|eval|hypothesis] # reject + park (keeps anchor ref)
129
131
  evo restore <id> # un-discard or un-prune
130
132
  evo annotate <id> [<task_id>] "<text>" # per-attempt analysis
131
133
  evo set <id> --note "<text>" [--tag <t>] # per-node note from orchestrator
@@ -138,6 +140,7 @@ see `references/cli-quick-reference.md` "Reading workspace state".
138
140
  ## First Steps
139
141
 
140
142
  1. Read `.evo/project.md` to understand the target, what can be changed, and how to interpret results.
143
+ 1a. **Load this task's category skill(s).** Run `evo config get task-skills`; for each name returned (e.g. `finetuning`), load that evo skill IN FULL via your host's skill loader before you form an edit — it carries this category's method priors, recipes, and pre-run checks. If it returns blank, infer the category from `.evo/project.md` and load the matching skill if one applies (the subagent protocol alone covers prompt/code/config tasks). Skipping this is how a builder reverts to base-model defaults and reintroduces known mistakes (wrong device placement, stale trainer APIs, eval-before-build).
141
144
  2. Read the scratchpad for current state: `evo scratchpad`
142
145
  It surfaces: best path (★-marked in the tree), frontier (strategy-ranked branchable nodes), evaluated nodes awaiting decision, gates, annotations, what not to try, infra events, and notes. The Drill-downs section at the bottom lists the read-only commands for going deeper on any section.
143
146
  3. Study the pointer traces from your brief:
@@ -246,6 +249,20 @@ portable progress files there. evo mirrors that directory back into
246
249
  restart from benchmark-owned checkpoint files, but it does not freeze/restore an
247
250
  arbitrary Linux process.
248
251
 
252
+ **Declare reusable outputs as artifacts (any category).** If your experiment
253
+ produces an expensive, reusable output — a checkpoint, an adapter, a built
254
+ index, a compiled prompt, anything — write it to `EVO_CHECKPOINT_DIR` (durable:
255
+ it survives between-attempt cleanup and discard) and name it in the benchmark
256
+ result's `artifacts` field: `{"score": ..., "artifacts": {"<label>": "<path>"}}`.
257
+ Declared artifacts are preserved when the node is discarded, so a later
258
+ experiment can reuse them via `evo new --from-artifact <exp[:label]>` (the path
259
+ arrives as `EVO_SEED_ARTIFACT`). Never hardcode a name like `final_model/` — the
260
+ label is whatever your recipe declares. If a run is clearly heading toward
261
+ failure mid-flight (divergent loss, projected budget blow-out, a known-failure
262
+ signature), it can be stopped with `evo abort <id>` — that kills the driver and
263
+ its subprocess tree, and a partial artifact already written to
264
+ `EVO_CHECKPOINT_DIR` survives for reuse.
265
+
249
266
  **If the workspace was initialized with `commit_strategy=tracked-only` (the default for `--backend pool`):** `evo run` only commits modifications to *tracked* files. New files require an explicit `git add` from inside the worktree, then a shisa-kanko ack on the run command:
250
267
 
251
268
  ```bash
@@ -270,7 +287,10 @@ The ack flag is required when the worktree has any untracked, non-gitignored fil
270
287
 
271
288
  Then either:
272
289
  - Fixable edit-bug (off-by-one, wrong signature): edit the worktree and `evo run <id>` again. Bounded by `max_attempts` (default 3). Before retrying, compare your planned edit against the previous attempts' `outcome.json` on this same node -- if two earlier attempts hit the same gate, a small tweak won't fix it. When the cap is hit, run is refused -- you must discard.
273
- - Hypothesis is wrong, no fix: `evo discard <id> --reason "..."` and branch a new experiment from the **original parent**.
290
+ - Otherwise discard, and **classify why** with `--failure-class` so the orchestrator can route reuse vs branch:
291
+ - **`eval`** — the produced artifact is good but the scoring / serving / decode config is wrong. Make sure the artifact was declared + preserved; a sibling can **retest it in seconds** via `evo new --from-artifact <id>` (arrives as `EVO_SEED_ARTIFACT`) instead of rebuilding. `evo discard <id> --reason "..." --failure-class eval`.
292
+ - **`build`** — the artifact-production step itself broke. Fix it, then retry/resume *from the last checkpoint in `EVO_CHECKPOINT_DIR`* rather than rebuilding from scratch. `evo discard <id> --reason "..." --failure-class build` only if you're abandoning this node.
293
+ - **`hypothesis`** — it ran clean but didn't help. `evo discard <id> --reason "..." --failure-class hypothesis` and branch a new experiment from the **original parent** (a different direction, not a retry of the same idea).
274
294
 
275
295
  - **`FAILED`** (infra error, non-zero exit, timeout): couldn't evaluate. Doesn't consume the retry budget.
276
296
  - Transient / fixable locally: retry.