@evo-hq/pi-evo 0.5.0-alpha.11 → 0.5.0-alpha.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
package/skills/discover/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: discover
|
|
3
3
|
description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
|
|
4
4
|
argument-hint: <optional context about what to optimize>
|
|
5
|
-
evo_version: 0.5.0-alpha.
|
|
5
|
+
evo_version: 0.5.0-alpha.12
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Discover
|
|
@@ -116,20 +116,20 @@ evo --version
|
|
|
116
116
|
The output must be exactly:
|
|
117
117
|
|
|
118
118
|
```
|
|
119
|
-
evo-hq-cli 0.5.0-alpha.
|
|
119
|
+
evo-hq-cli 0.5.0-alpha.12
|
|
120
120
|
```
|
|
121
121
|
|
|
122
122
|
Three outcomes:
|
|
123
123
|
|
|
124
124
|
1. **Matches exactly** — continue to step 1.
|
|
125
125
|
2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
|
|
126
|
-
> Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.
|
|
126
|
+
> Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.12`). Run:
|
|
127
127
|
> ```
|
|
128
|
-
> uv tool install --force evo-hq-cli==0.5.0-alpha.
|
|
128
|
+
> uv tool install --force evo-hq-cli==0.5.0-alpha.12
|
|
129
129
|
> ```
|
|
130
130
|
> Then re-invoke this skill.
|
|
131
131
|
3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
|
|
132
|
-
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.
|
|
132
|
+
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.12` (or `pipx install evo-hq-cli==0.5.0-alpha.12`). Then re-invoke this skill.
|
|
133
133
|
|
|
134
134
|
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
|
|
135
135
|
|
|
@@ -308,10 +308,10 @@ Dashboard live: http://127.0.0.1:8080 (pid 12345)
|
|
|
308
308
|
**Benchmark commands must be eval-only.** Do NOT wrap training and evaluation into a single benchmark command. If your benchmark command runs training before scoring, every gate revalidation and every `evo run --check` retrains from scratch, and the experiment budget burns on duplicated training instead of new experiments. Training is a separate step the agent invokes BEFORE `evo run`:
|
|
309
309
|
|
|
310
310
|
1. The agent makes changes (data curation, hyperparameter selection, technique choice, training code edits).
|
|
311
|
-
2. The agent runs training to produce a checkpoint
|
|
312
|
-
3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced
|
|
311
|
+
2. The agent runs the build/training step to produce its artifact — a checkpoint, adapter, index, whatever the recipe makes. Write it to `EVO_CHECKPOINT_DIR` (durable: survives between-attempt cleanup and discard) and declare it in the benchmark result's `artifacts` field so it's preserved + reusable; the per-technique I/O contract is in `evo:finetuning/references/glue.md`. Never hardcode a fixed name like `final_model/`.
|
|
312
|
+
3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced artifact and emits a score.
|
|
313
313
|
|
|
314
|
-
The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs
|
|
314
|
+
The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs build+eval together with no eval-only mode, wrap it: add a `--skip-build` flag, or have the wrapper detect an existing artifact (under `EVO_CHECKPOINT_DIR`) and short-circuit the build step. Without this, evo's gate-recheck and re-score mechanics rebuild repeatedly and the budget evaporates.
|
|
315
315
|
|
|
316
316
|
**Runtime environment.** If the benchmark needs keys or other runtime variables, configure them through evo rather than copying `.env` into worktrees or hand-editing `config.json`:
|
|
317
317
|
|
|
@@ -323,6 +323,16 @@ evo env show
|
|
|
323
323
|
|
|
324
324
|
Values are resolved fresh by the orchestrator on each `evo run`. Config stores dotenv source metadata and key names, not secret values. The benchmark and gates receive the resolved env; gates do not receive `EVO_*` artifact variables.
|
|
325
325
|
|
|
326
|
+
### 7a. Record the task category (`task-skills`)
|
|
327
|
+
|
|
328
|
+
You already decided what's being optimized (steps 2–4). Record the evo category skill(s) a builder should load for it, so every executing agent — prose subagent or workflow lane — loads the right method knowledge instead of rediscovering it each round:
|
|
329
|
+
|
|
330
|
+
```bash
|
|
331
|
+
evo config set task-skills finetuning # any weight-update / training task
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
Rule of thumb: if the optimization updates model weights (SFT / LoRA / DPO / RL / continued-pretraining), set `task-skills finetuning`. **Leave it unset** for prompt / code / config / harness optimization — the subagent protocol already covers those; only set it when a dedicated category skill applies. Use a comma-separated list if more than one applies. Mirror the choice in `.evo/project.md` (step 12) so it's human-readable and survives as the fallback if config is ever cleared.
|
|
335
|
+
|
|
326
336
|
## 8. Set up gates
|
|
327
337
|
|
|
328
338
|
Gates inherit down the experiment tree -- children automatically get all ancestor gates.
|
package/skills/optimize/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: optimize
|
|
3
3
|
description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
|
|
4
4
|
argument-hint: "[subagents=N] [budget=N] [stall=N]"
|
|
5
|
-
evo_version: 0.5.0-alpha.
|
|
5
|
+
evo_version: 0.5.0-alpha.12
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
|
|
@@ -65,6 +65,8 @@ const STATE = {
|
|
|
65
65
|
committedCount: { type: 'integer' }, // total committed experiments (drives the periodic ideator trigger)
|
|
66
66
|
verifyRepeats: { type: 'integer' }, // benchmark noise profile (1 = deterministic, no confirm-loop)
|
|
67
67
|
summary: { type: 'string' }, // short scratchpad summary for subagent context
|
|
68
|
+
taskSkills: { type: 'array', items: { type: 'string' } }, // category skills a builder should load (e.g. ["finetuning"]) — resolved from config/project, never hardcoded in the workflow
|
|
69
|
+
knownLearnings: { type: 'array', items: { type: 'string' } }, // durable lessons to apply up front (drained from annotations) so a fresh stateless lane doesn't rediscover them
|
|
68
70
|
},
|
|
69
71
|
}
|
|
70
72
|
|
|
@@ -181,10 +183,27 @@ const PREVERDICT = {
|
|
|
181
183
|
// Analyst tick output: work-quality hints (fed into the next brief) + runtime/host alerts (surfaced).
|
|
182
184
|
const ANALYST_FINDINGS = {
|
|
183
185
|
type: 'object',
|
|
184
|
-
required: ['briefHints', 'alerts'],
|
|
186
|
+
required: ['briefHints', 'alerts', 'stops'],
|
|
185
187
|
properties: {
|
|
186
188
|
briefHints: { type: 'array', items: { type: 'string' } },
|
|
187
189
|
alerts: { type: 'array', items: { type: 'string' } },
|
|
190
|
+
// STOP recommendations for in-flight experiments that are clearly doomed.
|
|
191
|
+
// A stop is NOT a crash: each carries the diagnosis + a fix so the gated
|
|
192
|
+
// enforcer can abort, annotate (lesson outlives the worktree), and classify+
|
|
193
|
+
// preserve — and the fix feeds the next round (and, if general, a skill prior).
|
|
194
|
+
stops: {
|
|
195
|
+
type: 'array',
|
|
196
|
+
items: {
|
|
197
|
+
type: 'object',
|
|
198
|
+
required: ['expId', 'failureClass', 'reason', 'fixHint'],
|
|
199
|
+
properties: {
|
|
200
|
+
expId: { type: 'string' },
|
|
201
|
+
failureClass: { enum: ['build', 'eval', 'hypothesis'] },
|
|
202
|
+
reason: { type: 'string' }, // what's wrong + the concrete evidence
|
|
203
|
+
fixHint: { type: 'string' }, // what the NEXT experiment must change
|
|
204
|
+
},
|
|
205
|
+
},
|
|
206
|
+
},
|
|
188
207
|
},
|
|
189
208
|
}
|
|
190
209
|
|
|
@@ -281,11 +300,14 @@ function statePrompt() {
|
|
|
281
300
|
'Read-only. Do NOT edit files or run experiments. Run these and parse their output:',
|
|
282
301
|
'`evo scratchpad`, `evo frontier` (already prints a JSON envelope), `evo status`, `evo awaiting`.',
|
|
283
302
|
'Also read `.evo/project.md` for the metric goal, direction, and the benchmark-determinism line.',
|
|
303
|
+
'Also run `evo config get task-skills` (if unset/blank, infer the task category from project.md) and read recent `evo annotations` for durable learnings already found this run.',
|
|
284
304
|
'Return: bestScore + bestExpId; the theoretical ceiling (1.0 for max metric, 0.0 for min)',
|
|
285
305
|
'and direction; the frontier nodes ALREADY ranked by the configured strategy',
|
|
286
306
|
'(id, score, rank) — preserve evo\'s ordering, do not re-rank; the list of evaluated-but-',
|
|
287
307
|
'undecided experiment ids; committedCount (number of committed experiments, from `evo status`);',
|
|
288
308
|
'verifyRepeats (from project.md: 1 if deterministic, 3 if sampling-based / variance-expected);',
|
|
309
|
+
'taskSkills (category skills a builder should load, e.g. ["finetuning"] — from `evo config get task-skills`, else inferred from project.md);',
|
|
310
|
+
'knownLearnings (short durable lessons from annotations to apply up front: trainer-API quirks, device placement, eval-side config, etc.);',
|
|
289
311
|
'and a 2-3 sentence scratchpad summary for subagent context.',
|
|
290
312
|
].join(' ')
|
|
291
313
|
}
|
|
@@ -305,6 +327,7 @@ function scanBrief(batch) {
|
|
|
305
327
|
'- Shared failure causes -- root-cause reasons recurring across 2+ experiments (the *why*, not the surface gate name).',
|
|
306
328
|
'- Wall patterns -- approaches or gates multiple experiments consistently fail on.',
|
|
307
329
|
'- Compound-failure standouts -- single experiments hitting multiple failure modes.',
|
|
330
|
+
'- Axis exhaustion vs fixable plumbing -- read each node\'s `failure_class` (build|eval|hypothesis) from outcome.json. A cluster of `hypothesis` failures across STRUCTURALLY DISTINCT approaches means the axis itself is unpromising (flag it so the next briefs diverge); `build`/`eval` failures are fixable plumbing (recipe/scoring) and must NOT be read as axis exhaustion.',
|
|
308
331
|
'',
|
|
309
332
|
'Evidence must be VERBATIM quotes from outcome.json fields, trace messages, or error text -- not paraphrases. If you cannot cite verbatim evidence for a finding, drop it. Evidence: short quotes (<200 chars), max 3 per finding.',
|
|
310
333
|
'Return JSON only: {"findings":[{"description","experiment_ids":[],"evidence":[]}]}.',
|
|
@@ -326,6 +349,10 @@ function aggregatePrompt(ids) {
|
|
|
326
349
|
'~the same score (a plateau), the bottleneck is not where those hypotheses aimed — emit kind:"axis-warning" whose',
|
|
327
350
|
'label names the saturated axis AND suggests the orthogonal axis (harness, score definition, input data, or a',
|
|
328
351
|
'different mechanism) the next briefs should pivot to. At most one axis-warning.',
|
|
352
|
+
'Also read DISCARDED nodes (`evo discards` + their outcome.json `failure_class`): a cluster of 3+ failure_class="hypothesis"',
|
|
353
|
+
'discards across STRUCTURALLY DISTINCT approaches is itself an axis-warning (the direction keeps not helping). IGNORE',
|
|
354
|
+
'failure_class="build"/"eval" discards for axis purposes — those are fixable plumbing (retry/resume or eval-retest), not',
|
|
355
|
+
'evidence the axis is wrong.',
|
|
329
356
|
'Return JSON only.',
|
|
330
357
|
].join(' ')
|
|
331
358
|
}
|
|
@@ -356,9 +383,24 @@ function briefPrompt(state, findings, patterns, parents, ideated, analystHints)
|
|
|
356
383
|
}
|
|
357
384
|
|
|
358
385
|
// IMPLEMENT — allocate + edit, but do NOT run (a pre-run verifier audits the edit first).
|
|
386
|
+
// Context capsule shared by every builder/runner lane: which category skills to load on demand,
|
|
387
|
+
// and the durable learnings to apply up front — so a fresh stateless agent inherits the priors and
|
|
388
|
+
// hard-won lessons a prose single-subagent would have had, instead of rediscovering them.
|
|
389
|
+
function capsuleLines(state) {
|
|
390
|
+
const skills = (state && state.taskSkills) || []
|
|
391
|
+
const learnings = (state && state.knownLearnings) || []
|
|
392
|
+
const lines = []
|
|
393
|
+
lines.push(skills.length
|
|
394
|
+
? `Task-category skills — load IN FULL via your host skill loader if the work needs them (they carry this category's priors, recipes, and pre-run checks): ${skills.join(', ')}.`
|
|
395
|
+
: "Identify the task category from `.evo/project.md` and load the matching evo skill (e.g. evo:finetuning for a training move) IN FULL before you build — it carries this category's priors and pre-run checks.")
|
|
396
|
+
if (learnings.length) lines.push(`KNOWN LEARNINGS to apply before acting (already found this run — do not rediscover): ${JSON.stringify(learnings)}.`)
|
|
397
|
+
return lines
|
|
398
|
+
}
|
|
399
|
+
|
|
359
400
|
function implementPrompt(brief, parent, state) {
|
|
360
401
|
return [
|
|
361
402
|
`First, load and follow the evo subagent skill: Read ${pr}/skills/subagent/SKILL.md IN FULL and follow it as your operating protocol — do not skip it even if the brief looks simple.`,
|
|
403
|
+
...capsuleLines(state),
|
|
362
404
|
`Allocate your experiment via \`evo new --parent ${parent}\`, then edit inside the returned worktree to implement the brief.`,
|
|
363
405
|
'IMPORTANT: do NOT run `evo run` yet — a pre-run verifier audits your change first. Stop once the edit is complete.',
|
|
364
406
|
'Do NOT edit benchmark, gate, or framework code; do NOT weaken/bypass any gate.',
|
|
@@ -397,12 +439,14 @@ function revisePrompt(expId, worktree, findings) {
|
|
|
397
439
|
}
|
|
398
440
|
|
|
399
441
|
// RUN — evaluate + commit the (pre-verified) experiment.
|
|
400
|
-
function runPrompt(expId) {
|
|
442
|
+
function runPrompt(expId, state) {
|
|
401
443
|
return [
|
|
402
444
|
`First, load the evo subagent skill: Read ${pr}/skills/subagent/SKILL.md IN FULL and follow its run protocol (it covers \`evo run ${expId} --check\` for non-committing wiring validation that does not consume the attempt budget).`,
|
|
403
|
-
|
|
445
|
+
...capsuleLines(state),
|
|
446
|
+
`CRITICAL ordering: if this experiment produces an output artifact through a build or training step (whatever your recipe declares — a checkpoint dir, adapter, merged model, index, etc.), run that step to COMPLETION and confirm the artifact exists BEFORE the real run. Never call \`evo run\` while that step is still in flight or before its output exists — evaluating a not-yet-produced artifact wastes the attempt. If the experiment warm-starts, the parent's reusable artifact is in EVO_PARENT_POLICY (start from it; do not redo from scratch).`,
|
|
404
447
|
`Then run \`evo run ${expId}\` to evaluate and (if it improves and passes gates) commit it.`,
|
|
405
448
|
'If it exits GATE_FAILED, do not fight the gate — report status=evaluated.',
|
|
449
|
+
'If `evo run` is terminated externally mid-flight (the concurrent analyst can STOP a doomed experiment — it aborts the run and discards this node with a diagnosis), do NOT retry: report status:none and stop. The diagnosis is already recorded as an annotation and will steer the next brief.',
|
|
406
450
|
`Return: expIds:["${expId}"]; status (committed|evaluated|failed|none); committedImprover = true ONLY if evo printed COMMITTED;`,
|
|
407
451
|
'bestExpId + bestScore (required when committedImprover is true); any gates added; learnings.',
|
|
408
452
|
].join(' ')
|
|
@@ -468,7 +512,9 @@ function analystPrompt(ctx, intervalS, reported) {
|
|
|
468
512
|
'- Stuck experiment / time-budget overrun: from `evo status`/`evo show`, an experiment active far longer than its peers, or a round overrunning the others. ALERT.',
|
|
469
513
|
'- Stuck axis: from `evo tree`, 3+ structurally-distinct committed hypotheses plateaued at ~the same score → name the saturated axis + one orthogonal axis. BRIEF HINT.',
|
|
470
514
|
'- Dead direction / ignored mechanism: annotations repeatedly naming a mechanism the recent work ignores, or a direction that keeps regressing. BRIEF HINT.',
|
|
471
|
-
'
|
|
515
|
+
'- Heading toward failure (STOP): an in-flight experiment that is CLEARLY doomed or wasting the budget — a divergent / NaN / flatlined progress metric; projected completion beyond the remaining time budget; or a known-fatal signature (e.g. output the scorer cannot parse; a silent resource mis-placement that tanks throughput with no error; a corrupt input/format that invalidates the result). HIGH PRECISION ONLY: default to NOT stopping — recommend a STOP only with concrete evidence that finishing is wasted, and only for an experiment still `active`. Emit a stop with: expId; failureClass (build = the build/produce step is broken; eval = artifact is fine but scoring/serving is wrong; hypothesis = it runs but won\'t help); reason (the diagnosis + the evidence you saw); fixHint (what the NEXT experiment must change).',
|
|
516
|
+
'You stay READ-ONLY: do NOT run `evo abort` / `evo discard` yourself. A gated enforcer acts on each stop — it aborts the run + its subprocess tree, annotates your diagnosis (so it outlives the worktree and feeds the next round), and discards with the failureClass so the partial artifact is preserved. A STOP is a diagnosed, recoverable stop, never a silent kill.',
|
|
517
|
+
'Return {briefHints:[...], alerts:[...], stops:[...]}. briefHints feed the NEXT round\'s briefs; alerts surface to the user; each stop triggers the gated enforcer (and its fixHint also feeds the next brief). All-empty is fine — most ticks should be quiet.',
|
|
472
518
|
].join('\n')
|
|
473
519
|
}
|
|
474
520
|
|
|
@@ -476,6 +522,16 @@ function analystPrompt(ctx, intervalS, reported) {
|
|
|
476
522
|
// iteration budget (deepening the branch each time a committed improver lands). The independent
|
|
477
523
|
// evo:verifier gates EACH run for design-time cheating BEFORE the experiment is evaluated; its
|
|
478
524
|
// findings are fed back to a revise agent on the same experiment until it passes or is discarded.
|
|
525
|
+
//
|
|
526
|
+
// Lane decomposition (decompose only at CONTEXT SEAMS): build+eval are a SINGLE agent — `run`
|
|
527
|
+
// produces the artifact and evaluates it end-to-end (one coherent context, no handoff mid-build).
|
|
528
|
+
// The only split is `implement` (write the edit) vs `run`, separated by the read-only evo:verifier
|
|
529
|
+
// seam — a genuinely different concern (adversarial diff audit, different agentType/model) that has
|
|
530
|
+
// to interpose between the edit and the expensive run. The two share state by REFERENCE (the
|
|
531
|
+
// worktree on disk), not by passing a context window, and BOTH receive the capsule (category skills
|
|
532
|
+
// + known learnings via capsuleLines), so neither reverts to base-model defaults. Merging implement
|
|
533
|
+
// into run would erase the verifier gate for no real gain, since the code already lives in the
|
|
534
|
+
// worktree the run agent reads.
|
|
479
535
|
async function runBrief(brief, state) {
|
|
480
536
|
let parent = brief.parent
|
|
481
537
|
let best = null
|
|
@@ -504,7 +560,7 @@ async function runBrief(brief, state) {
|
|
|
504
560
|
}
|
|
505
561
|
|
|
506
562
|
// run -> evaluate + commit
|
|
507
|
-
const r = await agent(runPrompt(impl.expId), { schema: SUBAGENT_RESULT, phase: 'Optimize', label: `run:${impl.expId}` })
|
|
563
|
+
const r = await agent(runPrompt(impl.expId, state), { schema: SUBAGENT_RESULT, phase: 'Optimize', label: `run:${impl.expId}` })
|
|
508
564
|
if (!r) break
|
|
509
565
|
|
|
510
566
|
// post-run validity audit (evo:verifier, post-phase) on committed improvers
|
|
@@ -630,6 +686,23 @@ async function optimizeLoop() {
|
|
|
630
686
|
// DURING rounds (not per-round). Each tick is a FRESH agent (no cross-tick memory), so `reported`
|
|
631
687
|
// holds the dedup state in this closure. Work-quality findings -> analystSignals (next brief);
|
|
632
688
|
// runtime/host alerts -> the run log. Stops when optimizeLoop sets `done`.
|
|
689
|
+
// Gated ENFORCER for an analyst STOP: detect (analyst) and act (this agent) stay separate. Verifies
|
|
690
|
+
// the experiment is still active, then aborts its run (driver + subprocess tree), annotates the
|
|
691
|
+
// diagnosis (survives the worktree + feeds the next round via knownLearnings), and discards with the
|
|
692
|
+
// failure class so the partial artifact is preserved + classified. A STOP is a diagnosed, recoverable
|
|
693
|
+
// stop — never a silent kill.
|
|
694
|
+
function enforceStopPrompt(s) {
|
|
695
|
+
return [
|
|
696
|
+
`A concurrent analyst flagged experiment ${s.expId} as heading toward failure and recommends STOPPING it. You are the gated ENFORCER — read-only except for the three evo commands below; do NOT edit code or run training.`,
|
|
697
|
+
`First VERIFY: run \`evo show ${s.expId}\`. Only proceed if its status is still \`active\`. If it is committed / evaluated / discarded / not found, do NOTHING and report skipped (it already resolved).`,
|
|
698
|
+
`If still active, run in order:`,
|
|
699
|
+
` 1. \`evo abort ${s.expId}\` — stop the evo run driver and its subprocess tree.`,
|
|
700
|
+
` 2. annotate the diagnosis so it outlives the worktree and feeds the next round: \`evo annotate ${s.expId} "STOPPED (${s.failureClass}): ${s.reason} | FIX: ${s.fixHint}"\` (quote carefully).`,
|
|
701
|
+
` 3. classify + preserve: \`evo discard ${s.expId} --force --failure-class ${s.failureClass} --reason "analyst stop: ${s.reason}"\` (--force because abort already killed the driver; declared artifacts are preserved).`,
|
|
702
|
+
`Report what you did (aborted / annotated / discarded) or that you skipped because it was no longer active. This is a diagnosed, recoverable stop, not a crash.`,
|
|
703
|
+
].join('\n')
|
|
704
|
+
}
|
|
705
|
+
|
|
633
706
|
async function analystLoop() {
|
|
634
707
|
if (!ANALYST_ENABLED) return
|
|
635
708
|
const reported = [] // closure memory across the stateless ticks (caps re-alerting)
|
|
@@ -651,6 +724,21 @@ async function analystLoop() {
|
|
|
651
724
|
fails = 0 // a real tick resets the failure streak
|
|
652
725
|
for (const h of (tick.briefHints || [])) { analystSignals.push(h); reported.push(h) }
|
|
653
726
|
for (const a of (tick.alerts || [])) { log(`ANALYST ALERT: ${a}`); reported.push(a) }
|
|
727
|
+
// STOP recommendations: hand each to a gated enforcer (detect/act separation). The fix also
|
|
728
|
+
// feeds the next round's brief so the loop corrects rather than just abandons.
|
|
729
|
+
for (const s of (tick.stops || [])) {
|
|
730
|
+
if (!s || !s.expId) continue
|
|
731
|
+
const stopKey = `stop:${s.expId}`
|
|
732
|
+
if (reported.includes(stopKey)) continue // never re-enforce the same experiment
|
|
733
|
+
reported.push(stopKey)
|
|
734
|
+
log(`ANALYST STOP: ${s.expId} [${s.failureClass}] ${s.reason}`)
|
|
735
|
+
analystSignals.push(`Experiment ${s.expId} was stopped (${s.failureClass}): ${s.reason} — next: ${s.fixHint}`)
|
|
736
|
+
try {
|
|
737
|
+
await agent(enforceStopPrompt(s), { phase: 'Analyst', label: `enforce-stop:${s.expId}` })
|
|
738
|
+
} catch (e) {
|
|
739
|
+
log(`ANALYST enforce-stop ${s.expId} errored (ignored): ${(e && e.message) || e}`)
|
|
740
|
+
}
|
|
741
|
+
}
|
|
654
742
|
} else if (++fails >= ANALYST_MAX_FAILS) {
|
|
655
743
|
// The pacing wait lives INSIDE the agent, so a tick that fails before sleeping (e.g. a schema
|
|
656
744
|
// reject) leaves nothing to pace the retry — left unchecked the loop hot-spins agents. The
|
package/skills/report/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: report
|
|
3
3
|
description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
|
|
4
|
-
evo_version: 0.5.0-alpha.
|
|
4
|
+
evo_version: 0.5.0-alpha.12
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Report
|
package/skills/subagent/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: subagent
|
|
3
3
|
description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
|
|
4
|
-
evo_version: 0.5.0-alpha.
|
|
4
|
+
evo_version: 0.5.0-alpha.12
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Evo Subagent Protocol
|
|
@@ -124,8 +124,10 @@ evo gate add <id> --name <name> --command "<command>" # add a gate
|
|
|
124
124
|
|
|
125
125
|
# Write paths used during iteration
|
|
126
126
|
evo new --parent <id> -m "<hypothesis>" # allocate sibling experiment
|
|
127
|
+
evo new --parent <id> -m "<h>" --from-artifact <exp[:label]> # seed from a preserved artifact (EVO_SEED_ARTIFACT)
|
|
127
128
|
evo run <id> [--check] # run (or --check to validate without consuming attempts)
|
|
128
|
-
evo
|
|
129
|
+
evo abort <id> # stop a mid-run experiment (driver + its subprocess tree)
|
|
130
|
+
evo discard <id> --reason "<text>" [--failure-class build|eval|hypothesis] # reject + park (keeps anchor ref)
|
|
129
131
|
evo restore <id> # un-discard or un-prune
|
|
130
132
|
evo annotate <id> [<task_id>] "<text>" # per-attempt analysis
|
|
131
133
|
evo set <id> --note "<text>" [--tag <t>] # per-node note from orchestrator
|
|
@@ -138,6 +140,7 @@ see `references/cli-quick-reference.md` "Reading workspace state".
|
|
|
138
140
|
## First Steps
|
|
139
141
|
|
|
140
142
|
1. Read `.evo/project.md` to understand the target, what can be changed, and how to interpret results.
|
|
143
|
+
1a. **Load this task's category skill(s).** Run `evo config get task-skills`; for each name returned (e.g. `finetuning`), load that evo skill IN FULL via your host's skill loader before you form an edit — it carries this category's method priors, recipes, and pre-run checks. If it returns blank, infer the category from `.evo/project.md` and load the matching skill if one applies (the subagent protocol alone covers prompt/code/config tasks). Skipping this is how a builder reverts to base-model defaults and reintroduces known mistakes (wrong device placement, stale trainer APIs, eval-before-build).
|
|
141
144
|
2. Read the scratchpad for current state: `evo scratchpad`
|
|
142
145
|
It surfaces: best path (★-marked in the tree), frontier (strategy-ranked branchable nodes), evaluated nodes awaiting decision, gates, annotations, what not to try, infra events, and notes. The Drill-downs section at the bottom lists the read-only commands for going deeper on any section.
|
|
143
146
|
3. Study the pointer traces from your brief:
|
|
@@ -246,6 +249,20 @@ portable progress files there. evo mirrors that directory back into
|
|
|
246
249
|
restart from benchmark-owned checkpoint files, but it does not freeze/restore an
|
|
247
250
|
arbitrary Linux process.
|
|
248
251
|
|
|
252
|
+
**Declare reusable outputs as artifacts (any category).** If your experiment
|
|
253
|
+
produces an expensive, reusable output — a checkpoint, an adapter, a built
|
|
254
|
+
index, a compiled prompt, anything — write it to `EVO_CHECKPOINT_DIR` (durable:
|
|
255
|
+
it survives between-attempt cleanup and discard) and name it in the benchmark
|
|
256
|
+
result's `artifacts` field: `{"score": ..., "artifacts": {"<label>": "<path>"}}`.
|
|
257
|
+
Declared artifacts are preserved when the node is discarded, so a later
|
|
258
|
+
experiment can reuse them via `evo new --from-artifact <exp[:label]>` (the path
|
|
259
|
+
arrives as `EVO_SEED_ARTIFACT`). Never hardcode a name like `final_model/` — the
|
|
260
|
+
label is whatever your recipe declares. If a run is clearly heading toward
|
|
261
|
+
failure mid-flight (divergent loss, projected budget blow-out, a known-failure
|
|
262
|
+
signature), it can be stopped with `evo abort <id>` — that kills the driver and
|
|
263
|
+
its subprocess tree, and a partial artifact already written to
|
|
264
|
+
`EVO_CHECKPOINT_DIR` survives for reuse.
|
|
265
|
+
|
|
249
266
|
**If the workspace was initialized with `commit_strategy=tracked-only` (the default for `--backend pool`):** `evo run` only commits modifications to *tracked* files. New files require an explicit `git add` from inside the worktree, then a shisa-kanko ack on the run command:
|
|
250
267
|
|
|
251
268
|
```bash
|
|
@@ -270,7 +287,10 @@ The ack flag is required when the worktree has any untracked, non-gitignored fil
|
|
|
270
287
|
|
|
271
288
|
Then either:
|
|
272
289
|
- Fixable edit-bug (off-by-one, wrong signature): edit the worktree and `evo run <id>` again. Bounded by `max_attempts` (default 3). Before retrying, compare your planned edit against the previous attempts' `outcome.json` on this same node -- if two earlier attempts hit the same gate, a small tweak won't fix it. When the cap is hit, run is refused -- you must discard.
|
|
273
|
-
-
|
|
290
|
+
- Otherwise discard, and **classify why** with `--failure-class` so the orchestrator can route reuse vs branch:
|
|
291
|
+
- **`eval`** — the produced artifact is good but the scoring / serving / decode config is wrong. Make sure the artifact was declared + preserved; a sibling can **retest it in seconds** via `evo new --from-artifact <id>` (arrives as `EVO_SEED_ARTIFACT`) instead of rebuilding. `evo discard <id> --reason "..." --failure-class eval`.
|
|
292
|
+
- **`build`** — the artifact-production step itself broke. Fix it, then retry/resume *from the last checkpoint in `EVO_CHECKPOINT_DIR`* rather than rebuilding from scratch. `evo discard <id> --reason "..." --failure-class build` only if you're abandoning this node.
|
|
293
|
+
- **`hypothesis`** — it ran clean but didn't help. `evo discard <id> --reason "..." --failure-class hypothesis` and branch a new experiment from the **original parent** (a different direction, not a retry of the same idea).
|
|
274
294
|
|
|
275
295
|
- **`FAILED`** (infra error, non-zero exit, timeout): couldn't evaluate. Doesn't consume the retry budget.
|
|
276
296
|
- Transient / fixable locally: retry.
|