npm - @evo-hq/pi-evo - Versions diffs - 0.4.5 → 0.5.0-alpha.10 - Mend

@evo-hq/pi-evo 0.4.5 → 0.5.0-alpha.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/package.json +1 -1
package/skills/discover/SKILL.md +145 -9
package/skills/discover/references/inline_instrumentation.js +45 -0
package/skills/discover/references/inline_instrumentation.py +50 -0
package/skills/discover/references/sdk_node.js +19 -0
package/skills/discover/references/sdk_python.py +25 -0
package/skills/{optimize → discover}/references/sizing-the-round.md +8 -0
package/skills/infra-setup/SKILL.md +1 -2
package/skills/optimize/SKILL.md +167 -8
package/skills/optimize/workflows/evo-optimize.js +672 -0
package/skills/report/SKILL.md +1 -1
package/skills/subagent/SKILL.md +89 -7

package/skills/optimize/SKILL.md CHANGED Viewed

@@ -1,11 +1,47 @@
 ---
 name: optimize
-description: Run the evo optimization loop with parallel subagents until interrupted.
+description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
 argument-hint: "[subagents=N] [budget=N] [stall=N]"
-evo_version: 0.4.5
+evo_version: 0.5.0-alpha.10
 ---
-Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns parallel subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
+Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
+**This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
+## Evo surface -- loop-relevant
+You're inside `/evo:optimize`. Things you'll pull/dispatch during the loop:
+```
+main thread (you)
+├── Skills (Skill tool)
+│   └── evo:finetuning     before writing or changing any train.py
+│
+└── Subagents to dispatch (Task tool, subagent_type=...)
+    └── evo:ideator        stalled, or every ~5 committed experiments.
+                           One subagent per brief:
+                           failure_analysis, literature, frontier_extrapolation
+subagent thread (each subagent spawned by step 5)
+├── evo:subagent skill     loaded by the subagent on first turn -- the brief's
+│                          first sentence mandates it (not auto-loaded)
+└── evo:verifier subagent  MANDATORY pre AND post every evo run.
+                           Pre: ~30s static analysis before the experiment runs.
+                           Post: result-validity audit after it commits.
+references (Read tool, on demand)
+├── discover/references/sizing-the-round.md      pick subagents=N
+├── references/evo-wait.md                       waiting without burning context
+├── finetuning/references/glue.md                train.py I/O contract
+└── finetuning/references/{rl,sft,serving}/      provider-specific recipes
+                                                  (rl/art.md, sft/tinker.md,
+                                                  serving/vllm.md)
+```
+Full surface tree (orchestrator entry-point view, including benchmark-reviewer,
+infra-setup, and the complete references catalogue) lives in `evo:discover`'s
+"Evo surface" section.
 ## Host conventions
@@ -30,13 +66,23 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
 ## Configuration
-The orchestrator **sizes the round to the benchmark's resource profile** instead of using a flat default. Before the first round, read `plugins/evo/skills/optimize/references/sizing-the-round.md` and apply it. Short version:
+The orchestrator's three round-shape knobs are **subagents** (round width), **budget** (per-branch depth), and **stall** (consecutive rounds with no improvement before auto-stopping; default 5).
+A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over what's below.
-- **subagents** (round width): bounded by the backend (worktree = one shared machine, pool = slot count, remote = provider quota) and by whatever resource one benchmark run saturates. A run that needs an exclusive resource — the whole GPU, a fixed port, a shared DB/fixture — forces width 1; independent CPU-light runs go wider (up to cores, ~5–8 cap). Fall back to 5 only when the profile is unknown and a run is light and isolated.
-- **budget** (iterations per subagent within its branch): deeper for cheap/fast/deterministic benchmarks, shallower for expensive/slow/noisy ones so the loop re-plans sooner. ~5 is a reasonable midpoint.
-- **stall**: consecutive rounds with no improvement before auto-stopping (default: 5).
+**Picking `subagents` and `budget` is load-bearing -- do not skim.**
-A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over the heuristic.
+Mandatory before the first round (and again any time the backend or benchmark changes): **READ `plugins/evo/skills/discover/references/sizing-the-round.md` IN FULL.** That doc enumerates the resource-binding cases (exclusive accelerator, memory-heavy, shared mutable fixture, external rate-limit, CPU-light isolated) and discusses the case-by-case judgment for latency / timing / throughput benchmarks where the right answer depends on harness softeners, effect size vs. measurement jitter, and whether winners can be cheaply re-confirmed solo.
+Under-subscribing wastes wall-clock. Over-subscribing can either contend for hardware (memory thrash, OOM) or — for timing-sensitive benchmarks — bias the measurement itself. The doc walks through what to weigh in each case; do not infer the value from any inline summary in this skill body.
+Common ways agents get this wrong by skimming:
+- "8-core machine, CPU-light → width 5" sounds right but skips the question of whether the metric is corruptible by sibling-process pressure. The doc has the judgment framing.
+- "Worktree backend has no slot cap so I can go higher" — worktree just shifts the cap from infrastructure to the binding resource. Same hardware, no safety net.
+If `.evo/project.md` records a resource profile (it should, after `/evo:discover`), START from that. The reference doc is what you use to APPLY it. If the profile is missing or thin, that's a discover-step bug — fix it (write a resource profile that names the binding resource explicitly) before continuing.
+In your opening message, state the width/budget you chose AND a one-line reason that references the binding-resource framing FROM THE DOC (e.g. "width 1 — exclusive GPU; budget 8 — runs deterministic"; or "width 3 — CPU-light isolated, but harness reports stable jitter at this concurrency so promoting solo-confirm gate; budget 5"). If your reason doesn't connect to the doc's framing, go back and read it.
 - **autonomous**: the keep-going loop. **Default: on** — evo is autoresearch; it runs unattended. Turn off for a run with `evo autonomous off`.
 - **subagents-only**: gate orchestrator edits, pushing all edits through subagents. **Default: on**. Turn off for a run with `evo subagents-only off`.
@@ -56,6 +102,18 @@ evo defaults get subagents-only --json
 As your **very first actions, before the loop**, resolve each and arm it: run `evo autonomous on` / `evo subagents-only on` when it resolves on, or `evo autonomous off` / `evo subagents-only off` when an explicit instruction or stored default turned it off. If a behavior resolves off — whether from the user's instruction this run or a stored default — say so in your opening message (e.g. "autonomous off — running one round at a time, as you asked") so it's never invisible.
+**Orchestrator driver (Claude Code only).** evo can drive the loop two ways: the prose loop below (default, every host), or a deterministic **dynamic workflow** (Claude Code only, opt-in). Resolve which as part of your very first actions:
+1. `evo host show` — must be `claude-code` for the workflow driver. If it prints `<not set>` (a pre-host workspace), determine your actual runtime from your own context (system prompt, env such as `CLAUDECODE=1`, self-identity): **only if you are genuinely Claude Code**, do the one-time host migration now (`evo host set claude-code`) and continue; if you are any other runtime, do NOT stamp the host here — leave it for Step 0.1 to record and use the prose loop. Any non-`claude-code` host uses the prose loop.
+2. `evo config get default-orchestrator` — `workflow` selects the workflow driver; anything else (including unset) resolves to `prose`. An explicit user instruction this run still wins.
+If host is `claude-code` **and** the value is `workflow` **and** the Workflow tool is available in this session, do NOT drive the loop turn-by-turn. Launch the bundled workflow once instead:
+- Call the **Workflow** tool with `scriptPath: ${CLAUDE_PLUGIN_ROOT}/skills/optimize/workflows/evo-optimize.js` and `args: {pluginRoot: "${CLAUDE_PLUGIN_ROOT}", subagents: <N>, budget: <N>, stall: <N>}`, using the round sizing you resolved above. **Pass all four keys explicitly — never omit one.** For `stall`, use the user's `/optimize stall=N` override if given, else the default 5. (The workflow's stop condition is the stall limit, so a dropped `stall` silently reverts it to 5.)
+- Report the returned `runId` and tell the user to watch progress with `/workflows`. The workflow runs the round loop itself (orient → mandatory scan + cross-history axis check → ideators on stall/periodic → briefs → fan-out + verify → collect → frontier-select → stall); you do **not** execute "The Loop" section below.
+Otherwise — any non-`claude-code` host, `default-orchestrator` unset/`prose`, or the Workflow tool unavailable — ignore this and follow **The Loop** below as the canonical driver. The workflow is only an execution strategy over the same `evo` CLI; gates, frontier, dashboard, and recovery are identical either way.
 **Autonomous mode.** Off lets you stop naturally at a turn boundary — finish a round, report, and stop. On arms the stop-nudge: at every turn boundary you are re-prompted to keep driving the loop until the **stall** limit is hit or the user interrupts. Without it, the loop does NOT force-continue across turn boundaries. To stop an autonomous run, the user runs `evo autonomous off` or `evo exit-optimize-mode`.
 **Subagents-only mode.** Off, the orchestrator may edit files directly — the optimization protocol still pushes edits through subagents (you write briefs; they edit in their worktrees), but a one-off orchestrator edit is not blocked. On arms the deny-gate: orchestrator file-mutation tools (Edit/Write, mutating Bash) are denied on an alternating cadence — 1st violation blocked, 2nd allowed, 3rd blocked, and so on — each block nudging you to delegate the edit to a subagent. It is a nudge, not a hard block: an edit can still land on an even-numbered attempt. Subagent edits (sessions with an `exp_id`) are never gated. To lift it, the user runs `evo subagents-only off` or `evo exit-optimize-mode`.
@@ -297,6 +355,86 @@ Update notes with cross-cutting learnings:
   evo set <exp_id> --note "key insight from round N"
   ```
+### 6a. Pattern recognition across history (objective, not narrative)
+Step 6 cross-cuts a single round. This step looks across ALL committed experiments in the run, not just this round's. The orchestrator's failure mode is tunnel vision -- iterating on the visible axis (whatever knob the recent rounds touched) while missing the orthogonal axis (the harness itself, the score definition, the environment, the input data, plumbing). The check is cheap, runs between rounds, and is the most reliable signal that you're on the wrong axis.
+Four checks via `evo show` + `evo tree`:
+1. **Score plateaus across structurally distinct hypotheses.** Read the `hypothesis` strings of committed experiments. If 3+ experiments with materially different hypotheses (not minor parameter sweeps of the same idea) all commit at the same score, the bottleneck is not where the hypotheses were aiming. The next move belongs on an axis none of those hypotheses touched.
+2. **Repeated failure class.** Tally failure indicators across discarded + failed nodes in the run so far: `gate_failures` names, non-zero exit codes, shared error-message fragments in `benchmark_err.log`. If 2+ failures share a class, that class is structural -- fix the cause, rather than queuing more experiments that will hit the same wall.
+3. **Internal-vs-benchmark delta.** Compare each evaluated node's *internal* indicators (progress signals the experiment's own process produces during a run -- intermediate test pass-rates, training loss, build success, agent self-report metrics, whatever the trace stream carries) to its *committed* benchmark score. Healthy internal signal + flat benchmark score = the experiment is optimizing something the benchmark does not reward. The fix is usually in the harness, output format, or score definition -- not in another hypothesis on the same axis.
+4. **Annotate facts, not narratives.** Annotations via `evo annotate <exp_id>` and `evo set <exp_id> --note` should record what HAPPENED -- scores, exact error messages, surprising observations, sources used. Not what you hoped would happen, what you plan to try next, or how you feel about the result. Annotations get loaded into future decision context and into ideator briefs; narrative noise contaminates them. State facts; leave plans to TodoWrite or `evo set --note` on the round itself.
+If any check surfaces a structural issue, the next round's subagent briefs should target the orthogonal axis the pattern identifies. Another iteration on a plateaued or systematically-failing axis produces another data point with the same conclusion.
+### 6b. Periodically spawn ideators (in parallel)
+The optimize loop's scan sub-agents (step 3) read the CURRENT round's evaluated experiments for failure patterns. They don't do deep cross-graph analysis or external literature scans -- that work belongs to the ideator skill (`evo:ideator`).
+Spawn ideators in parallel when ANY of these triggers fire:
+- **Periodic**: every N=5 committed experiments since the last ideator round
+- **Stall**: best score unchanged for M=3 consecutive committed experiments (the stall counter from step 6)
+- **Failure cluster**: M=3 consecutive discards with related root causes (use the `evo discards` output)
+- **User-triggered**: a directive (`evo direct`) asks for fresh ideas
+When a trigger fires, spawn three parallel **evo ideator subagents** via your host's Task tool -- one per brief:
+```
+Task(subagent_type="evo:ideator", prompt="workspace=<path>\nbrief=failure_analysis")
+Task(subagent_type="evo:ideator", prompt="workspace=<path>\nbrief=literature")
+Task(subagent_type="evo:ideator", prompt="workspace=<path>\nbrief=frontier_extrapolation")
+```
+| Brief | What it does |
+|---|---|
+| `failure_analysis` | Cross-graph clustering of discards/failures |
+| `literature` | Web/arXiv scan for untried techniques in the workspace domain |
+| `frontier_extrapolation` | Deeper variants of the steepest score gradient on the best path |
+Each subagent runs the brief in its own context, appends proposals as JSONL lines to `.evo/run_<run_id>/ideator/proposals.jsonl` (single final write), and returns a JSON summary. See `plugins/evo/agents/ideator.md` for the full procedure each ideator follows.
+Ideators take 5-10 min while the optimize loop's next round is typically 1-2 min away. If you fire and continue, proposals miss the next round's brief-writing every time. Two patterns work:
+- **Block here briefly.** If the trigger was a STALL or FAILURE CLUSTER, the next round's quality depends on fresh ideas -- block until enough proposals land:
+  ```bash
+  evo wait --for ideators --count 3 --timeout 900   # 15 min cap, fail-open
+  ```
+  Exit 0 means the proposals are ready; exit 124 (timeout) means proceed with whatever's available -- proposals.jsonl may have partial results.
+- **Fire and continue for periodic spawns** (every-N-commits trigger). The next round can run without proposals; the round after that will read them once they land. Use this when there's plenty of in-graph signal still to extract.
+### 6c. Reconcile ideator proposals at brief-writing time
+Before writing the next round's briefs (step 4 of the next iteration), check for new ideator proposals:
+```bash
+# Read proposals newer than the last round
+test -f .evo/run_*/ideator/proposals.jsonl && \
+  tail -n +1 .evo/run_*/ideator/proposals.jsonl | \
+  jq -s --argjson cutoff "$LAST_ROUND_END_TS" \
+    'map(select(.generated_at > $cutoff))'
+```
+If you fired ideators in 6b WITHOUT blocking (periodic trigger) and they haven't landed yet, you can also wait here -- but only a short timeout, since brief-writing should not stall indefinitely:
+```bash
+evo wait --for ideators --count 1 --timeout 120   # 2 min cap, fail-open
+```
+For each new proposal:
+1. Check the workspace graph -- has the proposed config already been tried? Use `evo discards --like "<keyword>"` to scan. If yes, skip.
+2. Score each remaining proposal by `expected_score_uplift × confidence`. Confidence ranking: `frontier_extrapolation > failure_analysis > literature`, all else equal.
+3. The top 1-2 proposals become objectives in the next round's briefs (step 4). Cite the proposal's `hypothesis` and `mechanism` in the brief's *Objective* field.
+4. Leave the rest in the queue -- they may surface as winners after a few more rounds when the frontier shifts.
+Proposals are advisory, not mandatory. If none look better than what step 3's scan sub-agents surfaced from in-graph signal, ignore them and proceed with the in-graph briefs. Ideator output complements, doesn't replace.
 ### 7. Continue or stop
 **Continue** if:
@@ -317,6 +455,27 @@ On stop, print a final summary:
 Go back to step 1.
+## Polling discipline
+When waiting on a long-running background process (a subagent's training subprocess, a long evaluation, a batch job), do NOT use `while true; do sleep N; tail file; done`. That loop never exits when the underlying process crashes -- the tail keeps reading the same dead file, the agent interprets "no growth" as "still working," and the agent blocks indefinitely.
+Use `evo wait`. The CLI is the bounded, structured replacement:
+```bash
+# wait until the training subprocess exits, OR its log stalls, OR the GPU goes idle,
+# whichever first; 60-minute ceiling; structured JSON on stdout
+evo wait --for process=$TRAIN_PID \
+         --for log-growth=$TRAIN_LOG \
+         --for gpu-idle \
+         --timeout 60m --stall-threshold 5m --json
+```
+Multiple `--for` flags combine; the wait returns on the first matching condition. The JSON output's `exit_reason` and `triggered_by` identify which condition fired. Process / log-growth / gpu-* watches do not require an evo workspace context; the workspace-anchored watches (`--for experiments`, `--for ideators`) cover the ideator + commit waits described elsewhere in this skill.
+Full surface, exit codes, JSON shape, examples: `references/evo-wait.md` (under `plugins/evo/skills/references/`).
+If `evo wait` is not available for some reason (older CLI on PATH, sandbox constraint), fall back to a bounded poll loop that checks all three signals -- process liveness via `kill -0 $PID`, log growth via `wc -c` delta, GPU via `nvidia-smi --query-gpu=utilization.gpu` -- and exits on any one going negative. NEVER unbounded `while true`.
 ## Resetting the eval epoch
 `evo infra event -m "<reason>" --breaking` bumps `current_eval_epoch` and blocks