npm - @evo-hq/pi-evo - Versions diffs - 0.5.3 → 0.6.0 - Mend

@evo-hq/pi-evo 0.5.3 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md +2 -0
package/package.json +1 -1
package/skills/discover/SKILL.md +41 -6
package/skills/infra-setup/SKILL.md +1 -1
package/skills/optimize/SKILL.md +105 -13
package/skills/optimize/workflows/evo-optimize.js +1 -0
package/skills/report/SKILL.md +48 -6
package/skills/ship/SKILL.md +140 -0
package/skills/subagent/SKILL.md +34 -2

package/README.md CHANGED Viewed

@@ -25,6 +25,8 @@ skill needs one to run multiple experiments per round.
 | `skills/discover/` | First-run setup: explore repo, propose optimization dimensions, build benchmark, run first experiment |
 | `skills/optimize/` | The search loop: parallel subagents form hypotheses, edit, get scored, frontier picks next branch |
 | `skills/subagent/` | Per-experiment brief contract for the optimize round's fanout |
+| `skills/report/` | Terminal score chart mirroring the dashboard scatter plot |
+| `skills/ship/` | Distill the best valid experiment into a clean mergeable change |
 | `skills/infra-setup/` | Provider matrix for remote-sandbox backends (Modal, E2B, Daytona, etc.) |
 ## Versioning

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@evo-hq/pi-evo",
-  "version": "0.5.3",
+  "version": "0.6.0",
   "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
   "publishConfig": {
     "access": "public"

package/skills/discover/SKILL.md CHANGED Viewed

@@ -2,7 +2,7 @@
 name: discover
 description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
 argument-hint: <optional context about what to optimize>
-evo_version: 0.5.3
+evo_version: 0.6.0
 ---
 # Discover
@@ -25,6 +25,9 @@ evo plugin
 │   │   ├── evo:optimize       after discover commits the baseline -- drives the loop.
 │   │   │                      Args: subagents=N (read sizing-the-round FIRST),
 │   │   │                            autonomous, subagents-only, budget=N, stall=N
+│   │   ├── evo:ship           after the loop stops -- distills the best valid
+│   │   │                      experiment into a mergeable change (PR if remote,
+│   │   │                      else merge) + a mergeability report
 │   │   ├── evo:finetuning     task is finetuning / post-training / training a model
 │   │   └── evo:infra-setup    need a remote backend, pooled workspaces, lease/slot
 │   │                          management, or specific provider auth/setup
@@ -116,20 +119,20 @@ evo --version
 The output must be exactly:
 ```
-evo-hq-cli 0.5.3
+evo-hq-cli 0.6.0
 ```
 Three outcomes:
 1. **Matches exactly** — continue to step 1.
 2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
-   > Your installed evo CLI is on a different version than this skill (`0.5.3`). Run:
+   > Your installed evo CLI is on a different version than this skill (`0.6.0`). Run:
    > ```
-   > uv tool install --force evo-hq-cli==0.5.3
+   > uv tool install --force evo-hq-cli==0.6.0
    > ```
    > Then re-invoke this skill.
 3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
-   > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.3` (or `pipx install evo-hq-cli==0.5.3`). Then re-invoke this skill.
+   > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0` (or `pipx install evo-hq-cli==0.6.0`). Then re-invoke this skill.
 Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
@@ -519,7 +522,39 @@ Document:
 - Benchmark gaming risks identified during the Goodhart check
 - Future experiment candidates (the non-picked dimensions from step 3)
-## 13. Report to the user
+## 13. Report public-safe use case telemetry
+Call the telemetry helper with a concrete but sanitized use-case summary. This
+command is globally disabled when the user runs `evo telemetry off`; do not
+branch on that yourself.
+Write one short sentence that captures: workload/system shape, optimization
+lever, and measured signal or failure mode. Add 3-6 lower-kebab-case tags for
+the workload, lever, metric/failure mode, runtime category, or benchmark source.
+Avoid filler-only tags like `optimization`, `benchmark`, or `codebase`.
+```bash
+evo telemetry usecase \
+  --description "<public-safe, concrete use case summary>" \
+  --tag <tag> --tag <tag> --tag <tag>
+```
+Privacy rule: do not include repo names, company names, customer names, file
+paths, benchmark commands, prompt text, task examples, secrets, URLs, internal
+system names, environment names, raw error logs, or exact dataset/item names.
+Generalize private nouns instead of deleting useful signal: "internal billing
+agent" becomes "workflow agent"; "Acme support router" becomes "support-style
+routing agent".
+Good example:
+```bash
+evo telemetry usecase \
+  --description "Optimizing a tool-calling coding agent for higher task pass rate by tuning planner/retry behavior against an existing per-task eval." \
+  --tag coding-agent --tag tool-calling --tag retry-policy --tag pass-rate --tag existing-eval
+```
+## 14. Report to the user
 End the skill by reporting in chat:

package/skills/infra-setup/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: infra-setup
 description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
-evo_version: 0.5.3
+evo_version: 0.6.0
 ---
 # Infra Setup

package/skills/optimize/SKILL.md CHANGED Viewed

@@ -1,14 +1,51 @@
 ---
 name: optimize
-description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
+description: Drive structured autoresearch iteration after evo:discover and the baseline commit. Use when the user invokes /evo:optimize or asks to try ideas, try variants, run experiments, use available GPUs, improve the current best/frontier, continue an evo search, or compare candidate changes in an evo workspace. The orchestrator plans and spawns optimization subagents; candidate edits/runs belong to those subagents. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
 argument-hint: "[subagents=N] [budget=N] [stall=N]"
-evo_version: 0.5.3
+evo_version: 0.6.0
 ---
 Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
 **This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
+**Plain-language trigger.** In an initialized evo workspace, casual user wording
+like "try a couple ideas", "try different variants", "use the available GPUs",
+"continue from the current best", or "see what improves" is an optimize request
+unless the user explicitly asks for a read-only report. Do not treat the lack of
+a slash command as permission to bypass this protocol. Loading this skill from
+plain-language wording is also explicit authorization to use the host's
+subagent mechanism for the resolved round width; the user does not have to say
+"spawn subagents" or "parallel agents" separately.
+**Candidate-work delegation invariant.** The orchestrator does not create, edit,
+or run candidate experiments for the round. For `subagents=N`, write N briefs
+and spawn N optimization subagents; each spawned subagent allocates its own
+experiment with `evo new`, edits only its worktree, and runs `evo run`. Do not
+simulate a subagent round by running `evo new`, editing files, or launching
+multiple `evo run` commands from the orchestrator, even if that would be faster
+or easier. If the host's subagent tool is unavailable, stop and report that the
+host cannot run `/evo:optimize subagents=N` as requested; only fall back to
+orchestrator-owned experiments when the user explicitly asks for direct/manual
+execution or turns subagents-only off for that run. Do not infer a direct/manual
+fallback from casual wording, a simple-looking benchmark, or the absence of an
+explicit subagent phrase in the user's prompt.
+**Resource-cap invariant.** `subagents=N` is live concurrency, not total ideas.
+Never spawn more concurrent optimization subagents or launch more concurrent
+benchmark jobs than the binding resource can support. If the user asks for more
+ideas than available GPU/Slurm/pool slots, batch them across rounds at the safe
+width, or stop and explain the cap if batching is impossible. Do not rely on
+the scheduler to absorb an accidental flood unless the user explicitly asks to
+queue/oversubscribe jobs.
+**Bounded-run stop rule.** If the user says "one round", "stop after this
+round", "run them and tell me what happened", or otherwise asks for a bounded
+run, resolve autonomous off at startup. After the requested subagents finish,
+collect their evo-recorded outcomes, print the summary, and stop. Do not enter
+another loop turn, wait for a stop nudge, or keep the process alive just because
+the default autonomous behavior is normally on.
 ## Evo surface -- loop-relevant
 You're inside `/evo:optimize`. Things you'll pull/dispatch during the loop:
@@ -68,7 +105,7 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
 The orchestrator's three round-shape knobs are **subagents** (round width), **budget** (per-branch depth), and **stall** (consecutive rounds with no improvement before auto-stopping; default 5).
-A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over what's below.
+A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value wins over what's below, subject to the resource-cap invariant above.
 **Picking `subagents` and `budget` is load-bearing -- do not skim.**
@@ -145,6 +182,13 @@ files or the attempt should be treated as `remote_infra_failure`.
 **Runtime recipe/env.** Benchmark runtime is evo configuration, not something subagents should rediscover or copy into worktrees. Use `evo config runtime show` for prepare/before-run/prefix and `evo env show` for redacted env sources. If a run fails because expected runtime setup or env is missing, report it as setup failure or configure it from the orchestrator; do not patch benchmark code to bake in secrets or local paths. Use `evo run <exp_id> --check` for non-committing wiring validation; do not invent ad-hoc validation wrappers.
+**Replicated/noisy benchmarks.** If the user or `.evo/project.md` says an idea
+must pass `n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation,
+configure the benchmark so each `evo run` records the grouped aggregate as that
+experiment's score. Do not represent replicates as independent evo experiments
+and do not judge, report, or promote an idea by its best replicate. The frontier
+and best path must see the same aggregate statistic the user uses for decisions.
 **CLI reference.** If you are unsure which command to use, read `plugins/evo/skills/references/cli-quick-reference.md`. It is the canonical command map; this skill only repeats the high-frequency commands.
 ## Prerequisites
@@ -296,7 +340,15 @@ Spawn all subagents in a **single batch** using your host's parallel-subagent to
 Per host, the spawn shape matters because evo's loop depends on *completion notifications* arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):
 - **claude-code** — fire one `Bash(run_in_background=true)` call per brief. The bash invokes the subagent (the host's `Task` tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a `<task-notification>` at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
-- **codex** — non-blocking subagent invocation; notifications delivered similarly.
+- **codex** — call `spawn_agent` once per optimization brief. If `spawn_agent`
+  is deferred or not visible yet, first use Codex's tool-discovery tool
+  (`tool_search`) with a query like `spawn agent subagent`, then call the
+  discovered spawn tool. The optimize skill's plain-language trigger is
+  sufficient authorization for this; do not stop or fall back to direct edits
+  merely because the user did not explicitly write "spawn subagents." The
+  candidate worker prompt is the one that starts with the mandatory "First,
+  load and follow..." sentence below. A later read-only scan subagent does not
+  count toward the round width.
 - **hermes** — `terminal(background=true)`; notifications delivered similarly.
 - **openclaw** — `sessions_spawn deliver:false`; notifications delivered similarly.
 - **opencode** — *batch-parallel only* (no background notifications). Fire N `task` calls in ONE assistant message; all `tool_result`s return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
@@ -318,6 +370,11 @@ Then append:
 The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.
+Before leaving step 5, check yourself: the number of optimization subagents you
+spawned must equal the resolved `subagents=N` unless the user explicitly
+requested a smaller direct/manual run. Scan/analysis subagents are separate and
+do not count toward this number.
 ### 6. Collect results and update state
 After all subagents complete:
@@ -331,16 +388,18 @@ After all subagents complete:
 **Cross-cut the round's evaluated nodes.** Before moving on, read `experiments/<id>/attempts/NNN/outcome.json` for each evaluated node from this round. The structured `gates[]` entries and `benchmark.result` let you spot shared failure modes the subagent summaries may have glossed over (e.g., three different subagents produced evaluated nodes whose gate_failures all included `refund_flow` -- that's a structural constraint the next round must confront, not three independent bad hypotheses).
-Prune dead branches where 3+ children all regressed:
+Prune branches you have decided are exhausted:
   ```bash
-  evo prune <exp_id> --reason "exhausted: N children all regressed"
+  evo prune <exp_id> --exhausted --reason "exhausted: <why>"
   ```
-`evo prune` accepts `committed` or `evaluated` nodes. Use it when you want
-to mark a lineage exhausted while preserving the result for later review or
-reference. Prune keeps the git commit alive (anchored at `refs/evo-anchor/<run>/<exp>`)
-so the node can be restored if needed. **Never `evo discard` a committed
-node** — it would orphan the branch ref and risk losing the commit.
+`evo prune` accepts `committed` or `evaluated` nodes:
+- `--exhausted` (default/legacy): stop branching here; the result still counts.
+- `--invalid`: this result is wrong; exclude it and descendants from best/frontier.
+- `--yes`: required only with `--invalid` on the current best valid spine.
+Use `--invalid --yes` when you have proven a best-spine result or ancestor is
+wrong. **Never `evo discard` a committed node** -- prune preserves its commit.
 If a previously-pruned (or discarded-then-restored) node is worth revisiting:
   ```bash
@@ -444,18 +503,51 @@ Proposals are advisory, not mandatory. If none look better than what step 3's sc
 - User hasn't interrupted
 - Score hasn't reached the theoretical maximum
+To continue, go back to step 1.
 **Stop** if:
 - Stall counter >= stall limit (N consecutive rounds with no improvement)
 - Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
 - User interrupted
-On stop, print a final summary:
+On stop, the loop is done -- do not go back to step 1. Print a final summary:
 - Best score achieved and experiment ID
 - Total experiments run across all rounds
 - The winning diff: `evo diff <best_exp_id>`
 - Suggested next steps if the score hasn't converged
+- **How to land it:** point the user at `/evo:ship` to turn the winning
+  experiment into a mergeable change (a PR when the repo has a remote, else a
+  merge into the working branch). The raw winning diff is not the artifact to
+  merge by hand -- `/evo:ship` distills it to the minimal clean change and
+  attaches a mergeability report. Mention this whenever the run produced a
+  committed experiment that beats the baseline.
+### 8. Report evo-side feedback when evo itself went wrong
+Use feedback for evo product/orchestration issues, not for ordinary bad
+experiments. Good triggers: confusing CLI behavior, subagent handoff failures,
+workflow/meta-controller mistakes, recovery that failed, remote backend handling
+that was hard to diagnose, or an evo policy/gate that blocked the wrong thing.
+The command is anonymous and no-ops when telemetry is off; do not branch on that.
+Keep the report public-safe but actionable enough for the evo backend team to
+reproduce the case. Include the phase, what you expected evo to do, what it did,
+and a generic repro shape. Do not include repo names, company names, file paths,
+commands, prompt text, raw logs, URLs, secrets, dataset names, or exact task
+examples. If the issue is tied to a specific evo experiment, include its
+anonymous experiment id with `--exp-id`.
-Go back to step 1.
+```bash
+evo telemetry feedback \
+  --exp-id exp_0007 \
+  --kind orchestration \
+  --phase optimize \
+  --summary "Subagent completed but the orchestrator did not collect its evaluated result before writing the next round briefs." \
+  --expected "Round collection should wait for all dispatched subagents or clearly report the missing one." \
+  --actual "The next round was planned from partial results, leaving one evaluated node out of the pattern scan." \
+  --repro "Run optimize with multiple subagents on a workspace where one branch finishes after the first completion notification." \
+  --tag optimize-loop --tag subagent-handoff --tag collection
+```
 ## Polling discipline

package/skills/optimize/workflows/evo-optimize.js CHANGED Viewed

@@ -553,6 +553,7 @@ function metaPrompt(ctx, intervalS, reported, journal) {
     '- Dead direction / ignored mechanism: annotations repeatedly naming a mechanism the recent work ignores, or a direction that keeps regressing. BRIEF HINT.',
     '- Heading toward failure (STOP): an in-flight experiment that is CLEARLY doomed or wasting the budget — a divergent / NaN / flatlined progress metric; projected completion beyond the remaining time budget; or a known-fatal signature (e.g. output the scorer cannot parse; a silent resource mis-placement that tanks throughput with no error; a corrupt input/format that invalidates the result). HIGH PRECISION ONLY: default to NOT stopping — recommend a STOP only with concrete evidence that finishing is wasted, and only for an experiment still `active`. Emit a stop with: expId; failureClass (build = the build/produce step is broken; eval = artifact is fine but scoring/serving is wrong; hypothesis = it runs but won\'t help); reason (the diagnosis + the evidence you saw); fixHint (what the NEXT experiment must change).',
     'For STOPs you stay READ-ONLY: do NOT run `evo abort` / `evo discard` yourself. A gated enforcer acts on each stop — it aborts the run + its subprocess tree, annotates your diagnosis (so it outlives the worktree and feeds the next round), and discards with the failureClass so the partial artifact is preserved. A STOP is a diagnosed, recoverable stop, never a silent kill.',
+    'If you observe an evo workflow/meta-controller defect (missed collection, wrong prompt handoff, recovery confusion, bad stop/enforcer behavior), you MAY run `evo telemetry feedback --kind workflow --phase meta ...` with public-safe summary/expected/actual/repro/tags before returning. This is anonymous and no-ops when telemetry is off. Do NOT report ordinary bad experiments, raw logs, paths, commands, repo names, URLs, secrets, or prompt text.',
     '',
     'HARNESS CONTROL (your distinctive power): you may restructure the optimize workflow itself, live, when you judge it will help — edits apply directly (free will) and take effect next round. Current harness state: ' + JSON.stringify(harnessSummary()) + '.',
     'harnessEdits ops: (1) set-knob {knob: width|budget|stall|ideateEvery|ideateStall, value} — retune the loop (widen the round, deepen branches, change the stall limit or ideation cadence). (2) toggle-phase {phaseName: scan|ideate, enabled} — turn a phase off/on (e.g. skip scan when traces are uninformative; force ideation early). (3) set-prompt {target: state|scan|aggregate|brief|implement|run|ideator|collect|preverify|audit, mode: append|replace, text} — edit the prompt that step uses. Appends ACCUMULATE as standing directives (the current ones are visible in the harness state above — do not re-add them); replace swaps the base wholesale. Use preverify/audit to harden the verifier when you spot a cheat pattern the audit missed. (4) inject-step {at: before-scan|after-scan|before-brief|after-collect, text, label} — add an extra agent step at that seam each round. Every edit needs a rationale citing the evidence.',

package/skills/report/SKILL.md CHANGED Viewed

@@ -1,22 +1,40 @@
 ---
 name: report
-description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
-evo_version: 0.5.3
+description: Read-only evo run reporting. Use when the user invokes /evo:report, asks what happened overnight, asks what improved recently, asks for the best/frontier candidates, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output. Never run benchmarks, gates, Slurm commands, evo run, or ad-hoc verification scripts for report requests.
+evo_version: 0.6.0
 ---
 # Report
-Render the dashboard's scatter plot as a colored terminal block, one chart per run, sized to the current terminal.
+Report the current evo workspace from recorded state only. A report request is
+read-only, even if the user phrases it casually as "what happened?", "what got
+better?", "what should I pay attention to?", or "I just woke up".
+Do not spend compute while reporting:
+- Do not run `evo run`, `evo gate check`, benchmark commands, or project eval
+  scripts.
+- Do not run `python bench.py`, `python slurm_eval.py`, `sbatch`, `srun`,
+  `squeue`, `sacct`, or `scancel` to verify a result.
+- Do not create launcher, monitor, parsing, or analysis scripts.
+- Do not edit files.
+Use stored evo state instead: `evo report`, `evo status`, `evo tree`,
+`evo frontier`, `evo show <id>`, `evo diff <id>`, and immutable artifacts under
+`.evo/run_*/experiments/<exp>/attempts/<NNN>/`.
+For chart requests, render the dashboard's scatter plot as a colored terminal
+block, one chart per run, sized to the current terminal.
 ## What it shows
 Mirrors the web dashboard's score scatter (left rail of `evo dashboard`):
 - X = experiment creation order, Y = score
-- Dot color by status: green = committed, red = failed, purple = active, grey = pending / evaluated / discarded / pruned
-- ★ marks the current best committed experiment
+- Dot color by status: green = committed valid result, red = failed, purple = active, grey = pending / evaluated / discarded / pruned
+- ★ marks the current best valid committed-result experiment. `pruned` with `prune_kind=exhausted` can still be best; `prune_kind=invalid` and its descendants cannot.
 - Yellow ring on dots that sit on the best-path spine (root → best)
-- Yellow stair line traces cumulative-best across committed experiments
+- Yellow stair line traces cumulative-best across valid committed-result experiments
 - ○ at the baseline for experiments that have no score yet (active / pending)
 Every run in the workspace is rendered, stacked top-to-bottom, with a header line showing `run_id · target · metric`.
@@ -41,3 +59,27 @@ Flags:
 - For one-off score lookups, `evo status` or `evo show <id>` is faster.
 - For navigating the tree shape, `evo tree` is the right command.
 - For interactive exploration (click a dot, open a drawer), point the user at `evo dashboard` instead.
+## Overnight / Improvement Reports
+When the user asks what happened recently or what improved, summarize from
+recorded evo state:
+1. Run `evo status`, `evo frontier`, and `evo tree`.
+2. Use `evo show <id>` for the best node and any recent committed/evaluated
+   nodes you mention.
+3. Use `evo diff <id>` only to explain what changed in a recorded experiment.
+4. If you need benchmark details, read the existing `outcome.json`,
+   `benchmark.log`, or declared artifacts for that experiment. Treat missing
+   artifacts as "not recorded", not as permission to rerun.
+Report:
+- best current experiment and score;
+- score delta versus baseline or parent;
+- top candidates/frontier if relevant;
+- failed/evaluated nodes that need attention;
+- any caveats about gates, missing held-out checks, or tied candidates.
+If the user wants fresh validation or reruns, ask them to explicitly start a new
+optimization or evaluation command. Do not infer that from a report request.

package/skills/ship/SKILL.md ADDED Viewed

@@ -0,0 +1,140 @@
+---
+name: ship
+description: Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.
+evo_version: 0.6.0
+---
+# Ship
+Turn a finished evo run into a change a maintainer would merge.
+The optimize loop leaves a tree of committed experiments. The winning worktree
+diff is not mergeable as-is: it carries debug prints, search-process churn,
+over-broad edits, and sometimes a test that was relaxed to clear a gate. Shipping
+is the step that re-derives the *minimal clean change* reproducing the winning
+behaviour, lands it the way the repo expects (PR or merge), and reports how
+mergeable it is.
+Correctness is the floor, not the goal. The score says the behaviour works; this
+skill decides whether the *diff* is fit to merge.
+## Invocation
+```bash
+/evo:ship            # ship the auto-selected winner
+/evo:ship exp_0042   # ship a specific experiment instead
+```
+## Stage 1 -- Select the winner
+Pick the experiment to ship, then confirm it with the user before touching their
+tree.
+```bash
+evo status    # current best valid score + counts
+evo report    # top valid experiments table + score chart
+```
+- The default winner is the highest-scoring valid result in the graph history,
+  not the frontier. `evo frontier` is for choosing where to branch next; it can
+  exclude an exhausted branch whose score is still the right thing to ship. An
+  explicit `exp_id` argument overrides auto-selection.
+- A shippable winner must be valid: `committed`, or `pruned` with
+  `prune_kind=exhausted`, with a commit and score, no `gate_result === false`,
+  and no invalid-pruned ancestor. Never select `discarded`, `failed`, `active`,
+  `evaluated`, legacy-pruned nodes with no `prune_kind`, `prune_kind=invalid`, or
+  descendants of invalid-pruned nodes. If no valid candidate exists, stop and
+  report why nothing is safe to ship.
+- Resolve the run's root (baseline) node, then show the cumulative change:
+  ```bash
+  evo diff <root_id> <winner_id>   # target-scoped cumulative diff, baseline -> winner
+  ```
+  For changes outside the benchmark target, diff the commits directly
+  (`git diff <baseline_commit> <winner_commit>`); each node carries `.commit`.
+- Present a one-screen summary: winner id, score baseline -> winner (delta),
+  the winning hypothesis, and a diffstat. Get a go before proceeding.
+## Stage 2 -- Distill to a mergeable change
+Work on a fresh branch off the user's current HEAD, not in the experiment
+worktree. Re-derive the change so it stands on its own:
+- **Scope restraint.** Keep only the files and lines the behaviour needs. Drop
+  experiment scaffolding, debug logging, commented-out attempts, and churn the
+  search introduced and then abandoned. Smaller, local diffs merge; sprawl does
+  not.
+- **Test integrity.** If the search weakened, skipped, or deleted a test to clear
+  a gate, restore it. New behaviour that changes outputs needs a test that
+  covers it. Never ship a green benchmark that rode on a loosened test -- call it
+  out instead.
+- **Mechanical cleanliness.** Match the repo's formatter and linter. No stray
+  whitespace, no reordered imports unless the repo does that.
+- **Codebase adherence.** Match surrounding naming, error handling, and structure.
+  The diff should read like the file it lands in.
+Then confirm the behaviour survived the distillation:
+```bash
+evo run <winner_id> --check    # or the project's benchmark / test command
+```
+If the distilled change no longer reproduces the winning score, do not paper over
+it -- report the gap (which part of the experiment diff was load-bearing) and let
+the user decide. Best-effort means honest about what could not be cleaned up, not
+silently shipping the raw worktree.
+## Stage 3 -- Land
+Detect how the repo expects changes to arrive:
+```bash
+git remote -v
+```
+- **Remote present** -> open a pull request. Commit the distilled change on its
+  branch, push, and `gh pr create` with the mergeability report (Stage 4) as the
+  body. Do not push or open the PR without the user's go.
+- **No remote** -> merge the distilled change into the user's working branch as a
+  single clean commit. Do not force, do not rewrite existing history.
+The landed commit message carries provenance: the winning experiment id, the
+score delta, and the one-line hypothesis. State what changed and why it is safe;
+do not narrate the search process.
+## Stage 4 -- Mergeability report (advisory)
+Always produce the report. It never blocks the merge -- it tells the user, and a
+future reviewer, how mergeable the change is across the axes a maintainer judges
+on:
+- **Technique** -- what the change actually does to move the score, named
+  concretely (the algorithm, data structure, or mechanism), not the search
+  story. Distilled from the winning hypothesis: "replaced the O(n^2) dedup with a
+  hash set", not "exp_0042 improved throughput". This is what a reviewer reads
+  first.
+- **Behavioural correctness** -- score baseline -> shipped (delta); benchmark
+  status after distillation.
+- **Regression safety** -- full test suite result on the distilled change.
+- **Scope** -- files touched, diff size, whether the change stays local.
+- **Test correctness** -- explicit yes/no on whether any test was modified,
+  weakened, or removed, with detail; whether new behaviour is covered.
+- **Mechanical cleanliness** -- formatter / linter status.
+- **Codebase adherence** -- a note on style/convention fit.
+Lead with a plain-language summary: what changed and why it is safe to merge. On
+a remote repo this is the PR body. With no remote, print it and save it alongside
+the run so the user can paste it into a review later.
+## Guardrails (firm)
+Everything above is method you can adapt to the repo. These are not:
+- Never weaken, skip, or delete a test to make the change land. If the experiment
+  did, restore it and report it.
+- Never ship invalid-pruned, legacy-pruned, discarded, failed, active,
+  evaluated, gate-failed, or invalid-lineage nodes. Only exhausted pruned nodes
+  remain normal ship candidates.
+- Never push or open a PR without the user's explicit go.
+- Never rewrite or force-overwrite existing history on the user's branch.
+- Never ship the raw experiment worktree diff as-is when distillation failed --
+  report the gap instead.

package/skills/subagent/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: subagent
 description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
-evo_version: 0.5.3
+evo_version: 0.6.0
 ---
 # Evo Subagent Protocol
@@ -234,6 +234,13 @@ evo run <exp_id>
 This runs benchmark + gate and prints the result.
+For noisy or replicated benchmarks, one `evo run` must represent the user's
+decision statistic for the idea. If the brief, user, or `.evo/project.md` says
+`n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation is required,
+use the configured grouped benchmark path so the experiment records that
+aggregate score. Do not create separate evo experiments for replicates, and do
+not report or promote an idea by its best single replicate.
 In remote-backend workspaces, if a prior `evo run <exp_id>` was interrupted
 or the experiment is still `active`, run `evo run <exp_id>` again first. That
 is the recovery path: evo will try to attach to the existing remote process and
@@ -335,6 +342,30 @@ Continue if budget remains AND (last outcome was committed, OR you have a meanin
 Stop if budget exhausted, infra failure, or you've exhausted variations with no improvement.
+### 9. Report evo-side feedback when evo itself blocked you
+Use feedback for evo product/orchestration issues, not for ordinary bad
+experiments. Good triggers: confusing `evo run` behavior, remote workspace
+commands that were hard to recover from, verifier handoff problems, missing
+state that prevented you from following the brief, or a policy/gate that blocked
+the wrong thing. The command is anonymous and no-ops when telemetry is off.
+Keep it public-safe and reproducible. Do not include repo names, company names,
+file paths, commands, prompt text, raw logs, URLs, secrets, dataset names, or
+exact task examples.
+```bash
+evo telemetry feedback \
+  --exp-id <YOUR_EXP_ID> \
+  --kind orchestration \
+  --phase subagent \
+  --summary "Remote experiment recovery was unclear after the run process detached." \
+  --expected "Re-running evo run should clearly say whether it resumed or needs a fresh experiment." \
+  --actual "The subagent could not tell whether the active attempt was recoverable." \
+  --repro "Use a remote backend, interrupt a long-running experiment, then ask a subagent to continue the same exp_id." \
+  --tag remote-backend --tag recovery --tag subagent
+```
 ## Enriching traces
 Check `.evo/meta.json` for `"instrumentation_mode"` (`"sdk"` or `"inline"`) to see which style the benchmark uses -- **stay consistent with that choice across iterations; do not flip styles mid-run.**
@@ -357,7 +388,8 @@ If your experiment needs an artifact that is slow to produce and stable across s
 - Do NOT run `evo init` or `evo reset`
 - `evo discard <your_exp_id> --reason "..."` is your explicit "abandon" action — use it for any *non-committed* node you've decided not to pursue further (pre-run realization, evaluated with a bad hypothesis, or unfixable infra failure). Discard deletes the worktree and branch; the node and its per-attempt artifacts stay in `.evo/` as a record of what was tried.
-- If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune`.
+- If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune --exhausted` or `evo prune --invalid`.
+- If you discover a committed result is invalid (bad score calculation, benchmark gaming, weakened/deleted tests, failed inherited gate, or failed repro), report the exact node and evidence. Do not discard it; the orchestrator should mark it with `evo prune <id> --invalid --reason "..."` and add `--yes` only if evo says the node is on the current best spine.
 - If `evo discard` errors with **"cannot discard active node ... pass --force"** — the run is still in flight. Wait for it to finish; don't `--force` unless you know what you're doing (the running process can still write a final outcome that contradicts the discard).
 - If `evo discard` errors with **"cannot discard ... has non-discarded children"** — sibling/child experiments depend on this node's parent reference. Discard or commit-and-prune those first.
 - Do NOT copy `.env` files, bake secrets into source, or hard-code local runtime paths. Runtime setup/env is configured by the orchestrator (`evo config runtime ...`, `evo env ...`) and injected into benchmark/gate processes. If a missing dependency, setup step, or key blocks evaluation, report setup failure.