npm - @evo-hq/pi-evo - Versions diffs - 0.6.0-alpha.1 → 0.6.0 - Mend

@evo-hq/pi-evo 0.6.0-alpha.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/package.json +1 -1
package/skills/discover/SKILL.md +5 -5
package/skills/infra-setup/SKILL.md +1 -1
package/skills/optimize/SKILL.md +64 -5
package/skills/report/SKILL.md +45 -3
package/skills/ship/SKILL.md +1 -1
package/skills/subagent/SKILL.md +9 -1

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@evo-hq/pi-evo",
-  "version": "0.6.0-alpha.1",
+  "version": "0.6.0",
   "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
   "publishConfig": {
     "access": "public"

package/skills/discover/SKILL.md CHANGED Viewed

@@ -2,7 +2,7 @@
 name: discover
 description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
 argument-hint: <optional context about what to optimize>
-evo_version: 0.6.0-alpha.1
+evo_version: 0.6.0
 ---
 # Discover
@@ -119,20 +119,20 @@ evo --version
 The output must be exactly:
 ```
-evo-hq-cli 0.6.0-alpha.1
+evo-hq-cli 0.6.0
 ```
 Three outcomes:
 1. **Matches exactly** — continue to step 1.
 2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
-   > Your installed evo CLI is on a different version than this skill (`0.6.0-alpha.1`). Run:
+   > Your installed evo CLI is on a different version than this skill (`0.6.0`). Run:
    > ```
-   > uv tool install --force evo-hq-cli==0.6.0-alpha.1
+   > uv tool install --force evo-hq-cli==0.6.0
    > ```
    > Then re-invoke this skill.
 3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
-   > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0-alpha.1` (or `pipx install evo-hq-cli==0.6.0-alpha.1`). Then re-invoke this skill.
+   > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0` (or `pipx install evo-hq-cli==0.6.0`). Then re-invoke this skill.
 Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.

package/skills/infra-setup/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: infra-setup
 description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
-evo_version: 0.6.0-alpha.1
+evo_version: 0.6.0
 ---
 # Infra Setup

package/skills/optimize/SKILL.md CHANGED Viewed

@@ -1,14 +1,51 @@
 ---
 name: optimize
-description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
+description: Drive structured autoresearch iteration after evo:discover and the baseline commit. Use when the user invokes /evo:optimize or asks to try ideas, try variants, run experiments, use available GPUs, improve the current best/frontier, continue an evo search, or compare candidate changes in an evo workspace. The orchestrator plans and spawns optimization subagents; candidate edits/runs belong to those subagents. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
 argument-hint: "[subagents=N] [budget=N] [stall=N]"
-evo_version: 0.6.0-alpha.1
+evo_version: 0.6.0
 ---
 Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
 **This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
+**Plain-language trigger.** In an initialized evo workspace, casual user wording
+like "try a couple ideas", "try different variants", "use the available GPUs",
+"continue from the current best", or "see what improves" is an optimize request
+unless the user explicitly asks for a read-only report. Do not treat the lack of
+a slash command as permission to bypass this protocol. Loading this skill from
+plain-language wording is also explicit authorization to use the host's
+subagent mechanism for the resolved round width; the user does not have to say
+"spawn subagents" or "parallel agents" separately.
+**Candidate-work delegation invariant.** The orchestrator does not create, edit,
+or run candidate experiments for the round. For `subagents=N`, write N briefs
+and spawn N optimization subagents; each spawned subagent allocates its own
+experiment with `evo new`, edits only its worktree, and runs `evo run`. Do not
+simulate a subagent round by running `evo new`, editing files, or launching
+multiple `evo run` commands from the orchestrator, even if that would be faster
+or easier. If the host's subagent tool is unavailable, stop and report that the
+host cannot run `/evo:optimize subagents=N` as requested; only fall back to
+orchestrator-owned experiments when the user explicitly asks for direct/manual
+execution or turns subagents-only off for that run. Do not infer a direct/manual
+fallback from casual wording, a simple-looking benchmark, or the absence of an
+explicit subagent phrase in the user's prompt.
+**Resource-cap invariant.** `subagents=N` is live concurrency, not total ideas.
+Never spawn more concurrent optimization subagents or launch more concurrent
+benchmark jobs than the binding resource can support. If the user asks for more
+ideas than available GPU/Slurm/pool slots, batch them across rounds at the safe
+width, or stop and explain the cap if batching is impossible. Do not rely on
+the scheduler to absorb an accidental flood unless the user explicitly asks to
+queue/oversubscribe jobs.
+**Bounded-run stop rule.** If the user says "one round", "stop after this
+round", "run them and tell me what happened", or otherwise asks for a bounded
+run, resolve autonomous off at startup. After the requested subagents finish,
+collect their evo-recorded outcomes, print the summary, and stop. Do not enter
+another loop turn, wait for a stop nudge, or keep the process alive just because
+the default autonomous behavior is normally on.
 ## Evo surface -- loop-relevant
 You're inside `/evo:optimize`. Things you'll pull/dispatch during the loop:
@@ -68,7 +105,7 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
 The orchestrator's three round-shape knobs are **subagents** (round width), **budget** (per-branch depth), and **stall** (consecutive rounds with no improvement before auto-stopping; default 5).
-A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over what's below.
+A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value wins over what's below, subject to the resource-cap invariant above.
 **Picking `subagents` and `budget` is load-bearing -- do not skim.**
@@ -145,6 +182,13 @@ files or the attempt should be treated as `remote_infra_failure`.
 **Runtime recipe/env.** Benchmark runtime is evo configuration, not something subagents should rediscover or copy into worktrees. Use `evo config runtime show` for prepare/before-run/prefix and `evo env show` for redacted env sources. If a run fails because expected runtime setup or env is missing, report it as setup failure or configure it from the orchestrator; do not patch benchmark code to bake in secrets or local paths. Use `evo run <exp_id> --check` for non-committing wiring validation; do not invent ad-hoc validation wrappers.
+**Replicated/noisy benchmarks.** If the user or `.evo/project.md` says an idea
+must pass `n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation,
+configure the benchmark so each `evo run` records the grouped aggregate as that
+experiment's score. Do not represent replicates as independent evo experiments
+and do not judge, report, or promote an idea by its best replicate. The frontier
+and best path must see the same aggregate statistic the user uses for decisions.
 **CLI reference.** If you are unsure which command to use, read `plugins/evo/skills/references/cli-quick-reference.md`. It is the canonical command map; this skill only repeats the high-frequency commands.
 ## Prerequisites
@@ -296,7 +340,15 @@ Spawn all subagents in a **single batch** using your host's parallel-subagent to
 Per host, the spawn shape matters because evo's loop depends on *completion notifications* arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):
 - **claude-code** — fire one `Bash(run_in_background=true)` call per brief. The bash invokes the subagent (the host's `Task` tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a `<task-notification>` at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
-- **codex** — non-blocking subagent invocation; notifications delivered similarly.
+- **codex** — call `spawn_agent` once per optimization brief. If `spawn_agent`
+  is deferred or not visible yet, first use Codex's tool-discovery tool
+  (`tool_search`) with a query like `spawn agent subagent`, then call the
+  discovered spawn tool. The optimize skill's plain-language trigger is
+  sufficient authorization for this; do not stop or fall back to direct edits
+  merely because the user did not explicitly write "spawn subagents." The
+  candidate worker prompt is the one that starts with the mandatory "First,
+  load and follow..." sentence below. A later read-only scan subagent does not
+  count toward the round width.
 - **hermes** — `terminal(background=true)`; notifications delivered similarly.
 - **openclaw** — `sessions_spawn deliver:false`; notifications delivered similarly.
 - **opencode** — *batch-parallel only* (no background notifications). Fire N `task` calls in ONE assistant message; all `tool_result`s return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
@@ -318,6 +370,11 @@ Then append:
 The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.
+Before leaving step 5, check yourself: the number of optimization subagents you
+spawned must equal the resolved `subagents=N` unless the user explicitly
+requested a smaller direct/manual run. Scan/analysis subagents are separate and
+do not count toward this number.
 ### 6. Collect results and update state
 After all subagents complete:
@@ -477,10 +534,12 @@ Keep the report public-safe but actionable enough for the evo backend team to
 reproduce the case. Include the phase, what you expected evo to do, what it did,
 and a generic repro shape. Do not include repo names, company names, file paths,
 commands, prompt text, raw logs, URLs, secrets, dataset names, or exact task
-examples.
+examples. If the issue is tied to a specific evo experiment, include its
+anonymous experiment id with `--exp-id`.
 ```bash
 evo telemetry feedback \
+  --exp-id exp_0007 \
   --kind orchestration \
   --phase optimize \
   --summary "Subagent completed but the orchestrator did not collect its evaluated result before writing the next round briefs." \

package/skills/report/SKILL.md CHANGED Viewed

@@ -1,12 +1,30 @@
 ---
 name: report
-description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
-evo_version: 0.6.0-alpha.1
+description: Read-only evo run reporting. Use when the user invokes /evo:report, asks what happened overnight, asks what improved recently, asks for the best/frontier candidates, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output. Never run benchmarks, gates, Slurm commands, evo run, or ad-hoc verification scripts for report requests.
+evo_version: 0.6.0
 ---
 # Report
-Render the dashboard's scatter plot as a colored terminal block, one chart per run, sized to the current terminal.
+Report the current evo workspace from recorded state only. A report request is
+read-only, even if the user phrases it casually as "what happened?", "what got
+better?", "what should I pay attention to?", or "I just woke up".
+Do not spend compute while reporting:
+- Do not run `evo run`, `evo gate check`, benchmark commands, or project eval
+  scripts.
+- Do not run `python bench.py`, `python slurm_eval.py`, `sbatch`, `srun`,
+  `squeue`, `sacct`, or `scancel` to verify a result.
+- Do not create launcher, monitor, parsing, or analysis scripts.
+- Do not edit files.
+Use stored evo state instead: `evo report`, `evo status`, `evo tree`,
+`evo frontier`, `evo show <id>`, `evo diff <id>`, and immutable artifacts under
+`.evo/run_*/experiments/<exp>/attempts/<NNN>/`.
+For chart requests, render the dashboard's scatter plot as a colored terminal
+block, one chart per run, sized to the current terminal.
 ## What it shows
@@ -41,3 +59,27 @@ Flags:
 - For one-off score lookups, `evo status` or `evo show <id>` is faster.
 - For navigating the tree shape, `evo tree` is the right command.
 - For interactive exploration (click a dot, open a drawer), point the user at `evo dashboard` instead.
+## Overnight / Improvement Reports
+When the user asks what happened recently or what improved, summarize from
+recorded evo state:
+1. Run `evo status`, `evo frontier`, and `evo tree`.
+2. Use `evo show <id>` for the best node and any recent committed/evaluated
+   nodes you mention.
+3. Use `evo diff <id>` only to explain what changed in a recorded experiment.
+4. If you need benchmark details, read the existing `outcome.json`,
+   `benchmark.log`, or declared artifacts for that experiment. Treat missing
+   artifacts as "not recorded", not as permission to rerun.
+Report:
+- best current experiment and score;
+- score delta versus baseline or parent;
+- top candidates/frontier if relevant;
+- failed/evaluated nodes that need attention;
+- any caveats about gates, missing held-out checks, or tied candidates.
+If the user wants fresh validation or reruns, ask them to explicitly start a new
+optimization or evaluation command. Do not infer that from a report request.

package/skills/ship/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: ship
 description: Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.
-evo_version: 0.6.0-alpha.1
+evo_version: 0.6.0
 ---
 # Ship

package/skills/subagent/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: subagent
 description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
-evo_version: 0.6.0-alpha.1
+evo_version: 0.6.0
 ---
 # Evo Subagent Protocol
@@ -234,6 +234,13 @@ evo run <exp_id>
 This runs benchmark + gate and prints the result.
+For noisy or replicated benchmarks, one `evo run` must represent the user's
+decision statistic for the idea. If the brief, user, or `.evo/project.md` says
+`n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation is required,
+use the configured grouped benchmark path so the experiment records that
+aggregate score. Do not create separate evo experiments for replicates, and do
+not report or promote an idea by its best single replicate.
 In remote-backend workspaces, if a prior `evo run <exp_id>` was interrupted
 or the experiment is still `active`, run `evo run <exp_id>` again first. That
 is the recovery path: evo will try to attach to the existing remote process and
@@ -349,6 +356,7 @@ exact task examples.
 ```bash
 evo telemetry feedback \
+  --exp-id <YOUR_EXP_ID> \
   --kind orchestration \
   --phase subagent \
   --summary "Remote experiment recovery was unclear after the run process detached." \