@evo-hq/pi-evo 0.6.0-alpha.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@evo-hq/pi-evo",
3
- "version": "0.6.0-alpha.1",
3
+ "version": "0.6.0",
4
4
  "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
5
5
  "publishConfig": {
6
6
  "access": "public"
@@ -2,7 +2,7 @@
2
2
  name: discover
3
3
  description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
4
4
  argument-hint: <optional context about what to optimize>
5
- evo_version: 0.6.0-alpha.1
5
+ evo_version: 0.6.0
6
6
  ---
7
7
 
8
8
  # Discover
@@ -119,20 +119,20 @@ evo --version
119
119
  The output must be exactly:
120
120
 
121
121
  ```
122
- evo-hq-cli 0.6.0-alpha.1
122
+ evo-hq-cli 0.6.0
123
123
  ```
124
124
 
125
125
  Three outcomes:
126
126
 
127
127
  1. **Matches exactly** — continue to step 1.
128
128
  2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
129
- > Your installed evo CLI is on a different version than this skill (`0.6.0-alpha.1`). Run:
129
+ > Your installed evo CLI is on a different version than this skill (`0.6.0`). Run:
130
130
  > ```
131
- > uv tool install --force evo-hq-cli==0.6.0-alpha.1
131
+ > uv tool install --force evo-hq-cli==0.6.0
132
132
  > ```
133
133
  > Then re-invoke this skill.
134
134
  3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
135
- > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0-alpha.1` (or `pipx install evo-hq-cli==0.6.0-alpha.1`). Then re-invoke this skill.
135
+ > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0` (or `pipx install evo-hq-cli==0.6.0`). Then re-invoke this skill.
136
136
 
137
137
  Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
138
138
 
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: infra-setup
3
3
  description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
4
- evo_version: 0.6.0-alpha.1
4
+ evo_version: 0.6.0
5
5
  ---
6
6
 
7
7
  # Infra Setup
@@ -1,14 +1,51 @@
1
1
  ---
2
2
  name: optimize
3
- description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
3
+ description: Drive structured autoresearch iteration after evo:discover and the baseline commit. Use when the user invokes /evo:optimize or asks to try ideas, try variants, run experiments, use available GPUs, improve the current best/frontier, continue an evo search, or compare candidate changes in an evo workspace. The orchestrator plans and spawns optimization subagents; candidate edits/runs belong to those subagents. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
4
4
  argument-hint: "[subagents=N] [budget=N] [stall=N]"
5
- evo_version: 0.6.0-alpha.1
5
+ evo_version: 0.6.0
6
6
  ---
7
7
 
8
8
  Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
9
9
 
10
10
  **This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
11
11
 
12
+ **Plain-language trigger.** In an initialized evo workspace, casual user wording
13
+ like "try a couple ideas", "try different variants", "use the available GPUs",
14
+ "continue from the current best", or "see what improves" is an optimize request
15
+ unless the user explicitly asks for a read-only report. Do not treat the lack of
16
+ a slash command as permission to bypass this protocol. Loading this skill from
17
+ plain-language wording is also explicit authorization to use the host's
18
+ subagent mechanism for the resolved round width; the user does not have to say
19
+ "spawn subagents" or "parallel agents" separately.
20
+
21
+ **Candidate-work delegation invariant.** The orchestrator does not create, edit,
22
+ or run candidate experiments for the round. For `subagents=N`, write N briefs
23
+ and spawn N optimization subagents; each spawned subagent allocates its own
24
+ experiment with `evo new`, edits only its worktree, and runs `evo run`. Do not
25
+ simulate a subagent round by running `evo new`, editing files, or launching
26
+ multiple `evo run` commands from the orchestrator, even if that would be faster
27
+ or easier. If the host's subagent tool is unavailable, stop and report that the
28
+ host cannot run `/evo:optimize subagents=N` as requested; only fall back to
29
+ orchestrator-owned experiments when the user explicitly asks for direct/manual
30
+ execution or turns subagents-only off for that run. Do not infer a direct/manual
31
+ fallback from casual wording, a simple-looking benchmark, or the absence of an
32
+ explicit subagent phrase in the user's prompt.
33
+
34
+ **Resource-cap invariant.** `subagents=N` is live concurrency, not total ideas.
35
+ Never spawn more concurrent optimization subagents or launch more concurrent
36
+ benchmark jobs than the binding resource can support. If the user asks for more
37
+ ideas than available GPU/Slurm/pool slots, batch them across rounds at the safe
38
+ width, or stop and explain the cap if batching is impossible. Do not rely on
39
+ the scheduler to absorb an accidental flood unless the user explicitly asks to
40
+ queue/oversubscribe jobs.
41
+
42
+ **Bounded-run stop rule.** If the user says "one round", "stop after this
43
+ round", "run them and tell me what happened", or otherwise asks for a bounded
44
+ run, resolve autonomous off at startup. After the requested subagents finish,
45
+ collect their evo-recorded outcomes, print the summary, and stop. Do not enter
46
+ another loop turn, wait for a stop nudge, or keep the process alive just because
47
+ the default autonomous behavior is normally on.
48
+
12
49
  ## Evo surface -- loop-relevant
13
50
 
14
51
  You're inside `/evo:optimize`. Things you'll pull/dispatch during the loop:
@@ -68,7 +105,7 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
68
105
 
69
106
  The orchestrator's three round-shape knobs are **subagents** (round width), **budget** (per-branch depth), and **stall** (consecutive rounds with no improvement before auto-stopping; default 5).
70
107
 
71
- A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over what's below.
108
+ A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value wins over what's below, subject to the resource-cap invariant above.
72
109
 
73
110
  **Picking `subagents` and `budget` is load-bearing -- do not skim.**
74
111
 
@@ -145,6 +182,13 @@ files or the attempt should be treated as `remote_infra_failure`.
145
182
 
146
183
  **Runtime recipe/env.** Benchmark runtime is evo configuration, not something subagents should rediscover or copy into worktrees. Use `evo config runtime show` for prepare/before-run/prefix and `evo env show` for redacted env sources. If a run fails because expected runtime setup or env is missing, report it as setup failure or configure it from the orchestrator; do not patch benchmark code to bake in secrets or local paths. Use `evo run <exp_id> --check` for non-committing wiring validation; do not invent ad-hoc validation wrappers.
147
184
 
185
+ **Replicated/noisy benchmarks.** If the user or `.evo/project.md` says an idea
186
+ must pass `n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation,
187
+ configure the benchmark so each `evo run` records the grouped aggregate as that
188
+ experiment's score. Do not represent replicates as independent evo experiments
189
+ and do not judge, report, or promote an idea by its best replicate. The frontier
190
+ and best path must see the same aggregate statistic the user uses for decisions.
191
+
148
192
  **CLI reference.** If you are unsure which command to use, read `plugins/evo/skills/references/cli-quick-reference.md`. It is the canonical command map; this skill only repeats the high-frequency commands.
149
193
 
150
194
  ## Prerequisites
@@ -296,7 +340,15 @@ Spawn all subagents in a **single batch** using your host's parallel-subagent to
296
340
  Per host, the spawn shape matters because evo's loop depends on *completion notifications* arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):
297
341
 
298
342
  - **claude-code** — fire one `Bash(run_in_background=true)` call per brief. The bash invokes the subagent (the host's `Task` tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a `<task-notification>` at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
299
- - **codex** — non-blocking subagent invocation; notifications delivered similarly.
343
+ - **codex** — call `spawn_agent` once per optimization brief. If `spawn_agent`
344
+ is deferred or not visible yet, first use Codex's tool-discovery tool
345
+ (`tool_search`) with a query like `spawn agent subagent`, then call the
346
+ discovered spawn tool. The optimize skill's plain-language trigger is
347
+ sufficient authorization for this; do not stop or fall back to direct edits
348
+ merely because the user did not explicitly write "spawn subagents." The
349
+ candidate worker prompt is the one that starts with the mandatory "First,
350
+ load and follow..." sentence below. A later read-only scan subagent does not
351
+ count toward the round width.
300
352
  - **hermes** — `terminal(background=true)`; notifications delivered similarly.
301
353
  - **openclaw** — `sessions_spawn deliver:false`; notifications delivered similarly.
302
354
  - **opencode** — *batch-parallel only* (no background notifications). Fire N `task` calls in ONE assistant message; all `tool_result`s return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
@@ -318,6 +370,11 @@ Then append:
318
370
 
319
371
  The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.
320
372
 
373
+ Before leaving step 5, check yourself: the number of optimization subagents you
374
+ spawned must equal the resolved `subagents=N` unless the user explicitly
375
+ requested a smaller direct/manual run. Scan/analysis subagents are separate and
376
+ do not count toward this number.
377
+
321
378
  ### 6. Collect results and update state
322
379
 
323
380
  After all subagents complete:
@@ -477,10 +534,12 @@ Keep the report public-safe but actionable enough for the evo backend team to
477
534
  reproduce the case. Include the phase, what you expected evo to do, what it did,
478
535
  and a generic repro shape. Do not include repo names, company names, file paths,
479
536
  commands, prompt text, raw logs, URLs, secrets, dataset names, or exact task
480
- examples.
537
+ examples. If the issue is tied to a specific evo experiment, include its
538
+ anonymous experiment id with `--exp-id`.
481
539
 
482
540
  ```bash
483
541
  evo telemetry feedback \
542
+ --exp-id exp_0007 \
484
543
  --kind orchestration \
485
544
  --phase optimize \
486
545
  --summary "Subagent completed but the orchestrator did not collect its evaluated result before writing the next round briefs." \
@@ -1,12 +1,30 @@
1
1
  ---
2
2
  name: report
3
- description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
4
- evo_version: 0.6.0-alpha.1
3
+ description: Read-only evo run reporting. Use when the user invokes /evo:report, asks what happened overnight, asks what improved recently, asks for the best/frontier candidates, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output. Never run benchmarks, gates, Slurm commands, evo run, or ad-hoc verification scripts for report requests.
4
+ evo_version: 0.6.0
5
5
  ---
6
6
 
7
7
  # Report
8
8
 
9
- Render the dashboard's scatter plot as a colored terminal block, one chart per run, sized to the current terminal.
9
+ Report the current evo workspace from recorded state only. A report request is
10
+ read-only, even if the user phrases it casually as "what happened?", "what got
11
+ better?", "what should I pay attention to?", or "I just woke up".
12
+
13
+ Do not spend compute while reporting:
14
+
15
+ - Do not run `evo run`, `evo gate check`, benchmark commands, or project eval
16
+ scripts.
17
+ - Do not run `python bench.py`, `python slurm_eval.py`, `sbatch`, `srun`,
18
+ `squeue`, `sacct`, or `scancel` to verify a result.
19
+ - Do not create launcher, monitor, parsing, or analysis scripts.
20
+ - Do not edit files.
21
+
22
+ Use stored evo state instead: `evo report`, `evo status`, `evo tree`,
23
+ `evo frontier`, `evo show <id>`, `evo diff <id>`, and immutable artifacts under
24
+ `.evo/run_*/experiments/<exp>/attempts/<NNN>/`.
25
+
26
+ For chart requests, render the dashboard's scatter plot as a colored terminal
27
+ block, one chart per run, sized to the current terminal.
10
28
 
11
29
  ## What it shows
12
30
 
@@ -41,3 +59,27 @@ Flags:
41
59
  - For one-off score lookups, `evo status` or `evo show <id>` is faster.
42
60
  - For navigating the tree shape, `evo tree` is the right command.
43
61
  - For interactive exploration (click a dot, open a drawer), point the user at `evo dashboard` instead.
62
+
63
+ ## Overnight / Improvement Reports
64
+
65
+ When the user asks what happened recently or what improved, summarize from
66
+ recorded evo state:
67
+
68
+ 1. Run `evo status`, `evo frontier`, and `evo tree`.
69
+ 2. Use `evo show <id>` for the best node and any recent committed/evaluated
70
+ nodes you mention.
71
+ 3. Use `evo diff <id>` only to explain what changed in a recorded experiment.
72
+ 4. If you need benchmark details, read the existing `outcome.json`,
73
+ `benchmark.log`, or declared artifacts for that experiment. Treat missing
74
+ artifacts as "not recorded", not as permission to rerun.
75
+
76
+ Report:
77
+
78
+ - best current experiment and score;
79
+ - score delta versus baseline or parent;
80
+ - top candidates/frontier if relevant;
81
+ - failed/evaluated nodes that need attention;
82
+ - any caveats about gates, missing held-out checks, or tied candidates.
83
+
84
+ If the user wants fresh validation or reruns, ask them to explicitly start a new
85
+ optimization or evaluation command. Do not infer that from a report request.
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: ship
3
3
  description: Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.
4
- evo_version: 0.6.0-alpha.1
4
+ evo_version: 0.6.0
5
5
  ---
6
6
 
7
7
  # Ship
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: subagent
3
3
  description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
4
- evo_version: 0.6.0-alpha.1
4
+ evo_version: 0.6.0
5
5
  ---
6
6
 
7
7
  # Evo Subagent Protocol
@@ -234,6 +234,13 @@ evo run <exp_id>
234
234
 
235
235
  This runs benchmark + gate and prints the result.
236
236
 
237
+ For noisy or replicated benchmarks, one `evo run` must represent the user's
238
+ decision statistic for the idea. If the brief, user, or `.evo/project.md` says
239
+ `n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation is required,
240
+ use the configured grouped benchmark path so the experiment records that
241
+ aggregate score. Do not create separate evo experiments for replicates, and do
242
+ not report or promote an idea by its best single replicate.
243
+
237
244
  In remote-backend workspaces, if a prior `evo run <exp_id>` was interrupted
238
245
  or the experiment is still `active`, run `evo run <exp_id>` again first. That
239
246
  is the recovery path: evo will try to attach to the existing remote process and
@@ -349,6 +356,7 @@ exact task examples.
349
356
 
350
357
  ```bash
351
358
  evo telemetry feedback \
359
+ --exp-id <YOUR_EXP_ID> \
352
360
  --kind orchestration \
353
361
  --phase subagent \
354
362
  --summary "Remote experiment recovery was unclear after the run process detached." \