@evo-hq/pi-evo 0.6.0-alpha.1 → 0.6.1-alpha.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
package/skills/discover/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: discover
|
|
3
3
|
description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
|
|
4
4
|
argument-hint: <optional context about what to optimize>
|
|
5
|
-
evo_version: 0.6.
|
|
5
|
+
evo_version: 0.6.1-alpha.1
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Discover
|
|
@@ -119,20 +119,20 @@ evo --version
|
|
|
119
119
|
The output must be exactly:
|
|
120
120
|
|
|
121
121
|
```
|
|
122
|
-
evo-hq-cli 0.6.
|
|
122
|
+
evo-hq-cli 0.6.1-alpha.1
|
|
123
123
|
```
|
|
124
124
|
|
|
125
125
|
Three outcomes:
|
|
126
126
|
|
|
127
127
|
1. **Matches exactly** — continue to step 1.
|
|
128
128
|
2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
|
|
129
|
-
> Your installed evo CLI is on a different version than this skill (`0.6.
|
|
129
|
+
> Your installed evo CLI is on a different version than this skill (`0.6.1-alpha.1`). Run:
|
|
130
130
|
> ```
|
|
131
|
-
> uv tool install --force evo-hq-cli==0.6.
|
|
131
|
+
> uv tool install --force evo-hq-cli==0.6.1-alpha.1
|
|
132
132
|
> ```
|
|
133
133
|
> Then re-invoke this skill.
|
|
134
134
|
3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
|
|
135
|
-
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.
|
|
135
|
+
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.1-alpha.1` (or `pipx install evo-hq-cli==0.6.1-alpha.1`). Then re-invoke this skill.
|
|
136
136
|
|
|
137
137
|
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
|
|
138
138
|
|
package/skills/optimize/SKILL.md
CHANGED
|
@@ -1,14 +1,51 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: optimize
|
|
3
|
-
description: Drive structured autoresearch iteration after evo:discover and the baseline commit
|
|
3
|
+
description: Drive structured autoresearch iteration after evo:discover and the baseline commit. Use when the user invokes /evo:optimize or asks to try ideas, try variants, run experiments, use available GPUs, improve the current best/frontier, continue an evo search, or compare candidate changes in an evo workspace. The orchestrator plans and spawns optimization subagents; candidate edits/runs belong to those subagents. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
|
|
4
4
|
argument-hint: "[subagents=N] [budget=N] [stall=N]"
|
|
5
|
-
evo_version: 0.6.
|
|
5
|
+
evo_version: 0.6.1-alpha.1
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
|
|
9
9
|
|
|
10
10
|
**This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
|
|
11
11
|
|
|
12
|
+
**Plain-language trigger.** In an initialized evo workspace, casual user wording
|
|
13
|
+
like "try a couple ideas", "try different variants", "use the available GPUs",
|
|
14
|
+
"continue from the current best", or "see what improves" is an optimize request
|
|
15
|
+
unless the user explicitly asks for a read-only report. Do not treat the lack of
|
|
16
|
+
a slash command as permission to bypass this protocol. Loading this skill from
|
|
17
|
+
plain-language wording is also explicit authorization to use the host's
|
|
18
|
+
subagent mechanism for the resolved round width; the user does not have to say
|
|
19
|
+
"spawn subagents" or "parallel agents" separately.
|
|
20
|
+
|
|
21
|
+
**Candidate-work delegation invariant.** The orchestrator does not create, edit,
|
|
22
|
+
or run candidate experiments for the round. For `subagents=N`, write N briefs
|
|
23
|
+
and spawn N optimization subagents; each spawned subagent allocates its own
|
|
24
|
+
experiment with `evo new`, edits only its worktree, and runs `evo run`. Do not
|
|
25
|
+
simulate a subagent round by running `evo new`, editing files, or launching
|
|
26
|
+
multiple `evo run` commands from the orchestrator, even if that would be faster
|
|
27
|
+
or easier. If the host's subagent tool is unavailable, stop and report that the
|
|
28
|
+
host cannot run `/evo:optimize subagents=N` as requested; only fall back to
|
|
29
|
+
orchestrator-owned experiments when the user explicitly asks for direct/manual
|
|
30
|
+
execution or turns subagents-only off for that run. Do not infer a direct/manual
|
|
31
|
+
fallback from casual wording, a simple-looking benchmark, or the absence of an
|
|
32
|
+
explicit subagent phrase in the user's prompt.
|
|
33
|
+
|
|
34
|
+
**Resource-cap invariant.** `subagents=N` is live concurrency, not total ideas.
|
|
35
|
+
Never spawn more concurrent optimization subagents or launch more concurrent
|
|
36
|
+
benchmark jobs than the binding resource can support. If the user asks for more
|
|
37
|
+
ideas than available GPU/Slurm/pool slots, batch them across rounds at the safe
|
|
38
|
+
width, or stop and explain the cap if batching is impossible. Do not rely on
|
|
39
|
+
the scheduler to absorb an accidental flood unless the user explicitly asks to
|
|
40
|
+
queue/oversubscribe jobs.
|
|
41
|
+
|
|
42
|
+
**Bounded-run stop rule.** If the user says "one round", "stop after this
|
|
43
|
+
round", "run them and tell me what happened", or otherwise asks for a bounded
|
|
44
|
+
run, resolve autonomous off at startup. After the requested subagents finish,
|
|
45
|
+
collect their evo-recorded outcomes, print the summary, and stop. Do not enter
|
|
46
|
+
another loop turn, wait for a stop nudge, or keep the process alive just because
|
|
47
|
+
the default autonomous behavior is normally on.
|
|
48
|
+
|
|
12
49
|
## Evo surface -- loop-relevant
|
|
13
50
|
|
|
14
51
|
You're inside `/evo:optimize`. Things you'll pull/dispatch during the loop:
|
|
@@ -68,7 +105,7 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
|
|
|
68
105
|
|
|
69
106
|
The orchestrator's three round-shape knobs are **subagents** (round width), **budget** (per-branch depth), and **stall** (consecutive rounds with no improvement before auto-stopping; default 5).
|
|
70
107
|
|
|
71
|
-
A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value
|
|
108
|
+
A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value wins over what's below, subject to the resource-cap invariant above.
|
|
72
109
|
|
|
73
110
|
**Picking `subagents` and `budget` is load-bearing -- do not skim.**
|
|
74
111
|
|
|
@@ -145,6 +182,13 @@ files or the attempt should be treated as `remote_infra_failure`.
|
|
|
145
182
|
|
|
146
183
|
**Runtime recipe/env.** Benchmark runtime is evo configuration, not something subagents should rediscover or copy into worktrees. Use `evo config runtime show` for prepare/before-run/prefix and `evo env show` for redacted env sources. If a run fails because expected runtime setup or env is missing, report it as setup failure or configure it from the orchestrator; do not patch benchmark code to bake in secrets or local paths. Use `evo run <exp_id> --check` for non-committing wiring validation; do not invent ad-hoc validation wrappers.
|
|
147
184
|
|
|
185
|
+
**Replicated/noisy benchmarks.** If the user or `.evo/project.md` says an idea
|
|
186
|
+
must pass `n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation,
|
|
187
|
+
configure the benchmark so each `evo run` records the grouped aggregate as that
|
|
188
|
+
experiment's score. Do not represent replicates as independent evo experiments
|
|
189
|
+
and do not judge, report, or promote an idea by its best replicate. The frontier
|
|
190
|
+
and best path must see the same aggregate statistic the user uses for decisions.
|
|
191
|
+
|
|
148
192
|
**CLI reference.** If you are unsure which command to use, read `plugins/evo/skills/references/cli-quick-reference.md`. It is the canonical command map; this skill only repeats the high-frequency commands.
|
|
149
193
|
|
|
150
194
|
## Prerequisites
|
|
@@ -296,7 +340,15 @@ Spawn all subagents in a **single batch** using your host's parallel-subagent to
|
|
|
296
340
|
Per host, the spawn shape matters because evo's loop depends on *completion notifications* arriving turn-by-turn (so the orchestrator can review each subagent's outcome and decide round 2):
|
|
297
341
|
|
|
298
342
|
- **claude-code** — fire one `Bash(run_in_background=true)` call per brief. The bash invokes the subagent (the host's `Task` tool, or any equivalent that runs the brief to completion). Each backgrounded bash returns immediately and the runtime delivers a `<task-notification>` at a later turn when each subagent finishes. Do NOT wait on subagents inline; fan them out, then exit your current turn — notifications arrive in subsequent turns.
|
|
299
|
-
- **codex** —
|
|
343
|
+
- **codex** — call `spawn_agent` once per optimization brief. If `spawn_agent`
|
|
344
|
+
is deferred or not visible yet, first use Codex's tool-discovery tool
|
|
345
|
+
(`tool_search`) with a query like `spawn agent subagent`, then call the
|
|
346
|
+
discovered spawn tool. The optimize skill's plain-language trigger is
|
|
347
|
+
sufficient authorization for this; do not stop or fall back to direct edits
|
|
348
|
+
merely because the user did not explicitly write "spawn subagents." The
|
|
349
|
+
candidate worker prompt is the one that starts with the mandatory "First,
|
|
350
|
+
load and follow..." sentence below. A later read-only scan subagent does not
|
|
351
|
+
count toward the round width.
|
|
300
352
|
- **hermes** — `terminal(background=true)`; notifications delivered similarly.
|
|
301
353
|
- **openclaw** — `sessions_spawn deliver:false`; notifications delivered similarly.
|
|
302
354
|
- **opencode** — *batch-parallel only* (no background notifications). Fire N `task` calls in ONE assistant message; all `tool_result`s return together when the slowest finishes. Plan all parallel work (including non-task tools) in that single message — opencode cannot interleave reasoning across turns while subagents run.
|
|
@@ -318,6 +370,11 @@ Then append:
|
|
|
318
370
|
|
|
319
371
|
The opening sentence is non-negotiable — without it small models often skip the evo CLI and edit files directly, which produces no committed experiments and breaks the round.
|
|
320
372
|
|
|
373
|
+
Before leaving step 5, check yourself: the number of optimization subagents you
|
|
374
|
+
spawned must equal the resolved `subagents=N` unless the user explicitly
|
|
375
|
+
requested a smaller direct/manual run. Scan/analysis subagents are separate and
|
|
376
|
+
do not count toward this number.
|
|
377
|
+
|
|
321
378
|
### 6. Collect results and update state
|
|
322
379
|
|
|
323
380
|
After all subagents complete:
|
|
@@ -477,10 +534,12 @@ Keep the report public-safe but actionable enough for the evo backend team to
|
|
|
477
534
|
reproduce the case. Include the phase, what you expected evo to do, what it did,
|
|
478
535
|
and a generic repro shape. Do not include repo names, company names, file paths,
|
|
479
536
|
commands, prompt text, raw logs, URLs, secrets, dataset names, or exact task
|
|
480
|
-
examples.
|
|
537
|
+
examples. If the issue is tied to a specific evo experiment, include its
|
|
538
|
+
anonymous experiment id with `--exp-id`.
|
|
481
539
|
|
|
482
540
|
```bash
|
|
483
541
|
evo telemetry feedback \
|
|
542
|
+
--exp-id exp_0007 \
|
|
484
543
|
--kind orchestration \
|
|
485
544
|
--phase optimize \
|
|
486
545
|
--summary "Subagent completed but the orchestrator did not collect its evaluated result before writing the next round briefs." \
|
package/skills/report/SKILL.md
CHANGED
|
@@ -1,12 +1,30 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: report
|
|
3
|
-
description:
|
|
4
|
-
evo_version: 0.6.
|
|
3
|
+
description: Read-only evo run reporting. Use when the user invokes /evo:report, asks what happened overnight, asks what improved recently, asks for the best/frontier candidates, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output. Never run benchmarks, gates, Slurm commands, evo run, or ad-hoc verification scripts for report requests.
|
|
4
|
+
evo_version: 0.6.1-alpha.1
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Report
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
Report the current evo workspace from recorded state only. A report request is
|
|
10
|
+
read-only, even if the user phrases it casually as "what happened?", "what got
|
|
11
|
+
better?", "what should I pay attention to?", or "I just woke up".
|
|
12
|
+
|
|
13
|
+
Do not spend compute while reporting:
|
|
14
|
+
|
|
15
|
+
- Do not run `evo run`, `evo gate check`, benchmark commands, or project eval
|
|
16
|
+
scripts.
|
|
17
|
+
- Do not run `python bench.py`, `python slurm_eval.py`, `sbatch`, `srun`,
|
|
18
|
+
`squeue`, `sacct`, or `scancel` to verify a result.
|
|
19
|
+
- Do not create launcher, monitor, parsing, or analysis scripts.
|
|
20
|
+
- Do not edit files.
|
|
21
|
+
|
|
22
|
+
Use stored evo state instead: `evo report`, `evo status`, `evo tree`,
|
|
23
|
+
`evo frontier`, `evo show <id>`, `evo diff <id>`, and immutable artifacts under
|
|
24
|
+
`.evo/run_*/experiments/<exp>/attempts/<NNN>/`.
|
|
25
|
+
|
|
26
|
+
For chart requests, render the dashboard's scatter plot as a colored terminal
|
|
27
|
+
block, one chart per run, sized to the current terminal.
|
|
10
28
|
|
|
11
29
|
## What it shows
|
|
12
30
|
|
|
@@ -41,3 +59,27 @@ Flags:
|
|
|
41
59
|
- For one-off score lookups, `evo status` or `evo show <id>` is faster.
|
|
42
60
|
- For navigating the tree shape, `evo tree` is the right command.
|
|
43
61
|
- For interactive exploration (click a dot, open a drawer), point the user at `evo dashboard` instead.
|
|
62
|
+
|
|
63
|
+
## Overnight / Improvement Reports
|
|
64
|
+
|
|
65
|
+
When the user asks what happened recently or what improved, summarize from
|
|
66
|
+
recorded evo state:
|
|
67
|
+
|
|
68
|
+
1. Run `evo status`, `evo frontier`, and `evo tree`.
|
|
69
|
+
2. Use `evo show <id>` for the best node and any recent committed/evaluated
|
|
70
|
+
nodes you mention.
|
|
71
|
+
3. Use `evo diff <id>` only to explain what changed in a recorded experiment.
|
|
72
|
+
4. If you need benchmark details, read the existing `outcome.json`,
|
|
73
|
+
`benchmark.log`, or declared artifacts for that experiment. Treat missing
|
|
74
|
+
artifacts as "not recorded", not as permission to rerun.
|
|
75
|
+
|
|
76
|
+
Report:
|
|
77
|
+
|
|
78
|
+
- best current experiment and score;
|
|
79
|
+
- score delta versus baseline or parent;
|
|
80
|
+
- top candidates/frontier if relevant;
|
|
81
|
+
- failed/evaluated nodes that need attention;
|
|
82
|
+
- any caveats about gates, missing held-out checks, or tied candidates.
|
|
83
|
+
|
|
84
|
+
If the user wants fresh validation or reruns, ask them to explicitly start a new
|
|
85
|
+
optimization or evaluation command. Do not infer that from a report request.
|
package/skills/ship/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: ship
|
|
3
3
|
description: Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.
|
|
4
|
-
evo_version: 0.6.
|
|
4
|
+
evo_version: 0.6.1-alpha.1
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Ship
|
package/skills/subagent/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: subagent
|
|
3
3
|
description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
|
|
4
|
-
evo_version: 0.6.
|
|
4
|
+
evo_version: 0.6.1-alpha.1
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Evo Subagent Protocol
|
|
@@ -234,6 +234,13 @@ evo run <exp_id>
|
|
|
234
234
|
|
|
235
235
|
This runs benchmark + gate and prints the result.
|
|
236
236
|
|
|
237
|
+
For noisy or replicated benchmarks, one `evo run` must represent the user's
|
|
238
|
+
decision statistic for the idea. If the brief, user, or `.evo/project.md` says
|
|
239
|
+
`n=3`, `n=10`, median, mean, held-out, or cross-dataset evaluation is required,
|
|
240
|
+
use the configured grouped benchmark path so the experiment records that
|
|
241
|
+
aggregate score. Do not create separate evo experiments for replicates, and do
|
|
242
|
+
not report or promote an idea by its best single replicate.
|
|
243
|
+
|
|
237
244
|
In remote-backend workspaces, if a prior `evo run <exp_id>` was interrupted
|
|
238
245
|
or the experiment is still `active`, run `evo run <exp_id>` again first. That
|
|
239
246
|
is the recovery path: evo will try to attach to the existing remote process and
|
|
@@ -349,6 +356,7 @@ exact task examples.
|
|
|
349
356
|
|
|
350
357
|
```bash
|
|
351
358
|
evo telemetry feedback \
|
|
359
|
+
--exp-id <YOUR_EXP_ID> \
|
|
352
360
|
--kind orchestration \
|
|
353
361
|
--phase subagent \
|
|
354
362
|
--summary "Remote experiment recovery was unclear after the run process detached." \
|