@evo-hq/pi-evo 0.5.2 → 0.6.0-alpha.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +2 -0
- package/package.json +1 -1
- package/skills/discover/SKILL.md +41 -6
- package/skills/infra-setup/SKILL.md +1 -1
- package/skills/optimize/SKILL.md +47 -14
- package/skills/optimize/workflows/evo-optimize.js +1 -0
- package/skills/report/SKILL.md +4 -4
- package/skills/ship/SKILL.md +140 -0
- package/skills/subagent/SKILL.md +26 -2
package/README.md
CHANGED
|
@@ -25,6 +25,8 @@ skill needs one to run multiple experiments per round.
|
|
|
25
25
|
| `skills/discover/` | First-run setup: explore repo, propose optimization dimensions, build benchmark, run first experiment |
|
|
26
26
|
| `skills/optimize/` | The search loop: parallel subagents form hypotheses, edit, get scored, frontier picks next branch |
|
|
27
27
|
| `skills/subagent/` | Per-experiment brief contract for the optimize round's fanout |
|
|
28
|
+
| `skills/report/` | Terminal score chart mirroring the dashboard scatter plot |
|
|
29
|
+
| `skills/ship/` | Distill the best valid experiment into a clean mergeable change |
|
|
28
30
|
| `skills/infra-setup/` | Provider matrix for remote-sandbox backends (Modal, E2B, Daytona, etc.) |
|
|
29
31
|
|
|
30
32
|
## Versioning
|
package/package.json
CHANGED
package/skills/discover/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: discover
|
|
3
3
|
description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
|
|
4
4
|
argument-hint: <optional context about what to optimize>
|
|
5
|
-
evo_version: 0.
|
|
5
|
+
evo_version: 0.6.0-alpha.1
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Discover
|
|
@@ -25,6 +25,9 @@ evo plugin
|
|
|
25
25
|
│ │ ├── evo:optimize after discover commits the baseline -- drives the loop.
|
|
26
26
|
│ │ │ Args: subagents=N (read sizing-the-round FIRST),
|
|
27
27
|
│ │ │ autonomous, subagents-only, budget=N, stall=N
|
|
28
|
+
│ │ ├── evo:ship after the loop stops -- distills the best valid
|
|
29
|
+
│ │ │ experiment into a mergeable change (PR if remote,
|
|
30
|
+
│ │ │ else merge) + a mergeability report
|
|
28
31
|
│ │ ├── evo:finetuning task is finetuning / post-training / training a model
|
|
29
32
|
│ │ └── evo:infra-setup need a remote backend, pooled workspaces, lease/slot
|
|
30
33
|
│ │ management, or specific provider auth/setup
|
|
@@ -116,20 +119,20 @@ evo --version
|
|
|
116
119
|
The output must be exactly:
|
|
117
120
|
|
|
118
121
|
```
|
|
119
|
-
evo-hq-cli 0.
|
|
122
|
+
evo-hq-cli 0.6.0-alpha.1
|
|
120
123
|
```
|
|
121
124
|
|
|
122
125
|
Three outcomes:
|
|
123
126
|
|
|
124
127
|
1. **Matches exactly** — continue to step 1.
|
|
125
128
|
2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
|
|
126
|
-
> Your installed evo CLI is on a different version than this skill (`0.
|
|
129
|
+
> Your installed evo CLI is on a different version than this skill (`0.6.0-alpha.1`). Run:
|
|
127
130
|
> ```
|
|
128
|
-
> uv tool install --force evo-hq-cli==0.
|
|
131
|
+
> uv tool install --force evo-hq-cli==0.6.0-alpha.1
|
|
129
132
|
> ```
|
|
130
133
|
> Then re-invoke this skill.
|
|
131
134
|
3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
|
|
132
|
-
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.
|
|
135
|
+
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0-alpha.1` (or `pipx install evo-hq-cli==0.6.0-alpha.1`). Then re-invoke this skill.
|
|
133
136
|
|
|
134
137
|
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
|
|
135
138
|
|
|
@@ -519,7 +522,39 @@ Document:
|
|
|
519
522
|
- Benchmark gaming risks identified during the Goodhart check
|
|
520
523
|
- Future experiment candidates (the non-picked dimensions from step 3)
|
|
521
524
|
|
|
522
|
-
## 13. Report
|
|
525
|
+
## 13. Report public-safe use case telemetry
|
|
526
|
+
|
|
527
|
+
Call the telemetry helper with a concrete but sanitized use-case summary. This
|
|
528
|
+
command is globally disabled when the user runs `evo telemetry off`; do not
|
|
529
|
+
branch on that yourself.
|
|
530
|
+
|
|
531
|
+
Write one short sentence that captures: workload/system shape, optimization
|
|
532
|
+
lever, and measured signal or failure mode. Add 3-6 lower-kebab-case tags for
|
|
533
|
+
the workload, lever, metric/failure mode, runtime category, or benchmark source.
|
|
534
|
+
Avoid filler-only tags like `optimization`, `benchmark`, or `codebase`.
|
|
535
|
+
|
|
536
|
+
```bash
|
|
537
|
+
evo telemetry usecase \
|
|
538
|
+
--description "<public-safe, concrete use case summary>" \
|
|
539
|
+
--tag <tag> --tag <tag> --tag <tag>
|
|
540
|
+
```
|
|
541
|
+
|
|
542
|
+
Privacy rule: do not include repo names, company names, customer names, file
|
|
543
|
+
paths, benchmark commands, prompt text, task examples, secrets, URLs, internal
|
|
544
|
+
system names, environment names, raw error logs, or exact dataset/item names.
|
|
545
|
+
Generalize private nouns instead of deleting useful signal: "internal billing
|
|
546
|
+
agent" becomes "workflow agent"; "Acme support router" becomes "support-style
|
|
547
|
+
routing agent".
|
|
548
|
+
|
|
549
|
+
Good example:
|
|
550
|
+
|
|
551
|
+
```bash
|
|
552
|
+
evo telemetry usecase \
|
|
553
|
+
--description "Optimizing a tool-calling coding agent for higher task pass rate by tuning planner/retry behavior against an existing per-task eval." \
|
|
554
|
+
--tag coding-agent --tag tool-calling --tag retry-policy --tag pass-rate --tag existing-eval
|
|
555
|
+
```
|
|
556
|
+
|
|
557
|
+
## 14. Report to the user
|
|
523
558
|
|
|
524
559
|
End the skill by reporting in chat:
|
|
525
560
|
|
package/skills/optimize/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: optimize
|
|
3
3
|
description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
|
|
4
4
|
argument-hint: "[subagents=N] [budget=N] [stall=N]"
|
|
5
|
-
evo_version: 0.
|
|
5
|
+
evo_version: 0.6.0-alpha.1
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
|
|
@@ -102,17 +102,17 @@ evo defaults get subagents-only --json
|
|
|
102
102
|
|
|
103
103
|
As your **very first actions, before the loop**, resolve each and arm it: run `evo autonomous on` / `evo subagents-only on` when it resolves on, or `evo autonomous off` / `evo subagents-only off` when an explicit instruction or stored default turned it off. If a behavior resolves off — whether from the user's instruction this run or a stored default — say so in your opening message (e.g. "autonomous off — running one round at a time, as you asked") so it's never invisible.
|
|
104
104
|
|
|
105
|
-
**Orchestrator driver.** evo drives the loop two ways: a deterministic **dynamic workflow** (Claude Code only)
|
|
105
|
+
**Orchestrator driver.** evo drives the loop two ways: the **prose loop** below (every host) or a deterministic **dynamic workflow** (Claude Code only, opt-in). **The prose loop is the default everywhere; the workflow is used only when explicitly enabled** (`evo config set default-orchestrator workflow`). Resolve which as part of your very first actions:
|
|
106
106
|
|
|
107
107
|
1. `evo host show` — the workflow driver requires `claude-code`. If it prints `<not set>` (a pre-host workspace), determine your actual runtime from your own context (system prompt, env such as `CLAUDECODE=1`, self-identity): **only if you are genuinely Claude Code**, do the one-time host migration now (`evo host set claude-code`) and continue; if you are any other runtime, do NOT stamp the host here — leave it for Step 0.1 and use the prose loop.
|
|
108
|
-
2. `evo config get default-orchestrator` — `
|
|
108
|
+
2. `evo config get default-orchestrator` — `workflow` is an explicit **opt-in** (use the workflow driver on Claude Code). `prose` **or unset** resolves to the prose loop. An explicit user instruction this run still wins.
|
|
109
109
|
|
|
110
|
-
**Use the workflow** when
|
|
110
|
+
**Use the workflow** only when `default-orchestrator` is explicitly `workflow`, host is `claude-code`, AND the **Workflow tool is actually present in your available tools this session** — it is opt-in, never the default. The availability check is load-bearing: **older Claude Code builds do not ship the Workflow tool**, so verify it's really in your toolset; do not assume it exists from the host alone. Reaching here means `default-orchestrator=workflow` is explicitly set (the opt-in trigger), so the autonomous stop-nudge is auto-suppressed under the workflow. Launch it once, do NOT drive the loop turn-by-turn:
|
|
111
111
|
|
|
112
112
|
- Call the **Workflow** tool with `scriptPath: ${CLAUDE_PLUGIN_ROOT}/skills/optimize/workflows/evo-optimize.js` and `args: {pluginRoot: "${CLAUDE_PLUGIN_ROOT}", subagents: <N>, budget: <N>, stall: <N>}`, using the round sizing you resolved above. **Pass all four keys explicitly — never omit one.** For `stall`, use the user's `/optimize stall=N` override if given, else the default 5. (The workflow's stop condition is the stall limit, so a dropped `stall` silently reverts it to 5.)
|
|
113
113
|
- Report the returned `runId` and tell the user to watch progress with `/workflows`. The workflow runs the round loop itself (orient → mandatory scan + cross-history axis check → ideators on stall/periodic → briefs → fan-out + verify → collect → frontier-select → stall) plus the concurrent meta controller; you do **not** execute "The Loop" section below, and you do **not** need autonomous mode (the workflow self-drives; its stall limit is the stop).
|
|
114
114
|
|
|
115
|
-
Use **The Loop** below
|
|
115
|
+
Use **The Loop** below by default — it is the prose driver on every host, and the path whenever the workflow is not explicitly enabled (`default-orchestrator` unset or `prose`), the host is not `claude-code`, or the Workflow tool is unavailable (e.g. an older Claude Code build). The workflow is only an execution strategy over the same `evo` CLI; gates, frontier, dashboard, and recovery are identical either way.
|
|
116
116
|
|
|
117
117
|
**Reconcile config when you fall back to prose.** The stop-nudge that drives the prose loop is auto-suppressed whenever `default-orchestrator` is `workflow`. So if you fall back to the prose loop on Claude Code because the Workflow tool isn't available (older build) while `default-orchestrator` is still `workflow` from a prior run, you MUST set it back — `evo config set default-orchestrator prose` — and arm autonomous as usual. Otherwise the prose loop's stop-nudge stays suppressed and the run stalls after one round. Invariant to preserve: `default-orchestrator=workflow` in config iff the workflow is actually the driver this run.
|
|
118
118
|
|
|
@@ -331,16 +331,18 @@ After all subagents complete:
|
|
|
331
331
|
|
|
332
332
|
**Cross-cut the round's evaluated nodes.** Before moving on, read `experiments/<id>/attempts/NNN/outcome.json` for each evaluated node from this round. The structured `gates[]` entries and `benchmark.result` let you spot shared failure modes the subagent summaries may have glossed over (e.g., three different subagents produced evaluated nodes whose gate_failures all included `refund_flow` -- that's a structural constraint the next round must confront, not three independent bad hypotheses).
|
|
333
333
|
|
|
334
|
-
Prune
|
|
334
|
+
Prune branches you have decided are exhausted:
|
|
335
335
|
```bash
|
|
336
|
-
evo prune <exp_id> --reason "exhausted:
|
|
336
|
+
evo prune <exp_id> --exhausted --reason "exhausted: <why>"
|
|
337
337
|
```
|
|
338
338
|
|
|
339
|
-
`evo prune` accepts `committed` or `evaluated` nodes
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
339
|
+
`evo prune` accepts `committed` or `evaluated` nodes:
|
|
340
|
+
- `--exhausted` (default/legacy): stop branching here; the result still counts.
|
|
341
|
+
- `--invalid`: this result is wrong; exclude it and descendants from best/frontier.
|
|
342
|
+
- `--yes`: required only with `--invalid` on the current best valid spine.
|
|
343
|
+
|
|
344
|
+
Use `--invalid --yes` when you have proven a best-spine result or ancestor is
|
|
345
|
+
wrong. **Never `evo discard` a committed node** -- prune preserves its commit.
|
|
344
346
|
|
|
345
347
|
If a previously-pruned (or discarded-then-restored) node is worth revisiting:
|
|
346
348
|
```bash
|
|
@@ -444,18 +446,49 @@ Proposals are advisory, not mandatory. If none look better than what step 3's sc
|
|
|
444
446
|
- User hasn't interrupted
|
|
445
447
|
- Score hasn't reached the theoretical maximum
|
|
446
448
|
|
|
449
|
+
To continue, go back to step 1.
|
|
450
|
+
|
|
447
451
|
**Stop** if:
|
|
448
452
|
- Stall counter >= stall limit (N consecutive rounds with no improvement)
|
|
449
453
|
- Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
|
|
450
454
|
- User interrupted
|
|
451
455
|
|
|
452
|
-
On stop,
|
|
456
|
+
On stop, the loop is done -- do not go back to step 1. Print a final summary:
|
|
453
457
|
- Best score achieved and experiment ID
|
|
454
458
|
- Total experiments run across all rounds
|
|
455
459
|
- The winning diff: `evo diff <best_exp_id>`
|
|
456
460
|
- Suggested next steps if the score hasn't converged
|
|
461
|
+
- **How to land it:** point the user at `/evo:ship` to turn the winning
|
|
462
|
+
experiment into a mergeable change (a PR when the repo has a remote, else a
|
|
463
|
+
merge into the working branch). The raw winning diff is not the artifact to
|
|
464
|
+
merge by hand -- `/evo:ship` distills it to the minimal clean change and
|
|
465
|
+
attaches a mergeability report. Mention this whenever the run produced a
|
|
466
|
+
committed experiment that beats the baseline.
|
|
467
|
+
|
|
468
|
+
### 8. Report evo-side feedback when evo itself went wrong
|
|
469
|
+
|
|
470
|
+
Use feedback for evo product/orchestration issues, not for ordinary bad
|
|
471
|
+
experiments. Good triggers: confusing CLI behavior, subagent handoff failures,
|
|
472
|
+
workflow/meta-controller mistakes, recovery that failed, remote backend handling
|
|
473
|
+
that was hard to diagnose, or an evo policy/gate that blocked the wrong thing.
|
|
474
|
+
The command is anonymous and no-ops when telemetry is off; do not branch on that.
|
|
475
|
+
|
|
476
|
+
Keep the report public-safe but actionable enough for the evo backend team to
|
|
477
|
+
reproduce the case. Include the phase, what you expected evo to do, what it did,
|
|
478
|
+
and a generic repro shape. Do not include repo names, company names, file paths,
|
|
479
|
+
commands, prompt text, raw logs, URLs, secrets, dataset names, or exact task
|
|
480
|
+
examples.
|
|
457
481
|
|
|
458
|
-
|
|
482
|
+
```bash
|
|
483
|
+
evo telemetry feedback \
|
|
484
|
+
--kind orchestration \
|
|
485
|
+
--phase optimize \
|
|
486
|
+
--summary "Subagent completed but the orchestrator did not collect its evaluated result before writing the next round briefs." \
|
|
487
|
+
--expected "Round collection should wait for all dispatched subagents or clearly report the missing one." \
|
|
488
|
+
--actual "The next round was planned from partial results, leaving one evaluated node out of the pattern scan." \
|
|
489
|
+
--repro "Run optimize with multiple subagents on a workspace where one branch finishes after the first completion notification." \
|
|
490
|
+
--tag optimize-loop --tag subagent-handoff --tag collection
|
|
491
|
+
```
|
|
459
492
|
|
|
460
493
|
## Polling discipline
|
|
461
494
|
|
|
@@ -553,6 +553,7 @@ function metaPrompt(ctx, intervalS, reported, journal) {
|
|
|
553
553
|
'- Dead direction / ignored mechanism: annotations repeatedly naming a mechanism the recent work ignores, or a direction that keeps regressing. BRIEF HINT.',
|
|
554
554
|
'- Heading toward failure (STOP): an in-flight experiment that is CLEARLY doomed or wasting the budget — a divergent / NaN / flatlined progress metric; projected completion beyond the remaining time budget; or a known-fatal signature (e.g. output the scorer cannot parse; a silent resource mis-placement that tanks throughput with no error; a corrupt input/format that invalidates the result). HIGH PRECISION ONLY: default to NOT stopping — recommend a STOP only with concrete evidence that finishing is wasted, and only for an experiment still `active`. Emit a stop with: expId; failureClass (build = the build/produce step is broken; eval = artifact is fine but scoring/serving is wrong; hypothesis = it runs but won\'t help); reason (the diagnosis + the evidence you saw); fixHint (what the NEXT experiment must change).',
|
|
555
555
|
'For STOPs you stay READ-ONLY: do NOT run `evo abort` / `evo discard` yourself. A gated enforcer acts on each stop — it aborts the run + its subprocess tree, annotates your diagnosis (so it outlives the worktree and feeds the next round), and discards with the failureClass so the partial artifact is preserved. A STOP is a diagnosed, recoverable stop, never a silent kill.',
|
|
556
|
+
'If you observe an evo workflow/meta-controller defect (missed collection, wrong prompt handoff, recovery confusion, bad stop/enforcer behavior), you MAY run `evo telemetry feedback --kind workflow --phase meta ...` with public-safe summary/expected/actual/repro/tags before returning. This is anonymous and no-ops when telemetry is off. Do NOT report ordinary bad experiments, raw logs, paths, commands, repo names, URLs, secrets, or prompt text.',
|
|
556
557
|
'',
|
|
557
558
|
'HARNESS CONTROL (your distinctive power): you may restructure the optimize workflow itself, live, when you judge it will help — edits apply directly (free will) and take effect next round. Current harness state: ' + JSON.stringify(harnessSummary()) + '.',
|
|
558
559
|
'harnessEdits ops: (1) set-knob {knob: width|budget|stall|ideateEvery|ideateStall, value} — retune the loop (widen the round, deepen branches, change the stall limit or ideation cadence). (2) toggle-phase {phaseName: scan|ideate, enabled} — turn a phase off/on (e.g. skip scan when traces are uninformative; force ideation early). (3) set-prompt {target: state|scan|aggregate|brief|implement|run|ideator|collect|preverify|audit, mode: append|replace, text} — edit the prompt that step uses. Appends ACCUMULATE as standing directives (the current ones are visible in the harness state above — do not re-add them); replace swaps the base wholesale. Use preverify/audit to harden the verifier when you spot a cheat pattern the audit missed. (4) inject-step {at: before-scan|after-scan|before-brief|after-collect, text, label} — add an extra agent step at that seam each round. Every edit needs a rationale citing the evidence.',
|
package/skills/report/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: report
|
|
3
3
|
description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
|
|
4
|
-
evo_version: 0.
|
|
4
|
+
evo_version: 0.6.0-alpha.1
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Report
|
|
@@ -13,10 +13,10 @@ Render the dashboard's scatter plot as a colored terminal block, one chart per r
|
|
|
13
13
|
Mirrors the web dashboard's score scatter (left rail of `evo dashboard`):
|
|
14
14
|
|
|
15
15
|
- X = experiment creation order, Y = score
|
|
16
|
-
- Dot color by status: green = committed, red = failed, purple = active, grey = pending / evaluated / discarded / pruned
|
|
17
|
-
- ★ marks the current best committed experiment
|
|
16
|
+
- Dot color by status: green = committed valid result, red = failed, purple = active, grey = pending / evaluated / discarded / pruned
|
|
17
|
+
- ★ marks the current best valid committed-result experiment. `pruned` with `prune_kind=exhausted` can still be best; `prune_kind=invalid` and its descendants cannot.
|
|
18
18
|
- Yellow ring on dots that sit on the best-path spine (root → best)
|
|
19
|
-
- Yellow stair line traces cumulative-best across committed experiments
|
|
19
|
+
- Yellow stair line traces cumulative-best across valid committed-result experiments
|
|
20
20
|
- ○ at the baseline for experiments that have no score yet (active / pending)
|
|
21
21
|
|
|
22
22
|
Every run in the workspace is rendered, stacked top-to-bottom, with a header line showing `run_id · target · metric`.
|
|
@@ -0,0 +1,140 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ship
|
|
3
|
+
description: Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.
|
|
4
|
+
evo_version: 0.6.0-alpha.1
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Ship
|
|
8
|
+
|
|
9
|
+
Turn a finished evo run into a change a maintainer would merge.
|
|
10
|
+
|
|
11
|
+
The optimize loop leaves a tree of committed experiments. The winning worktree
|
|
12
|
+
diff is not mergeable as-is: it carries debug prints, search-process churn,
|
|
13
|
+
over-broad edits, and sometimes a test that was relaxed to clear a gate. Shipping
|
|
14
|
+
is the step that re-derives the *minimal clean change* reproducing the winning
|
|
15
|
+
behaviour, lands it the way the repo expects (PR or merge), and reports how
|
|
16
|
+
mergeable it is.
|
|
17
|
+
|
|
18
|
+
Correctness is the floor, not the goal. The score says the behaviour works; this
|
|
19
|
+
skill decides whether the *diff* is fit to merge.
|
|
20
|
+
|
|
21
|
+
## Invocation
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
/evo:ship # ship the auto-selected winner
|
|
25
|
+
/evo:ship exp_0042 # ship a specific experiment instead
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Stage 1 -- Select the winner
|
|
29
|
+
|
|
30
|
+
Pick the experiment to ship, then confirm it with the user before touching their
|
|
31
|
+
tree.
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
evo status # current best valid score + counts
|
|
35
|
+
evo report # top valid experiments table + score chart
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
- The default winner is the highest-scoring valid result in the graph history,
|
|
39
|
+
not the frontier. `evo frontier` is for choosing where to branch next; it can
|
|
40
|
+
exclude an exhausted branch whose score is still the right thing to ship. An
|
|
41
|
+
explicit `exp_id` argument overrides auto-selection.
|
|
42
|
+
- A shippable winner must be valid: `committed`, or `pruned` with
|
|
43
|
+
`prune_kind=exhausted`, with a commit and score, no `gate_result === false`,
|
|
44
|
+
and no invalid-pruned ancestor. Never select `discarded`, `failed`, `active`,
|
|
45
|
+
`evaluated`, legacy-pruned nodes with no `prune_kind`, `prune_kind=invalid`, or
|
|
46
|
+
descendants of invalid-pruned nodes. If no valid candidate exists, stop and
|
|
47
|
+
report why nothing is safe to ship.
|
|
48
|
+
- Resolve the run's root (baseline) node, then show the cumulative change:
|
|
49
|
+
```bash
|
|
50
|
+
evo diff <root_id> <winner_id> # target-scoped cumulative diff, baseline -> winner
|
|
51
|
+
```
|
|
52
|
+
For changes outside the benchmark target, diff the commits directly
|
|
53
|
+
(`git diff <baseline_commit> <winner_commit>`); each node carries `.commit`.
|
|
54
|
+
- Present a one-screen summary: winner id, score baseline -> winner (delta),
|
|
55
|
+
the winning hypothesis, and a diffstat. Get a go before proceeding.
|
|
56
|
+
|
|
57
|
+
## Stage 2 -- Distill to a mergeable change
|
|
58
|
+
|
|
59
|
+
Work on a fresh branch off the user's current HEAD, not in the experiment
|
|
60
|
+
worktree. Re-derive the change so it stands on its own:
|
|
61
|
+
|
|
62
|
+
- **Scope restraint.** Keep only the files and lines the behaviour needs. Drop
|
|
63
|
+
experiment scaffolding, debug logging, commented-out attempts, and churn the
|
|
64
|
+
search introduced and then abandoned. Smaller, local diffs merge; sprawl does
|
|
65
|
+
not.
|
|
66
|
+
- **Test integrity.** If the search weakened, skipped, or deleted a test to clear
|
|
67
|
+
a gate, restore it. New behaviour that changes outputs needs a test that
|
|
68
|
+
covers it. Never ship a green benchmark that rode on a loosened test -- call it
|
|
69
|
+
out instead.
|
|
70
|
+
- **Mechanical cleanliness.** Match the repo's formatter and linter. No stray
|
|
71
|
+
whitespace, no reordered imports unless the repo does that.
|
|
72
|
+
- **Codebase adherence.** Match surrounding naming, error handling, and structure.
|
|
73
|
+
The diff should read like the file it lands in.
|
|
74
|
+
|
|
75
|
+
Then confirm the behaviour survived the distillation:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
evo run <winner_id> --check # or the project's benchmark / test command
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
If the distilled change no longer reproduces the winning score, do not paper over
|
|
82
|
+
it -- report the gap (which part of the experiment diff was load-bearing) and let
|
|
83
|
+
the user decide. Best-effort means honest about what could not be cleaned up, not
|
|
84
|
+
silently shipping the raw worktree.
|
|
85
|
+
|
|
86
|
+
## Stage 3 -- Land
|
|
87
|
+
|
|
88
|
+
Detect how the repo expects changes to arrive:
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
git remote -v
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
- **Remote present** -> open a pull request. Commit the distilled change on its
|
|
95
|
+
branch, push, and `gh pr create` with the mergeability report (Stage 4) as the
|
|
96
|
+
body. Do not push or open the PR without the user's go.
|
|
97
|
+
- **No remote** -> merge the distilled change into the user's working branch as a
|
|
98
|
+
single clean commit. Do not force, do not rewrite existing history.
|
|
99
|
+
|
|
100
|
+
The landed commit message carries provenance: the winning experiment id, the
|
|
101
|
+
score delta, and the one-line hypothesis. State what changed and why it is safe;
|
|
102
|
+
do not narrate the search process.
|
|
103
|
+
|
|
104
|
+
## Stage 4 -- Mergeability report (advisory)
|
|
105
|
+
|
|
106
|
+
Always produce the report. It never blocks the merge -- it tells the user, and a
|
|
107
|
+
future reviewer, how mergeable the change is across the axes a maintainer judges
|
|
108
|
+
on:
|
|
109
|
+
|
|
110
|
+
- **Technique** -- what the change actually does to move the score, named
|
|
111
|
+
concretely (the algorithm, data structure, or mechanism), not the search
|
|
112
|
+
story. Distilled from the winning hypothesis: "replaced the O(n^2) dedup with a
|
|
113
|
+
hash set", not "exp_0042 improved throughput". This is what a reviewer reads
|
|
114
|
+
first.
|
|
115
|
+
- **Behavioural correctness** -- score baseline -> shipped (delta); benchmark
|
|
116
|
+
status after distillation.
|
|
117
|
+
- **Regression safety** -- full test suite result on the distilled change.
|
|
118
|
+
- **Scope** -- files touched, diff size, whether the change stays local.
|
|
119
|
+
- **Test correctness** -- explicit yes/no on whether any test was modified,
|
|
120
|
+
weakened, or removed, with detail; whether new behaviour is covered.
|
|
121
|
+
- **Mechanical cleanliness** -- formatter / linter status.
|
|
122
|
+
- **Codebase adherence** -- a note on style/convention fit.
|
|
123
|
+
|
|
124
|
+
Lead with a plain-language summary: what changed and why it is safe to merge. On
|
|
125
|
+
a remote repo this is the PR body. With no remote, print it and save it alongside
|
|
126
|
+
the run so the user can paste it into a review later.
|
|
127
|
+
|
|
128
|
+
## Guardrails (firm)
|
|
129
|
+
|
|
130
|
+
Everything above is method you can adapt to the repo. These are not:
|
|
131
|
+
|
|
132
|
+
- Never weaken, skip, or delete a test to make the change land. If the experiment
|
|
133
|
+
did, restore it and report it.
|
|
134
|
+
- Never ship invalid-pruned, legacy-pruned, discarded, failed, active,
|
|
135
|
+
evaluated, gate-failed, or invalid-lineage nodes. Only exhausted pruned nodes
|
|
136
|
+
remain normal ship candidates.
|
|
137
|
+
- Never push or open a PR without the user's explicit go.
|
|
138
|
+
- Never rewrite or force-overwrite existing history on the user's branch.
|
|
139
|
+
- Never ship the raw experiment worktree diff as-is when distillation failed --
|
|
140
|
+
report the gap instead.
|
package/skills/subagent/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: subagent
|
|
3
3
|
description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
|
|
4
|
-
evo_version: 0.
|
|
4
|
+
evo_version: 0.6.0-alpha.1
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Evo Subagent Protocol
|
|
@@ -335,6 +335,29 @@ Continue if budget remains AND (last outcome was committed, OR you have a meanin
|
|
|
335
335
|
|
|
336
336
|
Stop if budget exhausted, infra failure, or you've exhausted variations with no improvement.
|
|
337
337
|
|
|
338
|
+
### 9. Report evo-side feedback when evo itself blocked you
|
|
339
|
+
|
|
340
|
+
Use feedback for evo product/orchestration issues, not for ordinary bad
|
|
341
|
+
experiments. Good triggers: confusing `evo run` behavior, remote workspace
|
|
342
|
+
commands that were hard to recover from, verifier handoff problems, missing
|
|
343
|
+
state that prevented you from following the brief, or a policy/gate that blocked
|
|
344
|
+
the wrong thing. The command is anonymous and no-ops when telemetry is off.
|
|
345
|
+
|
|
346
|
+
Keep it public-safe and reproducible. Do not include repo names, company names,
|
|
347
|
+
file paths, commands, prompt text, raw logs, URLs, secrets, dataset names, or
|
|
348
|
+
exact task examples.
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
evo telemetry feedback \
|
|
352
|
+
--kind orchestration \
|
|
353
|
+
--phase subagent \
|
|
354
|
+
--summary "Remote experiment recovery was unclear after the run process detached." \
|
|
355
|
+
--expected "Re-running evo run should clearly say whether it resumed or needs a fresh experiment." \
|
|
356
|
+
--actual "The subagent could not tell whether the active attempt was recoverable." \
|
|
357
|
+
--repro "Use a remote backend, interrupt a long-running experiment, then ask a subagent to continue the same exp_id." \
|
|
358
|
+
--tag remote-backend --tag recovery --tag subagent
|
|
359
|
+
```
|
|
360
|
+
|
|
338
361
|
## Enriching traces
|
|
339
362
|
|
|
340
363
|
Check `.evo/meta.json` for `"instrumentation_mode"` (`"sdk"` or `"inline"`) to see which style the benchmark uses -- **stay consistent with that choice across iterations; do not flip styles mid-run.**
|
|
@@ -357,7 +380,8 @@ If your experiment needs an artifact that is slow to produce and stable across s
|
|
|
357
380
|
|
|
358
381
|
- Do NOT run `evo init` or `evo reset`
|
|
359
382
|
- `evo discard <your_exp_id> --reason "..."` is your explicit "abandon" action — use it for any *non-committed* node you've decided not to pursue further (pre-run realization, evaluated with a bad hypothesis, or unfixable infra failure). Discard deletes the worktree and branch; the node and its per-attempt artifacts stay in `.evo/` as a record of what was tried.
|
|
360
|
-
- If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune`.
|
|
383
|
+
- If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune --exhausted` or `evo prune --invalid`.
|
|
384
|
+
- If you discover a committed result is invalid (bad score calculation, benchmark gaming, weakened/deleted tests, failed inherited gate, or failed repro), report the exact node and evidence. Do not discard it; the orchestrator should mark it with `evo prune <id> --invalid --reason "..."` and add `--yes` only if evo says the node is on the current best spine.
|
|
361
385
|
- If `evo discard` errors with **"cannot discard active node ... pass --force"** — the run is still in flight. Wait for it to finish; don't `--force` unless you know what you're doing (the running process can still write a final outcome that contradicts the discard).
|
|
362
386
|
- If `evo discard` errors with **"cannot discard ... has non-discarded children"** — sibling/child experiments depend on this node's parent reference. Discard or commit-and-prune those first.
|
|
363
387
|
- Do NOT copy `.env` files, bake secrets into source, or hard-code local runtime paths. Runtime setup/env is configured by the orchestrator (`evo config runtime ...`, `evo env ...`) and injected into benchmark/gate processes. If a missing dependency, setup step, or key blocks evaluation, report setup failure.
|