@evo-hq/pi-evo 0.5.2 → 0.6.0-alpha.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -25,6 +25,8 @@ skill needs one to run multiple experiments per round.
25
25
  | `skills/discover/` | First-run setup: explore repo, propose optimization dimensions, build benchmark, run first experiment |
26
26
  | `skills/optimize/` | The search loop: parallel subagents form hypotheses, edit, get scored, frontier picks next branch |
27
27
  | `skills/subagent/` | Per-experiment brief contract for the optimize round's fanout |
28
+ | `skills/report/` | Terminal score chart mirroring the dashboard scatter plot |
29
+ | `skills/ship/` | Distill the best valid experiment into a clean mergeable change |
28
30
  | `skills/infra-setup/` | Provider matrix for remote-sandbox backends (Modal, E2B, Daytona, etc.) |
29
31
 
30
32
  ## Versioning
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@evo-hq/pi-evo",
3
- "version": "0.5.2",
3
+ "version": "0.6.0-alpha.1",
4
4
  "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
5
5
  "publishConfig": {
6
6
  "access": "public"
@@ -2,7 +2,7 @@
2
2
  name: discover
3
3
  description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
4
4
  argument-hint: <optional context about what to optimize>
5
- evo_version: 0.5.2
5
+ evo_version: 0.6.0-alpha.1
6
6
  ---
7
7
 
8
8
  # Discover
@@ -25,6 +25,9 @@ evo plugin
25
25
  │ │ ├── evo:optimize after discover commits the baseline -- drives the loop.
26
26
  │ │ │ Args: subagents=N (read sizing-the-round FIRST),
27
27
  │ │ │ autonomous, subagents-only, budget=N, stall=N
28
+ │ │ ├── evo:ship after the loop stops -- distills the best valid
29
+ │ │ │ experiment into a mergeable change (PR if remote,
30
+ │ │ │ else merge) + a mergeability report
28
31
  │ │ ├── evo:finetuning task is finetuning / post-training / training a model
29
32
  │ │ └── evo:infra-setup need a remote backend, pooled workspaces, lease/slot
30
33
  │ │ management, or specific provider auth/setup
@@ -116,20 +119,20 @@ evo --version
116
119
  The output must be exactly:
117
120
 
118
121
  ```
119
- evo-hq-cli 0.5.2
122
+ evo-hq-cli 0.6.0-alpha.1
120
123
  ```
121
124
 
122
125
  Three outcomes:
123
126
 
124
127
  1. **Matches exactly** — continue to step 1.
125
128
  2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
126
- > Your installed evo CLI is on a different version than this skill (`0.5.2`). Run:
129
+ > Your installed evo CLI is on a different version than this skill (`0.6.0-alpha.1`). Run:
127
130
  > ```
128
- > uv tool install --force evo-hq-cli==0.5.2
131
+ > uv tool install --force evo-hq-cli==0.6.0-alpha.1
129
132
  > ```
130
133
  > Then re-invoke this skill.
131
134
  3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
132
- > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.2` (or `pipx install evo-hq-cli==0.5.2`). Then re-invoke this skill.
135
+ > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.6.0-alpha.1` (or `pipx install evo-hq-cli==0.6.0-alpha.1`). Then re-invoke this skill.
133
136
 
134
137
  Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
135
138
 
@@ -519,7 +522,39 @@ Document:
519
522
  - Benchmark gaming risks identified during the Goodhart check
520
523
  - Future experiment candidates (the non-picked dimensions from step 3)
521
524
 
522
- ## 13. Report to the user
525
+ ## 13. Report public-safe use case telemetry
526
+
527
+ Call the telemetry helper with a concrete but sanitized use-case summary. This
528
+ command is globally disabled when the user runs `evo telemetry off`; do not
529
+ branch on that yourself.
530
+
531
+ Write one short sentence that captures: workload/system shape, optimization
532
+ lever, and measured signal or failure mode. Add 3-6 lower-kebab-case tags for
533
+ the workload, lever, metric/failure mode, runtime category, or benchmark source.
534
+ Avoid filler-only tags like `optimization`, `benchmark`, or `codebase`.
535
+
536
+ ```bash
537
+ evo telemetry usecase \
538
+ --description "<public-safe, concrete use case summary>" \
539
+ --tag <tag> --tag <tag> --tag <tag>
540
+ ```
541
+
542
+ Privacy rule: do not include repo names, company names, customer names, file
543
+ paths, benchmark commands, prompt text, task examples, secrets, URLs, internal
544
+ system names, environment names, raw error logs, or exact dataset/item names.
545
+ Generalize private nouns instead of deleting useful signal: "internal billing
546
+ agent" becomes "workflow agent"; "Acme support router" becomes "support-style
547
+ routing agent".
548
+
549
+ Good example:
550
+
551
+ ```bash
552
+ evo telemetry usecase \
553
+ --description "Optimizing a tool-calling coding agent for higher task pass rate by tuning planner/retry behavior against an existing per-task eval." \
554
+ --tag coding-agent --tag tool-calling --tag retry-policy --tag pass-rate --tag existing-eval
555
+ ```
556
+
557
+ ## 14. Report to the user
523
558
 
524
559
  End the skill by reporting in chat:
525
560
 
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: infra-setup
3
3
  description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
4
- evo_version: 0.5.2
4
+ evo_version: 0.6.0-alpha.1
5
5
  ---
6
6
 
7
7
  # Infra Setup
@@ -2,7 +2,7 @@
2
2
  name: optimize
3
3
  description: Drive structured autoresearch iteration after evo:discover and the baseline commit -- scan-subagent cross-cutting analysis between rounds, frontier-based parent selection, ideator dispatch on stall, verifier pre/post hooks, annotation discipline. Width is set via subagents=N (1 for serial workloads, larger for parallel); the loop's structural value applies at any width.
4
4
  argument-hint: "[subagents=N] [budget=N] [stall=N]"
5
- evo_version: 0.5.2
5
+ evo_version: 0.6.0-alpha.1
6
6
  ---
7
7
 
8
8
  Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
@@ -102,17 +102,17 @@ evo defaults get subagents-only --json
102
102
 
103
103
  As your **very first actions, before the loop**, resolve each and arm it: run `evo autonomous on` / `evo subagents-only on` when it resolves on, or `evo autonomous off` / `evo subagents-only off` when an explicit instruction or stored default turned it off. If a behavior resolves off — whether from the user's instruction this run or a stored default — say so in your opening message (e.g. "autonomous off — running one round at a time, as you asked") so it's never invisible.
104
104
 
105
- **Orchestrator driver.** evo drives the loop two ways: a deterministic **dynamic workflow** (Claude Code only) or the **prose loop** below (every host). **On Claude Code the workflow is the DEFAULT use it whenever it's available.** Resolve which as part of your very first actions:
105
+ **Orchestrator driver.** evo drives the loop two ways: the **prose loop** below (every host) or a deterministic **dynamic workflow** (Claude Code only, opt-in). **The prose loop is the default everywhere; the workflow is used only when explicitly enabled** (`evo config set default-orchestrator workflow`). Resolve which as part of your very first actions:
106
106
 
107
107
  1. `evo host show` — the workflow driver requires `claude-code`. If it prints `<not set>` (a pre-host workspace), determine your actual runtime from your own context (system prompt, env such as `CLAUDECODE=1`, self-identity): **only if you are genuinely Claude Code**, do the one-time host migration now (`evo host set claude-code`) and continue; if you are any other runtime, do NOT stamp the host here — leave it for Step 0.1 and use the prose loop.
108
- 2. `evo config get default-orchestrator` — `prose` is an explicit **opt-out** (honor it: use the prose loop). `workflow` **or unset** resolves to the workflow driver on Claude Code. An explicit user instruction this run still wins.
108
+ 2. `evo config get default-orchestrator` — `workflow` is an explicit **opt-in** (use the workflow driver on Claude Code). `prose` **or unset** resolves to the prose loop. An explicit user instruction this run still wins.
109
109
 
110
- **Use the workflow** when host is `claude-code`, the value is not explicitly `prose`, AND the **Workflow tool is actually present in your available tools this session** — this is the default path, not opt-in. The availability check is load-bearing: **older Claude Code builds do not ship the Workflow tool**, so verify it's really in your toolset; do not assume it exists from the host alone. When (and only when) you will actually launch it, FIRST persist the choice so the rest of evo agrees (`evo config get` reflects it, and the autonomous stop-nudge auto-suppresses under the workflow): run `evo config set default-orchestrator workflow`. Then launch it once do NOT drive the loop turn-by-turn:
110
+ **Use the workflow** only when `default-orchestrator` is explicitly `workflow`, host is `claude-code`, AND the **Workflow tool is actually present in your available tools this session** — it is opt-in, never the default. The availability check is load-bearing: **older Claude Code builds do not ship the Workflow tool**, so verify it's really in your toolset; do not assume it exists from the host alone. Reaching here means `default-orchestrator=workflow` is explicitly set (the opt-in trigger), so the autonomous stop-nudge is auto-suppressed under the workflow. Launch it once, do NOT drive the loop turn-by-turn:
111
111
 
112
112
  - Call the **Workflow** tool with `scriptPath: ${CLAUDE_PLUGIN_ROOT}/skills/optimize/workflows/evo-optimize.js` and `args: {pluginRoot: "${CLAUDE_PLUGIN_ROOT}", subagents: <N>, budget: <N>, stall: <N>}`, using the round sizing you resolved above. **Pass all four keys explicitly — never omit one.** For `stall`, use the user's `/optimize stall=N` override if given, else the default 5. (The workflow's stop condition is the stall limit, so a dropped `stall` silently reverts it to 5.)
113
113
  - Report the returned `runId` and tell the user to watch progress with `/workflows`. The workflow runs the round loop itself (orient → mandatory scan + cross-history axis check → ideators on stall/periodic → briefs → fan-out + verify → collect → frontier-select → stall) plus the concurrent meta controller; you do **not** execute "The Loop" section below, and you do **not** need autonomous mode (the workflow self-drives; its stall limit is the stop).
114
114
 
115
- Use **The Loop** below only when the workflow can't drive: host is not `claude-code`, `default-orchestrator` is explicitly `prose`, or the Workflow tool is unavailable (e.g. an older Claude Code build). The workflow is only an execution strategy over the same `evo` CLI; gates, frontier, dashboard, and recovery are identical either way.
115
+ Use **The Loop** below by default — it is the prose driver on every host, and the path whenever the workflow is not explicitly enabled (`default-orchestrator` unset or `prose`), the host is not `claude-code`, or the Workflow tool is unavailable (e.g. an older Claude Code build). The workflow is only an execution strategy over the same `evo` CLI; gates, frontier, dashboard, and recovery are identical either way.
116
116
 
117
117
  **Reconcile config when you fall back to prose.** The stop-nudge that drives the prose loop is auto-suppressed whenever `default-orchestrator` is `workflow`. So if you fall back to the prose loop on Claude Code because the Workflow tool isn't available (older build) while `default-orchestrator` is still `workflow` from a prior run, you MUST set it back — `evo config set default-orchestrator prose` — and arm autonomous as usual. Otherwise the prose loop's stop-nudge stays suppressed and the run stalls after one round. Invariant to preserve: `default-orchestrator=workflow` in config iff the workflow is actually the driver this run.
118
118
 
@@ -331,16 +331,18 @@ After all subagents complete:
331
331
 
332
332
  **Cross-cut the round's evaluated nodes.** Before moving on, read `experiments/<id>/attempts/NNN/outcome.json` for each evaluated node from this round. The structured `gates[]` entries and `benchmark.result` let you spot shared failure modes the subagent summaries may have glossed over (e.g., three different subagents produced evaluated nodes whose gate_failures all included `refund_flow` -- that's a structural constraint the next round must confront, not three independent bad hypotheses).
333
333
 
334
- Prune dead branches where 3+ children all regressed:
334
+ Prune branches you have decided are exhausted:
335
335
  ```bash
336
- evo prune <exp_id> --reason "exhausted: N children all regressed"
336
+ evo prune <exp_id> --exhausted --reason "exhausted: <why>"
337
337
  ```
338
338
 
339
- `evo prune` accepts `committed` or `evaluated` nodes. Use it when you want
340
- to mark a lineage exhausted while preserving the result for later review or
341
- reference. Prune keeps the git commit alive (anchored at `refs/evo-anchor/<run>/<exp>`)
342
- so the node can be restored if needed. **Never `evo discard` a committed
343
- node** — it would orphan the branch ref and risk losing the commit.
339
+ `evo prune` accepts `committed` or `evaluated` nodes:
340
+ - `--exhausted` (default/legacy): stop branching here; the result still counts.
341
+ - `--invalid`: this result is wrong; exclude it and descendants from best/frontier.
342
+ - `--yes`: required only with `--invalid` on the current best valid spine.
343
+
344
+ Use `--invalid --yes` when you have proven a best-spine result or ancestor is
345
+ wrong. **Never `evo discard` a committed node** -- prune preserves its commit.
344
346
 
345
347
  If a previously-pruned (or discarded-then-restored) node is worth revisiting:
346
348
  ```bash
@@ -444,18 +446,49 @@ Proposals are advisory, not mandatory. If none look better than what step 3's sc
444
446
  - User hasn't interrupted
445
447
  - Score hasn't reached the theoretical maximum
446
448
 
449
+ To continue, go back to step 1.
450
+
447
451
  **Stop** if:
448
452
  - Stall counter >= stall limit (N consecutive rounds with no improvement)
449
453
  - Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
450
454
  - User interrupted
451
455
 
452
- On stop, print a final summary:
456
+ On stop, the loop is done -- do not go back to step 1. Print a final summary:
453
457
  - Best score achieved and experiment ID
454
458
  - Total experiments run across all rounds
455
459
  - The winning diff: `evo diff <best_exp_id>`
456
460
  - Suggested next steps if the score hasn't converged
461
+ - **How to land it:** point the user at `/evo:ship` to turn the winning
462
+ experiment into a mergeable change (a PR when the repo has a remote, else a
463
+ merge into the working branch). The raw winning diff is not the artifact to
464
+ merge by hand -- `/evo:ship` distills it to the minimal clean change and
465
+ attaches a mergeability report. Mention this whenever the run produced a
466
+ committed experiment that beats the baseline.
467
+
468
+ ### 8. Report evo-side feedback when evo itself went wrong
469
+
470
+ Use feedback for evo product/orchestration issues, not for ordinary bad
471
+ experiments. Good triggers: confusing CLI behavior, subagent handoff failures,
472
+ workflow/meta-controller mistakes, recovery that failed, remote backend handling
473
+ that was hard to diagnose, or an evo policy/gate that blocked the wrong thing.
474
+ The command is anonymous and no-ops when telemetry is off; do not branch on that.
475
+
476
+ Keep the report public-safe but actionable enough for the evo backend team to
477
+ reproduce the case. Include the phase, what you expected evo to do, what it did,
478
+ and a generic repro shape. Do not include repo names, company names, file paths,
479
+ commands, prompt text, raw logs, URLs, secrets, dataset names, or exact task
480
+ examples.
457
481
 
458
- Go back to step 1.
482
+ ```bash
483
+ evo telemetry feedback \
484
+ --kind orchestration \
485
+ --phase optimize \
486
+ --summary "Subagent completed but the orchestrator did not collect its evaluated result before writing the next round briefs." \
487
+ --expected "Round collection should wait for all dispatched subagents or clearly report the missing one." \
488
+ --actual "The next round was planned from partial results, leaving one evaluated node out of the pattern scan." \
489
+ --repro "Run optimize with multiple subagents on a workspace where one branch finishes after the first completion notification." \
490
+ --tag optimize-loop --tag subagent-handoff --tag collection
491
+ ```
459
492
 
460
493
  ## Polling discipline
461
494
 
@@ -553,6 +553,7 @@ function metaPrompt(ctx, intervalS, reported, journal) {
553
553
  '- Dead direction / ignored mechanism: annotations repeatedly naming a mechanism the recent work ignores, or a direction that keeps regressing. BRIEF HINT.',
554
554
  '- Heading toward failure (STOP): an in-flight experiment that is CLEARLY doomed or wasting the budget — a divergent / NaN / flatlined progress metric; projected completion beyond the remaining time budget; or a known-fatal signature (e.g. output the scorer cannot parse; a silent resource mis-placement that tanks throughput with no error; a corrupt input/format that invalidates the result). HIGH PRECISION ONLY: default to NOT stopping — recommend a STOP only with concrete evidence that finishing is wasted, and only for an experiment still `active`. Emit a stop with: expId; failureClass (build = the build/produce step is broken; eval = artifact is fine but scoring/serving is wrong; hypothesis = it runs but won\'t help); reason (the diagnosis + the evidence you saw); fixHint (what the NEXT experiment must change).',
555
555
  'For STOPs you stay READ-ONLY: do NOT run `evo abort` / `evo discard` yourself. A gated enforcer acts on each stop — it aborts the run + its subprocess tree, annotates your diagnosis (so it outlives the worktree and feeds the next round), and discards with the failureClass so the partial artifact is preserved. A STOP is a diagnosed, recoverable stop, never a silent kill.',
556
+ 'If you observe an evo workflow/meta-controller defect (missed collection, wrong prompt handoff, recovery confusion, bad stop/enforcer behavior), you MAY run `evo telemetry feedback --kind workflow --phase meta ...` with public-safe summary/expected/actual/repro/tags before returning. This is anonymous and no-ops when telemetry is off. Do NOT report ordinary bad experiments, raw logs, paths, commands, repo names, URLs, secrets, or prompt text.',
556
557
  '',
557
558
  'HARNESS CONTROL (your distinctive power): you may restructure the optimize workflow itself, live, when you judge it will help — edits apply directly (free will) and take effect next round. Current harness state: ' + JSON.stringify(harnessSummary()) + '.',
558
559
  'harnessEdits ops: (1) set-knob {knob: width|budget|stall|ideateEvery|ideateStall, value} — retune the loop (widen the round, deepen branches, change the stall limit or ideation cadence). (2) toggle-phase {phaseName: scan|ideate, enabled} — turn a phase off/on (e.g. skip scan when traces are uninformative; force ideation early). (3) set-prompt {target: state|scan|aggregate|brief|implement|run|ideator|collect|preverify|audit, mode: append|replace, text} — edit the prompt that step uses. Appends ACCUMULATE as standing directives (the current ones are visible in the harness state above — do not re-add them); replace swaps the base wholesale. Use preverify/audit to harden the verifier when you spot a cheat pattern the audit missed. (4) inject-step {at: before-scan|after-scan|before-brief|after-collect, text, label} — add an extra agent step at that seam each round. Every edit needs a rationale citing the evidence.',
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: report
3
3
  description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
4
- evo_version: 0.5.2
4
+ evo_version: 0.6.0-alpha.1
5
5
  ---
6
6
 
7
7
  # Report
@@ -13,10 +13,10 @@ Render the dashboard's scatter plot as a colored terminal block, one chart per r
13
13
  Mirrors the web dashboard's score scatter (left rail of `evo dashboard`):
14
14
 
15
15
  - X = experiment creation order, Y = score
16
- - Dot color by status: green = committed, red = failed, purple = active, grey = pending / evaluated / discarded / pruned
17
- - ★ marks the current best committed experiment
16
+ - Dot color by status: green = committed valid result, red = failed, purple = active, grey = pending / evaluated / discarded / pruned
17
+ - ★ marks the current best valid committed-result experiment. `pruned` with `prune_kind=exhausted` can still be best; `prune_kind=invalid` and its descendants cannot.
18
18
  - Yellow ring on dots that sit on the best-path spine (root → best)
19
- - Yellow stair line traces cumulative-best across committed experiments
19
+ - Yellow stair line traces cumulative-best across valid committed-result experiments
20
20
  - ○ at the baseline for experiments that have no score yet (active / pending)
21
21
 
22
22
  Every run in the workspace is rendered, stacked top-to-bottom, with a header line showing `run_id · target · metric`.
@@ -0,0 +1,140 @@
1
+ ---
2
+ name: ship
3
+ description: Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.
4
+ evo_version: 0.6.0-alpha.1
5
+ ---
6
+
7
+ # Ship
8
+
9
+ Turn a finished evo run into a change a maintainer would merge.
10
+
11
+ The optimize loop leaves a tree of committed experiments. The winning worktree
12
+ diff is not mergeable as-is: it carries debug prints, search-process churn,
13
+ over-broad edits, and sometimes a test that was relaxed to clear a gate. Shipping
14
+ is the step that re-derives the *minimal clean change* reproducing the winning
15
+ behaviour, lands it the way the repo expects (PR or merge), and reports how
16
+ mergeable it is.
17
+
18
+ Correctness is the floor, not the goal. The score says the behaviour works; this
19
+ skill decides whether the *diff* is fit to merge.
20
+
21
+ ## Invocation
22
+
23
+ ```bash
24
+ /evo:ship # ship the auto-selected winner
25
+ /evo:ship exp_0042 # ship a specific experiment instead
26
+ ```
27
+
28
+ ## Stage 1 -- Select the winner
29
+
30
+ Pick the experiment to ship, then confirm it with the user before touching their
31
+ tree.
32
+
33
+ ```bash
34
+ evo status # current best valid score + counts
35
+ evo report # top valid experiments table + score chart
36
+ ```
37
+
38
+ - The default winner is the highest-scoring valid result in the graph history,
39
+ not the frontier. `evo frontier` is for choosing where to branch next; it can
40
+ exclude an exhausted branch whose score is still the right thing to ship. An
41
+ explicit `exp_id` argument overrides auto-selection.
42
+ - A shippable winner must be valid: `committed`, or `pruned` with
43
+ `prune_kind=exhausted`, with a commit and score, no `gate_result === false`,
44
+ and no invalid-pruned ancestor. Never select `discarded`, `failed`, `active`,
45
+ `evaluated`, legacy-pruned nodes with no `prune_kind`, `prune_kind=invalid`, or
46
+ descendants of invalid-pruned nodes. If no valid candidate exists, stop and
47
+ report why nothing is safe to ship.
48
+ - Resolve the run's root (baseline) node, then show the cumulative change:
49
+ ```bash
50
+ evo diff <root_id> <winner_id> # target-scoped cumulative diff, baseline -> winner
51
+ ```
52
+ For changes outside the benchmark target, diff the commits directly
53
+ (`git diff <baseline_commit> <winner_commit>`); each node carries `.commit`.
54
+ - Present a one-screen summary: winner id, score baseline -> winner (delta),
55
+ the winning hypothesis, and a diffstat. Get a go before proceeding.
56
+
57
+ ## Stage 2 -- Distill to a mergeable change
58
+
59
+ Work on a fresh branch off the user's current HEAD, not in the experiment
60
+ worktree. Re-derive the change so it stands on its own:
61
+
62
+ - **Scope restraint.** Keep only the files and lines the behaviour needs. Drop
63
+ experiment scaffolding, debug logging, commented-out attempts, and churn the
64
+ search introduced and then abandoned. Smaller, local diffs merge; sprawl does
65
+ not.
66
+ - **Test integrity.** If the search weakened, skipped, or deleted a test to clear
67
+ a gate, restore it. New behaviour that changes outputs needs a test that
68
+ covers it. Never ship a green benchmark that rode on a loosened test -- call it
69
+ out instead.
70
+ - **Mechanical cleanliness.** Match the repo's formatter and linter. No stray
71
+ whitespace, no reordered imports unless the repo does that.
72
+ - **Codebase adherence.** Match surrounding naming, error handling, and structure.
73
+ The diff should read like the file it lands in.
74
+
75
+ Then confirm the behaviour survived the distillation:
76
+
77
+ ```bash
78
+ evo run <winner_id> --check # or the project's benchmark / test command
79
+ ```
80
+
81
+ If the distilled change no longer reproduces the winning score, do not paper over
82
+ it -- report the gap (which part of the experiment diff was load-bearing) and let
83
+ the user decide. Best-effort means honest about what could not be cleaned up, not
84
+ silently shipping the raw worktree.
85
+
86
+ ## Stage 3 -- Land
87
+
88
+ Detect how the repo expects changes to arrive:
89
+
90
+ ```bash
91
+ git remote -v
92
+ ```
93
+
94
+ - **Remote present** -> open a pull request. Commit the distilled change on its
95
+ branch, push, and `gh pr create` with the mergeability report (Stage 4) as the
96
+ body. Do not push or open the PR without the user's go.
97
+ - **No remote** -> merge the distilled change into the user's working branch as a
98
+ single clean commit. Do not force, do not rewrite existing history.
99
+
100
+ The landed commit message carries provenance: the winning experiment id, the
101
+ score delta, and the one-line hypothesis. State what changed and why it is safe;
102
+ do not narrate the search process.
103
+
104
+ ## Stage 4 -- Mergeability report (advisory)
105
+
106
+ Always produce the report. It never blocks the merge -- it tells the user, and a
107
+ future reviewer, how mergeable the change is across the axes a maintainer judges
108
+ on:
109
+
110
+ - **Technique** -- what the change actually does to move the score, named
111
+ concretely (the algorithm, data structure, or mechanism), not the search
112
+ story. Distilled from the winning hypothesis: "replaced the O(n^2) dedup with a
113
+ hash set", not "exp_0042 improved throughput". This is what a reviewer reads
114
+ first.
115
+ - **Behavioural correctness** -- score baseline -> shipped (delta); benchmark
116
+ status after distillation.
117
+ - **Regression safety** -- full test suite result on the distilled change.
118
+ - **Scope** -- files touched, diff size, whether the change stays local.
119
+ - **Test correctness** -- explicit yes/no on whether any test was modified,
120
+ weakened, or removed, with detail; whether new behaviour is covered.
121
+ - **Mechanical cleanliness** -- formatter / linter status.
122
+ - **Codebase adherence** -- a note on style/convention fit.
123
+
124
+ Lead with a plain-language summary: what changed and why it is safe to merge. On
125
+ a remote repo this is the PR body. With no remote, print it and save it alongside
126
+ the run so the user can paste it into a review later.
127
+
128
+ ## Guardrails (firm)
129
+
130
+ Everything above is method you can adapt to the repo. These are not:
131
+
132
+ - Never weaken, skip, or delete a test to make the change land. If the experiment
133
+ did, restore it and report it.
134
+ - Never ship invalid-pruned, legacy-pruned, discarded, failed, active,
135
+ evaluated, gate-failed, or invalid-lineage nodes. Only exhausted pruned nodes
136
+ remain normal ship candidates.
137
+ - Never push or open a PR without the user's explicit go.
138
+ - Never rewrite or force-overwrite existing history on the user's branch.
139
+ - Never ship the raw experiment worktree diff as-is when distillation failed --
140
+ report the gap instead.
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: subagent
3
3
  description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
4
- evo_version: 0.5.2
4
+ evo_version: 0.6.0-alpha.1
5
5
  ---
6
6
 
7
7
  # Evo Subagent Protocol
@@ -335,6 +335,29 @@ Continue if budget remains AND (last outcome was committed, OR you have a meanin
335
335
 
336
336
  Stop if budget exhausted, infra failure, or you've exhausted variations with no improvement.
337
337
 
338
+ ### 9. Report evo-side feedback when evo itself blocked you
339
+
340
+ Use feedback for evo product/orchestration issues, not for ordinary bad
341
+ experiments. Good triggers: confusing `evo run` behavior, remote workspace
342
+ commands that were hard to recover from, verifier handoff problems, missing
343
+ state that prevented you from following the brief, or a policy/gate that blocked
344
+ the wrong thing. The command is anonymous and no-ops when telemetry is off.
345
+
346
+ Keep it public-safe and reproducible. Do not include repo names, company names,
347
+ file paths, commands, prompt text, raw logs, URLs, secrets, dataset names, or
348
+ exact task examples.
349
+
350
+ ```bash
351
+ evo telemetry feedback \
352
+ --kind orchestration \
353
+ --phase subagent \
354
+ --summary "Remote experiment recovery was unclear after the run process detached." \
355
+ --expected "Re-running evo run should clearly say whether it resumed or needs a fresh experiment." \
356
+ --actual "The subagent could not tell whether the active attempt was recoverable." \
357
+ --repro "Use a remote backend, interrupt a long-running experiment, then ask a subagent to continue the same exp_id." \
358
+ --tag remote-backend --tag recovery --tag subagent
359
+ ```
360
+
338
361
  ## Enriching traces
339
362
 
340
363
  Check `.evo/meta.json` for `"instrumentation_mode"` (`"sdk"` or `"inline"`) to see which style the benchmark uses -- **stay consistent with that choice across iterations; do not flip styles mid-run.**
@@ -357,7 +380,8 @@ If your experiment needs an artifact that is slow to produce and stable across s
357
380
 
358
381
  - Do NOT run `evo init` or `evo reset`
359
382
  - `evo discard <your_exp_id> --reason "..."` is your explicit "abandon" action — use it for any *non-committed* node you've decided not to pursue further (pre-run realization, evaluated with a bad hypothesis, or unfixable infra failure). Discard deletes the worktree and branch; the node and its per-attempt artifacts stay in `.evo/` as a record of what was tried.
360
- - If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune`.
383
+ - If `evo discard` errors with **"cannot discard committed node ... use prune"** — the experiment cleared the gate and improved the score. You shouldn't be discarding it. Don't fight the error; the orchestrator owns committed-lineage decisions via `evo prune --exhausted` or `evo prune --invalid`.
384
+ - If you discover a committed result is invalid (bad score calculation, benchmark gaming, weakened/deleted tests, failed inherited gate, or failed repro), report the exact node and evidence. Do not discard it; the orchestrator should mark it with `evo prune <id> --invalid --reason "..."` and add `--yes` only if evo says the node is on the current best spine.
361
385
  - If `evo discard` errors with **"cannot discard active node ... pass --force"** — the run is still in flight. Wait for it to finish; don't `--force` unless you know what you're doing (the running process can still write a final outcome that contradicts the discard).
362
386
  - If `evo discard` errors with **"cannot discard ... has non-discarded children"** — sibling/child experiments depend on this node's parent reference. Discard or commit-and-prune those first.
363
387
  - Do NOT copy `.env` files, bake secrets into source, or hard-code local runtime paths. Runtime setup/env is configured by the orchestrator (`evo config runtime ...`, `evo env ...`) and injected into benchmark/gate processes. If a missing dependency, setup step, or key blocks evaluation, report setup failure.