npm - @evo-hq/pi-evo - Versions diffs - 0.5.0-alpha.5 → 0.5.0-alpha.7 - Mend

@evo-hq/pi-evo 0.5.0-alpha.5 → 0.5.0-alpha.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/package.json +1 -1
package/skills/discover/SKILL.md +79 -5
package/skills/infra-setup/SKILL.md +1 -1
package/skills/optimize/SKILL.md +36 -2
package/skills/report/SKILL.md +1 -1
package/skills/subagent/SKILL.md +43 -2
/package/skills/{optimize → discover}/references/sizing-the-round.md +0 -0

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@evo-hq/pi-evo",
-  "version": "0.5.0-alpha.5",
+  "version": "0.5.0-alpha.7",
   "description": "Evo plugin for pi-coding-agent: optimize/discover/subagent skills + mid-run inject extension.",
   "publishConfig": {
     "access": "public"

package/skills/discover/SKILL.md CHANGED Viewed

@@ -2,13 +2,87 @@
 name: discover
 description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
 argument-hint: <optional context about what to optimize>
-evo_version: 0.5.0-alpha.5
+evo_version: 0.5.0-alpha.7
 ---
 # Discover
 Internal procedure for `evo:discover`. The user only sees the user-facing prompts, the dashboard URL, and the baseline score -- everything else is the agent's choreography.
+## Evo surface
+What you can invoke / dispatch / read. Each line is a triggering condition: if you're about to do X, pull/dispatch/read this. Don't preload -- act when the trigger fires.
+```
+evo plugin
+│
+├── Main thread  (the orchestrator -- you, inside /evo:discover or /evo:optimize)
+│   │
+│   ├── Skills (Skill tool)
+│   │   ├── evo:discover       starting a new evo workspace / instrumenting a project
+│   │   ├── evo:optimize       after discover commits the baseline -- drives the loop.
+│   │   │                      Args: subagents=N (read sizing-the-round FIRST),
+│   │   │                            autonomous, subagents-only, budget=N, stall=N
+│   │   ├── evo:finetuning     task is finetuning / post-training / training a model
+│   │   └── evo:infra-setup    need a remote backend, pooled workspaces, lease/slot
+│   │                          management, or specific provider auth/setup
+│   │
+│   └── Subagents to dispatch (Task tool, subagent_type=...)
+│       ├── evo:benchmark-reviewer  before the baseline run, or whenever the
+│       │                           benchmark command / harness changes
+│       └── evo:ideator             stalled, or every ~5 committed experiments.
+│                                   One subagent per brief:
+│                                   failure_analysis, literature, frontier_extrapolation
+│
+├── Subagent thread  (each subagent spawned by /optimize step 5)
+│   │
+│   ├── Skills  (the subagent loads this on first turn -- the brief's first
+│   │            sentence mandates it; not auto-loaded by the host)
+│   │   └── evo:subagent     load FIRST -- defines the iteration protocol
+│   │                        + brief field shape the subagent operates under
+│   │
+│   └── Subagents to dispatch (Task tool, subagent_type=...)
+│       └── evo:verifier      ALWAYS dispatch pre AND post every evo run.
+│                             Pre: ~30s static analysis before the experiment runs.
+│                             Post: result-validity audit after it commits.
+│                             Not optional. Not ad-hoc.
+│
+└── Key references (Read tool, on demand)
+    ├── discover/references/
+    │   ├── constructing-benchmark.md      designing + assembling a benchmark from scratch
+    │   ├── sdk_python.py / sdk_node.js    wiring per-task instrumentation -- preferred path
+    │   ├── inline_instrumentation.py      inline fallback when SDK can't be used.
+    │   │                                  Copy as-is; do not reimplement (file header
+    │   │                                  explains why)
+    │   ├── sizing-the-round.md            BEFORE invoking /evo:optimize with any
+    │   │                                  specific subagents=N. Single-GPU /
+    │   │                                  single-exclusive-resource -> subagents=1
+    │   ├── proposing-dimensions.md        choosing what to optimize when not obvious
+    │   └── instrumentation-contract.md    the format evo reads (result + traces shapes)
+    │
+    ├── finetuning/references/
+    │   ├── glue.md                         writing train.py -- I/O contract evo expects
+    │   ├── diagnostics.md                  per-failure-mode diagnostics
+    │   ├── false-progress.md               what doesn't count as improvement
+    │   ├── trace-schema.md                 per-task trace JSON schema for training runs
+    │   ├── rl/                             RL framework references
+    │   │   └── art.md                       ART (Algorithm-Refined Training)
+    │   ├── sft/                            SFT framework references
+    │   │   └── tinker.md                    Tinker SFT
+    │   └── serving/                        eval-time inference references
+    │       └── vllm.md                      vLLM serving config + LoRA-multi
+    │
+    ├── infra-setup/references/
+    │   └── provider-matrix.md              provider/backend summary (auth, setup, costs)
+    │
+    └── references/                          (shared across skills)
+        ├── evo-wait.md                      any time you need to wait without burning
+        │                                    context (subagent completion, training,
+        │                                    ideators, GPU activity, any long-running)
+        ├── agent-sdk-reference.md           SDK API surface
+        └── cli-quick-reference.md           CLI subcommand cheat sheet
+```
 ## Host conventions
 This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
@@ -40,20 +114,20 @@ evo --version
 The output must be exactly:
 ```
-evo-hq-cli 0.5.0-alpha.5
+evo-hq-cli 0.5.0-alpha.7
 ```
 Three outcomes:
 1. **Matches exactly** — continue to step 1.
 2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
-   > Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.5`). Run:
+   > Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.7`). Run:
    > ```
-   > uv tool install --force evo-hq-cli==0.5.0-alpha.5
+   > uv tool install --force evo-hq-cli==0.5.0-alpha.7
    > ```
    > Then re-invoke this skill.
 3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
-   > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.5` (or `pipx install evo-hq-cli==0.5.0-alpha.5`). Then re-invoke this skill.
+   > `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.7` (or `pipx install evo-hq-cli==0.5.0-alpha.7`). Then re-invoke this skill.
 Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.

package/skills/infra-setup/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: infra-setup
 description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
-evo_version: 0.5.0-alpha.5
+evo_version: 0.5.0-alpha.7
 ---
 # Infra Setup

package/skills/optimize/SKILL.md CHANGED Viewed

@@ -2,13 +2,47 @@
 name: optimize
 description: Run the evo optimization loop with parallel subagents until interrupted.
 argument-hint: "[subagents=N] [budget=N] [stall=N]"
-evo_version: 0.5.0-alpha.5
+evo_version: 0.5.0-alpha.7
 ---
 Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
 **This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
+## Evo surface -- loop-relevant
+You're inside `/evo:optimize`. Things you'll pull/dispatch during the loop:
+```
+main thread (you)
+├── Skills (Skill tool)
+│   └── evo:finetuning     before writing or changing any train.py
+│
+└── Subagents to dispatch (Task tool, subagent_type=...)
+    └── evo:ideator        stalled, or every ~5 committed experiments.
+                           One subagent per brief:
+                           failure_analysis, literature, frontier_extrapolation
+subagent thread (each subagent spawned by step 5)
+├── evo:subagent skill     loaded by the subagent on first turn -- the brief's
+│                          first sentence mandates it (not auto-loaded)
+└── evo:verifier subagent  MANDATORY pre AND post every evo run.
+                           Pre: ~30s static analysis before the experiment runs.
+                           Post: result-validity audit after it commits.
+references (Read tool, on demand)
+├── discover/references/sizing-the-round.md      pick subagents=N
+├── references/evo-wait.md                       waiting without burning context
+├── finetuning/references/glue.md                train.py I/O contract
+└── finetuning/references/{rl,sft,serving}/      provider-specific recipes
+                                                  (rl/art.md, sft/tinker.md,
+                                                  serving/vllm.md)
+```
+Full surface tree (orchestrator entry-point view, including benchmark-reviewer,
+infra-setup, and the complete references catalogue) lives in `evo:discover`'s
+"Evo surface" section.
 ## Host conventions
 This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
@@ -38,7 +72,7 @@ A user can override any of these with `/optimize [subagents=N] [budget=N] [stall
 **Picking `subagents` and `budget` is load-bearing -- do not skim.**
-Mandatory before the first round (and again any time the backend or benchmark changes): **READ `plugins/evo/skills/optimize/references/sizing-the-round.md` IN FULL.** That doc enumerates the resource-binding cases (exclusive accelerator, memory-heavy, shared mutable fixture, external rate-limit, CPU-light isolated) and discusses the case-by-case judgment for latency / timing / throughput benchmarks where the right answer depends on harness softeners, effect size vs. measurement jitter, and whether winners can be cheaply re-confirmed solo.
+Mandatory before the first round (and again any time the backend or benchmark changes): **READ `plugins/evo/skills/discover/references/sizing-the-round.md` IN FULL.** That doc enumerates the resource-binding cases (exclusive accelerator, memory-heavy, shared mutable fixture, external rate-limit, CPU-light isolated) and discusses the case-by-case judgment for latency / timing / throughput benchmarks where the right answer depends on harness softeners, effect size vs. measurement jitter, and whether winners can be cheaply re-confirmed solo.
 Under-subscribing wastes wall-clock. Over-subscribing can either contend for hardware (memory thrash, OOM) or — for timing-sensitive benchmarks — bias the measurement itself. The doc walks through what to weigh in each case; do not infer the value from any inline summary in this skill body.

package/skills/report/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: report
 description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
-evo_version: 0.5.0-alpha.5
+evo_version: 0.5.0-alpha.7
 ---
 # Report

package/skills/subagent/SKILL.md CHANGED Viewed

@@ -1,11 +1,52 @@
 ---
 name: subagent
-description: Internal protocol for evo optimization subagents. Loaded by subagents spawned from /optimize via their host's skill loader. Not for orchestrator use.
-evo_version: 0.5.0-alpha.5
+description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
+evo_version: 0.5.0-alpha.7
 ---
 # Evo Subagent Protocol
+**Orchestrators reading for context**: this is the protocol your dispatched subagents follow. You don't act on it yourself -- write briefs that satisfy the four required fields described below, and rely on each spawned subagent to drive the loop on its end. Stop reading at "Host conventions" if you only need the brief shape; the rest is for the subagent.
+## Evo surface -- subagent perspective
+What you can pull/dispatch/read as a subagent. Each line is a triggering condition.
+```
+skills you may pull (Skill tool)
+└── evo:finetuning     before writing or changing any train.py
+subagents you dispatch (Task tool, subagent_type=...)
+└── evo:verifier       MANDATORY pre AND post every evo run.
+                       Pre: ~30s static analysis BEFORE the experiment runs
+                            (block on failure -- fix and retry).
+                       Post: result-validity audit AFTER it commits.
+references (Read tool, on demand)
+├── discover/references/
+│   ├── sdk_python.py / sdk_node.js     wiring per-task instrumentation -- preferred
+│   ├── inline_instrumentation.py       inline fallback. Copy as-is; do not reimplement
+│   └── instrumentation-contract.md     the format evo reads (result + traces shapes)
+│
+├── references/evo-wait.md              any time you need to wait -- training, eval,
+│                                       any long-running condition. Use this instead
+│                                       of `sleep N`; doesn't burn context.
+│
+└── finetuning/references/
+    ├── glue.md                          train.py I/O contract evo expects
+    ├── diagnostics.md                   per-failure-mode diagnostics
+    ├── false-progress.md                what doesn't count as improvement
+    ├── trace-schema.md                  per-task trace JSON schema
+    ├── rl/art.md                        ART (Algorithm-Refined Training)
+    ├── sft/tinker.md                    Tinker SFT
+    └── serving/vllm.md                  vLLM serving config + LoRA-multi
+```
+Orchestrator entry-point view (benchmark-reviewer, ideator, infra-setup, full
+references catalogue) lives in `evo:discover`'s "Evo surface" section.
+---
 You are an evo optimization subagent. The orchestrator has given you a **brief** with four fields:
 - **Objective** -- the bottleneck to attack and evidence for it (strategic, not edit-level)

/package/skills/{optimize → discover}/references/sizing-the-round.md RENAMED Viewed

File without changes