npm - @evo-hq/pi-evo - Versions diffs - 0.4.4 → 0.5.0-alpha.10 - Mend

@evo-hq/pi-evo 0.4.4 → 0.5.0-alpha.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/package.json +1 -1
package/skills/discover/SKILL.md +145 -9
package/skills/discover/references/inline_instrumentation.js +45 -0
package/skills/discover/references/inline_instrumentation.py +50 -0
package/skills/discover/references/sdk_node.js +19 -0
package/skills/discover/references/sdk_python.py +25 -0
package/skills/{optimize → discover}/references/sizing-the-round.md +8 -0
package/skills/infra-setup/SKILL.md +1 -2
package/skills/optimize/SKILL.md +167 -8
package/skills/optimize/workflows/evo-optimize.js +672 -0
package/skills/report/SKILL.md +1 -1
package/skills/subagent/SKILL.md +89 -7

package/skills/report/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: report
 description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
-evo_version: 0.4.4
+evo_version: 0.5.0-alpha.10
 ---
 # Report

package/skills/subagent/SKILL.md CHANGED Viewed

@@ -1,11 +1,59 @@
 ---
 name: subagent
-description: Internal protocol for evo optimization subagents. Loaded by subagents spawned from /optimize via their host's skill loader. Not for orchestrator use.
-evo_version: 0.4.4
+description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
+evo_version: 0.5.0-alpha.10
 ---
 # Evo Subagent Protocol
+**Orchestrators reading for context**: this is the protocol your dispatched subagents follow. You don't act on it yourself -- write briefs that satisfy the four required fields described below, and rely on each spawned subagent to drive the loop on its end. Stop reading at "Host conventions" if you only need the brief shape; the rest is for the subagent.
+## Evo surface -- subagent perspective
+What you can pull/dispatch/read as a subagent. Each line is a triggering condition.
+```
+skills you may pull (Skill tool)
+└── evo:finetuning     before writing or changing any train.py -- technique
+                       choice, training recipe, observability, retry discipline.
+subagents you dispatch (Task tool, subagent_type=...)
+├── evo:verifier              MANDATORY pre AND post every `evo run`.
+│                             Pre: static analysis before the experiment runs
+│                                  (block on failure -- fix and retry).
+│                             Post: result-validity audit after it commits.
+└── evo:benchmark-reviewer    POST-COMMIT only, mode=review-experiment --
+                              per-task failure classification + annotations.
+                              Skip on evaluated/discarded/failed outcomes.
+references (Read tool, on demand)
+├── discover/references/
+│   ├── sdk_python.py / sdk_node.js     wiring per-task instrumentation -- preferred
+│   ├── inline_instrumentation.py       inline fallback. Copy as-is; do not reimplement
+│   └── instrumentation-contract.md     the format evo reads (result + traces shapes)
+│
+├── references/evo-wait.md              any time you need to wait -- training, eval,
+│                                       any long-running condition. Use this instead
+│                                       of `sleep N`; doesn't burn context.
+│
+└── finetuning/references/
+    ├── glue.md                          train.py I/O contract evo expects
+    ├── observability.md                 wandb/trackio/mlflow wiring -- env-driven
+    │                                    detection, TRL report_to options, custom-loop
+    │                                    patterns. Read when writing a training script.
+    ├── diagnostics.md                   per-failure-mode diagnostics
+    ├── false-progress.md                what doesn't count as improvement
+    ├── trace-schema.md                  per-task trace JSON schema
+    ├── rl/art.md                        ART (Algorithm-Refined Training)
+    ├── sft/tinker.md                    Tinker SFT
+    └── serving/vllm.md                  vLLM serving config + LoRA-multi
+```
+Orchestrator entry-point view (benchmark-reviewer, ideator, infra-setup, full
+references catalogue) lives in `evo:discover`'s "Evo surface" section.
+---
 You are an evo optimization subagent. The orchestrator has given you a **brief** with four fields:
 - **Objective** -- the bottleneck to attack and evidence for it (strategic, not edit-level)
@@ -160,7 +208,22 @@ For multi-line edits, `evo edit --json-stdin` reads `{"old":...,"new":...,"repla
 You may edit anything within the target scope. Do NOT modify benchmark, gate, or framework code.
-### 4. Run the experiment
+### 4. Verify the experiment design (pre-`evo run`)
+Before `evo run` burns compute, invoke the **evo verifier subagent** via your host's Task tool. Static analysis, ~30s.
+```
+Task(subagent_type="evo:verifier",
+     prompt="workspace=<workspace abs path>\nexperiment_id=<your exp_id>\nphase=pre")
+```
+The verifier checks for test-set leakage in your training data, subsetted eval commands, missing gates for new artifacts, generic hypotheses, and concurrent-resource conflicts. It returns a JSON report (`{passed, verdict, findings}`) and writes the same verdict as an `evo annotation` on the experiment. See `plugins/evo/agents/verifier.md` for the full check list.
+If the verifier returns `passed: false` (verdict `fail`), address every flagged `block` finding and re-invoke until it returns `passed: true`. Skipping or fudging a `fail` verdict is a stop-the-line bug -- the verdict is the precondition for compute spend.
+If the verifier returns verdict `warn`, you may proceed but address the warnings in your annotation (step 7).
+### 5. Run the experiment
 ```bash
 evo run <exp_id>
@@ -195,7 +258,7 @@ evo run <exp_id> --i-staged-new-files yes
 The ack flag is required when the worktree has any untracked, non-gitignored file. Without it, `evo run` errors closed and lists the files. For each file, decide: source (then `git add`) or warm state (leave untracked -- it persists in the slot for future experiments). Then re-run with `--i-staged-new-files yes`. The flag value must be exactly `yes`. In `commit_strategy=all` workspaces (default for `--backend worktree`) the flag is a silent no-op; safe to always pass.
-### 5. Analyze the result
+### 6. Analyze the result
 `evo run` prints one of three outcomes:
@@ -215,7 +278,20 @@ The ack flag is required when the worktree has any untracked, non-gitignored fil
   - Structural (benchmark broken, evo misconfigured): report to orchestrator and stop.
   - Not worth fixing: `evo discard <id> --reason "..."`.
-### 6. Annotate
+### 6b. Review your own failures (committed experiments only)
+After a `COMMITTED` outcome, before annotating yourself, spawn `evo:benchmark-reviewer` in review-experiment mode. It reads the per-task traces and the eval-runner log you just produced, classifies failures into a small taxonomy, and writes per-task annotations via `evo annotate <exp> --task K`. This is the data the next experiment's hypothesis is built on -- skip it and the orchestrator picks a frontier from `passed/failed` booleans with no diagnosis.
+```
+Task(subagent_type="evo:benchmark-reviewer",
+     prompt="mode=review-experiment\nworkspace=<workspace path>\nexperiment_id=<your exp_id>")
+```
+The returned JSON includes `failure_breakdown`, `top_failure_pattern`, and `next_step_signal`. Read it, include the breakdown + top pattern in your final handoff message, but **do not act on `next_step_signal` yourself** -- it's a hint for the next experiment, which isn't yours to design.
+Skip this step for `EVALUATED` (regressed, will be discarded), `FAILED` (infra error), or `DISCARDED` outcomes -- there's no meaningful per-task data worth classifying.
+### 7. Annotate
 ```bash
 evo annotate <exp_id> "<what you changed, what happened, and why>"
@@ -223,7 +299,7 @@ evo annotate <exp_id> "<what you changed, what happened, and why>"
 Always annotate so other agents can learn from your experiments.
-### 6b. Add gates for fixed behaviors
+### 7b. Add gates for fixed behaviors
 When you fix a critical, easy-to-regress behavior, lock it in as a gate so future experiments on this branch can't break it:
@@ -233,7 +309,7 @@ evo gate add <exp_id> --name "social_eng_resistance" --command "python3 {worktre
 Good candidates: a specific benchmark task that was hard to fix, a test for a critical policy rule, a smoke test for a fragile behavior. The gate command must exit non-zero when the protected behavior regresses; a bare benchmark invocation that prints a low score but exits 0 is decorative and should not be registered. Do NOT gate every passing task -- that over-constrains the search.
-### 7. Decide: continue or stop
+### 8. Decide: continue or stop
 Continue if budget remains AND (last outcome was committed, OR you have a meaningfully different idea after an evaluated/discarded outcome). When continuing after a committed experiment, update your parent to the newly committed ID.
@@ -251,6 +327,12 @@ Trace quality is part of the benchmark contract. After a failed baseline or fail
 The trace format is forward-compatible -- extra fields are preserved. Do NOT change the score computation or gate logic -- only add observability.
+## Reuse expensive intermediates
+If your experiment needs an artifact that is slow to produce and stable across sibling/descendant experiments -- curated/tokenized datasets, fine-tuned weights or adapters, embeddings, retrieval indexes, precomputed eval generations, large compiled assets -- check `.evo/cache/` (workspace-level, sibling to `run_<NNNN>/`) before recomputing. Write back what you compute, keyed by every input that changes the artifact (recipe version, source, parameters). The next experiment that asks for the same artifact reads from disk instead of rebuilding from scratch.
+`.evo/cache/` is already gitignored via the workspace's `.evo/` exclude and is not touched by `evo new` / `evo run` / `evo reset`. Anti-pattern: writing the artifact inside your experiment's worktree -- it's worktree-local, doesn't propagate to descendants, and disappears on cleanup. The full read-or-compute pattern (workspace-root lookup, cache-key construction, deferring to the per-user HF cache where relevant) is in the **finetuning skill** under "Cache expensive intermediates." Apply it in any domain where the artifact shape fits.
 ## Rules
 - Do NOT run `evo init` or `evo reset`