@evo-hq/pi-evo 0.4.5 → 0.5.0-alpha.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  name: report
3
3
  description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
4
- evo_version: 0.4.5
4
+ evo_version: 0.5.0-alpha.10
5
5
  ---
6
6
 
7
7
  # Report
@@ -1,11 +1,59 @@
1
1
  ---
2
2
  name: subagent
3
- description: Internal protocol for evo optimization subagents. Loaded by subagents spawned from /optimize via their host's skill loader. Not for orchestrator use.
4
- evo_version: 0.4.5
3
+ description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
4
+ evo_version: 0.5.0-alpha.10
5
5
  ---
6
6
 
7
7
  # Evo Subagent Protocol
8
8
 
9
+ **Orchestrators reading for context**: this is the protocol your dispatched subagents follow. You don't act on it yourself -- write briefs that satisfy the four required fields described below, and rely on each spawned subagent to drive the loop on its end. Stop reading at "Host conventions" if you only need the brief shape; the rest is for the subagent.
10
+
11
+ ## Evo surface -- subagent perspective
12
+
13
+ What you can pull/dispatch/read as a subagent. Each line is a triggering condition.
14
+
15
+ ```
16
+ skills you may pull (Skill tool)
17
+ └── evo:finetuning before writing or changing any train.py -- technique
18
+ choice, training recipe, observability, retry discipline.
19
+
20
+ subagents you dispatch (Task tool, subagent_type=...)
21
+ ├── evo:verifier MANDATORY pre AND post every `evo run`.
22
+ │ Pre: static analysis before the experiment runs
23
+ │ (block on failure -- fix and retry).
24
+ │ Post: result-validity audit after it commits.
25
+ └── evo:benchmark-reviewer POST-COMMIT only, mode=review-experiment --
26
+ per-task failure classification + annotations.
27
+ Skip on evaluated/discarded/failed outcomes.
28
+
29
+ references (Read tool, on demand)
30
+ ├── discover/references/
31
+ │ ├── sdk_python.py / sdk_node.js wiring per-task instrumentation -- preferred
32
+ │ ├── inline_instrumentation.py inline fallback. Copy as-is; do not reimplement
33
+ │ └── instrumentation-contract.md the format evo reads (result + traces shapes)
34
+
35
+ ├── references/evo-wait.md any time you need to wait -- training, eval,
36
+ │ any long-running condition. Use this instead
37
+ │ of `sleep N`; doesn't burn context.
38
+
39
+ └── finetuning/references/
40
+ ├── glue.md train.py I/O contract evo expects
41
+ ├── observability.md wandb/trackio/mlflow wiring -- env-driven
42
+ │ detection, TRL report_to options, custom-loop
43
+ │ patterns. Read when writing a training script.
44
+ ├── diagnostics.md per-failure-mode diagnostics
45
+ ├── false-progress.md what doesn't count as improvement
46
+ ├── trace-schema.md per-task trace JSON schema
47
+ ├── rl/art.md ART (Algorithm-Refined Training)
48
+ ├── sft/tinker.md Tinker SFT
49
+ └── serving/vllm.md vLLM serving config + LoRA-multi
50
+ ```
51
+
52
+ Orchestrator entry-point view (benchmark-reviewer, ideator, infra-setup, full
53
+ references catalogue) lives in `evo:discover`'s "Evo surface" section.
54
+
55
+ ---
56
+
9
57
  You are an evo optimization subagent. The orchestrator has given you a **brief** with four fields:
10
58
 
11
59
  - **Objective** -- the bottleneck to attack and evidence for it (strategic, not edit-level)
@@ -160,7 +208,22 @@ For multi-line edits, `evo edit --json-stdin` reads `{"old":...,"new":...,"repla
160
208
 
161
209
  You may edit anything within the target scope. Do NOT modify benchmark, gate, or framework code.
162
210
 
163
- ### 4. Run the experiment
211
+ ### 4. Verify the experiment design (pre-`evo run`)
212
+
213
+ Before `evo run` burns compute, invoke the **evo verifier subagent** via your host's Task tool. Static analysis, ~30s.
214
+
215
+ ```
216
+ Task(subagent_type="evo:verifier",
217
+ prompt="workspace=<workspace abs path>\nexperiment_id=<your exp_id>\nphase=pre")
218
+ ```
219
+
220
+ The verifier checks for test-set leakage in your training data, subsetted eval commands, missing gates for new artifacts, generic hypotheses, and concurrent-resource conflicts. It returns a JSON report (`{passed, verdict, findings}`) and writes the same verdict as an `evo annotation` on the experiment. See `plugins/evo/agents/verifier.md` for the full check list.
221
+
222
+ If the verifier returns `passed: false` (verdict `fail`), address every flagged `block` finding and re-invoke until it returns `passed: true`. Skipping or fudging a `fail` verdict is a stop-the-line bug -- the verdict is the precondition for compute spend.
223
+
224
+ If the verifier returns verdict `warn`, you may proceed but address the warnings in your annotation (step 7).
225
+
226
+ ### 5. Run the experiment
164
227
 
165
228
  ```bash
166
229
  evo run <exp_id>
@@ -195,7 +258,7 @@ evo run <exp_id> --i-staged-new-files yes
195
258
 
196
259
  The ack flag is required when the worktree has any untracked, non-gitignored file. Without it, `evo run` errors closed and lists the files. For each file, decide: source (then `git add`) or warm state (leave untracked -- it persists in the slot for future experiments). Then re-run with `--i-staged-new-files yes`. The flag value must be exactly `yes`. In `commit_strategy=all` workspaces (default for `--backend worktree`) the flag is a silent no-op; safe to always pass.
197
260
 
198
- ### 5. Analyze the result
261
+ ### 6. Analyze the result
199
262
 
200
263
  `evo run` prints one of three outcomes:
201
264
 
@@ -215,7 +278,20 @@ The ack flag is required when the worktree has any untracked, non-gitignored fil
215
278
  - Structural (benchmark broken, evo misconfigured): report to orchestrator and stop.
216
279
  - Not worth fixing: `evo discard <id> --reason "..."`.
217
280
 
218
- ### 6. Annotate
281
+ ### 6b. Review your own failures (committed experiments only)
282
+
283
+ After a `COMMITTED` outcome, before annotating yourself, spawn `evo:benchmark-reviewer` in review-experiment mode. It reads the per-task traces and the eval-runner log you just produced, classifies failures into a small taxonomy, and writes per-task annotations via `evo annotate <exp> --task K`. This is the data the next experiment's hypothesis is built on -- skip it and the orchestrator picks a frontier from `passed/failed` booleans with no diagnosis.
284
+
285
+ ```
286
+ Task(subagent_type="evo:benchmark-reviewer",
287
+ prompt="mode=review-experiment\nworkspace=<workspace path>\nexperiment_id=<your exp_id>")
288
+ ```
289
+
290
+ The returned JSON includes `failure_breakdown`, `top_failure_pattern`, and `next_step_signal`. Read it, include the breakdown + top pattern in your final handoff message, but **do not act on `next_step_signal` yourself** -- it's a hint for the next experiment, which isn't yours to design.
291
+
292
+ Skip this step for `EVALUATED` (regressed, will be discarded), `FAILED` (infra error), or `DISCARDED` outcomes -- there's no meaningful per-task data worth classifying.
293
+
294
+ ### 7. Annotate
219
295
 
220
296
  ```bash
221
297
  evo annotate <exp_id> "<what you changed, what happened, and why>"
@@ -223,7 +299,7 @@ evo annotate <exp_id> "<what you changed, what happened, and why>"
223
299
 
224
300
  Always annotate so other agents can learn from your experiments.
225
301
 
226
- ### 6b. Add gates for fixed behaviors
302
+ ### 7b. Add gates for fixed behaviors
227
303
 
228
304
  When you fix a critical, easy-to-regress behavior, lock it in as a gate so future experiments on this branch can't break it:
229
305
 
@@ -233,7 +309,7 @@ evo gate add <exp_id> --name "social_eng_resistance" --command "python3 {worktre
233
309
 
234
310
  Good candidates: a specific benchmark task that was hard to fix, a test for a critical policy rule, a smoke test for a fragile behavior. The gate command must exit non-zero when the protected behavior regresses; a bare benchmark invocation that prints a low score but exits 0 is decorative and should not be registered. Do NOT gate every passing task -- that over-constrains the search.
235
311
 
236
- ### 7. Decide: continue or stop
312
+ ### 8. Decide: continue or stop
237
313
 
238
314
  Continue if budget remains AND (last outcome was committed, OR you have a meaningfully different idea after an evaluated/discarded outcome). When continuing after a committed experiment, update your parent to the newly committed ID.
239
315
 
@@ -251,6 +327,12 @@ Trace quality is part of the benchmark contract. After a failed baseline or fail
251
327
 
252
328
  The trace format is forward-compatible -- extra fields are preserved. Do NOT change the score computation or gate logic -- only add observability.
253
329
 
330
+ ## Reuse expensive intermediates
331
+
332
+ If your experiment needs an artifact that is slow to produce and stable across sibling/descendant experiments -- curated/tokenized datasets, fine-tuned weights or adapters, embeddings, retrieval indexes, precomputed eval generations, large compiled assets -- check `.evo/cache/` (workspace-level, sibling to `run_<NNNN>/`) before recomputing. Write back what you compute, keyed by every input that changes the artifact (recipe version, source, parameters). The next experiment that asks for the same artifact reads from disk instead of rebuilding from scratch.
333
+
334
+ `.evo/cache/` is already gitignored via the workspace's `.evo/` exclude and is not touched by `evo new` / `evo run` / `evo reset`. Anti-pattern: writing the artifact inside your experiment's worktree -- it's worktree-local, doesn't propagate to descendants, and disappears on cleanup. The full read-or-compute pattern (workspace-root lookup, cache-key construction, deferring to the per-user HF cache where relevant) is in the **finetuning skill** under "Cache expensive intermediates." Apply it in any domain where the artifact shape fits.
335
+
254
336
  ## Rules
255
337
 
256
338
  - Do NOT run `evo init` or `evo reset`