@evo-hq/pi-evo 0.4.4 → 0.5.0-alpha.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/discover/SKILL.md +145 -9
- package/skills/discover/references/inline_instrumentation.js +45 -0
- package/skills/discover/references/inline_instrumentation.py +50 -0
- package/skills/discover/references/sdk_node.js +19 -0
- package/skills/discover/references/sdk_python.py +25 -0
- package/skills/{optimize → discover}/references/sizing-the-round.md +8 -0
- package/skills/infra-setup/SKILL.md +1 -2
- package/skills/optimize/SKILL.md +167 -8
- package/skills/optimize/workflows/evo-optimize.js +672 -0
- package/skills/report/SKILL.md +1 -1
- package/skills/subagent/SKILL.md +89 -7
package/skills/report/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: report
|
|
3
3
|
description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
|
|
4
|
-
evo_version: 0.
|
|
4
|
+
evo_version: 0.5.0-alpha.10
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Report
|
package/skills/subagent/SKILL.md
CHANGED
|
@@ -1,11 +1,59 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: subagent
|
|
3
|
-
description:
|
|
4
|
-
evo_version: 0.
|
|
3
|
+
description: Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.
|
|
4
|
+
evo_version: 0.5.0-alpha.10
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Evo Subagent Protocol
|
|
8
8
|
|
|
9
|
+
**Orchestrators reading for context**: this is the protocol your dispatched subagents follow. You don't act on it yourself -- write briefs that satisfy the four required fields described below, and rely on each spawned subagent to drive the loop on its end. Stop reading at "Host conventions" if you only need the brief shape; the rest is for the subagent.
|
|
10
|
+
|
|
11
|
+
## Evo surface -- subagent perspective
|
|
12
|
+
|
|
13
|
+
What you can pull/dispatch/read as a subagent. Each line is a triggering condition.
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
skills you may pull (Skill tool)
|
|
17
|
+
└── evo:finetuning before writing or changing any train.py -- technique
|
|
18
|
+
choice, training recipe, observability, retry discipline.
|
|
19
|
+
|
|
20
|
+
subagents you dispatch (Task tool, subagent_type=...)
|
|
21
|
+
├── evo:verifier MANDATORY pre AND post every `evo run`.
|
|
22
|
+
│ Pre: static analysis before the experiment runs
|
|
23
|
+
│ (block on failure -- fix and retry).
|
|
24
|
+
│ Post: result-validity audit after it commits.
|
|
25
|
+
└── evo:benchmark-reviewer POST-COMMIT only, mode=review-experiment --
|
|
26
|
+
per-task failure classification + annotations.
|
|
27
|
+
Skip on evaluated/discarded/failed outcomes.
|
|
28
|
+
|
|
29
|
+
references (Read tool, on demand)
|
|
30
|
+
├── discover/references/
|
|
31
|
+
│ ├── sdk_python.py / sdk_node.js wiring per-task instrumentation -- preferred
|
|
32
|
+
│ ├── inline_instrumentation.py inline fallback. Copy as-is; do not reimplement
|
|
33
|
+
│ └── instrumentation-contract.md the format evo reads (result + traces shapes)
|
|
34
|
+
│
|
|
35
|
+
├── references/evo-wait.md any time you need to wait -- training, eval,
|
|
36
|
+
│ any long-running condition. Use this instead
|
|
37
|
+
│ of `sleep N`; doesn't burn context.
|
|
38
|
+
│
|
|
39
|
+
└── finetuning/references/
|
|
40
|
+
├── glue.md train.py I/O contract evo expects
|
|
41
|
+
├── observability.md wandb/trackio/mlflow wiring -- env-driven
|
|
42
|
+
│ detection, TRL report_to options, custom-loop
|
|
43
|
+
│ patterns. Read when writing a training script.
|
|
44
|
+
├── diagnostics.md per-failure-mode diagnostics
|
|
45
|
+
├── false-progress.md what doesn't count as improvement
|
|
46
|
+
├── trace-schema.md per-task trace JSON schema
|
|
47
|
+
├── rl/art.md ART (Algorithm-Refined Training)
|
|
48
|
+
├── sft/tinker.md Tinker SFT
|
|
49
|
+
└── serving/vllm.md vLLM serving config + LoRA-multi
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Orchestrator entry-point view (benchmark-reviewer, ideator, infra-setup, full
|
|
53
|
+
references catalogue) lives in `evo:discover`'s "Evo surface" section.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
9
57
|
You are an evo optimization subagent. The orchestrator has given you a **brief** with four fields:
|
|
10
58
|
|
|
11
59
|
- **Objective** -- the bottleneck to attack and evidence for it (strategic, not edit-level)
|
|
@@ -160,7 +208,22 @@ For multi-line edits, `evo edit --json-stdin` reads `{"old":...,"new":...,"repla
|
|
|
160
208
|
|
|
161
209
|
You may edit anything within the target scope. Do NOT modify benchmark, gate, or framework code.
|
|
162
210
|
|
|
163
|
-
### 4.
|
|
211
|
+
### 4. Verify the experiment design (pre-`evo run`)
|
|
212
|
+
|
|
213
|
+
Before `evo run` burns compute, invoke the **evo verifier subagent** via your host's Task tool. Static analysis, ~30s.
|
|
214
|
+
|
|
215
|
+
```
|
|
216
|
+
Task(subagent_type="evo:verifier",
|
|
217
|
+
prompt="workspace=<workspace abs path>\nexperiment_id=<your exp_id>\nphase=pre")
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
The verifier checks for test-set leakage in your training data, subsetted eval commands, missing gates for new artifacts, generic hypotheses, and concurrent-resource conflicts. It returns a JSON report (`{passed, verdict, findings}`) and writes the same verdict as an `evo annotation` on the experiment. See `plugins/evo/agents/verifier.md` for the full check list.
|
|
221
|
+
|
|
222
|
+
If the verifier returns `passed: false` (verdict `fail`), address every flagged `block` finding and re-invoke until it returns `passed: true`. Skipping or fudging a `fail` verdict is a stop-the-line bug -- the verdict is the precondition for compute spend.
|
|
223
|
+
|
|
224
|
+
If the verifier returns verdict `warn`, you may proceed but address the warnings in your annotation (step 7).
|
|
225
|
+
|
|
226
|
+
### 5. Run the experiment
|
|
164
227
|
|
|
165
228
|
```bash
|
|
166
229
|
evo run <exp_id>
|
|
@@ -195,7 +258,7 @@ evo run <exp_id> --i-staged-new-files yes
|
|
|
195
258
|
|
|
196
259
|
The ack flag is required when the worktree has any untracked, non-gitignored file. Without it, `evo run` errors closed and lists the files. For each file, decide: source (then `git add`) or warm state (leave untracked -- it persists in the slot for future experiments). Then re-run with `--i-staged-new-files yes`. The flag value must be exactly `yes`. In `commit_strategy=all` workspaces (default for `--backend worktree`) the flag is a silent no-op; safe to always pass.
|
|
197
260
|
|
|
198
|
-
###
|
|
261
|
+
### 6. Analyze the result
|
|
199
262
|
|
|
200
263
|
`evo run` prints one of three outcomes:
|
|
201
264
|
|
|
@@ -215,7 +278,20 @@ The ack flag is required when the worktree has any untracked, non-gitignored fil
|
|
|
215
278
|
- Structural (benchmark broken, evo misconfigured): report to orchestrator and stop.
|
|
216
279
|
- Not worth fixing: `evo discard <id> --reason "..."`.
|
|
217
280
|
|
|
218
|
-
###
|
|
281
|
+
### 6b. Review your own failures (committed experiments only)
|
|
282
|
+
|
|
283
|
+
After a `COMMITTED` outcome, before annotating yourself, spawn `evo:benchmark-reviewer` in review-experiment mode. It reads the per-task traces and the eval-runner log you just produced, classifies failures into a small taxonomy, and writes per-task annotations via `evo annotate <exp> --task K`. This is the data the next experiment's hypothesis is built on -- skip it and the orchestrator picks a frontier from `passed/failed` booleans with no diagnosis.
|
|
284
|
+
|
|
285
|
+
```
|
|
286
|
+
Task(subagent_type="evo:benchmark-reviewer",
|
|
287
|
+
prompt="mode=review-experiment\nworkspace=<workspace path>\nexperiment_id=<your exp_id>")
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
The returned JSON includes `failure_breakdown`, `top_failure_pattern`, and `next_step_signal`. Read it, include the breakdown + top pattern in your final handoff message, but **do not act on `next_step_signal` yourself** -- it's a hint for the next experiment, which isn't yours to design.
|
|
291
|
+
|
|
292
|
+
Skip this step for `EVALUATED` (regressed, will be discarded), `FAILED` (infra error), or `DISCARDED` outcomes -- there's no meaningful per-task data worth classifying.
|
|
293
|
+
|
|
294
|
+
### 7. Annotate
|
|
219
295
|
|
|
220
296
|
```bash
|
|
221
297
|
evo annotate <exp_id> "<what you changed, what happened, and why>"
|
|
@@ -223,7 +299,7 @@ evo annotate <exp_id> "<what you changed, what happened, and why>"
|
|
|
223
299
|
|
|
224
300
|
Always annotate so other agents can learn from your experiments.
|
|
225
301
|
|
|
226
|
-
###
|
|
302
|
+
### 7b. Add gates for fixed behaviors
|
|
227
303
|
|
|
228
304
|
When you fix a critical, easy-to-regress behavior, lock it in as a gate so future experiments on this branch can't break it:
|
|
229
305
|
|
|
@@ -233,7 +309,7 @@ evo gate add <exp_id> --name "social_eng_resistance" --command "python3 {worktre
|
|
|
233
309
|
|
|
234
310
|
Good candidates: a specific benchmark task that was hard to fix, a test for a critical policy rule, a smoke test for a fragile behavior. The gate command must exit non-zero when the protected behavior regresses; a bare benchmark invocation that prints a low score but exits 0 is decorative and should not be registered. Do NOT gate every passing task -- that over-constrains the search.
|
|
235
311
|
|
|
236
|
-
###
|
|
312
|
+
### 8. Decide: continue or stop
|
|
237
313
|
|
|
238
314
|
Continue if budget remains AND (last outcome was committed, OR you have a meaningfully different idea after an evaluated/discarded outcome). When continuing after a committed experiment, update your parent to the newly committed ID.
|
|
239
315
|
|
|
@@ -251,6 +327,12 @@ Trace quality is part of the benchmark contract. After a failed baseline or fail
|
|
|
251
327
|
|
|
252
328
|
The trace format is forward-compatible -- extra fields are preserved. Do NOT change the score computation or gate logic -- only add observability.
|
|
253
329
|
|
|
330
|
+
## Reuse expensive intermediates
|
|
331
|
+
|
|
332
|
+
If your experiment needs an artifact that is slow to produce and stable across sibling/descendant experiments -- curated/tokenized datasets, fine-tuned weights or adapters, embeddings, retrieval indexes, precomputed eval generations, large compiled assets -- check `.evo/cache/` (workspace-level, sibling to `run_<NNNN>/`) before recomputing. Write back what you compute, keyed by every input that changes the artifact (recipe version, source, parameters). The next experiment that asks for the same artifact reads from disk instead of rebuilding from scratch.
|
|
333
|
+
|
|
334
|
+
`.evo/cache/` is already gitignored via the workspace's `.evo/` exclude and is not touched by `evo new` / `evo run` / `evo reset`. Anti-pattern: writing the artifact inside your experiment's worktree -- it's worktree-local, doesn't propagate to descendants, and disappears on cleanup. The full read-or-compute pattern (workspace-root lookup, cache-key construction, deferring to the per-user HF cache where relevant) is in the **finetuning skill** under "Cache expensive intermediates." Apply it in any domain where the artifact shape fits.
|
|
335
|
+
|
|
254
336
|
## Rules
|
|
255
337
|
|
|
256
338
|
- Do NOT run `evo init` or `evo reset`
|