@evo-hq/pi-evo 0.4.4-alpha.6 → 0.5.0-alpha.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/discover/SKILL.md +69 -9
- package/skills/discover/references/inline_instrumentation.js +45 -0
- package/skills/discover/references/inline_instrumentation.py +50 -0
- package/skills/discover/references/sdk_node.js +19 -0
- package/skills/discover/references/sdk_python.py +25 -0
- package/skills/infra-setup/SKILL.md +1 -2
- package/skills/optimize/SKILL.md +120 -7
- package/skills/optimize/references/sizing-the-round.md +8 -0
- package/skills/report/SKILL.md +1 -1
- package/skills/subagent/SKILL.md +21 -6
package/package.json
CHANGED
package/skills/discover/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: discover
|
|
3
3
|
description: Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.
|
|
4
4
|
argument-hint: <optional context about what to optimize>
|
|
5
|
-
evo_version: 0.
|
|
5
|
+
evo_version: 0.5.0-alpha.3
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# Discover
|
|
@@ -40,20 +40,20 @@ evo --version
|
|
|
40
40
|
The output must be exactly:
|
|
41
41
|
|
|
42
42
|
```
|
|
43
|
-
evo-hq-cli 0.
|
|
43
|
+
evo-hq-cli 0.5.0-alpha.3
|
|
44
44
|
```
|
|
45
45
|
|
|
46
46
|
Three outcomes:
|
|
47
47
|
|
|
48
48
|
1. **Matches exactly** — continue to step 1.
|
|
49
49
|
2. **Reports a different version** (`evo-hq-cli 0.4.2`, etc.) — the host refetched a newer/older skill bundle than the CLI on PATH. Drift breaks skills silently. Stop and tell the user:
|
|
50
|
-
> Your installed evo CLI is on a different version than this skill (`0.
|
|
50
|
+
> Your installed evo CLI is on a different version than this skill (`0.5.0-alpha.3`). Run:
|
|
51
51
|
> ```
|
|
52
|
-
> uv tool install --force evo-hq-cli==0.
|
|
52
|
+
> uv tool install --force evo-hq-cli==0.5.0-alpha.3
|
|
53
53
|
> ```
|
|
54
54
|
> Then re-invoke this skill.
|
|
55
55
|
3. **`command not found`, or reports a different package** (commonly `evo 1.x` — the unrelated SLAM tool) — the CLI isn't installed. Tell the user:
|
|
56
|
-
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.
|
|
56
|
+
> `evo-hq-cli` isn't on your PATH. Install it: `uv tool install evo-hq-cli==0.5.0-alpha.3` (or `pipx install evo-hq-cli==0.5.0-alpha.3`). Then re-invoke this skill.
|
|
57
57
|
|
|
58
58
|
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
|
|
59
59
|
|
|
@@ -180,7 +180,8 @@ build/
|
|
|
180
180
|
evo init --name "<short project name>" \
|
|
181
181
|
--target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
|
|
182
182
|
--host <claude-code|codex|opencode|openclaw|hermes|pi|generic> \
|
|
183
|
-
--instrumentation-mode <sdk|inline>
|
|
183
|
+
--instrumentation-mode <sdk|inline> \
|
|
184
|
+
--per-exp-timeout <seconds> [--gate "<gate command>"] \
|
|
184
185
|
[--commit-strategy <all|tracked-only>]
|
|
185
186
|
```
|
|
186
187
|
|
|
@@ -188,6 +189,8 @@ evo init --name "<short project name>" \
|
|
|
188
189
|
|
|
189
190
|
**`--name` should be a short human-readable project label** for dashboard display, chosen from the repository/product context. Existing workspaces without a name fall back to the repo directory name; do not hand-edit config just to migrate them.
|
|
190
191
|
|
|
192
|
+
**`--per-exp-timeout` is required.** Wall-clock seconds for each `evo run` invocation. Becomes the workspace default; override per-call with `evo run --timeout N`. Pick based on what the benchmark actually costs end-to-end on this hardware -- if you don't know yet, time the benchmark once locally and use ~2x that. Typical ranges: a unit-test-style benchmark is 300-900s; a small-model SFT + eval cycle is 1800-3600s; a large-model train run is several hours. Set conservatively -- a too-tight value kills experiments mid-flight; a too-loose value wastes budget only when something actually hangs. Update later with `evo config set per-exp-timeout <seconds>`.
|
|
193
|
+
|
|
191
194
|
**`--commit-strategy` is optional.** Default is `all`. Override with `--commit-strategy tracked-only` only when you want the stricter shisa-kanko flow where new files must be staged explicitly and acknowledged at `evo run` time.
|
|
192
195
|
|
|
193
196
|
**Placeholder semantics.** Benchmark and gate commands support two placeholders, resolved lazily at run time by `evo run` / gate evaluation:
|
|
@@ -205,7 +208,8 @@ evo init \
|
|
|
205
208
|
--target agent/solve.py \
|
|
206
209
|
--benchmark "python3 {worktree}/benchmark.py --target {target}" \
|
|
207
210
|
--metric max \
|
|
208
|
-
--host claude-code
|
|
211
|
+
--host claude-code \
|
|
212
|
+
--per-exp-timeout 1800
|
|
209
213
|
```
|
|
210
214
|
|
|
211
215
|
Use the same runtime entry point the project already uses, but make sure the command does not assume uncommitted runtime state exists inside the worktree. Worktrees are git checkouts; untracked directories such as local virtualenvs, build caches, and downloaded models are usually not present there. If the benchmark needs setup or a package-manager runner, configure evo's runtime recipe instead of baking local paths into the benchmark command:
|
|
@@ -225,6 +229,14 @@ Dashboard live: http://127.0.0.1:8080 (pid 12345)
|
|
|
225
229
|
|
|
226
230
|
**Relay that line back to the user verbatim.** If port 8080 is busy, evo auto-increments -- show whatever port prints. The URL is how the user watches the run.
|
|
227
231
|
|
|
232
|
+
**Benchmark commands must be eval-only.** Do NOT wrap training and evaluation into a single benchmark command. If your benchmark command runs training before scoring, every gate revalidation and every `evo run --check` retrains from scratch, and the experiment budget burns on duplicated training instead of new experiments. Training is a separate step the agent invokes BEFORE `evo run`:
|
|
233
|
+
|
|
234
|
+
1. The agent makes changes (data curation, hyperparameter selection, technique choice, training code edits).
|
|
235
|
+
2. The agent runs training to produce a checkpoint at the experiment worktree's `final_model/` (or wherever the technique's recipe in `evo:finetuning/references/glue.md` specifies).
|
|
236
|
+
3. THEN `evo run <exp_id>` invokes the registered benchmark command, which loads the produced checkpoint and emits a score.
|
|
237
|
+
|
|
238
|
+
The registered benchmark command should call `evaluate.py`, `run_eval.py`, or equivalent -- NOT `train.py`. If the project's only existing evaluation tool runs train+eval together with no eval-only mode, wrap it: add a `--skip-train` flag, or have the wrapper detect an existing checkpoint at `final_model/` and short-circuit the train step. Without this, evo's gate-recheck and re-score mechanics retrain repeatedly and the budget evaporates.
|
|
239
|
+
|
|
228
240
|
**Runtime environment.** If the benchmark needs keys or other runtime variables, configure them through evo rather than copying `.env` into worktrees or hand-editing `config.json`:
|
|
229
241
|
|
|
230
242
|
```bash
|
|
@@ -318,7 +330,31 @@ Paths below are relative to this `SKILL.md` file (resolve them against the skill
|
|
|
318
330
|
- Benchmark is Python or Node: paste `references/inline_instrumentation.py` (or `.js`) and call `log_task` / `logTask` per task, `write_result` / `writeResult` once at the end.
|
|
319
331
|
- Benchmark is any other language: port the contract from `references/instrumentation-contract.md` directly into that language (~10-15 lines: read the env vars, write each task trace, atomically publish the result). Do not add a Python/JS wrapper around it.
|
|
320
332
|
|
|
321
|
-
|
|
333
|
+
**Per-task emission is load-bearing.** If your benchmark evaluates N independent items (per-question math, per-test-case unit tests, per-document QA, per-sample reasoning trace), emit ONE `log_task` / `report` / trace file PER ITEM -- not one aggregate. Include the item's input, expected output, model output, and any per-item metadata as extras; that detail is what the dashboard's per-task panel + the verifier's reproducibility spot-check + the ideator's failure-clustering all rely on. Wrappers that compute the average score themselves and emit a single aggregate task entry look like they work but lose every diagnostic capability evo provides. The reference files have explicit USAGE EXAMPLES showing the per-item loop AND an ANTI-PATTERN block showing what NOT to do (see `references/inline_instrumentation.{py,js}` and `references/sdk_{python,node}.{py,js}`). Follow them. Single-aggregate emission is only valid when the benchmark really is one indivisible measurement (one e2e workflow, one perf number) -- and even then, attach every observable as extras.
|
|
334
|
+
|
|
335
|
+
**Runner-library wrappers are the common failure mode.** When the selected benchmark wraps a runner library (e.g. `inspect_evals`, `lm-eval-harness`, `evals`), the per-item loop is hidden inside the runner. The wrapper script's natural shape is to call the runner, read its aggregate output, and write a single `{"score": X}` to `result.json`. This is the anti-pattern. **Even when the runner library handles the per-sample loop internally, the wrapper script MUST parse the runner's per-sample output JSON and emit one `log_task(item_id, score=..., extras={...})` per item.** The runner typically already writes per-sample data to disk (most of these libraries do — `inspect_evals` writes per-sample logs, `lm-eval-harness` writes per-doc records); the wrapper just hasn't been forwarding it to evo. Without per-item forwarding, the dashboard's Tasks tab is empty, the verifier can't spot-check, and the agent has no way to diagnose WHICH items the model fails on — which is required for any RL-on-failures or curriculum strategy.
|
|
336
|
+
|
|
337
|
+
**Write traces for the future reader.** The only consumers of a committed experiment's traces look BACK at them -- a future orchestrator picking the next move, a verifier auditing for false progress, an ideator clustering failure modes. A trace that records `{score: 0}` and nothing else is unrecoverable: the future reader cannot tell whether the model produced wrong reasoning, unparseable format, or no output at all. Two rules of thumb:
|
|
338
|
+
|
|
339
|
+
- **`log_task` extras carry context** -- the item's input, expected output, the model's actual output (or first ~500 chars of it), parse outcome, any error. Cost: a few KB per task. Payoff: diagnosis is possible later without re-running the eval.
|
|
340
|
+
- **`evo annotate <exp_id>` is the human-readable summary** -- one or two factual lines after a run commits: which data sources were used, what the score actually represented, the failure mode if any. Annotations get loaded into future orchestrator decisions and ideator briefs, so write them as facts ("model emitted `\boxed{}`, eval prompt requested `ANSWER:`, format mismatch suspected"), not as plans or feelings.
|
|
341
|
+
|
|
342
|
+
### 10d. Audit with benchmark-reviewer (mandatory before baseline)
|
|
343
|
+
|
|
344
|
+
Before `evo run` is invoked for the first time, audit the harness via the bundled subagent:
|
|
345
|
+
|
|
346
|
+
```
|
|
347
|
+
Task(subagent_type="evo:benchmark-reviewer",
|
|
348
|
+
prompt="workspace=<absolute path to dir containing .evo/>\nbenchmark_command=<the literal --benchmark string from evo init>\nunit=<one-line description of an item, e.g. 'AIME problem', 'HumanEval task', 'BFCL turn'>")
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
The subagent returns a JSON report with `passed`, `findings[]`. Each finding has `severity: block|warn|note`.
|
|
352
|
+
|
|
353
|
+
**Gate the baseline on this.** If `passed: false`, address every `block` finding before re-invoking the reviewer. Do **not** proceed to `evo run` until the report comes back clean. Typical `block` findings: aggregate-only emission (most common -- see step 10c), training source that overlaps the held-out set, no real gate registered for a constructed benchmark, harness silently writes `{"score": 0.0}` on error instead of crashing.
|
|
354
|
+
|
|
355
|
+
`warn` and `note` findings are informational -- record them in `.evo/project.md` and proceed.
|
|
356
|
+
|
|
357
|
+
### 10e. Cheap validation run
|
|
322
358
|
|
|
323
359
|
Before the full baseline, validate the toolchain with the cheapest possible end-to-end run (single task, smallest split, dry-run flag -- whatever is fastest). Run the check from the main repo root:
|
|
324
360
|
|
|
@@ -339,7 +375,7 @@ This is the authoritative wiring check, and it is language-agnostic -- it runs t
|
|
|
339
375
|
|
|
340
376
|
Fix any issues and re-validate before proceeding.
|
|
341
377
|
|
|
342
|
-
###
|
|
378
|
+
### 10f. Commit inside the worktree
|
|
343
379
|
|
|
344
380
|
Logical commits are ideal but not required. Minimal acceptable:
|
|
345
381
|
|
|
@@ -405,9 +441,33 @@ End the skill by reporting in chat:
|
|
|
405
441
|
- The baseline experiment ID and score
|
|
406
442
|
- The chosen optimization dimension and why
|
|
407
443
|
- A one-liner on next steps: "Run `/evo:optimize` to start the optimization loop."
|
|
444
|
+
|
|
445
|
+
**Do not run experiments outside `/evo:optimize`.** Even if the workspace's resource profile forces serial execution (e.g. exclusive-GPU, width 1), you still go through `/evo:optimize` with `subagents=1`. The optimize loop's value isn't just parallelism -- it's the structured loop around every experiment (scan-subagent cross-cutting analysis, brief writing, verifier pre/post hooks, ideator spawning on stall, frontier reconciliation). Bypassing optimize to "drive experiments directly" loses all of that and reverts to ad-hoc iteration. If you are tempted to skip optimize because the workload is serial, read `plugins/evo/skills/optimize/SKILL.md` for how to configure it for serial work -- the answer is `subagents=1`, NOT "don't run optimize."
|
|
446
|
+
|
|
408
447
|
- **Resume after crash:** if the host, the shell, or the machine restarts mid-flow, re-invoke `evo:optimize`. Evo reads `.evo/` and resumes from the last committed experiment -- no special restore procedure.
|
|
409
448
|
- **State is local to this machine:** experiment commits on branches like `evo/run_0000/exp_*` survive `git push --all`, but orchestration state (graph, annotations, project notes) lives only in `.evo/`. If that history matters to you, back up `.evo/` separately (e.g., `tar -czf evo-state-$(date +%F).tar.gz .evo/`).
|
|
410
449
|
|
|
450
|
+
## Polling discipline
|
|
451
|
+
|
|
452
|
+
When waiting on a long-running background process (training, evaluation, batch generation, any externally-spawned long task), do NOT use `while true; do sleep N; tail file; done`. That loop never exits when the underlying process crashes -- the tail keeps reading the same dead file, the agent interprets "no growth" as "still working," and the agent blocks indefinitely.
|
|
453
|
+
|
|
454
|
+
Use `evo wait`. The CLI is the bounded, structured replacement:
|
|
455
|
+
|
|
456
|
+
```bash
|
|
457
|
+
# wait for a training subprocess to exit, OR its log to stall, OR the GPU to go idle,
|
|
458
|
+
# whichever first; 60-minute ceiling; structured JSON on stdout
|
|
459
|
+
evo wait --for process=$TRAIN_PID \
|
|
460
|
+
--for log-growth=$TRAIN_LOG \
|
|
461
|
+
--for gpu-idle \
|
|
462
|
+
--timeout 60m --stall-threshold 5m --json
|
|
463
|
+
```
|
|
464
|
+
|
|
465
|
+
Multiple `--for` flags combine; the wait returns on the first matching condition. The JSON output's `exit_reason` and `triggered_by` identify which condition fired (process-exited / log-stalled / gpu-idle / timed-out). Process / log-growth / gpu-* watches do not require an evo workspace; the workspace-anchored watches (`--for experiments`, `--for ideators`) still work for ideator-proposal and commit waits.
|
|
466
|
+
|
|
467
|
+
Full surface, exit codes, JSON shape, and examples in `references/evo-wait.md`.
|
|
468
|
+
|
|
469
|
+
If `evo wait` is not available for some reason (older CLI on PATH), fall back to a bounded poll loop that checks all three signals -- process liveness via `kill -0 $PID`, log growth via `wc -c` delta, GPU via `nvidia-smi --query-gpu=utilization.gpu` -- and exits on any one going negative. NEVER unbounded `while true`.
|
|
470
|
+
|
|
411
471
|
## Inspection commands (for debugging, reference only)
|
|
412
472
|
|
|
413
473
|
```bash
|
|
@@ -6,6 +6,13 @@
|
|
|
6
6
|
* - Reads EVO_TRACES_DIR, EVO_EXPERIMENT_ID, EVO_RESULT_PATH from process.env.
|
|
7
7
|
* - Writes traces/task_<id>.json per task.
|
|
8
8
|
* - Writes the final result JSON to EVO_RESULT_PATH, or stdout if unset.
|
|
9
|
+
*
|
|
10
|
+
* **Per-task emission is the load-bearing discipline.** If your benchmark
|
|
11
|
+
* evaluates N independent items (per-test-case, per-document, per-sample),
|
|
12
|
+
* call logTask ONCE PER ITEM with as much detail as you have (input,
|
|
13
|
+
* expected, model output, timings). Do NOT roll up to one aggregate call --
|
|
14
|
+
* the dashboard's per-task panel and the verifier's reproducibility check
|
|
15
|
+
* both rely on per-item traces. See the USAGE EXAMPLE at the bottom.
|
|
9
16
|
*/
|
|
10
17
|
|
|
11
18
|
import {
|
|
@@ -95,3 +102,41 @@ export function writeResult(score) {
|
|
|
95
102
|
}
|
|
96
103
|
return score;
|
|
97
104
|
}
|
|
105
|
+
|
|
106
|
+
|
|
107
|
+
// === USAGE EXAMPLE (copy + adapt) ===
|
|
108
|
+
//
|
|
109
|
+
// For a benchmark scoring N independent items, emit one logTask per item.
|
|
110
|
+
// writeResult() with no arg aggregates from the per-task scores.
|
|
111
|
+
//
|
|
112
|
+
// async function main() {
|
|
113
|
+
// const cases = await loadTestCases(); // N items
|
|
114
|
+
// for (const [i, c] of cases.entries()) {
|
|
115
|
+
// const output = await runUnderTest(c.input);
|
|
116
|
+
// const correct = output === c.expected;
|
|
117
|
+
// logTask(
|
|
118
|
+
// `case_${String(i).padStart(2, "0")}`, // stable id per item
|
|
119
|
+
// correct ? 1.0 : 0.0, // per-item score
|
|
120
|
+
// {
|
|
121
|
+
// input: c.input, // extras land in the trace JSON
|
|
122
|
+
// expected: c.expected, // for diagnosis later
|
|
123
|
+
// actual: output,
|
|
124
|
+
// elapsed_ms: c.elapsed,
|
|
125
|
+
// }
|
|
126
|
+
// );
|
|
127
|
+
// }
|
|
128
|
+
// // No arg -> writeResult averages the per-task scores
|
|
129
|
+
// const final = writeResult();
|
|
130
|
+
// console.log(`final: ${final.toFixed(4)}`);
|
|
131
|
+
// }
|
|
132
|
+
//
|
|
133
|
+
// Anti-pattern: one logTask or writeResult with the aggregate score. Loses
|
|
134
|
+
// the per-item detail the dashboard and verifier rely on. Don't do this:
|
|
135
|
+
//
|
|
136
|
+
// const score = passed / total;
|
|
137
|
+
// logTask("eval_total", score); // <-- aggregate; useless
|
|
138
|
+
// writeResult(score);
|
|
139
|
+
//
|
|
140
|
+
// Exception: if the benchmark really is a single indivisible measurement
|
|
141
|
+
// (one e2e workflow, one perf number), emit one task with that score AND
|
|
142
|
+
// pass every observable as extras (timings, allocations, error log).
|
|
@@ -5,6 +5,18 @@ Contract:
|
|
|
5
5
|
- Reads EVO_TRACES_DIR, EVO_EXPERIMENT_ID, EVO_RESULT_PATH from env.
|
|
6
6
|
- Writes traces/task_<id>.json per task.
|
|
7
7
|
- Writes the final result JSON to EVO_RESULT_PATH, or stdout if unset.
|
|
8
|
+
|
|
9
|
+
**Per-task emission is the load-bearing discipline.** If your benchmark
|
|
10
|
+
evaluates N independent items (per-question math, per-test-case unit
|
|
11
|
+
tests, per-document QA, per-sample reasoning trace), call `log_task`
|
|
12
|
+
ONCE PER ITEM with as much detail as you have (problem text, model
|
|
13
|
+
output, expected answer, intermediate reasoning) -- not one aggregate
|
|
14
|
+
call with the rolled-up score. The dashboard's per-task panel and the
|
|
15
|
+
verifier's reproducibility spot-check both rely on per-item traces;
|
|
16
|
+
emitting one aggregate `log_task("eval_total", score)` makes both
|
|
17
|
+
useless and loses the diagnostic value of the run. `write_result()`
|
|
18
|
+
aggregates from the per-task scores automatically -- you don't need to
|
|
19
|
+
roll up yourself. See the USAGE EXAMPLE at the bottom of this file.
|
|
8
20
|
"""
|
|
9
21
|
|
|
10
22
|
from __future__ import annotations
|
|
@@ -107,3 +119,41 @@ def write_result(score: float | None = None) -> float:
|
|
|
107
119
|
else:
|
|
108
120
|
print(payload)
|
|
109
121
|
return score
|
|
122
|
+
|
|
123
|
+
|
|
124
|
+
# === USAGE EXAMPLE (copy + adapt) ===
|
|
125
|
+
#
|
|
126
|
+
# For a benchmark scoring N independent items, emit one log_task per item.
|
|
127
|
+
# write_result() with no arg aggregates _SCORES into the final score.
|
|
128
|
+
#
|
|
129
|
+
# def main():
|
|
130
|
+
# problems = load_aime_problems() # list of N items
|
|
131
|
+
# model = load_model()
|
|
132
|
+
# for i, problem in enumerate(problems):
|
|
133
|
+
# output = model.generate(problem.question)
|
|
134
|
+
# correct = extract_answer(output) == problem.expected
|
|
135
|
+
# log_task(
|
|
136
|
+
# f"aime_q{i:02d}", # stable id per item
|
|
137
|
+
# score=1.0 if correct else 0.0, # per-item score
|
|
138
|
+
# question=problem.question, # everything else goes as **extra
|
|
139
|
+
# expected=problem.expected, # and lands in the trace JSON
|
|
140
|
+
# model_output=output, # for diagnosis later
|
|
141
|
+
# tokens_used=len(model.last_tokens),
|
|
142
|
+
# )
|
|
143
|
+
# # write_result() with no arg returns mean(_SCORES) over all logged tasks --
|
|
144
|
+
# # no need to compute the average yourself
|
|
145
|
+
# final_score = write_result()
|
|
146
|
+
# print(f"final: {final_score:.4f}")
|
|
147
|
+
#
|
|
148
|
+
# Anti-pattern: ONE log_task or write_result call with the aggregate. The
|
|
149
|
+
# dashboard's per-task panel + the verifier's reproducibility spot-check
|
|
150
|
+
# both need per-item traces; aggregate-only emission makes them useless.
|
|
151
|
+
#
|
|
152
|
+
# # DO NOT do this:
|
|
153
|
+
# score = sum(per_problem_correct) / len(per_problem_correct)
|
|
154
|
+
# log_task("eval_total", score) # <-- aggregate; loses detail
|
|
155
|
+
# write_result(score)
|
|
156
|
+
#
|
|
157
|
+
# Exception: if the benchmark really is a single indivisible measurement
|
|
158
|
+
# (one e2e workflow, one perf number), emit one task with that score AND
|
|
159
|
+
# include every observable as **extra (timings, allocations, error log).
|
|
@@ -3,6 +3,12 @@
|
|
|
3
3
|
// The SDK auto-reads $EVO_TRACES_DIR, $EVO_EXPERIMENT_ID, and
|
|
4
4
|
// $EVO_RESULT_PATH. Traces flush on each report() so the dashboard can
|
|
5
5
|
// stream progress live.
|
|
6
|
+
//
|
|
7
|
+
// **Per-task emission is the load-bearing discipline.** Loop over your N
|
|
8
|
+
// independent items and call run.report(id, {score, ...}) ONCE PER ITEM.
|
|
9
|
+
// Do NOT roll up to one aggregate `run.report("eval_total", ...)` call --
|
|
10
|
+
// dashboard panel + verifier reproducibility check both rely on per-item
|
|
11
|
+
// traces. The Anti-pattern at the bottom is what to avoid.
|
|
6
12
|
|
|
7
13
|
import { Run, Gate } from '@evo-hq/evo-agent';
|
|
8
14
|
|
|
@@ -26,3 +32,16 @@ for (const task of criticalTasks) {
|
|
|
26
32
|
gate.check(task.id, { score: result.score });
|
|
27
33
|
}
|
|
28
34
|
await gate.finish();
|
|
35
|
+
|
|
36
|
+
// ---- ANTI-PATTERN (do NOT do this) ----
|
|
37
|
+
//
|
|
38
|
+
// Reporting one aggregate task entry loses every diagnostic value of
|
|
39
|
+
// per-item traces. SDK aggregates from per-task reports automatically.
|
|
40
|
+
//
|
|
41
|
+
// // WRONG:
|
|
42
|
+
// const scores = await Promise.all(tasks.map(t => evaluate(t).then(r => r.score)));
|
|
43
|
+
// run.report("eval_total", { score: scores.reduce((a, b) => a + b) / scores.length });
|
|
44
|
+
// await run.finish();
|
|
45
|
+
//
|
|
46
|
+
// Exception: if your benchmark really is a single indivisible measurement,
|
|
47
|
+
// report one task AND attach every observable as extras.
|
|
@@ -5,6 +5,13 @@ Install `evo-hq-agent` with this project's package manager/runtime, for example
|
|
|
5
5
|
|
|
6
6
|
The SDK auto-reads $EVO_TRACES_DIR, $EVO_EXPERIMENT_ID, and $EVO_RESULT_PATH.
|
|
7
7
|
Traces flush on each report() so the dashboard can stream progress live.
|
|
8
|
+
|
|
9
|
+
**Per-task emission is the load-bearing discipline.** Loop over your N
|
|
10
|
+
independent items and call `run.report(task_id, score=..., ...)` ONCE PER
|
|
11
|
+
ITEM. Do NOT roll up to one aggregate `run.report("eval_total", score=avg)`
|
|
12
|
+
call -- the dashboard's per-task panel and the verifier's reproducibility
|
|
13
|
+
spot-check both rely on per-item traces. The example below shows the right
|
|
14
|
+
pattern. The Anti-pattern at the bottom is what to avoid.
|
|
8
15
|
"""
|
|
9
16
|
|
|
10
17
|
from evo_agent import Run, Gate
|
|
@@ -41,3 +48,21 @@ with Gate() as gate:
|
|
|
41
48
|
for task in critical_tasks:
|
|
42
49
|
result = evaluate(task, agent)
|
|
43
50
|
gate.check(task["id"], score=result.score)
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
# ---- ANTI-PATTERN (do NOT do this) ----
|
|
54
|
+
#
|
|
55
|
+
# Computing the aggregate yourself and reporting it as a single task
|
|
56
|
+
# entry loses every diagnostic value of the per-item traces -- dashboard
|
|
57
|
+
# can't show which problems failed, verifier can't spot-check, ideator
|
|
58
|
+
# can't cluster failure modes. The SDK aggregates from per-task reports
|
|
59
|
+
# automatically.
|
|
60
|
+
#
|
|
61
|
+
# # WRONG:
|
|
62
|
+
# scores = [evaluate(t, agent).score for t in tasks]
|
|
63
|
+
# run.report("eval_total", score=sum(scores) / len(scores))
|
|
64
|
+
# run.finish()
|
|
65
|
+
#
|
|
66
|
+
# Exception: if your benchmark really is a single indivisible measurement
|
|
67
|
+
# (one e2e workflow with one score number), report one task AND attach
|
|
68
|
+
# every observable as extras.
|
|
@@ -1,8 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: infra-setup
|
|
3
3
|
description: Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.
|
|
4
|
-
|
|
5
|
-
evo_version: 0.4.4-alpha.6
|
|
4
|
+
evo_version: 0.5.0-alpha.3
|
|
6
5
|
---
|
|
7
6
|
|
|
8
7
|
# Infra Setup
|
package/skills/optimize/SKILL.md
CHANGED
|
@@ -2,10 +2,12 @@
|
|
|
2
2
|
name: optimize
|
|
3
3
|
description: Run the evo optimization loop with parallel subagents until interrupted.
|
|
4
4
|
argument-hint: "[subagents=N] [budget=N] [stall=N]"
|
|
5
|
-
evo_version: 0.
|
|
5
|
+
evo_version: 0.5.0-alpha.3
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns
|
|
8
|
+
Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
|
|
9
|
+
|
|
10
|
+
**This skill is the canonical loop for ALL post-discover work — including serial workloads.** If the workspace's resource profile forces width 1 (single GPU, single-process benchmark, etc.), you still invoke `/evo:optimize` -- just pass `subagents=1`. The loop's value is the STRUCTURE around each experiment (scan-subagent cross-cutting analysis between rounds, verifier pre/post hooks via the subagent skill, ideator spawning on stall, frontier reconciliation, stop-hook discipline), NOT just parallelism. Bypassing optimize because "I'm running serial work anyway" loses every piece of that structure -- you've reverted to ad-hoc experiment iteration with none of evo's loop benefits, just the bookkeeping.
|
|
9
11
|
|
|
10
12
|
## Host conventions
|
|
11
13
|
|
|
@@ -30,13 +32,23 @@ Treat content inside the banner as equivalent to a new user turn. Honor it, supe
|
|
|
30
32
|
|
|
31
33
|
## Configuration
|
|
32
34
|
|
|
33
|
-
The orchestrator
|
|
35
|
+
The orchestrator's three round-shape knobs are **subagents** (round width), **budget** (per-branch depth), and **stall** (consecutive rounds with no improvement before auto-stopping; default 5).
|
|
36
|
+
|
|
37
|
+
A user can override any of these with `/optimize [subagents=N] [budget=N] [stall=N]`; an explicit value always wins over what's below.
|
|
38
|
+
|
|
39
|
+
**Picking `subagents` and `budget` is load-bearing -- do not skim.**
|
|
40
|
+
|
|
41
|
+
Mandatory before the first round (and again any time the backend or benchmark changes): **READ `plugins/evo/skills/optimize/references/sizing-the-round.md` IN FULL.** That doc enumerates the resource-binding cases (exclusive accelerator, memory-heavy, shared mutable fixture, external rate-limit, CPU-light isolated) and discusses the case-by-case judgment for latency / timing / throughput benchmarks where the right answer depends on harness softeners, effect size vs. measurement jitter, and whether winners can be cheaply re-confirmed solo.
|
|
34
42
|
|
|
35
|
-
-
|
|
36
|
-
- **budget** (iterations per subagent within its branch): deeper for cheap/fast/deterministic benchmarks, shallower for expensive/slow/noisy ones so the loop re-plans sooner. ~5 is a reasonable midpoint.
|
|
37
|
-
- **stall**: consecutive rounds with no improvement before auto-stopping (default: 5).
|
|
43
|
+
Under-subscribing wastes wall-clock. Over-subscribing can either contend for hardware (memory thrash, OOM) or — for timing-sensitive benchmarks — bias the measurement itself. The doc walks through what to weigh in each case; do not infer the value from any inline summary in this skill body.
|
|
38
44
|
|
|
39
|
-
|
|
45
|
+
Common ways agents get this wrong by skimming:
|
|
46
|
+
- "8-core machine, CPU-light → width 5" sounds right but skips the question of whether the metric is corruptible by sibling-process pressure. The doc has the judgment framing.
|
|
47
|
+
- "Worktree backend has no slot cap so I can go higher" — worktree just shifts the cap from infrastructure to the binding resource. Same hardware, no safety net.
|
|
48
|
+
|
|
49
|
+
If `.evo/project.md` records a resource profile (it should, after `/evo:discover`), START from that. The reference doc is what you use to APPLY it. If the profile is missing or thin, that's a discover-step bug — fix it (write a resource profile that names the binding resource explicitly) before continuing.
|
|
50
|
+
|
|
51
|
+
In your opening message, state the width/budget you chose AND a one-line reason that references the binding-resource framing FROM THE DOC (e.g. "width 1 — exclusive GPU; budget 8 — runs deterministic"; or "width 3 — CPU-light isolated, but harness reports stable jitter at this concurrency so promoting solo-confirm gate; budget 5"). If your reason doesn't connect to the doc's framing, go back and read it.
|
|
40
52
|
|
|
41
53
|
- **autonomous**: the keep-going loop. **Default: on** — evo is autoresearch; it runs unattended. Turn off for a run with `evo autonomous off`.
|
|
42
54
|
- **subagents-only**: gate orchestrator edits, pushing all edits through subagents. **Default: on**. Turn off for a run with `evo subagents-only off`.
|
|
@@ -297,6 +309,86 @@ Update notes with cross-cutting learnings:
|
|
|
297
309
|
evo set <exp_id> --note "key insight from round N"
|
|
298
310
|
```
|
|
299
311
|
|
|
312
|
+
### 6a. Pattern recognition across history (objective, not narrative)
|
|
313
|
+
|
|
314
|
+
Step 6 cross-cuts a single round. This step looks across ALL committed experiments in the run, not just this round's. The orchestrator's failure mode is tunnel vision -- iterating on the visible axis (whatever knob the recent rounds touched) while missing the orthogonal axis (the harness itself, the score definition, the environment, the input data, plumbing). The check is cheap, runs between rounds, and is the most reliable signal that you're on the wrong axis.
|
|
315
|
+
|
|
316
|
+
Four checks via `evo show` + `evo tree`:
|
|
317
|
+
|
|
318
|
+
1. **Score plateaus across structurally distinct hypotheses.** Read the `hypothesis` strings of committed experiments. If 3+ experiments with materially different hypotheses (not minor parameter sweeps of the same idea) all commit at the same score, the bottleneck is not where the hypotheses were aiming. The next move belongs on an axis none of those hypotheses touched.
|
|
319
|
+
|
|
320
|
+
2. **Repeated failure class.** Tally failure indicators across discarded + failed nodes in the run so far: `gate_failures` names, non-zero exit codes, shared error-message fragments in `benchmark_err.log`. If 2+ failures share a class, that class is structural -- fix the cause, rather than queuing more experiments that will hit the same wall.
|
|
321
|
+
|
|
322
|
+
3. **Internal-vs-benchmark delta.** Compare each evaluated node's *internal* indicators (progress signals the experiment's own process produces during a run -- intermediate test pass-rates, training loss, build success, agent self-report metrics, whatever the trace stream carries) to its *committed* benchmark score. Healthy internal signal + flat benchmark score = the experiment is optimizing something the benchmark does not reward. The fix is usually in the harness, output format, or score definition -- not in another hypothesis on the same axis.
|
|
323
|
+
|
|
324
|
+
4. **Annotate facts, not narratives.** Annotations via `evo annotate <exp_id>` and `evo set <exp_id> --note` should record what HAPPENED -- scores, exact error messages, surprising observations, sources used. Not what you hoped would happen, what you plan to try next, or how you feel about the result. Annotations get loaded into future decision context and into ideator briefs; narrative noise contaminates them. State facts; leave plans to TodoWrite or `evo set --note` on the round itself.
|
|
325
|
+
|
|
326
|
+
If any check surfaces a structural issue, the next round's subagent briefs should target the orthogonal axis the pattern identifies. Another iteration on a plateaued or systematically-failing axis produces another data point with the same conclusion.
|
|
327
|
+
|
|
328
|
+
### 6b. Periodically spawn ideators (in parallel)
|
|
329
|
+
|
|
330
|
+
The optimize loop's scan sub-agents (step 3) read the CURRENT round's evaluated experiments for failure patterns. They don't do deep cross-graph analysis or external literature scans -- that work belongs to the ideator skill (`evo:ideator`).
|
|
331
|
+
|
|
332
|
+
Spawn ideators in parallel when ANY of these triggers fire:
|
|
333
|
+
|
|
334
|
+
- **Periodic**: every N=5 committed experiments since the last ideator round
|
|
335
|
+
- **Stall**: best score unchanged for M=3 consecutive committed experiments (the stall counter from step 6)
|
|
336
|
+
- **Failure cluster**: M=3 consecutive discards with related root causes (use the `evo discards` output)
|
|
337
|
+
- **User-triggered**: a directive (`evo direct`) asks for fresh ideas
|
|
338
|
+
|
|
339
|
+
When a trigger fires, spawn three parallel **evo ideator subagents** via your host's Task tool -- one per brief:
|
|
340
|
+
|
|
341
|
+
```
|
|
342
|
+
Task(subagent_type="evo:ideator", prompt="workspace=<path>\nbrief=failure_analysis")
|
|
343
|
+
Task(subagent_type="evo:ideator", prompt="workspace=<path>\nbrief=literature")
|
|
344
|
+
Task(subagent_type="evo:ideator", prompt="workspace=<path>\nbrief=frontier_extrapolation")
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
| Brief | What it does |
|
|
348
|
+
|---|---|
|
|
349
|
+
| `failure_analysis` | Cross-graph clustering of discards/failures |
|
|
350
|
+
| `literature` | Web/arXiv scan for untried techniques in the workspace domain |
|
|
351
|
+
| `frontier_extrapolation` | Deeper variants of the steepest score gradient on the best path |
|
|
352
|
+
|
|
353
|
+
Each subagent runs the brief in its own context, appends proposals as JSONL lines to `.evo/run_<run_id>/ideator/proposals.jsonl` (single final write), and returns a JSON summary. See `plugins/evo/agents/ideator.md` for the full procedure each ideator follows.
|
|
354
|
+
|
|
355
|
+
Ideators take 5-10 min while the optimize loop's next round is typically 1-2 min away. If you fire and continue, proposals miss the next round's brief-writing every time. Two patterns work:
|
|
356
|
+
|
|
357
|
+
- **Block here briefly.** If the trigger was a STALL or FAILURE CLUSTER, the next round's quality depends on fresh ideas -- block until enough proposals land:
|
|
358
|
+
```bash
|
|
359
|
+
evo wait --for ideators --count 3 --timeout 900 # 15 min cap, fail-open
|
|
360
|
+
```
|
|
361
|
+
Exit 0 means the proposals are ready; exit 124 (timeout) means proceed with whatever's available -- proposals.jsonl may have partial results.
|
|
362
|
+
|
|
363
|
+
- **Fire and continue for periodic spawns** (every-N-commits trigger). The next round can run without proposals; the round after that will read them once they land. Use this when there's plenty of in-graph signal still to extract.
|
|
364
|
+
|
|
365
|
+
### 6c. Reconcile ideator proposals at brief-writing time
|
|
366
|
+
|
|
367
|
+
Before writing the next round's briefs (step 4 of the next iteration), check for new ideator proposals:
|
|
368
|
+
|
|
369
|
+
```bash
|
|
370
|
+
# Read proposals newer than the last round
|
|
371
|
+
test -f .evo/run_*/ideator/proposals.jsonl && \
|
|
372
|
+
tail -n +1 .evo/run_*/ideator/proposals.jsonl | \
|
|
373
|
+
jq -s --argjson cutoff "$LAST_ROUND_END_TS" \
|
|
374
|
+
'map(select(.generated_at > $cutoff))'
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
If you fired ideators in 6b WITHOUT blocking (periodic trigger) and they haven't landed yet, you can also wait here -- but only a short timeout, since brief-writing should not stall indefinitely:
|
|
378
|
+
|
|
379
|
+
```bash
|
|
380
|
+
evo wait --for ideators --count 1 --timeout 120 # 2 min cap, fail-open
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
For each new proposal:
|
|
384
|
+
|
|
385
|
+
1. Check the workspace graph -- has the proposed config already been tried? Use `evo discards --like "<keyword>"` to scan. If yes, skip.
|
|
386
|
+
2. Score each remaining proposal by `expected_score_uplift × confidence`. Confidence ranking: `frontier_extrapolation > failure_analysis > literature`, all else equal.
|
|
387
|
+
3. The top 1-2 proposals become objectives in the next round's briefs (step 4). Cite the proposal's `hypothesis` and `mechanism` in the brief's *Objective* field.
|
|
388
|
+
4. Leave the rest in the queue -- they may surface as winners after a few more rounds when the frontier shifts.
|
|
389
|
+
|
|
390
|
+
Proposals are advisory, not mandatory. If none look better than what step 3's scan sub-agents surfaced from in-graph signal, ignore them and proceed with the in-graph briefs. Ideator output complements, doesn't replace.
|
|
391
|
+
|
|
300
392
|
### 7. Continue or stop
|
|
301
393
|
|
|
302
394
|
**Continue** if:
|
|
@@ -317,6 +409,27 @@ On stop, print a final summary:
|
|
|
317
409
|
|
|
318
410
|
Go back to step 1.
|
|
319
411
|
|
|
412
|
+
## Polling discipline
|
|
413
|
+
|
|
414
|
+
When waiting on a long-running background process (a subagent's training subprocess, a long evaluation, a batch job), do NOT use `while true; do sleep N; tail file; done`. That loop never exits when the underlying process crashes -- the tail keeps reading the same dead file, the agent interprets "no growth" as "still working," and the agent blocks indefinitely.
|
|
415
|
+
|
|
416
|
+
Use `evo wait`. The CLI is the bounded, structured replacement:
|
|
417
|
+
|
|
418
|
+
```bash
|
|
419
|
+
# wait until the training subprocess exits, OR its log stalls, OR the GPU goes idle,
|
|
420
|
+
# whichever first; 60-minute ceiling; structured JSON on stdout
|
|
421
|
+
evo wait --for process=$TRAIN_PID \
|
|
422
|
+
--for log-growth=$TRAIN_LOG \
|
|
423
|
+
--for gpu-idle \
|
|
424
|
+
--timeout 60m --stall-threshold 5m --json
|
|
425
|
+
```
|
|
426
|
+
|
|
427
|
+
Multiple `--for` flags combine; the wait returns on the first matching condition. The JSON output's `exit_reason` and `triggered_by` identify which condition fired. Process / log-growth / gpu-* watches do not require an evo workspace context; the workspace-anchored watches (`--for experiments`, `--for ideators`) cover the ideator + commit waits described elsewhere in this skill.
|
|
428
|
+
|
|
429
|
+
Full surface, exit codes, JSON shape, examples: `references/evo-wait.md` (under `plugins/evo/skills/references/`).
|
|
430
|
+
|
|
431
|
+
If `evo wait` is not available for some reason (older CLI on PATH, sandbox constraint), fall back to a bounded poll loop that checks all three signals -- process liveness via `kill -0 $PID`, log growth via `wc -c` delta, GPU via `nvidia-smi --query-gpu=utilization.gpu` -- and exits on any one going negative. NEVER unbounded `while true`.
|
|
432
|
+
|
|
320
433
|
## Resetting the eval epoch
|
|
321
434
|
|
|
322
435
|
`evo infra event -m "<reason>" --breaking` bumps `current_eval_epoch` and blocks
|
|
@@ -30,6 +30,14 @@ Read `.evo/project.md`'s resource-profile line first (discover records it). If i
|
|
|
30
30
|
|
|
31
31
|
When a run needs an exclusive resource, serializing benchmark *execution* (width 1) is correct even though the *edits* are independent — on the worktree backend `evo run` executes the benchmark in-place, so concurrent `evo run` means concurrent benchmark processes on that one resource.
|
|
32
32
|
|
|
33
|
+
**Latency / timing / throughput benchmarks deserve a per-workspace judgment call, not a fixed answer.** When the metric IS time, jitter, or rate, sibling-process CPU/cache/memory-bandwidth pressure can BIAS the measurement (not just add noise) — and the orchestrator may then promote a "winner" that's just a contention artifact. But this doesn't always happen, and harness softeners (warmup, min-over-N batches, outlier rejection) reduce the risk. Things to weigh case-by-case before picking width:
|
|
34
|
+
- How big is the optimization's expected effect vs. the variance the harness reports under parallel runs? If the effect is much larger than measurement jitter, modest parallelism is fine.
|
|
35
|
+
- How much of the benchmark's wall-clock is the actual timed section? Long edit/compile phases overlap safely; only the timed section needs isolation.
|
|
36
|
+
- Can a winner be cheaply re-confirmed solo before being promoted? If yes, going wider for exploration with a solo-confirm gate is reasonable.
|
|
37
|
+
- Does the harness already filter contention (e.g., reject batches with outlier jitter)?
|
|
38
|
+
|
|
39
|
+
If unsure, start narrower and widen once you've confirmed measurements are stable. Width 1 is the safe default for *unknown* timing-sensitive benchmarks; don't apply it reflexively when the workspace has data that says otherwise.
|
|
40
|
+
|
|
33
41
|
## 3. Depth — `budget` (iterations per subagent within its branch)
|
|
34
42
|
|
|
35
43
|
Depth trades exploration against spend, and keys off cost per run, not concurrency:
|
package/skills/report/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: report
|
|
3
3
|
description: Print the dashboard's dot chart (score over experiment order, status colors, best-path stair) inline in the terminal for every run in the workspace. Use when the user invokes /evo:report, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output.
|
|
4
|
-
evo_version: 0.
|
|
4
|
+
evo_version: 0.5.0-alpha.3
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Report
|
package/skills/subagent/SKILL.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: subagent
|
|
3
3
|
description: Internal protocol for evo optimization subagents. Loaded by subagents spawned from /optimize via their host's skill loader. Not for orchestrator use.
|
|
4
|
-
evo_version: 0.
|
|
4
|
+
evo_version: 0.5.0-alpha.3
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Evo Subagent Protocol
|
|
@@ -160,7 +160,22 @@ For multi-line edits, `evo edit --json-stdin` reads `{"old":...,"new":...,"repla
|
|
|
160
160
|
|
|
161
161
|
You may edit anything within the target scope. Do NOT modify benchmark, gate, or framework code.
|
|
162
162
|
|
|
163
|
-
### 4.
|
|
163
|
+
### 4. Verify the experiment design (pre-`evo run`)
|
|
164
|
+
|
|
165
|
+
Before `evo run` burns compute, invoke the **evo verifier subagent** via your host's Task tool. Static analysis, ~30s.
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
Task(subagent_type="evo:verifier",
|
|
169
|
+
prompt="workspace=<workspace abs path>\nexperiment_id=<your exp_id>\nphase=pre")
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
The verifier checks for test-set leakage in your training data, subsetted eval commands, missing gates for new artifacts, generic hypotheses, and concurrent-resource conflicts. It returns a JSON report (`{passed, verdict, findings}`) and writes the same verdict as an `evo annotation` on the experiment. See `plugins/evo/agents/verifier.md` for the full check list.
|
|
173
|
+
|
|
174
|
+
If the verifier returns `passed: false` (verdict `fail`), address every flagged `block` finding and re-invoke until it returns `passed: true`. Skipping or fudging a `fail` verdict is a stop-the-line bug -- the verdict is the precondition for compute spend.
|
|
175
|
+
|
|
176
|
+
If the verifier returns verdict `warn`, you may proceed but address the warnings in your annotation (step 7).
|
|
177
|
+
|
|
178
|
+
### 5. Run the experiment
|
|
164
179
|
|
|
165
180
|
```bash
|
|
166
181
|
evo run <exp_id>
|
|
@@ -195,7 +210,7 @@ evo run <exp_id> --i-staged-new-files yes
|
|
|
195
210
|
|
|
196
211
|
The ack flag is required when the worktree has any untracked, non-gitignored file. Without it, `evo run` errors closed and lists the files. For each file, decide: source (then `git add`) or warm state (leave untracked -- it persists in the slot for future experiments). Then re-run with `--i-staged-new-files yes`. The flag value must be exactly `yes`. In `commit_strategy=all` workspaces (default for `--backend worktree`) the flag is a silent no-op; safe to always pass.
|
|
197
212
|
|
|
198
|
-
###
|
|
213
|
+
### 6. Analyze the result
|
|
199
214
|
|
|
200
215
|
`evo run` prints one of three outcomes:
|
|
201
216
|
|
|
@@ -215,7 +230,7 @@ The ack flag is required when the worktree has any untracked, non-gitignored fil
|
|
|
215
230
|
- Structural (benchmark broken, evo misconfigured): report to orchestrator and stop.
|
|
216
231
|
- Not worth fixing: `evo discard <id> --reason "..."`.
|
|
217
232
|
|
|
218
|
-
###
|
|
233
|
+
### 7. Annotate
|
|
219
234
|
|
|
220
235
|
```bash
|
|
221
236
|
evo annotate <exp_id> "<what you changed, what happened, and why>"
|
|
@@ -223,7 +238,7 @@ evo annotate <exp_id> "<what you changed, what happened, and why>"
|
|
|
223
238
|
|
|
224
239
|
Always annotate so other agents can learn from your experiments.
|
|
225
240
|
|
|
226
|
-
###
|
|
241
|
+
### 7b. Add gates for fixed behaviors
|
|
227
242
|
|
|
228
243
|
When you fix a critical, easy-to-regress behavior, lock it in as a gate so future experiments on this branch can't break it:
|
|
229
244
|
|
|
@@ -233,7 +248,7 @@ evo gate add <exp_id> --name "social_eng_resistance" --command "python3 {worktre
|
|
|
233
248
|
|
|
234
249
|
Good candidates: a specific benchmark task that was hard to fix, a test for a critical policy rule, a smoke test for a fragile behavior. The gate command must exit non-zero when the protected behavior regresses; a bare benchmark invocation that prints a low score but exits 0 is decorative and should not be registered. Do NOT gate every passing task -- that over-constrains the search.
|
|
235
250
|
|
|
236
|
-
###
|
|
251
|
+
### 8. Decide: continue or stop
|
|
237
252
|
|
|
238
253
|
Continue if budget remains AND (last outcome was committed, OR you have a meaningfully different idea after an evaluated/discarded outcome). When continuing after a committed experiment, update your parent to the newly committed ID.
|
|
239
254
|
|